"""
If you use the VADER sentiment analysis tools, please cite:

Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for
Sentiment Analysis of Social Media Text. Eighth International Conference on
Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.
"""

### Package imports 

In [3]:
#General
import numpy as np 
import pandas as pd

#Sentiment Analysis
import nltk #natural language toolkit
import re # regular expresions
import html.parser #the basis for parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
from nltk.sentiment.vader import SentimentIntensityAnalyzer #Give a sentiment intensity score to sentences


VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains. 

In [4]:
#nltk.downloader.download('vader_lexicon')

### Data import and processing

In [5]:
dataset = pd.read_csv('./vienna_data/raw/reviews_17_03_2020.csv')

In [10]:
dataset.head(5)

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,15883,29643839,2015-04-10,30537860,Robert,"If you need a clean, comfortable place to stay..."
1,15883,80590019,2016-06-19,37529754,Chuang,It's so nice to be in the house! It's a peace ...
2,15883,89583522,2016-07-29,3147341,Arber,"A beautiful place, uniquely decorated showing ..."
3,15883,93550424,2016-08-13,29518067,Raphaela,Eine sehr schöne Unterkunft in einem privaten ...
4,15883,114990769,2016-11-21,36016357,Chris,It was a very pleasant stay. Excellent locatio...


In [7]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 470279 entries, 0 to 470278
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   listing_id     470279 non-null  int64 
 1   id             470279 non-null  int64 
 2   date           470279 non-null  object
 3   reviewer_id    470279 non-null  int64 
 4   reviewer_name  470279 non-null  object
 5   comments       469979 non-null  object
dtypes: int64(3), object(3)
memory usage: 21.5+ MB


In [8]:
dataset.isna().sum()

listing_id         0
id                 0
date               0
reviewer_id        0
reviewer_name      0
comments         300
dtype: int64

In [9]:
#the comments column it's the only one having na values so we will drop those lines
dataset=dataset.dropna()
dataset.reset_index(drop=True, inplace = True)

In [11]:
print(f'The remaining dataset contains {len(dataset)} reviews \n')

dataset.head()

The remaining dataset contains 469979 reviews 



Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,15883,29643839,2015-04-10,30537860,Robert,"If you need a clean, comfortable place to stay..."
1,15883,80590019,2016-06-19,37529754,Chuang,It's so nice to be in the house! It's a peace ...
2,15883,89583522,2016-07-29,3147341,Arber,"A beautiful place, uniquely decorated showing ..."
3,15883,93550424,2016-08-13,29518067,Raphaela,Eine sehr schöne Unterkunft in einem privaten ...
4,15883,114990769,2016-11-21,36016357,Chris,It was a very pleasant stay. Excellent locatio...


In [12]:
#Only the listings id, reviewer id and comments columns will be kept
dataset=dataset.drop(['id','reviewer_name','date'],axis=1)
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 469979 entries, 0 to 469978
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   listing_id   469979 non-null  int64 
 1   reviewer_id  469979 non-null  int64 
 2   comments     469979 non-null  object
dtypes: int64(2), object(1)
memory usage: 10.8+ MB


In [13]:
dataset.tail(20)

Unnamed: 0,listing_id,reviewer_id,comments
469959,42642091,299008765,La estancia fue increíble. Nos dejaron cosas p...
469960,42665072,14455623,The apartment is clean and simple. You have to...
469961,42665072,6036216,"Very cute and great apartment, price vs value ..."
469962,42665072,196311212,This accommodation is not so far from Vienna h...
469963,42666433,126455592,Me and my friend really liked it\nWe loved it ❤️
469964,42667250,141082281,"Super nette Gastgeberin, kleine feine Wohnung,..."
469965,42667250,30923871,Leticia ist sehr nett und hilfsbereit. Die Woh...
469966,42667250,224507883,The host canceled this reservation 21 days bef...
469967,42667250,33800333,The host canceled this reservation 23 days bef...
469968,42667250,294752327,The host canceled this reservation 27 days bef...


In [11]:
dataset['comments'][469962]

'This accommodation is not so far from Vienna hbf station. Alexander is very kind and contacted us frequently. If you choose this accommodation, you can be got a rare experience!\n\nこの宿はｳｨｰﾝ中央駅からそんなに遠く離れていません｡Alexanderさんは頻繁に連絡をくれるため､とても親切で安心できます｡この宿では少し珍しい体験をできるかもしれません!'

In [12]:
dataset['comments'][469963]

'Me and my friend really liked it\nWe loved it ❤️'

The first comment above was originally in japanese and was probably automatically translated on/by Airbnb. 

As it can be seen, the dataset contains comments in multiple languages and emojis. 

## Data cleaning 


The objective of this notebook is getting the polarity scores of each review.  

Among the advantages of VADER is that this model does not require training data and that it produces very good results when applied to online data, and can also analyze emoticons or abbreviations specific to the language used on the web.

In [14]:
# define function for removing html tags and some re
def unescape_data(textData):
    textData = re.sub(r'[\r|\n|\t]', r' ', textData)
    textData = html.parser.unescape(textData)
    return textData

In [14]:
## define function for cleaning reviews sentences 
#def clean_data(textData):
#    textData = re.sub(r'[?|$|.|!|&|\r|\n|;|-|,|:|\d+]', r' ', textData)
#    textData = html.parser.unescape(textData)
#    return textData

In [17]:
length = len(dataset.comments)

In [16]:
#unescape html and a bit of cleaning - removing \n and\r
print("Review count",length)
for i in range(0,length):
    dataset.comments.values[i] = unescape_data(dataset.comments.values[i])

Review count 469979


In [19]:
dataset['comments'][469962]

'This accommodation is not so far from Vienna hbf station. Alexander is very kind and contacted us frequently. If you choose this accommodation, you can be got a rare experience!  この宿はｳｨｰﾝ中央駅からそんなに遠く離れていません｡Alexanderさんは頻繁に連絡をくれるため､とても親切で安心できます｡この宿では少し珍しい体験をできるかもしれません!'

In [18]:
dataset['comments'][469963]

'Me and my friend really liked it We loved it ❤️'

The pos, neg, neu and compound metrics will be multiplied by 100 for better representation.

In [20]:
#Return a float for sentiment strength based on the input text. 

sentiment_analyzer = SentimentIntensityAnalyzer()

positive_rating = []
negative_rating = []
neutral_rating = []
compound_rating = []

for i in range(0,length):
    score = sentiment_analyzer.polarity_scores(dataset.comments.values[i])
    positive_rating.append(score['pos']*100)
    negative_rating.append(score['neg']*100)
    neutral_rating.append(score['neu']*100)
    compound_rating.append(score['compound']*100)

In [22]:
dataset["Positive"] = np.array(positive_rating)
dataset["Negative"] = np.array(negative_rating)
dataset["Neutral"] = np.array(neutral_rating)
dataset["Compound"] = np.array(compound_rating)

In [20]:
dataset.isna().sum()

listing_id     0
reviewer_id    0
comments       0
Positive       0
Negative       0
Neutral        0
Compound       0
dtype: int64

In [23]:
dataset.head(20)

Unnamed: 0,listing_id,reviewer_id,comments,Positive,Negative,Neutral,Compound
0,15883,30537860,"If you need a clean, comfortable place to stay...",24.1,5.8,70.0,94.8
1,15883,37529754,It's so nice to be in the house! It's a peace ...,53.4,0.0,46.6,96.04
2,15883,3147341,"A beautiful place, uniquely decorated showing ...",22.4,5.2,72.4,92.98
3,15883,29518067,Eine sehr schöne Unterkunft in einem privaten ...,0.0,0.0,100.0,0.0
4,15883,36016357,It was a very pleasant stay. Excellent locatio...,62.2,0.0,37.8,92.08
5,15883,5706500,Eva's place is the perfect place to stay while...,16.1,0.0,83.9,77.13
6,15883,118138020,Eva has created a unique and beautiful Bed and...,28.9,3.3,67.7,95.54
7,15883,11338741,Der Aufenthalt bei Eva war zu meiner vollsten ...,4.2,29.3,66.5,-97.61
8,15883,142176049,"Eva was a great host, very responsive and extr...",30.4,0.0,69.6,96.21
9,15883,14187109,Nice place to stay and relax and be taken care...,38.6,0.0,61.4,93.37


VADER works only for English texts.

The 8th reveiw in the dataset was originally written in german. The compound metric for this comment is -97; that means it's highly negative, but thetrasnlation is the following:

*The stay with Eva was to my complete satisfaction. The communication was super friendly and Eva was always available and she replied promptly to messages. The property is a few streets from the main street, so there is no street noise and free parking in front of the door. The room was exactly as described. Breakfast was varied and more than sufficient. I would definitely come back.*

This is the reason why the comments will be filtered as to keep only english reviews.

The polarity metrics for non-english reviews are not reliable. The non-english reviews would have to be translated first; the [translation](https://pypi.org/project/google-cloud-translate/) of this data could take days (with [time.sleep()](https://realpython.com/python-sleep/) function as to not overflow the maximum requests to the Google Translate public API) or [costs 20 dollars per 1mil characters](https://cloud.google.com/translate/pricing) which for the purpose of this analysis is not feasible.

In [24]:
dataset['comments'][7]

'Der Aufenthalt bei Eva war zu meiner vollsten Zufriedenheit. Die Kommunikation war super freundlich und Eva war Jederzeit erreichbar und sie hat promt auf Nachrichten geantwortet. Die Unterkunft befindet sich ein paar Straßen von der Hauptstraße entfernt, somit gibt es keinen Straßenlärm und freie Parkplätze vor der Tür. Das Zimmer war genau so wie beschrieben. Das Frühstück war vielfältig und mehr als ausreichend. Ich würde auf alle Fälle wieder kommen.'

In [25]:
dataset.to_csv('./vienna_data/reviews_polarity.csv',index=False)