# Data Analysis

An open source Machine Learning library for the Python programming language, Scikit-learn was used: it contains classification, regression and clustering algorithms and support vector machines, logistic regression, Bayesian classifier, k-mean and DBSCAN, and is designed to work with the NumPy and SciPy libraries.
As previously mentioned, the data set consists of 300 reviews, of which 75% will be used for training and the remaining 25% for testing.
After training the model, you can predict the label for a new dataset using the predict function. The test data that has been obtained is passed to the predict function as parameters.
The generated DataFrame will consist of the following columns:
* index: index that identifies the review within the original table
* nation: nationality of origin of the traveler
* Breakfast Y / N: binary value that corresponds to the presence / absence of the service within the review
* Rating: rating attributed to the review


In [None]:
import pandas as pd
import numpy as np

## Preparazione dati

In [3]:
df = pd.read_json('data/manual_annotations.json')
df.head(10)

Unnamed: 0,index,lang,review,amenity_1,amenity_2,amenity_3,amenity_4,amenity_5,amenity_6,hometown,rating
0,35689,da,The hotel is nicely located. Not out on Ortigi...,Roof terrace,Breakfast available,,,,,Danimarca,5
1,36725,da,The only good thing one can say about this hot...,Wifi,Air conditioning,Breakfast available,,,,Ungheria,1
2,37909,da,A wonderful little hotel that is very centrall...,Free parking,,,,,,Danimarca,5
3,38022,da,This hotel is surrounded by fields. There are ...,Pool,,,,,,Danimarca,5
4,39403,da,This small hotel located in the old town of Si...,Breakfast available,,,,,,Danimarca,4
6,41943,da,Here you will find a small oasis just inside t...,Breakfast available,parking area,Luggage storage,,,,Danimarca,4
7,42013,da,The hotel is in a green setting but there is n...,Restaurant,Breakfast available,,,,,Danimarca,4
8,42376,da,We have stayed here for a week and enjoyed eve...,Pool,Breakfast available,,,,,Danimarca,5
9,42471,da,We stayed there all 10 days we were in Sicily....,Breakfast available,Roof terrace,,,,,Danimarca,5
10,44728,da,Good location in the middle of the old town\nC...,,Breakfast available,,,,,Danimarca,4


In [4]:
sd = pd.read_json('data/amenities_translate_en.json')
sd

Unnamed: 0,language,amenity
0,en,Currency exchange
1,en,Private check-in / check-out
2,en,Breakfast available
3,en,Free breakfast
4,en,Breakfast in the room
...,...,...
178,en,Salt water swimming pool
179,en,Photocopier / fax in the congress center
180,en,Aerobics
181,en,Archery


## Preprocessing

In [5]:
df['amenity_1'] = df['amenity_1'].str.lower()
df['amenity_2'] = df['amenity_2'].str.lower()
df['amenity_3'] = df['amenity_3'].str.lower()
df['amenity_4'] = df['amenity_4'].str.lower()
df['amenity_5'] = df['amenity_5'].str.lower()
df['amenity_6'] = df['amenity_6'].str.lower()

In [6]:
s = sd['amenity'].str.lower()
servizi = s.to_list()

## Assegnamento valore binario

In [7]:
import numpy as np
y = []
for row in range(0,len(df)):
    found = False
    for column in range(1,7):
        column_name = 'amenity_' + str(column)
        if df.iloc[row][column_name] == 'breakfast available':
            y.append(1)
            found = True
            break
    if not found:
        y.append(0)

In [9]:
df['review'] = df['review'].str.lower()

## Train and test split

In [10]:
from sklearn.model_selection import train_test_split

X = df['review'].to_list()
y

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state=0)



## CountVectorizer

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()
X_train_vect = vect.fit_transform(X_train)
X_train_vect


<225x2858 sparse matrix of type '<class 'numpy.int64'>'
	with 15037 stored elements in Compressed Sparse Row format>

## Model

In [12]:
#from sklearn.tree import DecisionTreeClassifier
#from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
#from sklearn.svm import SVC

model = BernoulliNB()
model.fit(X_train_vect,y_train)



BernoulliNB()

## Predict

In [13]:
X_test_vect = vect.transform(X_test)

y_pred = model.predict(X_test_vect)

## Metrics

### Precision and recall

In [14]:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
X_test_vect = vect.transform(X_test)

y_pred = model.predict(X_test_vect)

print(precision_score(y_test, y_pred))
print(recall_score(y_test, y_pred))


0.8666666666666667
1.0


In [63]:
print(X_train[23])

the stay was fantastic, the room has every comfort and is very comfortable, it is in a great location and the staff is very friendly for every need! the wi-fi service, the air conditioning, the tv ... all functional !!! the bathroom is large with a spacious shower ... you deserve 5 stars!


In [65]:
y_train[23]

0

## AUC

In [167]:
from sklearn.metrics import roc_auc_score

# Predict the transformed test documents
predictions = y_pred

print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.5454545454545454


### Accuracy

In [159]:
from sklearn.metrics import accuracy_score

X_test_vect = vect.transform(X_test)
X_test_vect
y_pred = model.predict(X_test_vect)

print(accuracy_score(y_test, y_pred))

0.868421052631579


## Classification Report

In [162]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      0.09      0.17        11
           1       0.87      1.00      0.93        65

    accuracy                           0.87        76
   macro avg       0.93      0.55      0.55        76
weighted avg       0.89      0.87      0.82        76



# n-grams

In [131]:
# Fit the CountVectorizer to the training data specifiying a minimum 
# document frequency of 5 and extracting 1-grams and 2-grams
vect = CountVectorizer(min_df=5, ngram_range=(1,2), stop_words='english').fit(X_train)

X_train_vectorized = vect.transform(X_train)

len(vect.get_feature_names())

526

In [132]:
model = BernoulliNB()
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.7419580419580418


In [136]:
feature_names = np.array(vect.get_feature_names())

sorted_coef_index = model.coef_[0].argsort()

print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-2:-1]]))

Largest Coefs: 
['breakfast']


In [140]:
y_pred[0] # contiene i servizi della recensione 0
df.iloc[0]['hometown'] # contiene la provenienza della recensione 0


'Italia'

## Predict for 4000 reviews

In [139]:
df = pd.read_json('data/pred/reviews_for_predict.json')

X = df['TextEn'] 
X_vect = vect.transform(X)

y_pred = model.predict(X_vect)

### Result

In [142]:
dataset = pd.DataFrame({'nation' : df['Hometown'], 
                        'Breakfast Y/N':y_pred, 
                        'Rating': df['Rating']})
dataset.reset_index()

Unnamed: 0,index,nation,Breakfast Y/N,Rating
0,11191,Italia,1,4
1,11192,Italia,1,4
2,11193,Italia,1,3
3,11194,Italia,1,1
4,11195,Italia,0,5
...,...,...,...,...
3995,7697,Argentina,1,5
3996,7702,Argentina,0,4
3997,7703,Argentina,1,5
3998,7707,Argentina,1,4


In [None]:
#dataset[dataset['Breakfast Y/N']==1]
#group = dataset.groupby(by='nation')
italia = group.get_group('Argentina')
a = italia[(italia['Breakfast Y/N']==1) & (italia['Rating']==5)].count()
a