# COGS 118A - Final Project

# Predicting Reviewer R8tings of Hotels

## Group 8 members

- Caitlin Connolly
- Carolyn Yatco
- Abigail Koornwinder
- Joshua Widjanarko
- Caike Campana

# Abstract 
Our project attempts to predict the reviewers average score. We used "515K Hotel Reviews Data in Europe" from Kaggle. This dataset contains 515,000 reviews from 1493 different hotels. It includes the study separated by the positive and negative portions, the number of the word count of the positive and negative portions, respectively, locations of the hotel, and days since the review. We could use various techniques (Decision Tree, SVM, Perceptron) to predict the average score and analyze the different approaches to machine learning. In conclusion, we gathered meaningful outcomes from the analysis and evaluation metrics.


# Background

Beyond location, price, and amenities, consumers judge their perception on a possible hotel to stay out based on the experience “or reviews” of previous guests. For example, according to TripAdvisor, a site for hotel and restaurant reviews and booking, roughly 81% of people always or frequently check a hotel’s reviews before booking [[1]]. However, not all reviews and experiences are weighted equally. According to Sparks, B.A, and Browning, noticed that how information is framed within the review as well as the the focus of the review itself makes the biggest difference in how much a review would affect the overall view of a hotel [[2]]. This begs the question on whether a computer can take a look at these reviews and determine the trust(in the form of ratings) that a hotel has in a similar way to how humans prioritze certain ideas, information, and looking at the overall sentiments.. Sentiment analysis, the ability of extracting emotion, feeling, and other subjective states, has been used on a wide arrangement of different reviews from movies on Netflix to restaurant reviews [[3]]. However, most work on reviews ends with extracting the overall positivity or negativity of the review, not seeing if we can predict or gain insight on the overall view or rating of the hotel, movie, or other thing being reviewed.

# Problem Statement

Our goal is to research if an individuals hotel review scores can be predicted utilzing their review using Supervised Machine Learnign Algorithms. We will be utilizing a wide range of models such as Decision Trees, MLP, and Linear Regression and seeing if we can produce reasonably accurate models.

# Data

We are going to use the dataset 515K Hotel Reviews Data in Europe from the repository Kaggle and generated by the user Jianshen Liu.

The dataset contains 515000 reviews of 1493 hotels throughout Europe, it was collected from the hotel booking website booking. It has 17 features, and some very useful are the average score, hotel name, geographical latitude,nationality of the geographical longitude, days since the review, negative word counts, and positive word counts.

Each single observation contains one customer review with all the features. Despite the data being clean and with no missing values,we need to process the review text such that we have a quantifiable metric.

# Proposed Solution

Our proposed solution to predicting an individual’s score of a hotel will have multiple parts: Preprocessing, data cleaning, determining the best classifier for predicting scores, and finally, training, testing, and evaluation.

To preprocess and clean our data we have to remove any entries with missing values as well as remove those that correspond to reviewers who did not write out reviews and instead only gave numerical scores. Since we plan on using sentiment analysis to classify the written reviews, we need to remove the entries that did not provide any additional information aside from a score. We then need to remove any columns that correspond to features we do not need. Thus, the variables we plan on using are: the review date, the average hotel score, hotel name, reviewer nationality, negative review, positive review, the number of reviews the reviewer has given in the past, total number of valid reviews that the hotel has, tags the reviewer has given the hotel.

To see if we are able to use machine learning to predict hotel reviews, we will test multiple different models. These models include a linear regression model, a logistic regression model, an SVM, and a decision tree. We will generate predictions with all these models and compare using multiple metrics including mean squared error and accuracy. 

To split our data into train and test sets, we will use 80% for training and 20% for testing, choosing what entry goes into which set randomly. Once we train our model we will test it, run our evaluation metrics, and plot our results using Matplotlib.


# Evaluation Metrics

For our evaluation metrics, we have chosen to use the precision and recall scores to evaluate how many of our positive reviews are actually positive and how many of our negative reviews are actually negative. This will allow us to determine how good our model is at distinguishing positive from negative and therefore measure the accuracy of our model. 


To determine the errors we will use a confusion matrix to display the summary of our evaluations and calculate the errors made in our classification model. Through our confusion matrix, we will determine the sensitivity and specificity of our model, and our goal is to have high specificity and high sensitivity. As we tweak our model, we will compare the confusion matrices and re-train our model according to the results. 


A confusion matrix has two columns and two rows, where the columns are the true positive and negative conditions and the rows are the predicted positive and negative conditions. We have provided an example of what our confusion matrix might look like below.

![img1](images/ex_confusion_matrix)

For a confusion matrix we want to reduce the amount of errors our model makes and maximize our precision and recall scores while running our model on our testing data.

The precision score (true positive rate) is calculated at the amount of true positives divided by the total amount of positives (true positives and false positives). 

        Precision = true positives / (true positives + false positives)

The recall score (true negative rate) is calculated as the amount of true negatives divided by the total amount of negatives (true negatives and false negatives). 

        Recall = true negatives / (true negatives + false negatives)

The below table from Lecture 12 slides (subsequently taken from https://en.wikipedia.org/wiki/Precision_and_recall)[<sup>[4]</sup>] contains the mathematical representations from which the precision and recall are derived as well as contains other evaluation metrics which may prove useful as we begin to evaluate our model.

![img2](images/wiki_prec_recall_table)

We also used mean squared error to determine how well our models worked. In addition to the above metrics, we evaluted our linear regression model using mean absolute error and R square, our logistic regression model using f1 score (which combined precision and recall, as seen in the wiki table above) and accuracy, and with the Decision Tree, we used accuracy across all the methods. In our case, the Decision Tree was the best model, with 82% accuracy. 

# Results


### Linear Regression Model
We decided to implement a basic linear regression model using sklearn to evaluate the performance of a more simple model. 


In [None]:
# Imports 
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
import datetime as dt
from sklearn import metrics

In [None]:
# Load the cleaned data set


In [None]:
# Caitlin's One-Hot Encode of Country
# columns with all zeros
clean_hotel = hotel
num_rows = clean_hotel.shape[0]
clean_hotel['country_au'] = np.zeros(num_rows)
clean_hotel['country_fr'] = np.zeros(num_rows)
clean_hotel['country_it'] = np.zeros(num_rows)
clean_hotel['country_nl'] = np.zeros(num_rows)
clean_hotel['country_sp'] = np.zeros(num_rows)
clean_hotel['country_uk'] = np.zeros(num_rows)

# one hot encode
row = num_rows
col = np.unique(clean_hotel["Country"]).shape[0]

unique_x = np.unique(clean_hotel["Country"])
to_encode = np.zeros(row*col).reshape(row, col)

for i in range(row):
    for j in range(len(unique_x)):
        if clean_hotel["Country"][i] == unique_x[j]:
            to_encode[i,j] = 1

# set to correct countries
clean_hotel['country_au'] = to_encode[:,0]
clean_hotel['country_fr'] = to_encode[:,1]
clean_hotel['country_it'] = to_encode[:,2]
clean_hotel['country_nl'] = to_encode[:,3]
clean_hotel['country_sp'] = to_encode[:,4]
clean_hotel['country_uk'] = to_encode[:,5]

In [None]:
from sklearn.linear_model import LinearRegression 
from sklearn.model_selection import train_test_split

# Drop rows with NaNs
lr_hotel = clean_hotel
lr_hotel = lr_hotel.dropna() 

# Convert date time to ordinal so we can peform linear regression
lr_hotel['Review_Date'] = pd.to_datetime(lr_hotel['Review_Date'])
lr_hotel['Review_Date']=lr_hotel['Review_Date'].map(dt.datetime.toordinal)

# Apply ordinal encoder to encode nationalities
encoder2 = OrdinalEncoder()
nat_en = encoder2.fit_transform(lr_hotel[['Reviewer_Nationality']])
nat_en = nat_en[:,0]
lr_hotel['Reviewer_Nationality'] = nat_en

y = lr_hotel['Reviewer_Score']  
X = lr_hotel[['Average_Score', 'Review_Total_Negative_Word_Counts', 'Review_Total_Positive_Word_Counts', 'Total_Number_of_Reviews_Reviewer_Has_Given', 'days_since_review', 'lat', 'lng', 'Review_Date', 'Reviewer_Nationality', 'bed_list',
             'View', 'regular','deluxe', 'Single', 'Double', 'Group', 'leisure', 'business', 'country_au', 'country_fr', 'country_it',
       'country_nl', 'country_sp', 'country_uk']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)
lr = LinearRegression()
lr.fit(X_train, y_train)
predictions = lr.predict(X_test) 

# Plot predicted versus actual
fig, ax = plt.subplots()
ax.scatter(predictions, y_test, edgecolors=(0, 0, 1))  
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=3) 
ax.set_xlabel('Predicted') 
ax.set_ylabel('Actual') 
plt.show()

# Calculate metrics
mae = metrics.mean_absolute_error(y_test, predictions) 
mse = metrics.mean_squared_error(y_test, predictions)
r2 = metrics.r2_score(y_test, predictions)

print("The model performance for testing set")
print("--------------------------------------")
print('MAE is {}'.format(mae))
print('MSE is {}'.format(mse))
print('R2 score is {}'.format(r2))

### Logistic Regression Model

In [None]:
# add column with categories 0-10 that groups reviewer scores
clean_hotel["int_score"] = clean_hotel["Reviewer_Score"].apply(np.floor).astype(str)


from sklearn.model_selection import train_test_split, cross_val_score, RepeatedKFold, GridSearchCV
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import plot_confusion_matrix, f1_score, confusion_matrix, mean_squared_error
import matplotlib.pyplot as plt

clean_hotel.dropna(inplace=True) # drop nan values

y = clean_hotel.loc[:, "int_score"]

X = clean_hotel.loc[:, ~clean_hotel.columns.isin(["class", 'Unnamed: 0', 'Hotel_Address', 
                                                  'Additional_Number_of_Scoring',
       'Review_Date', "Hotel_Name", "Reviewer_Nationality",
       'Negative_Review', 'Review_Total_Negative_Word_Counts',
       'Total_Number_of_Reviews', 'Positive_Review',
       'Review_Total_Positive_Word_Counts',
       'Total_Number_of_Reviews_Reviewer_Has_Given', 'Reviewer_Score', 'Tags', "int_score",
        'lat', 'lng', 'Country', 'Formatted_Tags'])]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify=y)
np.random.seed(31415) 

# create logistic regression model via pipeline
scaler = StandardScaler()
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
logistic = LogisticRegression()

pipe = Pipeline(steps=[("scaler", scaler), ("imputer", imp), ("logistic", logistic)])


In [None]:
# L2 penalty gridsearch
param_grid = {
    "logistic__solver" : ['saga', 'sag', 'lbfgs'],
    "logistic__penalty" : ['l2']
}
gs = GridSearchCV(pipe, param_grid, scoring='f1_micro', n_jobs = -1, cv = 7)

gs.fit(X_train, y_train)

plot_confusion_matrix(gscv.best_estimator_, X_test, y_test)
plt.xticks(rotation = -45);

In [None]:
# L2 classification report
classification_report(y_test, gs.predict(X_test), zero_division=1, output_dict=True)

In [None]:
# plot gridsearch 
def plot_results(gridsearchcv):
    params = gridsearchcv.cv_results_["params"]
    ys = gridsearchcv.cv_results_["mean_test_score"]
    xs = ['|'.join(str(v) for v in param.values()) for param in params]
    yerr = gridsearchcv.cv_results_["std_test_score"]
    plt.errorbar(xs, ys, yerr / np.sqrt(gridsearchcv.cv), fmt='.k')
    plt.ylabel("f1")
    plt.xlabel("params")


In [None]:
# plot L2
plot_results(gs)

In [None]:
# L1 penalty
param_grid = {
    "logistic__solver" : ['saga', 'liblinear'],
    "logistic__penalty" : ['l1']
}

gs = GridSearchCV(pipe, param_grid, scoring='f1_micro', n_jobs = -1, cv = 7)

gs.fit(X_train, y_train)

plot_confusion_matrix(gscv.best_estimator_, X_test, y_test)
plt.xticks(rotation = -45);

In [None]:
# L1 classificaiton report
classification_report(y_test, gs.predict(X_test), zero_division=1, output_dict=True)

In [None]:
# Multi-class precision 
def multi_class_precision_macro(y_true, y_pred):
    """multi_class_precision_macro
    This function computes precision for multiclass problems
    
    How does it work?
    
    First, figure out the unique labels in your prediction problem
    Second, compute precision for each class in a one-vs-rest manner
    Third, take the (unweighted) average of all these precision scores
    
    This is inappropriate for imbalanced class settings. In those cases you would want to use a weighted average.
    """
    y_true = y_true.tolist()
    total_y = len(y_true)
    y_pred = y_pred.tolist()
    
    unique_labels = list(set(y_true))
    y_pred1 = list(set(y_pred))
    for item in y_pred1:
        unique_labels.append(item)
    unique_labels = list(set(unique_labels))

    precision = []
    
    for label in unique_labels:
        y_true_temp = [y == label for y in y_true]
        y_pred_temp = [y == label for y in y_pred]

        tn, fp, fn, tp = confusion_matrix(y_true_temp, y_pred_temp).ravel()
        
        if (tp+fp) == 0:
            precision.append(1)
        else:
            precision.append(tp/(tp+fp))
        
    return np.mean(precision)
    
multi_class_precision_macro(y_test, gs.predict(X_test))

In [None]:
# Mean squared error
mean_squared_error(y_test, gs.predict(X_test))

### Support Vector Machines
Another model we can use to try to predict reviewer scores for hotels is using support vector machines. In order to do this, we must split the data into training and testing and find the optimal hyperparameters. Furthermore, to use SVM to predict one score out of many others, we must use a one vs rest approach.

In [None]:
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier

#creating X and y datasets
X = clean_hotel[[
 'Average_Score',
 'Review_Total_Negative_Word_Counts',
 'Review_Total_Positive_Word_Counts',
 'days_since_review',
 'View',
 'Single',
 'Double',
 'Group',
'country_au', 'country_fr', 'country_it',
       'country_nl', 'country_sp', 'country_uk', 'country_au', 'country_fr',
       'country_it', 'country_ne', 'country_sp', 'country_un']]


y= clean_hotel[['int_score']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .25, random_state = 42)

Building SVM models with different hyperparameters:
This section constructs linear, radial basis function, sigmoid, and polynomial SVMs and searches for accuracies using different C values

In [None]:
def svm_best(C, svm_type):
    best_accuracy = 0
    best_c = 0
    y_pred= 0
    curr_accuracy =0
    for c in C:
        model = SVC(kernel = svm_type, C=c, random_state = 0)
        o_vs_r2 = OneVsRestClassifier(model)
        o_vs_r2 = o_vs_r2.fit(X_train, y_train)
        curr_accuracy = o_vs_r2.score(X_test,y_test)
        print('C label', c,': ', curr_accuracy)
        if best_accuracy < curr_accuracy:
            best_accuracy = curr_accuracy
            best_c = model
    print('best model is: ',best_c)

In [None]:
#tests an linear SVM model with different C values and prints best accuracy score
lin_c = [1,5,10]
lin_type = 'linear'

svm_best(lin_c,lin_type)

In [None]:
#tests a radial basis function SVM model, searches it with different C valies and prints best accuracy score

rbf_c = [1,10,20,25,50]
rbf_type = 'rbf'

svm_best(rbf_c, rbf_type)

In [None]:
#create a sigmoid svm model and test on different C values
sig_c = [1E-4, 1E-2, 1,5,10,100, 1000] 
sig_type = 'sigmoid'

svm_best(sig_c, sig_type)

In [None]:
#create a svm model for the polynomial SVM and test different C values
#GRIDSEARCH FOR POLY
D = [3,5,7,8,9,10,11] 

best_accuracy = 0
best_c = 0
y_pred= 0
curr_accuracy =0
for d in D:
    # use your previous implementation of the model while keeping track of the best performing c
    model = SVC(kernel ='poly', degree = d)
    o_vs_r2 = OneVsRestClassifier(model)
    o_vs_rc2 = o_vs_r2.fit(X_train, y_train)
    curr_accuracy = o_vs_rc2.score(X_test,y_test)
    print('C label', d,': ', curr_accuracy)
    if best_accuracy < curr_accuracy:
        best_accuracy = curr_accuracy
        best_c = model
print('best model is: ',best_c)

Finally, we will print out classification reports and mean squared error for the best models in order to get a better look at their performance.

In [None]:
svcl = SVC(kernel = 'linear', C=1,random_state = 0) 
ovr_l = OneVsRestClassifier(svcl)
ovr_l = ovr_l.fit(X_train, y_train)
print('linear best acc is: ', ovr_l.score(X_test,y_test))
y_pred_l = ovr_l.predict(X_test)
print('linear mse is: ',metrics.mean_squared_error(y_test, y_pred_l))
print(classification_report(y_test, y_pred_l))

svcr = SVC(kernel = 'rbf', C=25,random_state = 0) 
ovr_rbf = OneVsRestClassifier(svcr)
ovr_rbf = ovr_rbf.fit(X_train, y_train)
print('rbf best acc is: ', ovr_rbf.score(X_test,y_test))
y_pred_rbf = ovr_rbf.predict(X_test)
print('rbf mse is: ',metrics.mean_squared_error(y_test, y_pred_rbf))
print(classification_report(y_test, y_pred_rbf))


svcp = SVC(kernel = 'poly', degree = 9) 
ovr_p= OneVsRestClassifier(svcp)
ovr_p = ovr_p.fit(X_train, y_train)
print('poly best acc is: ', ovr_p.score(X_test,y_test))
y_pred_p = ovr_p.predict(X_test)
print('poly mse is: ',metrics.mean_squared_error(y_test, y_pred_p))
print(classification_report(y_test, y_pred_p))

svcs = SVC(kernel = 'sigmoid', C = .01) 
ovr_s= OneVsRestClassifier(svcs)
ovr_s = ovr_s.fit(X_train, y_train)
print('sigmoid best acc is: ', ovr_s.score(X_test,y_test))
y_pred_s = ovr_s.predict(X_test)
print('sigmoid mse is: ',metrics.mean_squared_error(y_test, y_pred_s))
print(classification_report(y_test, y_pred_s))

### Decision Tree


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score, RepeatedKFold, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler,LabelEncoder,MultiLabelBinarizer
from sklearn.pipeline import Pipeline
from sklearn import tree
from sklearn.model_selection import GridSearchCV,StratifiedKFold
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay,accuracy_score
from sklearn import set_config

from google.colab import drive
drive.mount('/content/drive')

#### Decision Tree Regression

In [None]:
set_config(display="diagram")
params = [
    {'tree__criterion':['squared_error', 'friedman_mse', 'poisson'],'tree__splitter':['best', 'random'], 'tree__max_depth':[i for i in range(1,30)]},
    {'tree__criterion':['squared_error', 'friedman_mse', 'poisson'],'tree__splitter':['best', 'random']}
]

dt = tree.DecisionTreeRegressor()
pipe = Pipeline(steps=[('scaler',StandardScaler()), ('tree',tree.DecisionTreeRegressor())])
search = GridSearchCV(pipe, params,verbose=0)
search.fit(X_train,y_train)
pipe
print(search.best_params_)

NameError: ignored

In [None]:
sc = StandardScaler()
scy = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
y_train_scaled = scy.fit_transform(np.asarray(y_train).reshape(-1, 1))
y_test_scaled = scy.transform(np.asarray(y_test).reshape(-1, 1))
X_test_scaled = sc.transform(X_test)
dt = tree.DecisionTreeRegressor(criterion='squared_error',splitter='best').fit(X_train_scaled,y_train_scaled)
print(f'The score of the best decision tree regression is {dt.score(X_test_scaled,y_test_scaled)*100:.2f} %')

In [None]:
y_pred = scy.inverse_transform(dt.predict(X_test_scaled).reshape(-1, 1))
plt.figure(figsize=(8,8))
plt.scatter(x=y_pred,y=np.asarray(X_test)[:,0],label='Predicted Average Scores')
plt.scatter(x=y_test,y=np.asarray(X_test)[:,0],alpha=1,label='Actual Average Scores')
plt.xlabel("Predicted Average Score")
plt.ylabel('Total Negative Word Counts')
plt.legend()
plt.title('Predicted average score vs total negative words counts')

#### Decision Tree Classifier

In [None]:
X = hotel[['Country','Reviewer_Nationality']]
y = np.floor(hotel['Average_Score'])

encoder = LabelEncoder()
X = X.apply(encoder.fit_transform)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

In [None]:
params = [{'criterion':['gini', 'entropy'],'splitter':['best', 'random']}]

clf = tree.DecisionTreeClassifier()

search = GridSearchCV(estimator=clf,param_grid=params,verbose=0)
search.fit(X_train,y_train)
search.best_estimator_

### Multi Level Perceptron

In [None]:
#Data Cleaning Specific to MLP and importing
from sklearn.neural_network import MLPClassifier 

#firstly, we need to get rid of all of the null values
df1=clean_hotel.dropna()
df1=df1.drop(columns=["Country","Hotel_Name","Reviewer_Nationality","Formatted_Tags",'Hotel_Address',"Additional_Number_of_Scoring", 'Review_Date','Negative_Review','Positive_Review',"Tags","days_since_review","lat","lng"])
#next some of the values have a wide range of possible values, which may make it so that it has a larger effect on the MLP, lets fix that 
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import MinMaxScaler
#the Min Max scaler seems appropriate due to the use of one hot encoding for a wide range of variables.If we were to use other scalers, it may heavily affect a wide range of features
y=df1["Reviewer_Score"]
X=df1.drop(columns=["Reviewer_Score"]) 
scaler = RobustScaler() 
scaler1=MinMaxScaler(feature_range=(0,1))
X=scaler.fit_transform(X)
X=scaler1.fit_transform(X)

In [None]:
import tensorflow as tf 
from tensorflow.keras import initializers
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense 
from sklearn.model_selection import train_test_split
from tensorflow.keras import layers, models
from tensorflow.keras import datasets
from tensorflow.keras.utils import to_categorical 
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, test_size = 0.2) 

X_train= tf.convert_to_tensor(X_train.astype('float32')) 
X_test=tf.convert_to_tensor(X_test.astype('float32')) 
y_train=tf.convert_to_tensor(y_train.astype('float32')) 

In [None]:
n_features = X_train.shape[1]

inputs= layers.Input(shape=(n_features)) 

x= layers.Dense(
  units=1,
  kernel_initializer=initializers.RandomNormal(mean=0.0, stddev=0.01),
  bias_initializer=initializers.Zeros()
  )  
# Gaussian Distributtion to .01 and bias set to zero for the layer 

x=layers.Dense(12)(inputs) 
x=layers.Activation('relu')(x)  

x=layers.Dense(10)(inputs) 
x=layers.Activation('relu')(x) 

x=layers.Dense(10)(inputs) 
x=layers.Activation('relu')(x)

x=layers.Dense(1)(x) 
outputs = layers.Activation('relu')(x) 

model = models.Model(inputs=inputs, outputs=outputs)
model.summary()
model.compile(loss='categorical_crossentropy', optimizer='sgd',metrics=['acc'])
model.fit(X_train,y_train,epochs=100, batch_size=32, validation_split=0.1, verbose=1) 

In [None]:
a=X_test
prediction= model.predict(a) 
from sklearn.metrics import mean_squared_error 

mse=mean_squared_error(y_test, prediction)
print(mse)

In [None]:
b=prediction-y_test 
margin_results=[]
for c in b: 
  if abs(c)<=.5:
    margin_results.append(1) 
  else: 
    margin_results.append(0)
 
m_results=margin_results.count(1)/len(margin_results) 
print(m_results)

# Discussion

### Interpreting the result

__Linear Regression Analysis:__

We visualized the performance of our linear regression model using an actual versus predicted graph. Ideally, all the points would be closer to the regressed red diagonal line. However, it is evident that the points in our graph are quite dispersed	and lie far away from the line. This is likely due to the high variance of our data. Though the graph shows that this is probably not the best model for our data, the mean squared error of 1.823 is better than the SVM we tried. Thus, although this simple model proved to perform quite decently, we decided to try and perform further tuning and model selection to achieve better predictions. 

__Logistic Regression Analysis:__

 The goal with creating and running a grid search on a logistic regression model is to fit a model which can identify the reviewer score class based on a combination of factors in the hotel data. To determine the “best” model we used the fi-micro score for both l1 and l2 regularization penalties. Our pipeline used a standard scaler and a logistic regression model to predict the reviewer score category. We used a grid search over the optimizers to find the best model and visualized the predictions via a confusion matrix for both l1 and l2 penalties. We also printed out the classification reports to see the precision, recall, and f1 scores for all categories. 

We found that regardless of using an L1 or L2 regularization penalty, the precision, recall and f1 scores for each category were comparatively similar, and the accuracy and weighted average for both penalties were the same. Accuracy was low, which can be attributed to either the fact that the data is extremely unbalanced when it comes to reviewer scores or that the combination of factors we chose to train our model on were not good or pertinent when predicting the reviewer score. We determined that with these results a logistic regression model would not be a good model to use when classifying reviewer scores.

Using a multi-class precision function with a one-vs-rest strategy, we attempted to determine the overall precision of our best model using the l2 penalty grid search and found that the precision was around 67%. The mean squared error for this model was 3.798, which is remarkably low considering our accuracy was 0.382. Since the accuracy is still so low for our model, we decided to go in a different direction.


__Support Vector Machine Analysis:__

The goal of creating a SVM is to create a decision boundary with margins that should separate the data between classes. To do this we must use a one vs rest approach, which compares one rating score in comparison to the rest of the scores. We created the X data using information from the hotel average score, country, number of positive, number of negative, days since review, views, and number of people. The y data is the integer scores of the reviewer scores.  There are many types of kernels that can be implemented in SVM, so for our project we assessed four of them: linear, radial basis function, sigmoid, and polynomial. For each kernel, we used a function to do a semi-gridsearch to get the optimal C values with the highest accuracy given to us using model.score. After assessing the best C values for each kernel type, we made a classification report and calculated the mean squared error, which we then used to compare performances between all of the kernels. 

Our analysis showed that the best model was the 9th degree polynomial SVM model. However, given the general accuracy scores of each model, it can be concluded that SVM are not a great model for predicting the reviewer’s rating of a hotel. This could be because SVM is most useful for only binary classification, and even though we used a one vs. rest approach, our problem required the use of 10 classification labels that are very close to each other and harder to distinguish compared to simple binary datasets. Due to this, the SVM would not be the best model to handle our problem statement.

__Decision Tree Analysis:__

We approach the problem in two ways, and the first is using a decision tree for regression. We do a multiple regression on the independent variables: total negative words on reviews, total number of reviews, total positive words on reviews, and reviewer score, to estimate the hotel's average score. We employ GridSearch on the hyperparameters: criterion, tree depth, and splitter to find the best model. With this approach, we get an accuracy of 82.10%. 

The other way we approach the task is with a binary tree classifier. Our input variables are the hotel country and the reviewer's nationality, and we want to classify the average score. We use GridSearch on the hyperparameters criterion and splitter. We get an accuracy of 66.24%. 

__Multi label Percepetron:__
Multiple MLP's were tested with different activation functions (sigmoid vs relu), differing number of layers,differing number of nodes in each layer, and differing batch size resulting in choosing a MLP with three hidden layers layers (with 12,10,10) nodes respectively. 

With the MLP, we saw fairly solid results with a MSE of various tests ranging from 1.3-1.6. In general, the MLP was a solid model that can be used to predict (with some success) the reviewers score. Beyond what we can see with the accuracy and loss functions, it is important to note that the predictions as a whole tended to be skewed higher than the true scores. This may be due to the natural imbalance in the data, with a more significant portion of the data being above a rating of 5 then below it. In addition, MLP works better with data with a similar sized, data.The scaling may have reducded some of the dimensionality and possibly prevent the MLP from gettingg even more accurate


### Limitations

One of the limitations of our data is that the data itself skewed torwards higher reviews and many of  model predictions reflected that. Even the worst experience on many review sites from Yelp to Trip Advisor often times have a majority of their bad reviews as a 2-3 out of 5 Stars. A full score is much more common than a 1 score. Another limitation is that many of the categories do not have a similar degree (some are or need to one hot encoded while others are in the 100's). This either skews our data heavily to favor certain data over others, or it limits the range of the data (which may have other unintended effects. In addition, some data such as Reviewer Nationality could not be used effectively due to the fact that there are too many categories. It being to spread out lead to not being able to fully utilize the data.  

### Ethics & Privacy

Before beginning this project, our team has acknowledged potential ethics and privacy concerns that may arise from our data and implementation. Primarily, it is important to note that the dataset we plan on using is legally obtained as it is data from Booking.com that has been made publicly available. We discovered this specific collection of data from Kaggle.com, posted by Jiashen Liu who curated the dataset and made it available to the public domain to copy and modify as we intend to do. It is also important to acknowledge the potential biases in the dataset. Most notably, each data entry includes the nationality of the user which is one of the many various features provided by the dataset that we can use to predict the user’s score of the hotel. We take this into consideration as it can result in a potential bias in our predictions. Our team has also taken into consideration the privacy of the responders and has assured that the privacy of all reviewers is upheld as there are no identifying personal features other than the reviewer’s nationality. However, though there may be personal information in specific written reviews, all information was freely given by the individual. Thus, the anonymity of individuals is maintained. One way we can address these ethical and privacy concerns is through using an ethics checklist that addresses important ethical considerations in data collection, modeling, and analysis. A useful tool we can use to add such a checklist to our project is the command line tool, Deon.

### Conclusion

In conclusion, we have proven that it is possible with some degree of success to be able to predict a person’s review score of a hotel based on data on the review itself. This is important because it can be applied by companies, apart from just hotels, to improve customer experience and drive advertising. Possible further research includes utilizing the same review data to predict the overall rating of a hotel, not just an individual’s rating of the hotel. 

# Footnotes

1.^: TripAdvisor. (2019, July 16). Online reviews remain a trusted source of information when booking trips, reveals New Research. Retrieved April 24, 2022, from https://www.prnewswire.com/news-releases/online-reviews-remain-a-trusted-source-of-information-when-booking-trips-reveals-new-research-300885097.html
2.^: Sparks, B.A., & Browning, V. (2011). The impact of online reviews on hotel booking intentions and perception of trust. Tourism Management, 32, 1310-1323.
3.^:An example of one of these uses and how they approached sentiment analysis: https://towardsdatascience.com/customer-reviews-analysis-using-nlp-the-netflix-use-case-92b3645770e1
4.^:Wikimedia Foundation. (2022, March 31). Precision and recall. Wikipedia. Retrieved April 24, 2022, from https://en.wikipedia.org/wiki/Precision_and_recall


In [None]:
https://www.dropbox.com/s/eyw03t7an6bsqek/Hotel_Reviews_clean.csv?dl=0