# NLP Project Anaïs Malet : Predict TripAdvisor Reviews Rating 

### Content

This dataset consisting of 20k reviews crawled from Tripadvisor.

### Goal

The whole goal of the project is to predict how many stars gets a hotel based on client reviews.

### Credits

Tripadvisor Hotel Review Dataset file, from the publication:

Alam, M. H., Ryu, W.-J., Lee, S., 2016. Joint multi-grain topic senti- ment: modeling semantic aspects for online reviews. Information Sci- ences 339, 206–223.

## Notebook 2 : Cleaning and preprocessing the data

In [1]:
# Import librairies
import pandas as pd
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
import re
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from xgboost import XGBClassifier
import tensorflow.compat.v1 as tf
from imblearn.under_sampling import TomekLinks
from imblearn.over_sampling import SMOTE
from collections import Counter
from matplotlib import pyplot
pd.options.mode.chained_assignment = None

2023-11-11 11:04:41.589949: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Load and read data


In [2]:
df = pd.read_csv("tripadvisor_hotel_reviews.csv")
print(f"Number of rows : {df.shape[0]}\nNumber of columns : {df.shape[1]}")
df

Number of rows : 20491
Number of columns : 2


Unnamed: 0,Review,Rating
0,nice hotel expensive parking got good deal sta...,4
1,ok nothing special charge diamond member hilto...,2
2,nice rooms not 4* experience hotel monaco seat...,3
3,"unique, great stay, wonderful time hotel monac...",5
4,"great stay great stay, went seahawk game aweso...",5
...,...,...
20486,"best kept secret 3rd time staying charm, not 5...",5
20487,great location price view hotel great quick pl...,4
20488,"ok just looks nice modern outside, desk staff ...",2
20489,hotel theft ruined vacation hotel opened sept ...,1


## ***Train/test split***

### Encoding labels

We want to encode the class labels and create a train/test split

In [3]:
# instantiate a label encoder # https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
label_encoder = LabelEncoder()

# fit and transform the encoder on labels
df['Review_enc'] = label_encoder.fit_transform(df['Rating'])

### Split arrays or matrices into random train and test subsets.

In [4]:
# Define class labels
class_labels = ["1 star","2 stars","3 stars","4 stars","5 stars"]
category_orders = {"Review": class_labels}

# Split the data into training and testing sets
X = df['Review']
y = df['Review_enc']

In [5]:
# Print the first few rows of the text data to inspect the format
print(df['Review'].head())

# Check the data types in 'Review' column
print(df['Review'].dtype)

0    nice hotel expensive parking got good deal sta...
1    ok nothing special charge diamond member hilto...
2    nice rooms not 4* experience hotel monaco seat...
3    unique, great stay, wonderful time hotel monac...
4    great stay great stay, went seahawk game aweso...
Name: Review, dtype: object
object


## ***Model Class Definition***

We'll create a class Model in order to test different Vectorizer and Model architecture. 
In this class we'll define a pipeline. This pipeline will take raw reviews as input, preprocess and vectorize them, before fitting a classification model to it.
The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters.

In [5]:
from Classes import Model
from Preprocessing_functions import preprocess_text

### Choice for random_seed

By setting a random seed (random_seed=42), we'll guarantee that random operations in our code (like randomly splitting the data into training and testing sets) will give the same results each time they run, provided that the rest of the code is the same. This is useful for reproducibility of experiments.

### Choice for test-size

The test_size is used to specify the proportion of the data that will be allocated to the test set when splitting the data. 
We will put test_size=0.2, it means that 20% of the data will be used as the test set and the rest (80%) will be the training set.

### Choices for Model Architecture

Because of class imbalanced seeing before when exploring the data, we want to gives additional weight to classes less represented during training, so we will first favor the use of these models first : Logistic Regression and SVM.

Then we will search maybe for other models.

### Options for Vectorizer

We are going to test the commonly used NLP solutions for text vectorization below :

- CountVectorizer and TfidfVectorizer (scikit-learn):
CountVectorizer counts the number of occurrences of each word in the document.
TfidfVectorizer (Term Frequency-Inverse Document Frequency) takes into account the frequency of the term in a document and its rarity in the entire corpus.

### Model with LogisticRegression and TF-IDF

To begin we will use Logistic regression model in order to use class_weight='balanced', and Tf-idf Vectorizer 

In [7]:
# instantiate the Model class with text and labels (X and y), a logistic regression model and a tfidf vectorizer
model_clf = Model(X, y, LogisticRegression(), TfidfVectorizer(preprocessor=preprocess_text), random_seed=42, test_size=0.2)

# fit the model
model_clf.fit()

# predict and generate classification report
model_clf.report(class_labels)


lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression



              precision    recall  f1-score   support

      1 star       0.78      0.61      0.68       292
     2 stars       0.49      0.39      0.44       333
     3 stars       0.48      0.25      0.33       432
     4 stars       0.55      0.51      0.53      1252
     5 stars       0.69      0.86      0.77      1790

    accuracy                           0.63      4099
   macro avg       0.60      0.52      0.55      4099
weighted avg       0.62      0.63      0.62      4099



In [7]:
# instantiate the Model class with text and labels (X and y), a logistic regression model and a tfidf vectorizer
model_clf2 = Model(X, y, LogisticRegression(class_weight='balanced', solver='liblinear'), TfidfVectorizer(preprocessor=preprocess_text), random_seed=42, test_size=0.2)

# fit the model
model_clf2.fit()

# predict and generate classification report
model_clf2.report(class_labels)

              precision    recall  f1-score   support

      1 star       0.68      0.68      0.68       292
     2 stars       0.42      0.49      0.45       333
     3 stars       0.38      0.35      0.36       432
     4 stars       0.58      0.47      0.52      1252
     5 stars       0.73      0.83      0.78      1790

    accuracy                           0.63      4099
   macro avg       0.56      0.56      0.56      4099
weighted avg       0.62      0.63      0.62      4099



The results are not bad ! We can quite observe a good diagonal, we are not suprised that there are lot of results on 5 stars class we saw that on the histogram before.
Nevertheless, we can see that in proportion, the model is making :

- small mistakes when choosing between 1-stars and 2 stars classes or between 2 and 3 stars classes, 
- medium mistakes when choosing between the 3-star and 4-stars class
- big mistake when choosing between 4 and 5 stars class

## Model with LogisticRegression and CountVectorizer

Let's try Logistic regression model in order to still use class_weight='balanced' but this time with CountVectorizer.

In [8]:
# instantiate the Model class with text and labels (X and y), a logistic regression model and a tfidf vectorizer
model_clf2 = Model(X, y, LogisticRegression(class_weight='balanced', solver='liblinear'), CountVectorizer(preprocessor=preprocess_text), random_seed=42, test_size=0.2)

# fit the model
model_clf2.fit()

# predict and generate classification report
model_clf2.report(class_labels)

              precision    recall  f1-score   support

      1 star       0.68      0.61      0.64       292
     2 stars       0.37      0.36      0.37       333
     3 stars       0.32      0.32      0.32       432
     4 stars       0.50      0.43      0.46      1252
     5 stars       0.70      0.78      0.74      1790

    accuracy                           0.58      4099
   macro avg       0.51      0.50      0.51      4099
weighted avg       0.57      0.58      0.57      4099



It look like Count Vectorizer does not make better results...and it also look worse when differenciating 4-stars class and 5-stars class.

## Model with SVM and CountVectorizer

In [9]:
# instantiate the Model class with text and labels (X and y), a logistic regression model and a tfidf vectorizer
from sklearn.svm import SVC

model_clf = Model(X, y, SVC(class_weight='balanced', kernel = 'linear', random_state = 0), CountVectorizer(preprocessor=preprocess_text), random_seed=42, test_size=0.2)

# fit the model
model_clf.fit()

# predict and generate classification report
model_clf.report(class_labels)

              precision    recall  f1-score   support

      1 star       0.63      0.64      0.64       292
     2 stars       0.39      0.46      0.42       333
     3 stars       0.31      0.36      0.33       432
     4 stars       0.48      0.43      0.45      1252
     5 stars       0.72      0.72      0.72      1790

    accuracy                           0.56      4099
   macro avg       0.50      0.52      0.51      4099
weighted avg       0.57      0.56      0.57      4099



## Model with SVM and TD-IDF

In [10]:
# instantiate the Model class with text and labels (X and y), a logistic regression model and a tfidf vectorizer
from sklearn.svm import SVC

model_clf = Model(X, y, SVC(class_weight='balanced', kernel = 'linear', random_state = 0), TfidfVectorizer(preprocessor=preprocess_text), random_seed=42, test_size=0.2)

# fit the model
model_clf.fit()

# predict and generate classification report
model_clf.report(class_labels)

              precision    recall  f1-score   support

      1 star       0.68      0.65      0.66       292
     2 stars       0.39      0.49      0.44       333
     3 stars       0.35      0.43      0.38       432
     4 stars       0.55      0.52      0.54      1252
     5 stars       0.79      0.74      0.76      1790

    accuracy                           0.61      4099
   macro avg       0.55      0.57      0.56      4099
weighted avg       0.63      0.61      0.62      4099



**SVM** classifier tooks much more time to compute than with **Logistical Regression** and so was really computationally demanding, the accuracy is not bad around 0.60 - 0.61 but less than 0.64 which is the Logistic Regression accuracy. But last example seems to do less mistakes regarding differentiate 4-stars and 5-stars classes.

Mysearch lead me too think that SVMs are known for their effectiveness in high-dimensional spaces and their ability to handle non-linear decision boundaries, but this flexibility comes at a cost in terms of computational complexity. Here Logistic Regression provides satisfactory performance.

## Choose a model

To be more efficient, we will loop over every combinaison of vectorizers and classifiers that we think interesting in order to choose the pipeline model that has the best accuracy score.

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [9]:
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neural_network import MLPClassifier

models = {
    'SVC': {'model': SVC(random_state=42)},
    'KNN': {'model': KNeighborsClassifier(random_state=42)},
    'XGBoost': {'model': XGBClassifier(random_state=42)}, 
    'LogisticRegression': {'model': LogisticRegression(random_state=42)},
    'RandomForest': {'model': RandomForestClassifier(random_state=42)},
    'NaiveBayes': {'model': MultinomialNB(random_state=42)},
    'GradientBoosting': {'model': GradientBoostingClassifier(random_state=42)},
    'DecisionTree': {'model': DecisionTreeClassifier(random_state=42)},
    'KNN_Weighted': {'model': KNeighborsClassifier(random_state=42)},
    'SGDClassifier': {'model': SGDClassifier(random_state=42)},
    'MLPClassifier': {'model': MLPClassifier(random_state=42)},
    'AdaBoost': {'model': AdaBoostClassifier(random_state=42)}
}

vectorizers = {
    'BoW': {'vectorizer': CountVectorizer()},
    'TF-IDF': {'vectorizer': TfidfVectorizer()},
}

In [11]:
best_model = None
best_score = 0
best_vectorizer = None

for model_name, model_data in models.items():
    for vectorizer_name, vectorizer_data in vectorizers.items():
        vectorizer = vectorizer_data['vectorizer']
        model = model_data['model']
        
        pipeline = Pipeline([
            ('Vectorize', vectorizer),
            ('Model', model)
        ])
        
        print('Vectorizer: ', vectorizer_name),
        print('Model: ', model_name)

        pipeline.fit(X_train, y_train)

        print(f'accuracy_score: {accuracy_score(y_test, pipeline.predict(X_test))}\n')

        if pipeline.score(X_test, y_test) > best_score:
            best_score = pipeline.score(X_test, y_test)
            best_model = model_name
            best_vectorizer = vectorizer_name
            best_pipeline = pipeline

print(f'Best model: {best_model}')
print(f'Best vectorizer: {best_vectorizer}')

best_pipeline.fit(X_train, y_train)

print(f'accuracy_score: {accuracy_score(y_test, best_pipeline.predict(X_test))}\n')
print(f'Classification Report:\n{classification_report(y_test, best_pipeline.predict(X_test))}')
confusion_matrix_f = confusion_matrix(y_test, best_pipeline.predict(X_test))
# styling the confusion matrix
fig = px.imshow(
    confusion_matrix_f, 
    text_auto=True, 
    title="Confusion Matrix", width=1000, height=800,
    labels=dict(x="Predicted", y="True Label"),
    x=class_labels,
    y=class_labels,
    color_continuous_scale='Blues'  
    )
fig.show()

Vectorizer:  BoW
Model:  SVC
accuracy_score: 0.6262503049524274

Vectorizer:  TF-IDF
Model:  SVC
accuracy_score: 0.6408880214686509

Vectorizer:  BoW
Model:  KNN
accuracy_score: 0.4527933642351793

Vectorizer:  TF-IDF
Model:  KNN
accuracy_score: 0.49939009514515736

Vectorizer:  BoW
Model:  XGBoost
accuracy_score: 0.6028299585264699

Vectorizer:  TF-IDF
Model:  XGBoost
accuracy_score: 0.5999024152232252

Vectorizer:  BoW
Model:  LogisticRegression



lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression



accuracy_score: 0.5989265674554769

Vectorizer:  TF-IDF
Model:  LogisticRegression



lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression



accuracy_score: 0.6352768968040986

Vectorizer:  BoW
Model:  RandomForest
accuracy_score: 0.5113442303000731

Vectorizer:  TF-IDF
Model:  RandomForest
accuracy_score: 0.5262259087582337

Vectorizer:  BoW
Model:  NaiveBayes
accuracy_score: 0.5930714808489875

Vectorizer:  TF-IDF
Model:  NaiveBayes
accuracy_score: 0.44083922908026346

Vectorizer:  BoW
Model:  GradientBoosting
accuracy_score: 0.5808733837521347

Vectorizer:  TF-IDF
Model:  GradientBoosting
accuracy_score: 0.5838009270553793

Vectorizer:  BoW
Model:  DecisionTree
accuracy_score: 0.4632837277384728

Vectorizer:  TF-IDF
Model:  DecisionTree
accuracy_score: 0.45669675530617226

Vectorizer:  BoW
Model:  SVM_Poly
accuracy_score: 0.6262503049524274

Vectorizer:  TF-IDF
Model:  SVM_Poly
accuracy_score: 0.6408880214686509

Vectorizer:  BoW
Model:  KNN_Weighted
accuracy_score: 0.4527933642351793

Vectorizer:  TF-IDF
Model:  KNN_Weighted
accuracy_score: 0.49939009514515736

Vectorizer:  BoW
Model:  SGDClassifier
accuracy_score: 0.56

After testing avery combinaison of 2 vectorizers and 10 classifiers, the best pipeline model that is returned is SVC model with TD-IDF vectorizer.

- Vectorizer: TF-IDF
- Model: SVC
- accuracy_score: 0.6408880214686509

We can try SVC with a linear kernel and with a poly kernel, which can handle linear and non linear problem. But our problem is more linear, so we will stick to kernel ='linear'.
Some classifiers

Some of the models are known to handle imbalanced datasets well, let's try in the next notebook how to implement weights and search for the best pipeline model !