# Sentiment Analysis on Short-Term Rental Reviews

## Part 1: Loading Data into Python Dataframe

In [1]:
# In starting out with the product developement, the first step will be getting the data into a dataframe.
# This will allow me to pre-process the data with Pandas built in functions [1].
# In order to do this, pandas must be added to the notebook. 
import pandas as pd

In [2]:
# next, I will read in the CSV into a pandas dataframe
file = "Hotel_Reviews.csv"
original_data = pd.read_csv(file)

In [3]:
#show some of the data to start to understand it better. 
original_data.head(3)

Unnamed: 0,Hotel_Address,Additional_Number_of_Scoring,Review_Date,Average_Score,Hotel_Name,Reviewer_Nationality,Negative_Review,Review_Total_Negative_Word_Counts,Total_Number_of_Reviews,Positive_Review,Review_Total_Positive_Word_Counts,Total_Number_of_Reviews_Reviewer_Has_Given,Reviewer_Score,Tags,days_since_review,lat,lng
0,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,8/3/2017,7.7,Hotel Arena,Russia,I am so angry that i made this post available...,397,1403,Only the park outside of the hotel was beauti...,11,7,2.9,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968
1,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,8/3/2017,7.7,Hotel Arena,Ireland,No Negative,0,1403,No real complaints the hotel was great great ...,105,7,7.5,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968
2,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,7/31/2017,7.7,Hotel Arena,Australia,Rooms are nice but for elderly a bit difficul...,42,1403,Location was good and staff were ok It is cut...,21,9,7.1,"[' Leisure trip ', ' Family with young childre...",3 days,52.360576,4.915968


In [4]:
# for sentiment analysis, we will want to combine the data into a new dataframe that will be used for pre-processing.
# the fields that are relevant are the positive review, negative review, and the reviewer score
# reviewer score can be used for labelling our sentiment as per the review 

original_data["Total_Review"] = original_data["Negative_Review"] + " " + original_data["Positive_Review"]

columns = ["Total_Review", "Reviewer_Score"]

review_data = original_data[columns]

review_data.head(3)

Unnamed: 0,Total_Review,Reviewer_Score
0,I am so angry that i made this post available...,2.9
1,No Negative No real complaints the hotel was ...,7.5
2,Rooms are nice but for elderly a bit difficul...,7.1


## Part 3: Optimizing the Pre-Processing into one Function

This section will be used for developing final preprocessing pipelines that can be used in my ML development. There will be two because one will be with using POS tagging and the other will be without.

In [5]:
#import the relevant libraries and download the relevant nltk dependancies [3]
import nltk
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

#download the needed nltk toolkits
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')

# optimized all-encompassing function for preprocessing that can easily be manipulated for tests
def preprocess_pos(review):
    
    #preprocess remove special characters and ensure lowercase 
    def preprocess_remove_special(review):
        review_handled = re.sub(r'[^a-zA-Z\s]', '', review)
        review_lower = review_handled.lower()
        return review_lower

    #tokenize the words
    def preprocess_tokenize(review):
        bag_of_words = word_tokenize(review)
        return bag_of_words

    #remove the stopwords
    def preprocess_stopwords(review):
        words = stopwords.words('english')
        set_stop_words = set(words)
        removed_stopwords_words = []
        for word in review:
            if word not in set_stop_words:
                removed_stopwords_words.append(word)
        return removed_stopwords_words

    #lemmatize
    def preprocess_lemmatize(review):
        lemmatizer = WordNetLemmatizer()
        lemmatized_review = [lemmatizer.lemmatize(word) for word in review ]
        return lemmatized_review

    #POS Tagging
    def preprocess_tagger(review):
        tagged_review = nltk.pos_tag(review)
        return tagged_review


    #execute the individual part functions within this bigger function
    review_data_copy = review_data.copy()
    review_data_copy['Total_Review_handled'] = review_data_copy["Total_Review"].apply(preprocess_remove_special)
    review_data_copy['Total_Review_handled'] = review_data_copy['Total_Review_handled'].apply(preprocess_tokenize)
    review_data_copy['Total_Review_handled'] = review_data_copy['Total_Review_handled'].apply(preprocess_stopwords)
    review_data_copy['Total_Review_handled'] = review_data_copy['Total_Review_handled'].apply(preprocess_lemmatize)
    review_data_copy['POS_Tag'] = review_data_copy['Total_Review_handled'].apply(preprocess_tagger)

    #create the final data
    review_data_final = review_data_copy[['POS_Tag', 'Reviewer_Score']]
    
    #return the data
    return review_data_final

[nltk_data] Downloading package punkt to /Users/coreyreid/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/coreyreid/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/coreyreid/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/coreyreid/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/coreyreid/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [6]:
# Now, I will create a similar function without POS Tagging and we will compare two techniques
# with and without POS Tagging

def preprocess_no_pos(review):
    
    # remove special characters and ensure it is lowercase
    def preprocess_remove_special(review):
        review_handled = re.sub(r'[^a-zA-Z\s]', '', review)
        review_lower = review_handled.lower()
        return review_lower

    #tokenize
    def preprocess_tokenize(review):
        bag_of_words = word_tokenize(review)
        return bag_of_words

    #remove stopwords
    def preprocess_stopwords(review):
        words = stopwords.words('english')
        set_stop_words = set(words)
        removed_stopwords_words = []
        for word in review:
            if word not in set_stop_words:
                removed_stopwords_words.append(word)
        return removed_stopwords_words

    #lemmatize
    def preprocess_lemmatize(review):
        lemmatizer = WordNetLemmatizer()
        lemmatized_review = [lemmatizer.lemmatize(word) for word in review ]
        return lemmatized_review

    #execute functions within the bigger function
    review_data_copy = review_data.copy()
    review_data_copy['Total_Review_handled'] = review_data_copy["Total_Review"].apply(preprocess_remove_special)
    review_data_copy['Total_Review_handled'] = review_data_copy['Total_Review_handled'].apply(preprocess_tokenize)
    review_data_copy['Total_Review_handled'] = review_data_copy['Total_Review_handled'].apply(preprocess_stopwords)
    review_data_copy['Total_Review_handled'] = review_data_copy['Total_Review_handled'].apply(preprocess_lemmatize)

    #finalize data
    review_data_final = review_data_copy[['Total_Review_handled', 'Reviewer_Score']]

    #return the preprocessed data
    return review_data_final

In [7]:
# #next I will process the data with POS Tagging to compare to the previous section
# #these should be the same as it is just an optimized function
preprocessed_reviews_pos_tagging = preprocess_pos(review_data)
preprocessed_reviews_pos_tagging

Unnamed: 0,POS_Tag,Reviewer_Score
0,"[(angry, JJ), (made, VBD), (post, NN), (availa...",2.9
1,"[(negative, JJ), (real, JJ), (complaint, NN), ...",7.5
2,"[(room, NN), (nice, RB), (elderly, JJ), (bit, ...",7.1
3,"[(room, NN), (dirty, NN), (afraid, JJ), (walk,...",3.8
4,"[(booked, VBN), (company, NN), (line, NN), (sh...",6.7
...,...,...
515733,"[(trolly, RB), (staff, NN), (help, NN), (take,...",7.0
515734,"[(hotel, NN), (look, NN), (like, IN), (surely,...",5.8
515735,"[(ac, JJ), (useless, JJ), (hot, JJ), (week, NN...",2.5
515736,"[(negative, JJ), (room, NN), (enormous, JJ), (...",8.8


In [8]:
#last, I will process the review data without POS Tagging, the processing will end at lemmatizing. 
preprocessed_reviews_no_pos_tagging = preprocess_no_pos(review_data)
preprocessed_reviews_no_pos_tagging

Unnamed: 0,Total_Review_handled,Reviewer_Score
0,"[angry, made, post, available, via, possible, ...",2.9
1,"[negative, real, complaint, hotel, great, grea...",7.5
2,"[room, nice, elderly, bit, difficult, room, tw...",7.1
3,"[room, dirty, afraid, walk, barefoot, floor, l...",3.8
4,"[booked, company, line, showed, picture, room,...",6.7
...,...,...
515733,"[trolly, staff, help, take, luggage, room, loc...",7.0
515734,"[hotel, look, like, surely, breakfast, ok, got...",5.8
515735,"[ac, useless, hot, week, vienna, gave, hot, ai...",2.5
515736,"[negative, room, enormous, really, comfortable...",8.8


These two dataframes will be the two types of preprocessing methods I will consider in each of the models. This will show the difference between using POS tags and not using them. Additionally, we will now have two cases for exploring the impact on machine learning outcomes. 

The dataframes both are showing the data as expected so we can now proceed to implementing different machine learning models. We will then compare the results and determine which machine learning model is best for sentiment analysis of the Short-Term Rental review data. 

## Part 4: Classifying the Review Score Data for Machine Learning Labels

In [9]:
# first I will define the thresholds - our labels will be positive, negative, and neutral 
# I decided to use a small window for neutral, between 4.5 and 5.5 - this will ensure most the review classifications
# are sensative - it can be updated later if desired by changing the following threshold values

positive_threshold = 5.5
negative_threshold = 4.5

# next, I will classify by building a classification function

def classify_scores(score_value):
    if score_value >= positive_threshold:
        return 'positive'
    elif score_value <= negative_threshold:
        return 'negative'
    else:
        return 'neutral'
    
preprocessed_reviews_pos_tagging_final = preprocessed_reviews_pos_tagging.copy()
    
preprocessed_reviews_pos_tagging_final['Reviewer_Score'] = preprocessed_reviews_pos_tagging_final['Reviewer_Score'].apply(classify_scores)

preprocessed_reviews_pos_tagging_final

Unnamed: 0,POS_Tag,Reviewer_Score
0,"[(angry, JJ), (made, VBD), (post, NN), (availa...",negative
1,"[(negative, JJ), (real, JJ), (complaint, NN), ...",positive
2,"[(room, NN), (nice, RB), (elderly, JJ), (bit, ...",positive
3,"[(room, NN), (dirty, NN), (afraid, JJ), (walk,...",negative
4,"[(booked, VBN), (company, NN), (line, NN), (sh...",positive
...,...,...
515733,"[(trolly, RB), (staff, NN), (help, NN), (take,...",positive
515734,"[(hotel, NN), (look, NN), (like, IN), (surely,...",positive
515735,"[(ac, JJ), (useless, JJ), (hot, JJ), (week, NN...",negative
515736,"[(negative, JJ), (room, NN), (enormous, JJ), (...",positive


In [10]:
# I will also do the same for the no POS tagging scenario

preprocessed_reviews_no_pos_tagging_final = preprocessed_reviews_no_pos_tagging.copy()
    
preprocessed_reviews_no_pos_tagging_final['Reviewer_Score'] = preprocessed_reviews_no_pos_tagging_final['Reviewer_Score'].apply(classify_scores)

preprocessed_reviews_no_pos_tagging_final

Unnamed: 0,Total_Review_handled,Reviewer_Score
0,"[angry, made, post, available, via, possible, ...",negative
1,"[negative, real, complaint, hotel, great, grea...",positive
2,"[room, nice, elderly, bit, difficult, room, tw...",positive
3,"[room, dirty, afraid, walk, barefoot, floor, l...",negative
4,"[booked, company, line, showed, picture, room,...",positive
...,...,...
515733,"[trolly, staff, help, take, luggage, room, loc...",positive
515734,"[hotel, look, like, surely, breakfast, ok, got...",positive
515735,"[ac, useless, hot, week, vienna, gave, hot, ai...",negative
515736,"[negative, room, enormous, really, comfortable...",positive


In [11]:
# determine the counts of each category
review_counts = preprocessed_reviews_no_pos_tagging_final['Reviewer_Score'].value_counts()
print(review_counts)

positive    475509
neutral      24188
negative     16041
Name: Reviewer_Score, dtype: int64


The data has now been preprocessed and categorized. As can be seen, the dataset is very unbalanced as most of the data is positive reviews at this threshold. This may be problematic in some of our machine learning models that we will be exploring. There are techniques for unbalanced datasets, so we may need to employ some of those to get better classification. 

## Part 7: Exploring Recurrent Neural Networks (RNN) Outcomes for Sentiment Analysis

### Part 7.1: Libraries and Installations

In this section we will expand to more complex deep learning models to try and get a better performing machine learning model for our sentiment analysis tool. 

In [12]:
# confirm the dataframes are still in order
# preprocessed_reviews_pos_tagging_final
# preprocessed_reviews_no_pos_tagging_final
reviews_pos_tagging_RNN = preprocessed_reviews_pos_tagging_final.copy()
reviews_no_pos_tagging_RNN = preprocessed_reviews_no_pos_tagging_final.copy()

In [13]:
# for this type of model I will use TensorFlow and Keras 
#install the deep learning framework Tensor Flow and library Keras [14][15]
!pip install tensorflow
!pip install keras



In [14]:
#In order to use a Recurrent Neural Network (RNN), we need to do a bit more processing on the data to format it the way it is needed for input
# This includes, seperating the words and the POS tags, adding padding to get the length consistent, and
# creating embeddings for the data, then splitting into test and train datasets for the model. 
#first I will import all the libraries needed for this model - I will use different tokenizers than previously
#to show capabilities with different technologies
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import LabelEncoder
from keras.models import Sequential
from keras.layers import Embedding, Bidirectional, LSTM, GRU, Dense, SimpleRNN, Dropout #[16][17]
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

### Part 7.2: Prepare Data for RNN usage

First, we will prepare the POS tagged data to be used in further development tasks below. 

In [15]:
#with an RNN the scores can not be text - they need to be numbers - will encode these now
class LabelEncoderSentiment:
    def __init__(self):
        self.classes_ = ['negative', 'neutral', 'positive']

    def fit_transform(self, labels):
        return np.array([self.classes_.index(label) for label in labels])
    
    def inverse_transform_sentiment(label_encorder, encoded_labels):
        classes_ = ['negative', 'neutral', 'positive']
        return [classes_[label] for label in encoded_labels]

#with an RNN the scores can not be text - they need to be numbers - will encode these now
label_encoder_RNN = LabelEncoderSentiment()
encoded_scores_RNN = label_encoder_RNN.fit_transform(reviews_pos_tagging_RNN["Reviewer_Score"])

# get the text and POS data
review_RNN = reviews_pos_tagging_RNN["POS_Tag"]

# prepare the text and POS data for RNN usage
tokenizer_RNN = Tokenizer()
tokenizer_RNN.fit_on_texts([" ".join([rev for rev, _ in seq]) for seq in review_RNN])
sequences_RNN = tokenizer_RNN.texts_to_sequences([" ".join([rev for rev, _ in seq]) for seq in review_RNN])
review_prepped_RNN = pad_sequences(sequences_RNN, maxlen=10)

# prepare the test and train datasets
X_train_RNN, X_test_RNN, y_train_RNN, y_test_RNN = train_test_split(review_prepped_RNN, encoded_scores_RNN, test_size=0.2, random_state=42)

Next, we will prepare the baseline data for the no POS tag scenario. 

In [16]:
#with an RNN the scores can not be text - they need to be numbers - will encode these now
class LabelEncoderSentiment_noPOS:
    def __init__(self):
        self.classes_ = ['negative', 'neutral', 'positive']

    def fit_transform_noPOS(self, labels):
        return np.array([self.classes_.index(label) for label in labels])
    
    def inverse_transform_sentiment_noPOS(label_encorder, encoded_labels):
        classes_ = ['negative', 'neutral', 'positive']
        return [classes_[label] for label in encoded_labels]

label_encoder_RNN_noPOS = LabelEncoderSentiment_noPOS()
encoded_scores_RNN_noPOS = label_encoder_RNN_noPOS.fit_transform_noPOS(reviews_no_pos_tagging_RNN["Reviewer_Score"])

# get the text and POS data
review_RNN_noPOS = reviews_no_pos_tagging_RNN["Total_Review_handled"]

# prepare the text and POS data for RNN usage
tokenizer_RNN_noPOS = Tokenizer()
tokenizer_RNN_noPOS.fit_on_texts([" ".join(seq) for seq in review_RNN_noPOS])
sequences_RNN_noPOS = tokenizer_RNN_noPOS.texts_to_sequences([" ".join(seq) for seq in review_RNN_noPOS])
review_prepped_RNN_noPOS = pad_sequences(sequences_RNN_noPOS, maxlen=10)

# prepare the test and train datasets
X_train_RNN_noPOS, X_test_RNN_noPOS, y_train_RNN_noPOS, y_test_RNN_noPOS = train_test_split(review_prepped_RNN_noPOS, encoded_scores_RNN_noPOS, test_size=0.2, random_state=42)

### Part 7.3: Build the Baseline Model

POS Tagged Data and Simple Model for Baseline evaluation. 

In [17]:
# # first, I will create a baseline model for the RNN test scenario with POS tags included.
# # it will just be a simple model to begin with
# model_simpleRNN = Sequential()
# model_simpleRNN.add(Embedding(input_dim=len(tokenizer_RNN.word_index) + 1, output_dim=128, input_length=10))
# model_simpleRNN.add(SimpleRNN(units=50))
# model_simpleRNN.add(Dense(units=3, activation='softmax'))

# #compile the model
# model_simpleRNN.compile(
#     loss='sparse_categorical_crossentropy', 
#     optimizer='adam', 
#     metrics=['accuracy'],    
#     weighted_metrics=['accuracy']
# )

# #fit the model
# model_simpleRNN.fit(X_train_RNN, y_train_RNN, epochs=10, batch_size=64, validation_split=0.2)

# pred_simpleRNN = model_simpleRNN.predict(X_test_RNN)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [18]:
# #classification metrics didnt work out as we are getting probabilities of each classification from the model
# #to fix this, we will convert it to the predicted outcome using numpy [19]
# pred_simpleRNN_converted = np.argmax(pred_simpleRNN, axis=1)

# #evaluate results from the RNN
# baseline_simpleRNN_report_POS = classification_report(y_test_RNN, pred_simpleRNN_converted)
# baseline_simpleRNN_confusion_matrix_POS = confusion_matrix(y_test_RNN, pred_simpleRNN_converted)
# baseline_simpleRNN_accuracy_POS = accuracy_score(y_test_RNN, pred_simpleRNN_converted)

# #print the results
# print("Classification Report:\n", baseline_simpleRNN_report_POS)
# print("Confusion Matrix:\n", baseline_simpleRNN_confusion_matrix_POS)
# print("Accuracy:", baseline_simpleRNN_accuracy_POS)

Classification Report:
               precision    recall  f1-score   support

           0       0.37      0.27      0.31      3198
           1       0.22      0.14      0.17      4981
           2       0.94      0.97      0.96     94969

    accuracy                           0.91    103148
   macro avg       0.51      0.46      0.48    103148
weighted avg       0.89      0.91      0.90    103148

Confusion Matrix:
 [[  865   545  1788]
 [  482   685  3814]
 [  994  1899 92076]]
Accuracy: 0.907686043355179


No POS Tags and Model Developed for the baseline in this scenario

In [19]:
# # next, I will create a baseline model for the RNN test scenario with no POS tags included.
# # it will just be a simple model to begin with
# model_simpleRNN_noPOS = Sequential()
# model_simpleRNN_noPOS.add(Embedding(input_dim=len(tokenizer_RNN_noPOS.word_index) + 1, output_dim=128, input_length=10))
# model_simpleRNN_noPOS.add(SimpleRNN(units=50))
# model_simpleRNN_noPOS.add(Dense(units=3, activation='softmax'))

# #compile the model
# model_simpleRNN_noPOS.compile(
#     loss='sparse_categorical_crossentropy', 
#     optimizer='adam', 
#     metrics=['accuracy'],    
#     weighted_metrics=['accuracy']
# )

# #fit the model
# model_simpleRNN_noPOS.fit(X_train_RNN_noPOS, y_train_RNN_noPOS, epochs=10, batch_size=64, validation_split=0.2)

# pred_simpleRNN_noPOS = model_simpleRNN_noPOS.predict(X_test_RNN_noPOS)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [20]:
# #classification metrics didnt work out as we are getting probabilities of each classification from the model
# #to fix this, we will convert it to the predicted outcome using numpy [19]
# pred_simpleRNN_noPOS_converted = np.argmax(pred_simpleRNN_noPOS, axis=1)

# #evaluate results from the RNN
# baseline_simpleRNN_noPOS_report_POS = classification_report(y_test_RNN_noPOS, pred_simpleRNN_noPOS_converted)
# baseline_simpleRNN_noPOS_confusion_matrix_POS = confusion_matrix(y_test_RNN_noPOS, pred_simpleRNN_noPOS_converted)
# baseline_simpleRNN_noPOS_accuracy_POS = accuracy_score(y_test_RNN_noPOS, pred_simpleRNN_noPOS_converted)

# #print the results
# print("Classification Report:\n", baseline_simpleRNN_noPOS_report_POS)
# print("Confusion Matrix:\n", baseline_simpleRNN_noPOS_confusion_matrix_POS)
# print("Accuracy:", baseline_simpleRNN_noPOS_accuracy_POS)

Classification Report:
               precision    recall  f1-score   support

           0       0.35      0.26      0.30      3198
           1       0.20      0.11      0.14      4981
           2       0.94      0.97      0.96     94969

    accuracy                           0.91    103148
   macro avg       0.50      0.45      0.46    103148
weighted avg       0.89      0.91      0.90    103148

Confusion Matrix:
 [[  832   442  1924]
 [  477   530  3974]
 [ 1039  1704 92226]]
Accuracy: 0.9073176406716562


This concludes the baseline model approximations and we can see that in both cases, the results seem to be overfitting the data. This can be seen because it is only getting reasonable accuracy, precision, and recall values for the positive case. This is likely due to the imbalanced dataset that we are working with. 

### 7.4: Testing Imbalanced Data Fixes to Determine Best Approach

#### 7.4.1: Using SMOTE to resample the data and try and improve the imbalance

In [17]:
# in order to try and improve the deep learning prediction model, I am going to implement 
# synthetic minority over-sampling technique (SMOTE) to try and produce more minority class data. 
#SMOTE generates samples fo rus to help balance the dataset. 

#first I will load my libraries in
!pip install -U imbalanced-learn
from imblearn.over_sampling import SMOTE



Starting with the POS Tagged Data

In [22]:
# #I will oversample the data as the minority class is not being recognized well
# # this was shown by training the model without any additional fine tuning and just the baseline preprocessing. 
# smote_RNN = SMOTE(sampling_strategy='auto', random_state=42)
# X_train_RNN_resampled, y_train_RNN_resampled = smote_RNN.fit_resample(X_train_RNN, y_train_RNN)

# model_simpleRNN_smote = Sequential()
# model_simpleRNN_smote.add(Embedding(input_dim=len(tokenizer_RNN.word_index) + 1, output_dim=128, input_length=10))
# model_simpleRNN_smote.add(SimpleRNN(units=50))
# model_simpleRNN_smote.add(Dense(units=3, activation='softmax'))

# #compile the model
# model_simpleRNN_smote.compile(
#     loss='sparse_categorical_crossentropy', 
#     optimizer='adam', 
#     metrics=['accuracy'],    
#     weighted_metrics=['accuracy']
# )

# #fit the model
# model_simpleRNN_smote.fit(X_train_RNN_resampled, y_train_RNN_resampled, epochs=10, batch_size=64, validation_split=0.2)

# pred_simpleRNN_smote = model_simpleRNN_smote.predict(X_test_RNN)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [23]:
# pred_simpleRNN_converted_resampled = np.argmax(pred_simpleRNN_smote, axis=1)

# #evaluate results from the RNN
# baseline_simpleRNN_report_POS = classification_report(y_test_RNN, pred_simpleRNN_converted_resampled)
# baseline_simpleRNN_confusion_matrix_POS = confusion_matrix(y_test_RNN, pred_simpleRNN_converted_resampled)
# baseline_simpleRNN_accuracy_POS = accuracy_score(y_test_RNN, pred_simpleRNN_converted_resampled)

# #print the results
# print("Classification Report:\n", baseline_simpleRNN_report_POS)
# print("Confusion Matrix:\n", baseline_simpleRNN_confusion_matrix_POS)
# print("Accuracy:", baseline_simpleRNN_accuracy_POS)

Classification Report:
               precision    recall  f1-score   support

           0       0.15      0.42      0.23      3198
           1       0.14      0.10      0.12      4981
           2       0.95      0.90      0.92     94969

    accuracy                           0.85    103148
   macro avg       0.41      0.48      0.42    103148
weighted avg       0.88      0.85      0.86    103148

Confusion Matrix:
 [[ 1349   347  1502]
 [ 1093   509  3379]
 [ 6317  2794 85858]]
Accuracy: 0.8503897312599372


Next, I will perform the same but for the no POS tagging case

In [24]:
# #I will oversample the data as the minority class is not being recognized well
# # this was shown by training the model without any additional fine tuning and just the baseline preprocessing. 

# smote_RNN_noPOS = SMOTE(sampling_strategy='not majority', random_state=42)
# X_train_RNN_noPOS_resampled, y_train_RNN_noPOS_resampled = smote_RNN_noPOS.fit_resample(X_train_RNN_noPOS, y_train_RNN_noPOS)

# model_simpleRNN_smote_noPOS = Sequential()
# model_simpleRNN_smote_noPOS.add(Embedding(input_dim=len(tokenizer_RNN_noPOS.word_index) + 1, output_dim=128, input_length=10))
# model_simpleRNN_smote_noPOS.add(SimpleRNN(units=50))
# model_simpleRNN_smote_noPOS.add(Dense(units=3, activation='softmax'))


# #compile the model
# model_simpleRNN_smote_noPOS.compile(
#     loss='sparse_categorical_crossentropy', 
#     optimizer='adam', 
#     metrics=['accuracy'],    
#     weighted_metrics=['accuracy']
# )

# #fit the model
# model_simpleRNN_smote_noPOS.fit(X_train_RNN_noPOS_resampled, y_train_RNN_noPOS_resampled, epochs=10, batch_size=64, validation_split=0.2)

# pred_simpleRNN_smote_noPOS = model_simpleRNN_smote_noPOS.predict(X_test_RNN_noPOS)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [25]:
# pred_simpleRNN_noPOS_converted_resampled = np.argmax(pred_simpleRNN_smote_noPOS, axis=1)

# #evaluate results from the RNN
# baseline_simpleRNN_report_noPOS = classification_report(y_test_RNN_noPOS, pred_simpleRNN_noPOS_converted_resampled)
# baseline_simpleRNN_confusion_matrix_noPOS = confusion_matrix(y_test_RNN_noPOS, pred_simpleRNN_noPOS_converted_resampled)
# baseline_simpleRNN_accuracy_noPOS = accuracy_score(y_test_RNN_noPOS, pred_simpleRNN_noPOS_converted_resampled)

# #print the results
# print("Classification Report:\n", baseline_simpleRNN_report_noPOS)
# print("Confusion Matrix:\n", baseline_simpleRNN_confusion_matrix_noPOS)
# print("Accuracy:", baseline_simpleRNN_accuracy_noPOS)

Classification Report:
               precision    recall  f1-score   support

           0       0.15      0.45      0.23      3198
           1       0.15      0.14      0.14      4981
           2       0.95      0.89      0.92     94969

    accuracy                           0.84    103148
   macro avg       0.42      0.49      0.43    103148
weighted avg       0.89      0.84      0.86    103148

Confusion Matrix:
 [[ 1434   437  1327]
 [ 1174   675  3132]
 [ 6701  3302 84966]]
Accuracy: 0.8441753596773568


In this case, the noPOS still is performing better, but the data still seems very overfit (this was confirmed by testing on the review types below as well). The data seems to now be overfit to the negative class though, so SMOTE has introduced some noise. Moving forward, while iterating through different techniques, I will only use the noPOS scenario, since throughout the report so far it has shown that the difference is very small, and the noPOS usually performs better. This will allow me to iterate through new methods more efficiently. 

#### 7.4.2: Adding L2 Normalization to the SMOTE resampled data to try and lower the potential for overfitting

In [18]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.regularizers import l2

In [26]:
# model_simpleRNN_Normalized = Sequential()
# model_simpleRNN_Normalized.add(Embedding(input_dim=len(tokenizer_RNN_noPOS.word_index) + 1, output_dim=128, input_length=10))
# model_simpleRNN_Normalized.add(SimpleRNN(units=50))
# model_simpleRNN_Normalized.add(Dense(64, activation='relu', kernel_regularizer=l2(0.0001), input_dim=(10,1)))
# model_simpleRNN_Normalized.add(Dense(32, activation='relu', kernel_regularizer=l2(0.0001)))
# model_simpleRNN_Normalized.add(Dense(3, activation='softmax'))

# model_simpleRNN_Normalized.compile(
#     loss='sparse_categorical_crossentropy', 
#     optimizer='adam', 
#     metrics=['accuracy'],    
#     weighted_metrics=['accuracy']
# )

# model_simpleRNN_Normalized.fit(X_train_RNN_noPOS_resampled, y_train_RNN_noPOS_resampled, epochs=10, batch_size=64, validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x310e72e50>

In [27]:
# # test predictions and get evaluation metrics
# pred_simpleRNN_noPOS_normalized = model_simpleRNN_Normalized.predict(X_test_RNN_noPOS)

# #convert back
# pred_simpleRNN_noPOS_converted_normalized = np.argmax(pred_simpleRNN_noPOS_normalized, axis=1)

# #evaluate results from the RNN
# normalized_simpleRNN_report_noPOS = classification_report(y_test_RNN_noPOS, pred_simpleRNN_noPOS_converted_normalized)
# normalized_simpleRNN_confusion_matrix_noPOS = confusion_matrix(y_test_RNN_noPOS, pred_simpleRNN_noPOS_converted_normalized)
# normalized_simpleRNN_accuracy_noPOS = accuracy_score(y_test_RNN_noPOS, pred_simpleRNN_noPOS_converted_normalized)

# #print the results
# print("Classification Report:\n", normalized_simpleRNN_report_noPOS)
# print("Confusion Matrix:\n", normalized_simpleRNN_confusion_matrix_noPOS)
# print("Accuracy:", normalized_simpleRNN_accuracy_noPOS)

Classification Report:
               precision    recall  f1-score   support

           0       0.16      0.45      0.24      3198
           1       0.16      0.08      0.11      4981
           2       0.95      0.91      0.93     94969

    accuracy                           0.86    103148
   macro avg       0.42      0.48      0.43    103148
weighted avg       0.88      0.86      0.87    103148

Confusion Matrix:
 [[ 1434   268  1496]
 [ 1167   408  3406]
 [ 6184  1945 86840]]
Accuracy: 0.8597549152673828


#### Part 7.4.3: Undersampling the majority class and using L2 Normalization

In [19]:
#import RandomUnderSampler
from imblearn.under_sampling import RandomUnderSampler

In [28]:
# under_sampler = RandomUnderSampler(sampling_strategy='majority', random_state=42)
# X_train_undersampled, y_train_undersampled = under_sampler.fit_resample(X_train_RNN_noPOS, y_train_RNN_noPOS)

# model_simpleRNN_normalized_undersampled = Sequential()
# model_simpleRNN_normalized_undersampled.add(Embedding(input_dim=len(tokenizer_RNN_noPOS.word_index) + 1, output_dim=128, input_length=10))
# model_simpleRNN_normalized_undersampled.add(SimpleRNN(units=50))
# model_simpleRNN_normalized_undersampled.add(Dense(64, activation='relu', kernel_regularizer=l2(0.01), input_dim=(10,1)))
# model_simpleRNN_normalized_undersampled.add(Dense(32, activation='relu', kernel_regularizer=l2(0.01)))
# model_simpleRNN_normalized_undersampled.add(Dense(3, activation='softmax'))

# model_simpleRNN_normalized_undersampled.compile(
#     loss='sparse_categorical_crossentropy', 
#     optimizer='adam', 
#     metrics=['accuracy'],    
#     weighted_metrics=['accuracy']
# )

# model_simpleRNN_normalized_undersampled.fit(X_train_undersampled, y_train_undersampled, epochs=10, batch_size=64, validation_split=0.2)

# # test predictions and get evaluation metrics
# pred_simpleRNN_noPOS_undersampled = model_simpleRNN_normalized_undersampled.predict(X_test_RNN_noPOS)

# #convert back
# pred_simpleRNN_noPOS_converted_undersampled = np.argmax(pred_simpleRNN_noPOS_undersampled, axis=1)

# #evaluate results from the RNN
# undersampled_normalized_simpleRNN_report_noPOS = classification_report(y_test_RNN_noPOS, pred_simpleRNN_noPOS_converted_undersampled)
# undersampled_normalized_simpleRNN_confusion_matrix_noPOS = confusion_matrix(y_test_RNN_noPOS, pred_simpleRNN_noPOS_converted_undersampled)
# undersampled_normalized_simpleRNN_accuracy_noPOS = accuracy_score(y_test_RNN_noPOS, pred_simpleRNN_noPOS_converted_undersampled)

# #print the results
# print("Classification Report:\n", undersampled_normalized_simpleRNN_report_noPOS)
# print("Confusion Matrix:\n", undersampled_normalized_simpleRNN_confusion_matrix_noPOS)
# print("Accuracy:", undersampled_normalized_simpleRNN_accuracy_noPOS)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Classification Report:
               precision    recall  f1-score   support

           0       0.09      0.46      0.15      3198
           1       0.07      0.61      0.12      4981
           2       0.99      0.42      0.59     94969

    accuracy                           0.43    103148
   macro avg       0.38      0.50      0.28    103148
weighted avg       0.91      0.43      0.55    103148

Confusion Matrix:
 [[ 1482  1578   138]
 [ 1504  3047   430]
 [14203 41132 39634]]
Accuracy: 0.4281517819056114


#### 7.4.4: No over or under sampling, just L2 Normalization

In [37]:
model_simpleRNN_normalized_only = Sequential()
model_simpleRNN_normalized_only.add(Embedding(input_dim=len(tokenizer_RNN_noPOS.word_index) + 1, output_dim=128, input_length=10))
model_simpleRNN_normalized_only.add(SimpleRNN(units=50))
model_simpleRNN_normalized_only.add(Dense(64, activation='relu', kernel_regularizer=l2(0.01), input_dim=(10,1)))
model_simpleRNN_normalized_only.add(Dense(32, activation='relu', kernel_regularizer=l2(0.01)))
model_simpleRNN_normalized_only.add(Dense(3, activation='softmax'))

model_simpleRNN_normalized_only.compile(
    loss='sparse_categorical_crossentropy', 
    optimizer='adam', 
    metrics=['accuracy'],    
    weighted_metrics=['accuracy']
)

model_simpleRNN_normalized_only.fit(X_train_RNN_noPOS, y_train_RNN_noPOS, epochs=10, batch_size=64, validation_split=0.2)

# test predictions and get evaluation metrics
pred_simpleRNN_noPOS_normalized_only = model_simpleRNN_normalized_only.predict(X_test_RNN_noPOS)

#convert back
pred_simpleRNN_noPOS_normalized_only = np.argmax(pred_simpleRNN_noPOS_normalized_only, axis=1)

#evaluate results from the RNN
only_normalized_simpleRNN_report_noPOS = classification_report(y_test_RNN_noPOS, pred_simpleRNN_noPOS_normalized_only)
only_normalized_simpleRNN_confusion_matrix_noPOS = confusion_matrix(y_test_RNN_noPOS, pred_simpleRNN_noPOS_normalized_only)
only_normalized_simpleRNN_accuracy_noPOS = accuracy_score(y_test_RNN_noPOS, pred_simpleRNN_noPOS_normalized_only)

#print the results
print("Classification Report:\n", only_normalized_simpleRNN_report_noPOS)
print("Confusion Matrix:\n", only_normalized_simpleRNN_confusion_matrix_noPOS)
print("Accuracy:", only_normalized_simpleRNN_accuracy_noPOS)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Classification Report:
               precision    recall  f1-score   support

           0       0.46      0.27      0.34      3198
           1       0.21      0.15      0.18      4981
           2       0.94      0.97      0.96     94969

    accuracy                           0.91    103148
   macro avg       0.54      0.46      0.49    103148
weighted avg       0.89      0.91      0.90    103148

Confusion Matrix:
 [[  858   653  1687]
 [  409   751  3821]
 [  599  2159 92211]]
Accuracy: 0.909566836002637


#### 7.4.5: Multiclass SMOTE oversampling for both negative and neutral and LSTM Model with L2 regularization

In [30]:
# #get the totals for multiclass smote
# unique_values, counts = np.unique(y_train_RNN_noPOS, return_counts=True)
# #print the values to check 
# for label, count in zip(unique_values, counts):
#     print(f"Class '{label}': {count} occurrences")

# #update the sampling dict for the smote analysis
# sampling_dict = {0:190270, 1:190270, 2:380540}
# smote_multi_RNN_noPOS = SMOTE(sampling_strategy=sampling_dict, random_state=42)
# X_train_RNN_noPOS_multi, y_train_RNN_noPOS_multi = smote_multi_RNN_noPOS.fit_resample(X_train_RNN_noPOS, y_train_RNN_noPOS)

# unique_values_multi, counts_multi = np.unique(y_train_RNN_noPOS_multi, return_counts=True)

# for label, count in zip(unique_values_multi, counts_multi):
#     print(f"Class '{label}': {count} occurrences")

# model_simpleRNN_smote_multi_noPOS = Sequential()
# model_simpleRNN_smote_multi_noPOS.add(Embedding(input_dim=len(tokenizer_RNN_noPOS.word_index) + 1, output_dim=128, input_length=10))
# model_simpleRNN_smote_multi_noPOS.add(LSTM(units=50, kernel_regularizer=l2(0.001)))
# model_simpleRNN_smote_multi_noPOS.add(Dense(units=3, activation='softmax'))


# #compile the model
# model_simpleRNN_smote_multi_noPOS.compile(
#     loss='sparse_categorical_crossentropy', 
#     optimizer='adam', 
#     metrics=['accuracy'],    
#     weighted_metrics=['accuracy']
# )

# #fit the model
# model_simpleRNN_smote_multi_noPOS.fit(X_train_RNN_noPOS_multi, y_train_RNN_noPOS_multi, epochs=10, batch_size=64, validation_split=0.2)

# pred_simpleRNN_smote_multi_noPOS = model_simpleRNN_smote_multi_noPOS.predict(X_test_RNN_noPOS)

# pred_simpleRNN_noPOS_multi_resampled = np.argmax(pred_simpleRNN_smote_multi_noPOS, axis=1)

# #evaluate results from the RNN
# multi_simpleRNN_report_noPOS = classification_report(y_test_RNN_noPOS, pred_simpleRNN_noPOS_multi_resampled)
# multi_simpleRNN_confusion_matrix_noPOS = confusion_matrix(y_test_RNN_noPOS, pred_simpleRNN_noPOS_multi_resampled)
# multi_simpleRNN_accuracy_noPOS = accuracy_score(y_test_RNN_noPOS, pred_simpleRNN_noPOS_multi_resampled)

# #print the results
# print("Classification Report:\n", multi_simpleRNN_report_noPOS)
# print("Confusion Matrix:\n", multi_simpleRNN_confusion_matrix_noPOS)
# print("Accuracy:", multi_simpleRNN_accuracy_noPOS)

Class '0': 12843 occurrences
Class '1': 19207 occurrences
Class '2': 380540 occurrences
Class '0': 190270 occurrences
Class '1': 190270 occurrences
Class '2': 380540 occurrences
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Classification Report:
               precision    recall  f1-score   support

           0       0.16      0.36      0.22      3198
           1       0.26      0.14      0.19      4981
           2       0.95      0.93      0.94     94969

    accuracy                           0.88    103148
   macro avg       0.46      0.48      0.45    103148
weighted avg       0.89      0.88      0.88    103148

Confusion Matrix:
 [[ 1153   600  1445]
 [  908   722  3351]
 [ 5042  1465 88462]]
Accuracy: 0.8757998216155427


In [31]:
# # seemingly the data is always overfitting the training for multiple models I have tried 
# # i need to explore if the data being created is any good from SMOTE

# negative_indices = np.where(y_train_RNN_noPOS_multi == 0)[0]

# for i in range(12483, 12943):
#     index = negative_indices[i]
#     reverse_word_index = {v: k for k, v in tokenizer_RNN_noPOS.word_index.items()}
#     decoded_sequence = [reverse_word_index.get(token, '') for token in X_train_RNN_noPOS_multi[index]]
#     print("Decoded Sequence:", ' '.join(decoded_sequence))

Decoded Sequence: customer badly bad experience breakfast great employee restaurant really nice
Decoded Sequence: investigation pleased manner hotel handled situation good location quiet room
Decoded Sequence:         everything location
Decoded Sequence: coffee shop work price per night totally unacceptable good location
Decoded Sequence: money tiny room air conditioning dated decor good central location
Decoded Sequence: work report morning evening cash machine order location helpful staff
Decoded Sequence: wifi attention request made booking fourth stay last restaurant nice
Decoded Sequence: help always proposing candy kid free apple water always available
Decoded Sequence: room resemble shown website rear garden view bedroom wad pleasant
Decoded Sequence: would change mattress doubt exactly two day sleep awful breakfast
Decoded Sequence: smoking neither room noisey sleep wink either night central location
Decoded Sequence:         nothing nothing
Decoded Sequence: neighbourhood nig

Decoded Sequence:      small noisy room near paddington
Decoded Sequence:    negative never truly uncomfortable huge waste money
Decoded Sequence: club hotel spa relaxing day sunday staff really helpful friendly
Decoded Sequence: location paid gbp night would liked pay half best positive
Decoded Sequence: brush tissue paper check guy wanted charge u friendly positive
Decoded Sequence: arrival furthermore room extremely small even open trolley space positive
Decoded Sequence: last moment competition provides free wifi lead market follow positive
Decoded Sequence: previous stay city enjoyed much better comfort lower price positive
Decoded Sequence: provided better service hotel staff location good worth money paid
Decoded Sequence: room complained said nothink else available think worth money pool
Decoded Sequence: area paris fridge really hard move inside tiny space nothing
Decoded Sequence: cheap cheap hotel holiday inn would better cheaper staff loby
Decoded Sequence: nothing friendly

Decoded Sequence:       sweage coming bathroom nothing
Decoded Sequence:  dislike indifference staff attitude towards costumer demand like location
Decoded Sequence: lead passenger resulting charged fee paying using debit card positive
Decoded Sequence: morning stay room ok size location good working near aldgate
Decoded Sequence:   clup room give bed room dirty room location
Decoded Sequence: week booking com check property promote nothing really bad experience
Decoded Sequence: staff never go back nice location every thing modern clean
Decoded Sequence:      rude staff bathroom filthy positive
Decoded Sequence: properly mattress horrible fridge bathroom seat broken miserable elevator positive
Decoded Sequence:       lousy hotel lousy hotel
Decoded Sequence: cheaper case much better experience absolutely staying neither friend positive
Decoded Sequence: family room make sofa bed bedding supplied ask time positive
Decoded Sequence: decent temp tap stiff breakfast good good choice locat

Decoded Sequence: drop time could bread negative negative none open place amazing
Decoded Sequence: immediately ok walkable access work door room enthusiastic even lyon
Decoded Sequence: london upgraded garden bit area lot central phone socket nothing
Decoded Sequence: privacy stay knocking enough pm central traveller spot comfortable nice
Decoded Sequence: nightmare tube staff able many different nothing gave friend well
Decoded Sequence: first open require detail small enough get location bed lovely
Decoded Sequence: stayed tiny look angle brother market large walking location bit
Decoded Sequence: nest unique credit see contacted ever floor enough could positive
Decoded Sequence: cleaning parking choice pudding room work night lovely light breakfast
Decoded Sequence: dark taking gratuitously need needed well late girl walk important
Decoded Sequence: everything lunch charged many travelling service comfortable staff back may
Decoded Sequence: bed served handy got could view perfect 

There isn't anything crazy unusual with these reviews, as compared to the processed reviews from the original data, so seemingly these are ok for negative reviews. I will need to look into different techniques to try and improve the model prediction. I will try adjusting class weights to penalize the majority class. 

#### 7.4.6: Class weights adjusted and bidirectional LSTM tested with SMOTE

In [32]:
# #adjust the class weights
# class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(y_train_RNN_noPOS), y=y_train_RNN_noPOS_multi)
# class_weights_dict = {i: weight for i, weight in enumerate(class_weights)}
# sample_weights = np.array([class_weights_dict[y] for y in y_train_RNN_noPOS_multi])

# #will try a bi-directional LSTM model 
# model_RNN_bidirect_class_smote = Sequential()
# model_RNN_bidirect_class_smote.add(Embedding(input_dim=len(tokenizer_RNN_noPOS.word_index) + 1, output_dim=128, input_length=10))
# model_RNN_bidirect_class_smote.add(Bidirectional(LSTM(128)))
# model_RNN_bidirect_class_smote.add(Dense(3, activation='softmax'))

# # Compile the model with the updated class weights
# model_RNN_bidirect_class_smote.compile(
#     loss='sparse_categorical_crossentropy',
#     optimizer='adam',
#     metrics=['accuracy'],
#     weighted_metrics=['accuracy']
# )

# #fit the model
# model_RNN_bidirect_class_smote.fit(X_train_RNN_noPOS_multi, y_train_RNN_noPOS_multi, epochs=10, batch_size=128, validation_split=0.2, sample_weight=sample_weights)

# pred_RNN_bidirect_class_smote = model_RNN_bidirect_class_smote.predict(X_test_RNN_noPOS)

# pred_converted_RNN_bidirect_class_smote = np.argmax(pred_RNN_bidirect_class_smote, axis=1)

# #evaluate results from the RNN
# report_model_RNN_bidirect_class_smote = classification_report(y_test_RNN_noPOS, pred_converted_RNN_bidirect_class_smote)
# confusion_matrix_model_RNN_bidirect_class_smote = confusion_matrix(y_test_RNN_noPOS, pred_converted_RNN_bidirect_class_smote)
# accuracy_model_RNN_bidirect_class_smote = accuracy_score(y_test_RNN_noPOS, pred_converted_RNN_bidirect_class_smote)

# #print the results
# print("Classification Report:\n", report_model_RNN_bidirect_class_smote)
# print("Confusion Matrix:\n", confusion_matrix_model_RNN_bidirect_class_smote)
# print("Accuracy:", accuracy_model_RNN_bidirect_class_smote)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Classification Report:
               precision    recall  f1-score   support

           0       0.18      0.41      0.25      3198
           1       0.19      0.18      0.18      4981
           2       0.95      0.91      0.93     94969

    accuracy                           0.86    103148
   macro avg       0.44      0.50      0.45    103148
weighted avg       0.89      0.86      0.87    103148

Confusion Matrix:
 [[ 1307   616  1275]
 [ 1055   883  3043]
 [ 5020  3237 86712]]
Accuracy: 0.86188777290883


#### 7.4.7: Class weights only with LSTM and L2 Norm

In [33]:
# # next I will try a new model architecture with class weights adjusted and L2 norm 
# class_weights_lstm = compute_class_weight(class_weight='balanced', classes=np.unique(y_train_RNN_noPOS), y=y_train_RNN_noPOS)
# class_weights_dict = {i: weight for i, weight in enumerate(class_weights_lstm)}
# sample_weights = np.array([class_weights_dict[y] for y in y_train_RNN_noPOS])

# model_class_norm_lstm = Sequential()
# model_class_norm_lstm.add(Embedding(input_dim=len(tokenizer_RNN_noPOS.word_index) + 1, output_dim=128, input_length=10))
# model_class_norm_lstm.add(LSTM(units=50))
# model_class_norm_lstm.add(Dense(64, activation='relu', kernel_regularizer=l2(0.01), input_dim=(10,1)))
# model_class_norm_lstm.add(Dense(32, activation='relu', kernel_regularizer=l2(0.01)))
# model_class_norm_lstm.add(Dense(3, activation='softmax'))

# model_class_norm_lstm.compile(
#     loss='sparse_categorical_crossentropy', 
#     optimizer='adam', 
#     metrics=['accuracy'],    
#     weighted_metrics=['accuracy']
# )

# model_class_norm_lstm.fit(X_train_RNN_noPOS, y_train_RNN_noPOS, epochs=10, batch_size=64, validation_split=0.2, sample_weight=sample_weights)

# # test predictions and get evaluation metrics
# pred_noPOS_class_norm_lstm = model_class_norm_lstm.predict(X_test_RNN_noPOS)

# #convert back
# pred_noPOS_class_norm_lstm_converted = np.argmax(pred_noPOS_class_norm_lstm, axis=1)

# #evaluate results from the RNN
# class_norm_lstm_report_noPOS = classification_report(y_test_RNN_noPOS, pred_noPOS_class_norm_lstm_converted)
# class_norm_lstm_confusion_matrix_noPOS = confusion_matrix(y_test_RNN_noPOS, pred_noPOS_class_norm_lstm_converted)
# class_norm_lstm_accuracy_noPOS = accuracy_score(y_test_RNN_noPOS, pred_noPOS_class_norm_lstm_converted)

# #print the results
# print("Classification Report:\n", class_norm_lstm_report_noPOS)
# print("Confusion Matrix:\n", class_norm_lstm_confusion_matrix_noPOS)
# print("Accuracy:", class_norm_lstm_accuracy_noPOS)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Classification Report:
               precision    recall  f1-score   support

           0       0.21      0.48      0.29      3198
           1       0.12      0.48      0.19      4981
           2       0.98      0.77      0.86     94969

    accuracy                           0.75    103148
   macro avg       0.43      0.58      0.45    103148
weighted avg       0.91      0.75      0.81    103148

Confusion Matrix:
 [[ 1529  1180   489]
 [ 1210  2415  1356]
 [ 4656 16829 73484]]
Accuracy: 0.7506495520998953


#### 7.4.8: Same as above, but with L2 normalization first, then LSTM 

In [34]:
# # next I will try a new model architecture with class weights adjusted and L2 norm and LSTM
# class_weights_lstm = compute_class_weight(class_weight='balanced', classes=np.unique(y_train_RNN_noPOS), y=y_train_RNN_noPOS)
# class_weights_dict = {i: weight for i, weight in enumerate(class_weights_lstm)}
# sample_weights = np.array([class_weights_dict[y] for y in y_train_RNN_noPOS])

# model_class_norm_lstm_2 = Sequential()
# model_class_norm_lstm_2.add(Embedding(input_dim=len(tokenizer_RNN_noPOS.word_index) + 1, output_dim=128, input_length=10))
# model_class_norm_lstm_2.add(Dense(64, activation='relu', kernel_regularizer=l2(0.01), input_dim=(10,1)))
# model_class_norm_lstm_2.add(Dense(32, activation='relu', kernel_regularizer=l2(0.01)))
# model_class_norm_lstm_2.add(LSTM(units=50))
# model_class_norm_lstm_2.add(Dense(3, activation='softmax'))

# model_class_norm_lstm_2.compile(
#     loss='sparse_categorical_crossentropy', 
#     optimizer='adam', 
#     metrics=['accuracy'],    
#     weighted_metrics=['accuracy']
# )

# model_class_norm_lstm_2.fit(X_train_RNN_noPOS, y_train_RNN_noPOS, epochs=10, batch_size=64, validation_split=0.2, sample_weight=sample_weights)

# # test predictions and get evaluation metrics
# pred_noPOS_class_norm_lstm_2 = model_class_norm_lstm_2.predict(X_test_RNN_noPOS)

# #convert back
# pred_noPOS_class_norm_lstm_converted_2 = np.argmax(pred_noPOS_class_norm_lstm_2, axis=1)

# #evaluate results from the RNN
# class_norm_lstm_report_noPOS_2 = classification_report(y_test_RNN_noPOS, pred_noPOS_class_norm_lstm_converted_2)
# class_norm_lstm_confusion_matrix_noPOS_2 = confusion_matrix(y_test_RNN_noPOS, pred_noPOS_class_norm_lstm_converted_2)
# class_norm_lstm_accuracy_noPOS_2 = accuracy_score(y_test_RNN_noPOS, pred_noPOS_class_norm_lstm_converted_2)

# #print the results
# print("Classification Report:\n", class_norm_lstm_report_noPOS_2)
# print("Confusion Matrix:\n", class_norm_lstm_confusion_matrix_noPOS_2)
# print("Accuracy:", class_norm_lstm_accuracy_noPOS_2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Classification Report:
               precision    recall  f1-score   support

           0       0.24      0.44      0.31      3198
           1       0.10      0.60      0.17      4981
           2       0.98      0.70      0.82     94969

    accuracy                           0.69    103148
   macro avg       0.44      0.58      0.44    103148
weighted avg       0.91      0.69      0.77    103148

Confusion Matrix:
 [[ 1407  1468   323]
 [  929  2974  1078]
 [ 3428 24647 66894]]
Accuracy: 0.6909974017916004


#### 7.4.9: Trying class weights with Bi-directional LSTM

In [19]:
# # next I will try a new model architecture with class weights adjusted and L2 norm and Bi-Directional LSTM
# class_weights_bi_lstm = compute_class_weight(class_weight='balanced', classes=np.unique(y_train_RNN_noPOS), y=y_train_RNN_noPOS)
# class_weights_dict = {i: weight for i, weight in enumerate(class_weights_bi_lstm)}
# sample_weights = np.array([class_weights_dict[y] for y in y_train_RNN_noPOS])

# model_class_norm_lstm_bi = Sequential()
# model_class_norm_lstm_bi.add(Embedding(input_dim=len(tokenizer_RNN_noPOS.word_index) + 1, output_dim=128, input_length=10))
# model_class_norm_lstm_bi.add(Bidirectional(LSTM(units=50, kernel_regularizer=l2(0.01))))
# model_class_norm_lstm_bi.add(Dense(3, activation='softmax'))

# model_class_norm_lstm_bi.compile(
#     loss='sparse_categorical_crossentropy', 
#     optimizer='adam', 
#     metrics=['accuracy'],    
#     weighted_metrics=['accuracy']
# )

# model_class_norm_lstm_bi.fit(X_train_RNN_noPOS, y_train_RNN_noPOS, epochs=10, batch_size=64, validation_split=0.2, sample_weight=sample_weights)

# # test predictions and get evaluation metrics
# pred_noPOS_class_norm_lstm_bi = model_class_norm_lstm_bi.predict(X_test_RNN_noPOS)

# #convert back
# pred_noPOS_class_norm_lstm_converted_bi = np.argmax(pred_noPOS_class_norm_lstm_bi, axis=1)

# #evaluate results from the RNN
# class_norm_lstm_report_noPOS_bi = classification_report(y_test_RNN_noPOS, pred_noPOS_class_norm_lstm_converted_bi)
# class_norm_lstm_confusion_matrix_noPOS_bi = confusion_matrix(y_test_RNN_noPOS, pred_noPOS_class_norm_lstm_converted_bi)
# class_norm_lstm_accuracy_noPOS_bi = accuracy_score(y_test_RNN_noPOS, pred_noPOS_class_norm_lstm_converted_bi)

# #print the results
# print("Classification Report:\n", class_norm_lstm_report_noPOS_bi)
# print("Confusion Matrix:\n", class_norm_lstm_confusion_matrix_noPOS_bi)
# print("Accuracy:", class_norm_lstm_accuracy_noPOS_bi)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Classification Report:
               precision    recall  f1-score   support

           0       0.19      0.53      0.28      3198
           1       0.11      0.46      0.17      4981
           2       0.98      0.75      0.85     94969

    accuracy                           0.73    103148
   macro avg       0.42      0.58      0.43    103148
weighted avg       0.91      0.73      0.80    103148

Confusion Matrix:
 [[ 1685  1095   418]
 [ 1437  2301  1243]
 [ 5808 17925 71236]]
Accuracy: 0.7292628068406561


#### 7.4.10: Trying with same as above but GRU instead 

In [20]:
# # next I will try a new model architecture with class weights adjusted and L2 norm and GRU
# class_weights_gru = compute_class_weight(class_weight='balanced', classes=np.unique(y_train_RNN_noPOS), y=y_train_RNN_noPOS)
# class_weights_dict = {i: weight for i, weight in enumerate(class_weights_gru)}
# sample_weights = np.array([class_weights_dict[y] for y in y_train_RNN_noPOS])

# model_class_norm_gru = Sequential()
# model_class_norm_gru.add(Embedding(input_dim=len(tokenizer_RNN_noPOS.word_index) + 1, output_dim=128, input_length=10))
# model_class_norm_gru.add(GRU(units=50, kernel_regularizer=l2(0.01)))
# model_class_norm_gru.add(Dense(3, activation='softmax'))

# model_class_norm_gru.compile(
#     loss='sparse_categorical_crossentropy', 
#     optimizer='adam', 
#     metrics=['accuracy'],    
#     weighted_metrics=['accuracy']
# )

# model_class_norm_gru.fit(X_train_RNN_noPOS, y_train_RNN_noPOS, epochs=10, batch_size=64, validation_split=0.2, sample_weight=sample_weights)

# # test predictions and get evaluation metrics
# pred_noPOS_class_norm_gru = model_class_norm_gru.predict(X_test_RNN_noPOS)

# #convert back
# pred_noPOS_class_norm_converted_gru = np.argmax(pred_noPOS_class_norm_gru, axis=1)

# #evaluate results from the RNN
# class_norm_lstm_report_noPOS_gru = classification_report(y_test_RNN_noPOS, pred_noPOS_class_norm_converted_gru)
# class_norm_lstm_confusion_matrix_noPOS_gru = confusion_matrix(y_test_RNN_noPOS, pred_noPOS_class_norm_converted_gru)
# class_norm_lstm_accuracy_noPOS_gru = accuracy_score(y_test_RNN_noPOS, pred_noPOS_class_norm_converted_gru)

# #print the results
# print("Classification Report:\n", class_norm_lstm_report_noPOS_gru)
# print("Confusion Matrix:\n", class_norm_lstm_confusion_matrix_noPOS_gru)
# print("Accuracy:", class_norm_lstm_accuracy_noPOS_gru)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Classification Report:
               precision    recall  f1-score   support

           0       0.14      0.56      0.23      3198
           1       0.10      0.46      0.17      4981
           2       0.98      0.70      0.82     94969

    accuracy                           0.68    103148
   macro avg       0.41      0.57      0.40    103148
weighted avg       0.91      0.68      0.77    103148

Confusion Matrix:
 [[ 1784  1089   325]
 [ 1657  2313  1011]
 [ 9037 19593 66339]]
Accuracy: 0.6828634583317176


#### 7.4.11: SimpleRNN with class weights balanced and L2 Normalization

In [22]:
# #adjust class weights than train with SimpleRNN
# class_weights_simple = compute_class_weight(class_weight='balanced', classes=np.unique(y_train_RNN_noPOS), y=y_train_RNN_noPOS)
# class_weights_dict = {i: weight for i, weight in enumerate(class_weights_simple)}
# sample_weights = np.array([class_weights_dict[y] for y in y_train_RNN_noPOS])

# model_simpleRNN_normalized_class = Sequential()
# model_simpleRNN_normalized_class.add(Embedding(input_dim=len(tokenizer_RNN_noPOS.word_index) + 1, output_dim=128, input_length=10))
# model_simpleRNN_normalized_class.add(SimpleRNN(units=50))
# model_simpleRNN_normalized_class.add(Dense(64, activation='relu', kernel_regularizer=l2(0.01), input_dim=(10,1)))
# model_simpleRNN_normalized_class.add(Dense(32, activation='relu', kernel_regularizer=l2(0.01)))
# model_simpleRNN_normalized_class.add(Dense(3, activation='softmax'))

# model_simpleRNN_normalized_class.compile(
#     loss='sparse_categorical_crossentropy', 
#     optimizer='adam', 
#     metrics=['accuracy'],    
#     weighted_metrics=['accuracy']
# )

# model_simpleRNN_normalized_class.fit(X_train_RNN_noPOS, y_train_RNN_noPOS, epochs=10, batch_size=64, validation_split=0.2, sample_weight=sample_weights)

# # test predictions and get evaluation metrics
# pred_simpleRNN_noPOS_normalized_class = model_simpleRNN_normalized_class.predict(X_test_RNN_noPOS)

# #convert back
# pred_simpleRNN_noPOS_normalized_class_convert = np.argmax(pred_simpleRNN_noPOS_normalized_class, axis=1)

# #evaluate results from the RNN
# only_normalized_simpleRNN_class_report_noPOS = classification_report(y_test_RNN_noPOS, pred_simpleRNN_noPOS_normalized_class_convert)
# only_normalized_simpleRNN_confusion_matrix_class_noPOS = confusion_matrix(y_test_RNN_noPOS, pred_simpleRNN_noPOS_normalized_class_convert)
# only_normalized_simpleRNN_accuracy_class_noPOS = accuracy_score(y_test_RNN_noPOS, pred_simpleRNN_noPOS_normalized_class_convert)

# #print the results
# print("Classification Report:\n", only_normalized_simpleRNN_class_report_noPOS)
# print("Confusion Matrix:\n", only_normalized_simpleRNN_confusion_matrix_class_noPOS)
# print("Accuracy:", only_normalized_simpleRNN_accuracy_class_noPOS)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Classification Report:
               precision    recall  f1-score   support

           0       0.15      0.45      0.22      3198
           1       0.10      0.52      0.17      4981
           2       0.98      0.69      0.81     94969

    accuracy                           0.68    103148
   macro avg       0.41      0.56      0.40    103148
weighted avg       0.91      0.68      0.76    103148

Confusion Matrix:
 [[ 1453  1374   371]
 [ 1346  2579  1056]
 [ 7187 21874 65908]]
Accuracy: 0.6780548338310001


## Part 8: Updated Classification Data to only 2 classes - positive and negative 

In [20]:
#import TFIDF for testing using this data instead of current features
from sklearn.feature_extraction.text import TfidfVectorizer

In [21]:
# first I will define the thresholds - our labels will be positive, negative, and neutral 
# I decided to use a small window for neutral, between 4.5 and 5.5 - this will ensure most the review classifications
# are sensative - it can be updated later if desired by changing the following threshold values

positive_threshold = 8.0
negative_threshold = 8.0

# next, I will classify by building a classification function

def classify_scores(score_value):
    if score_value >= positive_threshold:
        return 'positive'
    elif score_value < negative_threshold:
        return 'negative'
    
# I will also do the same for the no POS tagging scenario

preprocessed_reviews_updated = preprocessed_reviews_no_pos_tagging.copy()
    
preprocessed_reviews_updated['Reviewer_Score'] = preprocessed_reviews_updated['Reviewer_Score'].apply(classify_scores)

preprocessed_reviews_updated

# determine the counts of each category
review_counts_updated = preprocessed_reviews_updated['Reviewer_Score'].value_counts()
print(review_counts_updated)

positive    335646
negative    180092
Name: Reviewer_Score, dtype: int64


In [46]:
#with an RNN the scores can not be text - they need to be numbers - will encode these now
class LabelEncoderSentiment_updated:
    def __init__(self):
        self.classes_ = ['negative', 'positive']

    def fit_transform_updated(self, labels):
        return np.array([self.classes_.index(label) for label in labels])
    
    def inverse_transform_sentiment_updated(encoded_labels):
        classes_ = ['negative', 'positive']
        return [classes_[label] for label in encoded_labels]

label_encoder_RNN_updated = LabelEncoderSentiment_updated()
encoded_scores_RNN_updated = label_encoder_RNN_updated.fit_transform_updated(preprocessed_reviews_updated["Reviewer_Score"])

# get the text and POS data
review_RNN_updated = preprocessed_reviews_updated["Total_Review_handled"]

print(preprocessed_reviews_updated)

# # Define TF-IDF vectorizer - this didn't work well so I won't use it. 
# tfidf_vectorizer = TfidfVectorizer(max_features=500)
# preprocessed_reviews_tfidf = [" ".join(seq) for seq in review_RNN_updated]
# # Fit and transform the text data using TF-IDF
# tfidf_matrix = tfidf_vectorizer.fit_transform(preprocessed_reviews_tfidf)
# tfidf_array = tfidf_matrix.toarray()

# prepare the text and POS data for RNN usage
tokenizer_RNN_updated = Tokenizer()
tokenizer_RNN_updated.fit_on_texts([" ".join(seq) for seq in review_RNN_updated])
sequences_RNN_updated = tokenizer_RNN_updated.texts_to_sequences([" ".join(seq) for seq in review_RNN_updated])
review_prepped_RNN_updated = pad_sequences(sequences_RNN_updated, maxlen=30)

# prepare the test and train datasets
X_train_RNN_updated, X_test_RNN_updated, y_train_RNN_updated, y_test_RNN_updated = train_test_split(review_prepped_RNN_updated, encoded_scores_RNN_updated, test_size=0.2, random_state=42)

                                     Total_Review_handled Reviewer_Score
0       [angry, made, post, available, via, possible, ...       negative
1       [negative, real, complaint, hotel, great, grea...       negative
2       [room, nice, elderly, bit, difficult, room, tw...       negative
3       [room, dirty, afraid, walk, barefoot, floor, l...       negative
4       [booked, company, line, showed, picture, room,...       negative
...                                                   ...            ...
515733  [trolly, staff, help, take, luggage, room, loc...       negative
515734  [hotel, look, like, surely, breakfast, ok, got...       negative
515735  [ac, useless, hot, week, vienna, gave, hot, ai...       negative
515736  [negative, room, enormous, really, comfortable...       positive
515737         [rd, floor, work, free, wife, staff, kind]       positive

[515738 rows x 2 columns]


In [23]:
#test batch normalization and reducing the learning rate if no changes to val_loss

from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.callbacks import ReduceLROnPlateau

#update class weights
class_weights_updated = compute_class_weight(class_weight='balanced', classes=np.unique(y_train_RNN_updated), y=y_train_RNN_updated)
class_weights_dict = {i: weight for i, weight in enumerate(class_weights_updated)}
sample_weights_updated = np.array([class_weights_dict[y] for y in y_train_RNN_updated])

#define the reduce learning rate parameters. 
reduce_lr = ReduceLROnPlateau(
    monitor='val_loss',  
    factor=0.5,           
    patience=3,           
    min_lr=1e-5,          
    verbose=1              
)

In [24]:
#create the complex model
model_simpleRNN_updated = Sequential()
model_simpleRNN_updated.add(Embedding(input_dim=len(tokenizer_RNN_updated.word_index) + 1, output_dim=128, input_length=30))
model_simpleRNN_updated.add(BatchNormalization())
model_simpleRNN_updated.add(SimpleRNN(units=50))
model_simpleRNN_updated.add(BatchNormalization())
model_simpleRNN_updated.add(Dense(64, activation='relu', kernel_regularizer=l2(0.01)))
model_simpleRNN_updated.add(BatchNormalization())
model_simpleRNN_updated.add(Dense(32, activation='relu', kernel_regularizer=l2(0.01)))
model_simpleRNN_updated.add(BatchNormalization())
model_simpleRNN_updated.add(Dense(2, activation='softmax'))

model_simpleRNN_updated.compile(
    loss='sparse_categorical_crossentropy', 
    optimizer='adam', 
    metrics=['accuracy'],    
    weighted_metrics=['accuracy']
)

#train
model_simpleRNN_updated.fit(X_train_RNN_updated, y_train_RNN_updated, epochs=100, batch_size=64, validation_split=0.2, sample_weight=sample_weights_updated, callbacks=[reduce_lr])

# test predictions and get evaluation metrics
pred_simpleRNN_updated = model_simpleRNN_updated.predict(X_test_RNN_updated)

#convert back
pred_simpleRNN_updated_convert = np.argmax(pred_simpleRNN_updated, axis=1)

#evaluate results from the RNN
updated_simpleRNN_class_report = classification_report(y_test_RNN_updated, pred_simpleRNN_updated_convert)
updated_simpleRNN_confusion_matrix_class = confusion_matrix(y_test_RNN_updated, pred_simpleRNN_updated_convert)
updated_simpleRNN_accuracy_class = accuracy_score(y_test_RNN_updated, pred_simpleRNN_updated_convert)

#print the results
print("Classification Report:\n", updated_simpleRNN_class_report)
print("Confusion Matrix:\n", updated_simpleRNN_confusion_matrix_class)
print("Accuracy:", updated_simpleRNN_accuracy_class)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 10: ReduceLROnPlateau reducing learning rate to 0.0005000000237487257.
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 13: ReduceLROnPlateau reducing learning rate to 0.0002500000118743628.
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 16: ReduceLROnPlateau reducing learning rate to 0.0001250000059371814.
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 19: ReduceLROnPlateau reducing learning rate to 6.25000029685907e-05.
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 22: ReduceLROnPlateau reducing learning rate to 3.125000148429535e-05.
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 25: ReduceLROnPlateau reducing learning rate to 1.5625000742147677e-05.
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 28: ReduceLROnPlateau reducing learning rate to 1e-05.
Epoch 29/100
Epoch 30/100
Epoch 31/100


Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100


Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100
Classification Report:
               precision    recall  f1-score   support

           0       0.65      0.74      0.69     36165
           1       0.85      0.79      0.82     66983

    accuracy                           0.77    103148
   macro avg       0.75      0.76      0.76    103148
weighted avg       0.78      0.77      0.77    103148

Confusion Matrix:
 [[26847  9318]
 [14327 52656]]
Accuracy: 0.7707662775817272


In [40]:
#create the model
model_lstm_updated = Sequential()
model_lstm_updated.add(Embedding(input_dim=len(tokenizer_RNN_updated.word_index) + 1, output_dim=128, input_length=30))
model_lstm_updated.add(BatchNormalization())
model_lstm_updated.add(LSTM(units=50, kernel_regularizer=l2(0.1)))
model_lstm_updated.add(BatchNormalization())
model_lstm_updated.add(Dense(2, activation='softmax'))

model_lstm_updated.compile(
    loss='sparse_categorical_crossentropy', 
    optimizer='adam', 
    metrics=['accuracy'],    
    weighted_metrics=['accuracy']
)

#train
model_lstm_updated.fit(X_train_RNN_updated, y_train_RNN_updated, epochs=100, batch_size=64, validation_split=0.2, sample_weight=sample_weights_updated, callbacks=[reduce_lr])

# test predictions and get evaluation metrics
pred_lstm_updated = model_lstm_updated.predict(X_test_RNN_updated)

#convert back
pred_lstm_updated_convert = np.argmax(pred_lstm_updated, axis=1)

#evaluate results from the RNN
updated_lstm_class_report = classification_report(y_test_RNN_updated, pred_lstm_updated_convert)
updated_lstm_confusion_matrix_class = confusion_matrix(y_test_RNN_updated, pred_lstm_updated_convert)
updated_lstm_accuracy_class = accuracy_score(y_test_RNN_updated, pred_lstm_updated_convert)

#print the results
print("Classification Report:\n", updated_lstm_class_report)
print("Confusion Matrix:\n", updated_lstm_confusion_matrix_class)
print("Accuracy:", updated_lstm_accuracy_class)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 7: ReduceLROnPlateau reducing learning rate to 0.0005000000237487257.
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 10: ReduceLROnPlateau reducing learning rate to 0.0002500000118743628.
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 13: ReduceLROnPlateau reducing learning rate to 0.0001250000059371814.
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 16: ReduceLROnPlateau reducing learning rate to 6.25000029685907e-05.
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 19: ReduceLROnPlateau reducing learning rate to 3.125000148429535e-05.
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 22: ReduceLROnPlateau reducing learning rate to 1.5625000742147677e-05.
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 25: ReduceLROnPlateau reducing learning rate to 1e-05.
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 

Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100
Classification Report:
               precision    recall  f1-score   support

           0       0.64      0.74      0.69     36165
           1       0.85      0.78      0.81     66983

    accuracy                           0.76    103148
   macro avg       0.74      0.76      0.75    103148
weighted avg       0.77      0.76      0.77    103148

Confusion Matrix:
 [[26729  9436]
 [14988 51995]]
Accuracy: 0.7632140225695118


In [41]:
#create the bi-lstm model
model_bi_lstm_updated = Sequential()
model_bi_lstm_updated.add(Embedding(input_dim=len(tokenizer_RNN_updated.word_index) + 1, output_dim=128, input_length=30))
model_bi_lstm_updated.add(BatchNormalization())
model_bi_lstm_updated.add(Bidirectional(LSTM(units=50, kernel_regularizer=l2(0.1))))
model_bi_lstm_updated.add(BatchNormalization())
model_bi_lstm_updated.add(Dense(2, activation='softmax'))

model_bi_lstm_updated.compile(
    loss='sparse_categorical_crossentropy', 
    optimizer='adam', 
    metrics=['accuracy'],    
    weighted_metrics=['accuracy']
)

#train
model_bi_lstm_updated.fit(X_train_RNN_updated, y_train_RNN_updated, epochs=100, batch_size=64, validation_split=0.2, sample_weight=sample_weights_updated, callbacks=[reduce_lr])

# test predictions and get evaluation metrics
pred_bi_lstm_updated = model_bi_lstm_updated.predict(X_test_RNN_updated)

#convert back
pred_bi_lstm_updated_convert = np.argmax(pred_bi_lstm_updated, axis=1)

#evaluate results from the RNN
updated_bi_lstm_class_report = classification_report(y_test_RNN_updated, pred_bi_lstm_updated_convert)
updated_bi_lstm_confusion_matrix_class = confusion_matrix(y_test_RNN_updated, pred_bi_lstm_updated_convert)
updated_bi_lstm_accuracy_class = accuracy_score(y_test_RNN_updated, pred_bi_lstm_updated_convert)

#print the results
print("Classification Report:\n", updated_bi_lstm_class_report)
print("Confusion Matrix:\n", updated_bi_lstm_confusion_matrix_class)
print("Accuracy:", updated_bi_lstm_accuracy_class)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 7: ReduceLROnPlateau reducing learning rate to 0.0005000000237487257.
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 10: ReduceLROnPlateau reducing learning rate to 0.0002500000118743628.
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 13: ReduceLROnPlateau reducing learning rate to 0.0001250000059371814.
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 16: ReduceLROnPlateau reducing learning rate to 6.25000029685907e-05.
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 19: ReduceLROnPlateau reducing learning rate to 3.125000148429535e-05.
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 22: ReduceLROnPlateau reducing learning rate to 1.5625000742147677e-05.
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 25: ReduceLROnPlateau reducing learning rate to 1e-05.
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 

Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100
Classification Report:
               precision    recall  f1-score   support

           0       0.64      0.75      0.69     36165
           1       0.85      0.77      0.81     66983

    accuracy                           0.76    103148
   macro avg       0.74      0.76      0.75    103148
weighted avg       0.78      0.76      0.77    103148

Confusion Matrix:
 [[27226  8939]
 [15629 51354]]
Accuracy: 0.7618179702951099


In [24]:
# create the GRU model
model_gru_updated = Sequential()
model_gru_updated.add(Embedding(input_dim=len(tokenizer_RNN_updated.word_index) + 1, output_dim=128, input_length=30))
model_gru_updated.add(BatchNormalization())
model_gru_updated.add(GRU(units=100, kernel_regularizer=l2(0.01), return_sequences=True))
model_gru_updated.add(BatchNormalization())
model_gru_updated.add(GRU(units=50, kernel_regularizer=l2(0.01)))
model_gru_updated.add(BatchNormalization())
model_gru_updated.add(Dropout(0.5))
model_gru_updated.add(Dense(2, activation='softmax'))

model_gru_updated.compile(
    loss='sparse_categorical_crossentropy', 
    optimizer='adam', 
    metrics=['accuracy'],    
    weighted_metrics=['accuracy']
)

#train
model_gru_updated.fit(X_train_RNN_updated, y_train_RNN_updated, epochs=10, batch_size=64, validation_split=0.2, sample_weight=sample_weights_updated, callbacks=[reduce_lr])

# test predictions and get evaluation metrics
pred_gru_updated = model_gru_updated.predict(X_test_RNN_updated)

#convert back
pred_gru_updated_convert = np.argmax(pred_gru_updated, axis=1)

#evaluate results from the RNN
updated_gru_class_report = classification_report(y_test_RNN_updated, pred_gru_updated_convert)
updated_gru_confusion_matrix_class = confusion_matrix(y_test_RNN_updated, pred_gru_updated_convert)
updated_gru_accuracy_class = accuracy_score(y_test_RNN_updated, pred_gru_updated_convert)

#print the results
print("Classification Report:\n", updated_gru_class_report)
print("Confusion Matrix:\n", updated_gru_confusion_matrix_class)
print("Accuracy:", updated_gru_accuracy_class)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 6: ReduceLROnPlateau reducing learning rate to 0.0005000000237487257.
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 9: ReduceLROnPlateau reducing learning rate to 0.0002500000118743628.
Epoch 10/10
Classification Report:
               precision    recall  f1-score   support

           0       0.65      0.74      0.69     36165
           1       0.85      0.79      0.82     66983

    accuracy                           0.77    103148
   macro avg       0.75      0.76      0.76    103148
weighted avg       0.78      0.77      0.77    103148

Confusion Matrix:
 [[26786  9379]
 [14283 52700]]
Accuracy: 0.7706014658548881


Taking the best results with the SimpleRNN, I will test a larger batch size as a scenario to see if that helps. I will also try to overfit the minority class to see if it helps more in a following model. 

In [25]:
#adjust the batch size
model_simpleRNN_updated_2 = Sequential()
model_simpleRNN_updated_2.add(Embedding(input_dim=len(tokenizer_RNN_updated.word_index) + 1, output_dim=128, input_length=30))
model_simpleRNN_updated_2.add(BatchNormalization())
model_simpleRNN_updated_2.add(SimpleRNN(units=50))
model_simpleRNN_updated_2.add(BatchNormalization())
model_simpleRNN_updated_2.add(Dense(64, activation='relu', kernel_regularizer=l2(0.01)))
model_simpleRNN_updated_2.add(BatchNormalization())
model_simpleRNN_updated_2.add(Dense(32, activation='relu', kernel_regularizer=l2(0.01)))
model_simpleRNN_updated_2.add(BatchNormalization())
model_simpleRNN_updated_2.add(Dense(2, activation='softmax'))

model_simpleRNN_updated_2.compile(
    loss='sparse_categorical_crossentropy', 
    optimizer='adam', 
    metrics=['accuracy'],    
    weighted_metrics=['accuracy']
)

model_simpleRNN_updated_2.fit(X_train_RNN_updated, y_train_RNN_updated, epochs=100, batch_size=128, validation_split=0.2, sample_weight=sample_weights_updated, callbacks=[reduce_lr])

# test predictions and get evaluation metrics
pred_simpleRNN_updated_2 = model_simpleRNN_updated_2.predict(X_test_RNN_updated)

#convert back
pred_simpleRNN_updated_convert_2 = np.argmax(pred_simpleRNN_updated_2, axis=1)

#evaluate results from the RNN
updated_simpleRNN_class_report_2 = classification_report(y_test_RNN_updated, pred_simpleRNN_updated_convert_2)
updated_simpleRNN_confusion_matrix_class_2 = confusion_matrix(y_test_RNN_updated, pred_simpleRNN_updated_convert_2)
updated_simpleRNN_accuracy_class_2 = accuracy_score(y_test_RNN_updated, pred_simpleRNN_updated_convert_2)

#print the results
print("Classification Report:\n", updated_simpleRNN_class_report_2)
print("Confusion Matrix:\n", updated_simpleRNN_confusion_matrix_class_2)
print("Accuracy:", updated_simpleRNN_accuracy_class_2)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 6: ReduceLROnPlateau reducing learning rate to 0.0005000000237487257.
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 9: ReduceLROnPlateau reducing learning rate to 0.0002500000118743628.
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 12: ReduceLROnPlateau reducing learning rate to 0.0001250000059371814.
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 15: ReduceLROnPlateau reducing learning rate to 6.25000029685907e-05.
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 18: ReduceLROnPlateau reducing learning rate to 3.125000148429535e-05.
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 21: ReduceLROnPlateau reducing learning rate to 1.5625000742147677e-05.
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 24: ReduceLROnPlateau reducing learning rate to 1e-05.
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100


Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100


Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100
Classification Report:
               precision    recall  f1-score   support

           0       0.64      0.73      0.68     36165
           1       0.84      0.78      0.81     66983

    accuracy                           0.76    103148
   macro avg       0.74      0.75      0.75    103148
weighted avg       0.77      0.76      0.77    103148

Confusion Matrix:
 [[26264  9901]
 [14632 52351]]
Accuracy: 0.7621572885562493


In [24]:
#these results did not make much of an impact, will now try using smote to oversample the minority class
smote_over = SMOTE(sampling_strategy='auto', random_state=42)
X_train_RNN_over, y_train_RNN_over = smote_over.fit_resample(X_train_RNN_updated, y_train_RNN_updated)

#need to readjust the class weights for this new training data
class_weights_over = compute_class_weight(class_weight='balanced', classes=np.unique(y_train_RNN_over), y=y_train_RNN_over)
class_weights_dict_over = {i: weight for i, weight in enumerate(class_weights_over)}
sample_weights_over = np.array([class_weights_dict_over[y] for y in y_train_RNN_over])

In [25]:
#create the model for the SimpleRNN with smote and complex layers
model_simpleRNN_over = Sequential()
model_simpleRNN_over.add(Embedding(input_dim=len(tokenizer_RNN_updated.word_index) + 1, output_dim=128, input_length=30))
model_simpleRNN_over.add(BatchNormalization())
model_simpleRNN_over.add(SimpleRNN(units=50))
model_simpleRNN_over.add(BatchNormalization())
model_simpleRNN_over.add(Dense(64, activation='relu', kernel_regularizer=l2(0.01)))
model_simpleRNN_over.add(BatchNormalization())
model_simpleRNN_over.add(Dense(32, activation='relu', kernel_regularizer=l2(0.01)))
model_simpleRNN_over.add(BatchNormalization())
model_simpleRNN_over.add(Dense(2, activation='softmax'))

model_simpleRNN_over.compile(
    loss='sparse_categorical_crossentropy', 
    optimizer='adam', 
    metrics=['accuracy'],    
    weighted_metrics=['accuracy']
)

#train
model_simpleRNN_over.fit(X_train_RNN_over, y_train_RNN_over, epochs=100, batch_size=64, validation_split=0.2, sample_weight=sample_weights_over, callbacks=[reduce_lr])

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 9: ReduceLROnPlateau reducing learning rate to 0.0005000000237487257.
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 12: ReduceLROnPlateau reducing learning rate to 0.0002500000118743628.
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 15: ReduceLROnPlateau reducing learning rate to 0.0001250000059371814.
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 18: ReduceLROnPlateau reducing learning rate to 6.25000029685907e-05.
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 21: ReduceLROnPlateau reducing learning rate to 3.125000148429535e-05.
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 24: ReduceLROnPlateau reducing learning rate to 1.5625000742147677e-05.
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 27: ReduceLROnPlateau reducing learning rate to 1e-05.
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100


Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100


Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.src.callbacks.History at 0x3301908d0>

In [26]:
# test predictions and get evaluation metrics
pred_simpleRNN_over = model_simpleRNN_over.predict(X_test_RNN_updated)

#convert back
pred_simpleRNN_over_convert = np.argmax(pred_simpleRNN_over, axis=1)

#evaluate results from the RNN
over_simpleRNN_class_report = classification_report(y_test_RNN_updated, pred_simpleRNN_over_convert)
over_simpleRNN_confusion_matrix_class = confusion_matrix(y_test_RNN_updated, pred_simpleRNN_over_convert)
over_simpleRNN_accuracy_class = accuracy_score(y_test_RNN_updated, pred_simpleRNN_over_convert)

#print the results
print("Classification Report:\n", over_simpleRNN_class_report)
print("Confusion Matrix:\n", over_simpleRNN_confusion_matrix_class)
print("Accuracy:", over_simpleRNN_accuracy_class)

Classification Report:
               precision    recall  f1-score   support

           0       0.70      0.66      0.68     36165
           1       0.82      0.84      0.83     66983

    accuracy                           0.78    103148
   macro avg       0.76      0.75      0.76    103148
weighted avg       0.78      0.78      0.78    103148

Confusion Matrix:
 [[23908 12257]
 [10397 56586]]
Accuracy: 0.780373831775701


## Part 8: Scraping Reviews to Get New Data From a Website

In [27]:
#ensure libraries are installed for scraping needs [20][21]
!pip install beautifulsoup4
!pip install --upgrade selenium



In [28]:
# import relevant libraries
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time
import os

In [29]:
# next, I need to specify the driver path for the web driver for Selenium to work
#set up selenium webdriver [21]
service = Service('/Users/coreyreid/Documents/Final Project/Final Project - Implementation & Evaluation/chromedriver')
options = webdriver.ChromeOptions()
#options.add_argument('--headless') #to make it headless and not run the chrome instance
driver = webdriver.Chrome(service=service, options=options)

#example website to use for the scraping
url = "https://www.mainecottagekeepers.com/2-beautiful-water-view-properties-in-1-for-extended-families-orp5b4a2abx"
driver.get(url)

# delay as needed
time.sleep(10)  

# Switch to the iframe containing the reviews - need to switch the focus into the iframe
iframe = driver.find_element(By.CSS_SELECTOR, 'div.ownerrez-widget[data-widgetid="dc0c1ee92566447e81ba5e68a8a2ac1a"] iframe')
driver.switch_to.frame(iframe)

# delay as needed
time.sleep(10)  

# Extract the HTML content
html = driver.page_source

# now, I will scrape the website to expose the review content
soup = BeautifulSoup(html, 'html.parser')
review_elements = soup.find_all('div', class_='review-item')

In [30]:
#now that we have isolated the html element that matters, we will want to go through all the reviews
#and collect just the relevant text
#initialize
real_reviews = []

#loop through the review elements
for review in review_elements:
    # Scrape the review title text
    review_title = review.find("span", class_="review-item-title").strong
    # Scrape the review body text
    review_body = review.find("div", class_="has-read-more").text.strip()
    #check that a review exists
    if review_title and review_body:
        #get review title text 
        review_title_text = review_title.text.strip()
        #get review body text
        review_body_text = review_body
        # put the data in a dataframe
                # Append the scraped data as a dictionary to the 'data' list
        real_reviews.append(
            review_body_text
        )
        #print the details through the loop
        print("The title of the review is:", review_title_text)
        print("The review content is:", review_body_text)
        print("---------------------------------")
    else:
        #error handling
        print("There is no review title or text.")

The title of the review is: 
The review content is: A big thank you from us to be able to stay in this wonderful place in order to be with our family during the Lobstah fest.  We drove from Florida to see our Massachusetts family as we gathered in Maine to welcome our Grandson from the ship that anchored in Rockland for the Festival.  We loved being able to see him as he did us.  We felt at home in this Airbnb as we gathered at the fire pit and the deck table for meals. We did several meals at the Dip Net too.  Great food and very close to house.This house is very well equipped, clean, and is an old farmhouse with a lot of Character.We did host an Airbnb on the Cape at one time and know that Shawn-Elise deserves a 5 star rating to become a super host if she is not already.Would we come again?  You Bet.Sharyn
---------------------------------
The title of the review is: 
The review content is: Great stay. Just didn’t understand why they took 500 for a security deposit
------------------

In [31]:
# test the way the data is structured by looking at the output
real_reviews

['A big thank you from us to be able to stay in this wonderful place in order to be with our family during the Lobstah fest.\xa0 We drove from Florida to see our Massachusetts family as we gathered in Maine to welcome our Grandson from the ship that anchored in Rockland for the Festival.\xa0 We loved being able to see him as he did us.\xa0 We felt at home in this Airbnb as we gathered at the fire pit and the deck table for meals. We did several meals at the Dip Net too.\xa0 Great food and very close to house.This house is very well equipped, clean, and is an old farmhouse with a lot of Character.We did host an Airbnb on the Cape at one time and know that Shawn-Elise deserves a 5 star rating to become a super host if she is not already.Would we come again?\xa0 You Bet.Sharyn',
 'Great stay. Just didn’t understand why they took 500 for a security deposit',
 'The place is absolutely beautiful and there was plenty of room for all 10 f us to sleep. The houses are just a short drive from a b

## Part 9: Testing on Real Review Data

In [33]:
# in order to scale the preprocessing to be able to be used with new reviews, I need to generalize the 
# preprocessing function - I will do that now for the one with no POS tagging

def preprocess_no_pos(review):
    
    #remove special characters and ensure lowercase
    def preprocess_remove_special(review):
        review_handled = re.sub(r'[^a-zA-Z\s]', '', review)
        review_lower = review_handled.lower()
        return review_lower

    #tokenize
    def preprocess_tokenize(review):
        bag_of_words = word_tokenize(review)
        return bag_of_words

    #remove stopwords
    def preprocess_stopwords(review):
        words = stopwords.words('english')
        set_stop_words = set(words)
        removed_stopwords_words = []
        for word in review:
            if word not in set_stop_words:
                removed_stopwords_words.append(word)
        return removed_stopwords_words

    #lemmatize
    def preprocess_lemmatize(review):
        lemmatizer = WordNetLemmatizer()
        lemmatized_review = [lemmatizer.lemmatize(word) for word in review ]
        return lemmatized_review

    # Apply the preprocessing steps
    review = preprocess_remove_special(review)
    review = preprocess_tokenize(review)
    review = preprocess_stopwords(review)
    review = preprocess_lemmatize(review)

    # Join the tokens back into a single string
    preprocessed_review = ' '.join(review)

    #return the preprocessed reviews
    return preprocessed_review

In [34]:
# turns out they are all positive, which when you look above you can see is the case - this will happen 
# from time to time. I will manually add data to the reviews data to put in a few negative reviews for further 
# testing - this will confirm the model is working properly

updated_real_reviews = real_reviews.copy()

updated_real_reviews.extend(['Awful', 'It was a bad rental unit and was very ugly. The ammenities were all broken.', 'I will never return to this or recommend it to anyone - it was a horrible rental and it ruined my holiday.'])

updated_real_reviews

['A big thank you from us to be able to stay in this wonderful place in order to be with our family during the Lobstah fest.\xa0 We drove from Florida to see our Massachusetts family as we gathered in Maine to welcome our Grandson from the ship that anchored in Rockland for the Festival.\xa0 We loved being able to see him as he did us.\xa0 We felt at home in this Airbnb as we gathered at the fire pit and the deck table for meals. We did several meals at the Dip Net too.\xa0 Great food and very close to house.This house is very well equipped, clean, and is an old farmhouse with a lot of Character.We did host an Airbnb on the Cape at one time and know that Shawn-Elise deserves a 5 star rating to become a super host if she is not already.Would we come again?\xa0 You Bet.Sharyn',
 'Great stay. Just didn’t understand why they took 500 for a security deposit',
 'The place is absolutely beautiful and there was plenty of room for all 10 f us to sleep. The houses are just a short drive from a b

In [35]:
# next, I will test this set with Multinomial Naive Bayes with no POS tags
# to start I will preprocess the data using this function and looping through all the reviews
updated_real_reviews_processed = [preprocess_no_pos(review) for review in updated_real_reviews]

In [38]:
#last I will test the best 3-class RNN model with the real review data
# to do this, we will need to process the data a bit more to get it into it's final model training state.
# we will use the no POS tagging version as it always seemed to perform a bit better, even if the difference was
# very small between no POS tags and POS tagged data

# prepare the review text data for RNN usage
tokenizer_RNN_noPOS_test_3 = Tokenizer()
tokenizer_RNN_noPOS_test_3.fit_on_texts(updated_real_reviews_processed)
review_prepped_RNN_noPOS_test = tokenizer_RNN_noPOS_test_3.texts_to_sequences(updated_real_reviews_processed)

# pad the text with a set length - used 10 previously so will use it again
review_prepped_RNN_noPOS_padded = pad_sequences(review_prepped_RNN_noPOS_test, maxlen=30)

real_pred_RNN_noPOS_test = model_simpleRNN_normalized_only.predict(review_prepped_RNN_noPOS_padded)

# convert back to positive, negative, or neutral
real_pred_RNN_noPOS_converted_back_test = label_encoder_RNN_noPOS.inverse_transform_sentiment_noPOS(real_pred_RNN_noPOS_test.argmax(axis=1))

# get the predictions
real_pred_RNN_noPOS_converted_back_test



['positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive']

As can be seen, the 3-class predictor is overfit to positive and chooses positive each time.

In [47]:
# additionally, I will test the sentiment analysis on the redefined threshold 2-class  models 
# which performed a bit better than the previous 3-class's best results


# prepare the review text data for RNN usage
tokenizer_RNN_noPOS_test_2 = Tokenizer()
tokenizer_RNN_noPOS_test_2.fit_on_texts(updated_real_reviews_processed)
review_prepped_RNN_noPOS_test_2 = tokenizer_RNN_noPOS_test_2.texts_to_sequences(updated_real_reviews_processed)

# pad the text with a set length - used 10 previously so will use it again
review_prepped_RNN_noPOS_padded_2 = pad_sequences(review_prepped_RNN_noPOS_test_2, maxlen=30)

real_pred_RNN_noPOS_test_2 = model_simpleRNN_over.predict(review_prepped_RNN_noPOS_padded_2)

encoded_labels = real_pred_RNN_noPOS_test_2.argmax(axis=1)

# convert back to positive, negative, or neutral
real_pred_RNN_noPOS_converted_back_test_2 = LabelEncoderSentiment_updated.inverse_transform_sentiment_updated(encoded_labels)

# get the predictions
real_pred_RNN_noPOS_converted_back_test_2



['negative',
 'positive',
 'positive',
 'negative',
 'negative',
 'positive',
 'positive',
 'positive',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative']

## Part 10: Product Development

I will start by building a product for the 3-class model

In [48]:
#Import the necessary GUI libraries
#I will use message box from Tkinter to build the GUI and the Tkinter library in general [22]

import tkinter as tk
from tkinter import messagebox
import joblib
import string

In [57]:
#first I need to define a function that will run on the button click. 
# this will include the preprocessing, vectorization, and model execution for the sentiment analysis

# function to determine the preprocessed state for the reviews
def determine_preprocess(review):
    # Create a list of reviews, for any ";" will seperate the list so input needs to seperate reviews by ";"
    reviews_list = review.split(";")
    #initialize the results list to be used to append the results
    reviews_list_processed = []
    
    #loop through the reviews and analyze individually
    for review in reviews_list:
        # Preprocess each review individually
        processed_review = preprocess_no_pos(review) 
        tokenizer_RNN_noPOS_prod = Tokenizer()
        tokenizer_RNN_noPOS_prod.fit_on_texts(processed_review)
        review_prepped_RNN_noPOS_prod = tokenizer_RNN_noPOS_prod.texts_to_sequences(processed_review)
        # pad the text with a set length - used 10 previously so will use it again
        review_prepped_RNN_noPOS_padded_prod = pad_sequences(review_prepped_RNN_noPOS_prod, maxlen=30)
        # predict the sentiment of the reviews
        reviews_list_predictions = model_simpleRNN_normalized_only.predict(review_prepped_RNN_noPOS_padded_prod)
        # convert back to positive, negative, or neutral
        sentiment_labels = label_encoder_RNN_noPOS.inverse_transform_sentiment_noPOS(real_pred_RNN_noPOS_test.argmax(axis=1))
        #append the results
        reviews_list_processed.append((review, sentiment_labels[0])) 
    #return the results
    return reviews_list_processed

# create a function for the sentiment analysis
def get_sentiment():
    #get the entry from the form 
    user_reviews_list = entry.get()
    # if there are reviews, then analyze them 
    if user_reviews_list:
        #determine sentiment
        sentiment_out = determine_preprocess(user_reviews_list)
        #clear the previous review info to ensure each review is looked at individually
        review_text.delete(1.0, tk.END)
        
        # analyse the review
        for review, sentiment in sentiment_out:
            # in the text output that is part of the GUI, we will now get the values needed in our presentation of results
            # get the review
            review_text.insert(tk.END, "Review:\n{}\n".format(review))
            # get the sentiment
            review_text.insert(tk.END, "Sentiment: {}\n".format(sentiment))
            
            #logic for giving recommendations on the handling procedures for the review
            #positive
            if sentiment == 'positive':
                message = "Message: This is a positive review. You should consider reaching out to learn more about what was so great about the stay and thank them for leaving a good review."
            #negative
            elif sentiment == 'negative':
                message = "Message: This is a negative review. You should consider reaching out to learn more about their bad experience and offer them a 10% discount for their next stay. Also thank them for the feedback."
            #nuetral or other
            else:
                message = "Message: Sentiment is uncertain. You may want to connect with them for more information."
            
            #add the message to the text output for presentation
            review_text.insert(tk.END, message + "\n\n")
    #if no reviews input and button clicked
    else:
        #give an error to the user
        messagebox.showerror("Error", "Please enter a review.")


# initialize the tkinter model
root = tk.Tk()
#provide a title for the GUI
root.title("Short-Term Rental Review Sentiment Analysis")

# give the label for the GUI input
label = tk.Label(root, text="Enter your Short-Term Rental review (seperate multiple reviews with a ';')")
#update the label
label.pack()

#define the form entry size
entry = tk.Entry(root, width=50)
#update the entry
entry.pack()

#define the button and actions from the button click - in this case we will run the get_sentiment function
analyze_button = tk.Button(root, text="Determine Sentiment", command=get_sentiment)
#update the button
analyze_button.pack()

# Size the text widget which will be used to display the review analysis output - including the resolution message 
review_text = tk.Text(root, height=10, width=50)
# update the text details
review_text.pack()

#loop the root GUI to keep the window running
root.mainloop()



As can be seen when you run the code, a GUI is created that allows the user to input reviews - multiple are seperated with a ";" - and get outcomes from the sentiment analysis with recommendations on what to do. This proves that a sentiment analysis tool like this could be created, and the minimal viable product is completed. 

Next, I will build a 2-class model GUI for demonstrative purposes

In [60]:
#first I need to define a function that will run on the button click. 
# this will include the preprocessing, vectorization, and model execution for the sentiment analysis

# function to determine the preprocessed state for the reviews
def determine_preprocess_2(review):
    # Create a list of reviews, for any ";" will seperate the list so input needs to seperate reviews by ";"
    reviews_list_2 = review.split(";")
    #initialize the results list to be used to append the results
    reviews_list_processed_2 = []
    
    #loop through the reviews and analyze individually
    for review in reviews_list_2:
        # Preprocess each review individually
        processed_review_2 = preprocess_no_pos(review) 
        tokenizer_RNN_noPOS_prod_2 = Tokenizer()
        tokenizer_RNN_noPOS_prod_2.fit_on_texts(processed_review_2)
        review_prepped_RNN_noPOS_prod_2 = tokenizer_RNN_noPOS_test_2.texts_to_sequences(processed_review_2)
        # pad the text with a set length - used 10 previously so will use it again
        review_prepped_RNN_noPOS_padded_prod_2 = pad_sequences(review_prepped_RNN_noPOS_prod_2, maxlen=30)
        # predict the sentiment of the reviews
        reviews_list_predictions_2 = model_simpleRNN_over.predict(review_prepped_RNN_noPOS_padded_prod_2)
        # convert back to positive, negative, or neutral
        sentiment_labels_2 = label_encoder_RNN_noPOS.inverse_transform_sentiment_noPOS(real_pred_RNN_noPOS_test_2.argmax(axis=1))
        #append the results
        reviews_list_processed_2.append((review, sentiment_labels_2[0])) 
    #return the results
    return reviews_list_processed_2

# create a function for the sentiment analysis
def get_sentiment_2():
    #get the entry from the form 
    user_reviews_list_2 = entry_2.get()
    # if there are reviews, then analyze them 
    if user_reviews_list_2:
        #determine sentiment
        sentiment_out_2 = determine_preprocess_2(user_reviews_list_2)
        #clear the previous review info to ensure each review is looked at individually
        review_text.delete(1.0, tk.END)
        
        # analyse the review
        for review, sentiment in sentiment_out_2:
            # in the text output that is part of the GUI, we will now get the values needed in our presentation of results
            # get the review
            review_text.insert(tk.END, "Review:\n{}\n".format(review))
            # get the sentiment
            review_text.insert(tk.END, "Sentiment: {}\n".format(sentiment))
            
            #logic for giving recommendations on the handling procedures for the review
            #positive
            if sentiment == 'positive':
                message = "Message: This is a positive review. You should consider reaching out to learn more about what was so great about the stay and thank them for leaving a good review."
            #negative
            elif sentiment == 'negative':
                message = "Message: This is a negative review. You should consider reaching out to learn more about their bad experience and offer them a 10% discount for their next stay. Also thank them for the feedback."

            #add the message to the text output for presentation
            review_text.insert(tk.END, message + "\n\n")
    #if no reviews input and button clicked
    else:
        #give an error to the user
        messagebox.showerror("Error", "Please enter a review.")


# initialize the tkinter model
root_2 = tk.Tk()
#provide a title for the GUI
root_2.title("Short-Term Rental Review Sentiment Analysis")

# give the label for the GUI input
label_2 = tk.Label(root_2, text="Enter your Short-Term Rental review (seperate multiple reviews with a ';')")
#update the label
label_2.pack()

#define the form entry size
entry_2 = tk.Entry(root_2, width=50)
#update the entry
entry_2.pack()

#define the button and actions from the button click - in this case we will run the get_sentiment function
analyze_button_2 = tk.Button(root_2, text="Determine Sentiment", command=get_sentiment_2)
#update the button
analyze_button_2.pack()

# Size the text widget which will be used to display the review analysis output - including the resolution message 
review_text = tk.Text(root_2, height=10, width=50)
# update the text details
review_text.pack()

#loop the root GUI to keep the window running
root_2.mainloop()



While the 2-class predictor performed better, it was still not as accurate as the Random Forest model. 