# Sentiment Analysis on Short-Term Rental Reviews

## Part 1: Loading Data into Python Dataframe

In [1]:
# In starting out with the product developement, the first step will be getting the data into a dataframe.
# This will allow me to pre-process the data with Pandas built in functions [1].
# In order to do this, pandas must be added to the notebook. 
import pandas as pd

In [2]:
# next, I will read in the CSV into a pandas dataframe
file = "Hotel_Reviews.csv"
original_data = pd.read_csv(file)

In [3]:
#show some of the data to start to understand it better. 
original_data.head(3)

Unnamed: 0,Hotel_Address,Additional_Number_of_Scoring,Review_Date,Average_Score,Hotel_Name,Reviewer_Nationality,Negative_Review,Review_Total_Negative_Word_Counts,Total_Number_of_Reviews,Positive_Review,Review_Total_Positive_Word_Counts,Total_Number_of_Reviews_Reviewer_Has_Given,Reviewer_Score,Tags,days_since_review,lat,lng
0,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,8/3/2017,7.7,Hotel Arena,Russia,I am so angry that i made this post available...,397,1403,Only the park outside of the hotel was beauti...,11,7,2.9,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968
1,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,8/3/2017,7.7,Hotel Arena,Ireland,No Negative,0,1403,No real complaints the hotel was great great ...,105,7,7.5,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968
2,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,7/31/2017,7.7,Hotel Arena,Australia,Rooms are nice but for elderly a bit difficul...,42,1403,Location was good and staff were ok It is cut...,21,9,7.1,"[' Leisure trip ', ' Family with young childre...",3 days,52.360576,4.915968


In [4]:
# for sentiment analysis, we will want to combine the data into a new dataframe that will be used for pre-processing.
# the fields that are relevant are the positive review, negative review, and the reviewer score
# reviewer score can be used for labelling our sentiment as per the review 

original_data["Total_Review"] = original_data["Negative_Review"] + " " + original_data["Positive_Review"]

columns = ["Total_Review", "Reviewer_Score"]

review_data = original_data[columns]

review_data.head(3)

Unnamed: 0,Total_Review,Reviewer_Score
0,I am so angry that i made this post available...,2.9
1,No Negative No real complaints the hotel was ...,7.5
2,Rooms are nice but for elderly a bit difficul...,7.1


## Preprocessing The Review Data for Machine Learning

This section will be used to explore all the different preprocessing options and build functions that will work for my data cleaning and preprocessing pipeline. Once completed, I will optimize it into one function that can be used and adjusted more effectively for determining the best preprocessing techniques for the sentiment analysis. 

While there will be lots of work through this section that should be reviewed, the code will all be commented out and the optimized function will be the function I work with to allow me to change steps easier and get new results to see the impact of preprocessing steps. 

In [9]:
# Next, I will be using nltk to pre-process the reviews for machine learning applications. 
# I will also use regular expressions to remove special characters

# import nltk
# import re

In [50]:
# first I want to remove special characters
# I will define a function for this

# def preprocess_remove_special(review):
#     #to remove all special characters and ensure they are all alphabetic, I will use regular expression substitution
#     # i will just remove them in this case if there are any
#     review_handled = re.sub(r'[^a-zA-Z\s]', '', review)
#     #to extend the functionality I will also lowercase all the text
#     review_lower = review_handled.lower()
#     return review_lower

# #in order to compare I want to make a copy of the original data and then compare it to the processed later on
# #also, just to be safe so that we don't currupt data and have backups
# review_data_copy = review_data.copy()

# review_data_copy.loc[:, 'Total_Review_handled'] = review_data["Total_Review"].apply(preprocess_remove_special)

# review_data_copy

Unnamed: 0,Total_Review,Reviewer_Score,Total_Review_handled
0,I am so angry that i made this post available...,2.9,i am so angry that i made this post available...
1,No Negative No real complaints the hotel was ...,7.5,no negative no real complaints the hotel was ...
2,Rooms are nice but for elderly a bit difficul...,7.1,rooms are nice but for elderly a bit difficul...
3,My room was dirty and I was afraid to walk ba...,3.8,my room was dirty and i was afraid to walk ba...
4,You When I booked with your company on line y...,6.7,you when i booked with your company on line y...
...,...,...,...
515733,no trolly or staff to help you take the lugga...,7.0,no trolly or staff to help you take the lugga...
515734,The hotel looks like 3 but surely not 4 Bre...,5.8,the hotel looks like but surely not break...
515735,The ac was useless It was a hot week in vienn...,2.5,the ac was useless it was a hot week in vienn...
515736,No Negative The rooms are enormous and really...,8.8,no negative the rooms are enormous and really...


In [51]:
# next, I want to remove the "no positive" and "no negative" as this does not add any value to the text 
# it also may be confusing if it sees "positive" but it is actually not positive - since it adds little
# value I can remove it - ran into issues here - regex = true was needed 
# see source: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html [2]

# review_data_stripped = review_data_copy.copy()

# review_data_stripped['Total_Review_handled'] = review_data_stripped["Total_Review_handled"].replace(['no negative', 'no positive'], '', regex=True)

# review_data_stripped

Unnamed: 0,Total_Review,Reviewer_Score,Total_Review_handled
0,I am so angry that i made this post available...,2.9,i am so angry that i made this post available...
1,No Negative No real complaints the hotel was ...,7.5,no real complaints the hotel was great great...
2,Rooms are nice but for elderly a bit difficul...,7.1,rooms are nice but for elderly a bit difficul...
3,My room was dirty and I was afraid to walk ba...,3.8,my room was dirty and i was afraid to walk ba...
4,You When I booked with your company on line y...,6.7,you when i booked with your company on line y...
...,...,...,...
515733,no trolly or staff to help you take the lugga...,7.0,no trolly or staff to help you take the lugga...
515734,The hotel looks like 3 but surely not 4 Bre...,5.8,the hotel looks like but surely not break...
515735,The ac was useless It was a hot week in vienn...,2.5,the ac was useless it was a hot week in vienn...
515736,No Negative The rooms are enormous and really...,8.8,the rooms are enormous and really comfortabl...


In [52]:
# the next preprocessing techniques I am going to perform are tokenization, remove stopwords, and lemmatize
# in order to do this, I will still use nltk and import the relevant libraries
# I will use the WordNetLemmatizer first as I have most experience with it from the courses. 

# from nltk.tokenize import word_tokenize
# from nltk.corpus import stopwords
# from nltk.stem import WordNetLemmatizer

In [53]:
# first I will define a function for tokenizing the data
# I will use nltk first with the punkt sentance tokenizer - and may consider SpaCy in the future to compare
# coding refernce: nltk.org/_modules/nltk/tokenize/punkt.html

# download the NLTK Punkt resource
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/coreyreid/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [54]:
# Define the function for tokenizing the column "Total review handled"

# def preprocess_tokenize(review):
#     bag_of_words = word_tokenize(review)
#     return bag_of_words

# review_data_tokenized = review_data_stripped.copy()

# review_data_tokenized['Total_Review_handled'] = review_data_tokenized["Total_Review_handled"].apply(preprocess_tokenize)

# review_data_tokenized

Unnamed: 0,Total_Review,Reviewer_Score,Total_Review_handled
0,I am so angry that i made this post available...,2.9,"[i, am, so, angry, that, i, made, this, post, ..."
1,No Negative No real complaints the hotel was ...,7.5,"[no, real, complaints, the, hotel, was, great,..."
2,Rooms are nice but for elderly a bit difficul...,7.1,"[rooms, are, nice, but, for, elderly, a, bit, ..."
3,My room was dirty and I was afraid to walk ba...,3.8,"[my, room, was, dirty, and, i, was, afraid, to..."
4,You When I booked with your company on line y...,6.7,"[you, when, i, booked, with, your, company, on..."
...,...,...,...
515733,no trolly or staff to help you take the lugga...,7.0,"[no, trolly, or, staff, to, help, you, take, t..."
515734,The hotel looks like 3 but surely not 4 Bre...,5.8,"[the, hotel, looks, like, but, surely, not, br..."
515735,The ac was useless It was a hot week in vienn...,2.5,"[the, ac, was, useless, it, was, a, hot, week,..."
515736,No Negative The rooms are enormous and really...,8.8,"[the, rooms, are, enormous, and, really, comfo..."


In [55]:
# next, I will remove all stopwords from Total_Review_handled
# first, I need to download the stopwords from nltk

# nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/coreyreid/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [56]:
# now I will create a function to remove stopwords

# def preprocess_stopwords(review):
#     #define the language for the stopwords
#     words = stopwords.words('english')
#     #set the stopword data to check against
#     set_stop_words = set(words)
    
#     #create the loop to check all the words in the review against the stopword text
#     removed_stopwords_words = []
#     #loop through the review text and if it is not a stopword append it to a new list
#     for word in review:
#         if word not in set_stop_words:
#             removed_stopwords_words.append(word)
#     # return the new list with stopwords removed
#     return removed_stopwords_words

In [57]:
# now I will apply the stopword removal function to my dataframe column Total_Review_handled

# review_data_removed = review_data_tokenized.copy()

# review_data_removed['Total_Review_handled'] = review_data_removed["Total_Review_handled"].apply(preprocess_stopwords)

# review_data_removed

Unnamed: 0,Total_Review,Reviewer_Score,Total_Review_handled
0,I am so angry that i made this post available...,2.9,"[angry, made, post, available, via, possible, ..."
1,No Negative No real complaints the hotel was ...,7.5,"[real, complaints, hotel, great, great, locati..."
2,Rooms are nice but for elderly a bit difficul...,7.1,"[rooms, nice, elderly, bit, difficult, rooms, ..."
3,My room was dirty and I was afraid to walk ba...,3.8,"[room, dirty, afraid, walk, barefoot, floor, l..."
4,You When I booked with your company on line y...,6.7,"[booked, company, line, showed, pictures, room..."
...,...,...,...
515733,no trolly or staff to help you take the lugga...,7.0,"[trolly, staff, help, take, luggage, room, loc..."
515734,The hotel looks like 3 but surely not 4 Bre...,5.8,"[hotel, looks, like, surely, breakfast, ok, go..."
515735,The ac was useless It was a hot week in vienn...,2.5,"[ac, useless, hot, week, vienna, gave, hot, air]"
515736,No Negative The rooms are enormous and really...,8.8,"[rooms, enormous, really, comfortable, believe..."


In [58]:
# next, I want to lemmatize the column Total_Review_handled
# first I will download the needed nltk library, which will be wordnet in this application

# nltk.download('wordnet')

# #need to explicitly download this - found when exception was raised executing the lemmatizer code
# nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/coreyreid/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/coreyreid/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [59]:
# Now that I have the nltk library I will use, I will build a function for lemmatizing the column

# def preprocess_lemmatize(review):
    
#     #need to initiate the lemmatizer model
#     lemmatizer = WordNetLemmatizer()
    
#     #now I will lemmatize each review
#     lemmatized_review = [lemmatizer.lemmatize(word) for word in review ]
    
#     #return the lemmatized review
#     return lemmatized_review

In [60]:
# implement the function on the Total_Review_handled column

# review_data_lemmatized = review_data_removed.copy()

# review_data_lemmatized['Total_Review_handled'] = review_data_lemmatized["Total_Review_handled"].apply(preprocess_lemmatize)

# review_data_lemmatized

Unnamed: 0,Total_Review,Reviewer_Score,Total_Review_handled
0,I am so angry that i made this post available...,2.9,"[angry, made, post, available, via, possible, ..."
1,No Negative No real complaints the hotel was ...,7.5,"[real, complaint, hotel, great, great, locatio..."
2,Rooms are nice but for elderly a bit difficul...,7.1,"[room, nice, elderly, bit, difficult, room, tw..."
3,My room was dirty and I was afraid to walk ba...,3.8,"[room, dirty, afraid, walk, barefoot, floor, l..."
4,You When I booked with your company on line y...,6.7,"[booked, company, line, showed, picture, room,..."
...,...,...,...
515733,no trolly or staff to help you take the lugga...,7.0,"[trolly, staff, help, take, luggage, room, loc..."
515734,The hotel looks like 3 but surely not 4 Bre...,5.8,"[hotel, look, like, surely, breakfast, ok, got..."
515735,The ac was useless It was a hot week in vienn...,2.5,"[ac, useless, hot, week, vienna, gave, hot, air]"
515736,No Negative The rooms are enormous and really...,8.8,"[room, enormous, really, comfortable, believe,..."


In [62]:
# As a final step in my data preprocessing pipeline, I am going to implement POS tagging 

#first, I need to download the tagger I want to use

# nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/coreyreid/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [64]:
# now I will complete the POS tagger function

# def preprocess_tagger(review):
#     tagged_review = nltk.pos_tag(review)
#     return tagged_review

In [65]:
# finally, I will add a column that shows the POS tags

# review_data_tagged = review_data_lemmatized.copy()

# review_data_tagged['POS_Tag'] = review_data_tagged["Total_Review_handled"].apply(preprocess_tagger)

# review_data_tagged

Unnamed: 0,Total_Review,Reviewer_Score,Total_Review_handled,POS_Tag
0,I am so angry that i made this post available...,2.9,"[angry, made, post, available, via, possible, ...","[(angry, JJ), (made, VBD), (post, NN), (availa..."
1,No Negative No real complaints the hotel was ...,7.5,"[real, complaint, hotel, great, great, locatio...","[(real, JJ), (complaint, NN), (hotel, NN), (gr..."
2,Rooms are nice but for elderly a bit difficul...,7.1,"[room, nice, elderly, bit, difficult, room, tw...","[(room, NN), (nice, RB), (elderly, JJ), (bit, ..."
3,My room was dirty and I was afraid to walk ba...,3.8,"[room, dirty, afraid, walk, barefoot, floor, l...","[(room, NN), (dirty, NN), (afraid, JJ), (walk,..."
4,You When I booked with your company on line y...,6.7,"[booked, company, line, showed, picture, room,...","[(booked, VBN), (company, NN), (line, NN), (sh..."
...,...,...,...,...
515733,no trolly or staff to help you take the lugga...,7.0,"[trolly, staff, help, take, luggage, room, loc...","[(trolly, RB), (staff, NN), (help, NN), (take,..."
515734,The hotel looks like 3 but surely not 4 Bre...,5.8,"[hotel, look, like, surely, breakfast, ok, got...","[(hotel, NN), (look, NN), (like, IN), (surely,..."
515735,The ac was useless It was a hot week in vienn...,2.5,"[ac, useless, hot, week, vienna, gave, hot, air]","[(ac, JJ), (useless, JJ), (hot, JJ), (week, NN..."
515736,No Negative The rooms are enormous and really...,8.8,"[room, enormous, really, comfortable, believe,...","[(room, NN), (enormous, JJ), (really, RB), (co..."


In [69]:
# Now that I can confirm the cleaning and preprocessing was successful, I will reduce the data to two columns
# the POS_Tag column and the Reviewer_Score column
# first I will copy to have a backup
# review_data_final = review_data_tagged.copy()

# #next I will remove columns from the copy
# review_data_final = review_data_final.drop(columns=['Total_Review', 'Total_Review_handled'])

# review_data_final = review_data_final[['POS_Tag', 'Reviewer_Score']]

# review_data_final

Unnamed: 0,POS_Tag,Reviewer_Score
0,"[(angry, JJ), (made, VBD), (post, NN), (availa...",2.9
1,"[(real, JJ), (complaint, NN), (hotel, NN), (gr...",7.5
2,"[(room, NN), (nice, RB), (elderly, JJ), (bit, ...",7.1
3,"[(room, NN), (dirty, NN), (afraid, JJ), (walk,...",3.8
4,"[(booked, VBN), (company, NN), (line, NN), (sh...",6.7
...,...,...
515733,"[(trolly, RB), (staff, NN), (help, NN), (take,...",7.0
515734,"[(hotel, NN), (look, NN), (like, IN), (surely,...",5.8
515735,"[(ac, JJ), (useless, JJ), (hot, JJ), (week, NN...",2.5
515736,"[(room, NN), (enormous, JJ), (really, RB), (co...",8.8


## Part 3: Optimizing the Pre-Processing into one Function

This section will be used for developing final preprocessing pipelines that can be used in my ML development. There will be two because one will be with using POS tagging and the other will be without.

In [5]:
#import the relevant libraries and download the relevant nltk dependancies [3]
import nltk
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

#download the needed nltk toolkits
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')

# optimized all-encompassing function for preprocessing that can easily be manipulated for tests
def preprocess_pos(review):
    
    #preprocess remove special characters and ensure lowercase 
    def preprocess_remove_special(review):
        review_handled = re.sub(r'[^a-zA-Z\s]', '', review)
        review_lower = review_handled.lower()
        return review_lower

    #tokenize the words
    def preprocess_tokenize(review):
        bag_of_words = word_tokenize(review)
        return bag_of_words

    #remove the stopwords
    def preprocess_stopwords(review):
        words = stopwords.words('english')
        set_stop_words = set(words)
        removed_stopwords_words = []
        for word in review:
            if word not in set_stop_words:
                removed_stopwords_words.append(word)
        return removed_stopwords_words

    #lemmatize
    def preprocess_lemmatize(review):
        lemmatizer = WordNetLemmatizer()
        lemmatized_review = [lemmatizer.lemmatize(word) for word in review ]
        return lemmatized_review

    #POS Tagging
    def preprocess_tagger(review):
        tagged_review = nltk.pos_tag(review)
        return tagged_review


    #execute the individual part functions within this bigger function
    review_data_copy = review_data.copy()
    review_data_copy['Total_Review_handled'] = review_data_copy["Total_Review"].apply(preprocess_remove_special)
    review_data_copy['Total_Review_handled'] = review_data_copy['Total_Review_handled'].apply(preprocess_tokenize)
    review_data_copy['Total_Review_handled'] = review_data_copy['Total_Review_handled'].apply(preprocess_stopwords)
    review_data_copy['Total_Review_handled'] = review_data_copy['Total_Review_handled'].apply(preprocess_lemmatize)
    review_data_copy['POS_Tag'] = review_data_copy['Total_Review_handled'].apply(preprocess_tagger)

    #create the final data
    review_data_final = review_data_copy[['POS_Tag', 'Reviewer_Score']]
    
    #return the data
    return review_data_final

[nltk_data] Downloading package punkt to /Users/coreyreid/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/coreyreid/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/coreyreid/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/coreyreid/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/coreyreid/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [6]:
# Now, I will create a similar function without POS Tagging and we will compare two techniques
# with and without POS Tagging

def preprocess_no_pos(review):
    
    # remove special characters and ensure it is lowercase
    def preprocess_remove_special(review):
        review_handled = re.sub(r'[^a-zA-Z\s]', '', review)
        review_lower = review_handled.lower()
        return review_lower

    #tokenize
    def preprocess_tokenize(review):
        bag_of_words = word_tokenize(review)
        return bag_of_words

    #remove stopwords
    def preprocess_stopwords(review):
        words = stopwords.words('english')
        set_stop_words = set(words)
        removed_stopwords_words = []
        for word in review:
            if word not in set_stop_words:
                removed_stopwords_words.append(word)
        return removed_stopwords_words

    #lemmatize
    def preprocess_lemmatize(review):
        lemmatizer = WordNetLemmatizer()
        lemmatized_review = [lemmatizer.lemmatize(word) for word in review ]
        return lemmatized_review

    #execute functions within the bigger function
    review_data_copy = review_data.copy()
    review_data_copy['Total_Review_handled'] = review_data_copy["Total_Review"].apply(preprocess_remove_special)
    review_data_copy['Total_Review_handled'] = review_data_copy['Total_Review_handled'].apply(preprocess_tokenize)
    review_data_copy['Total_Review_handled'] = review_data_copy['Total_Review_handled'].apply(preprocess_stopwords)
    review_data_copy['Total_Review_handled'] = review_data_copy['Total_Review_handled'].apply(preprocess_lemmatize)

    #finalize data
    review_data_final = review_data_copy[['Total_Review_handled', 'Reviewer_Score']]

    #return the preprocessed data
    return review_data_final

In [7]:
#next I will process the data with POS Tagging to compare to the previous section
#these should be the same as it is just an optimized function
preprocessed_reviews_pos_tagging = preprocess_pos(review_data)
preprocessed_reviews_pos_tagging

Unnamed: 0,POS_Tag,Reviewer_Score
0,"[(angry, JJ), (made, VBD), (post, NN), (availa...",2.9
1,"[(negative, JJ), (real, JJ), (complaint, NN), ...",7.5
2,"[(room, NN), (nice, RB), (elderly, JJ), (bit, ...",7.1
3,"[(room, NN), (dirty, NN), (afraid, JJ), (walk,...",3.8
4,"[(booked, VBN), (company, NN), (line, NN), (sh...",6.7
...,...,...
515733,"[(trolly, RB), (staff, NN), (help, NN), (take,...",7.0
515734,"[(hotel, NN), (look, NN), (like, IN), (surely,...",5.8
515735,"[(ac, JJ), (useless, JJ), (hot, JJ), (week, NN...",2.5
515736,"[(negative, JJ), (room, NN), (enormous, JJ), (...",8.8


In [8]:
#last, I will process the review data without POS Tagging, the processing will end at lemmatizing. 
preprocessed_reviews_no_pos_tagging = preprocess_no_pos(review_data)
preprocessed_reviews_no_pos_tagging

Unnamed: 0,Total_Review_handled,Reviewer_Score
0,"[angry, made, post, available, via, possible, ...",2.9
1,"[negative, real, complaint, hotel, great, grea...",7.5
2,"[room, nice, elderly, bit, difficult, room, tw...",7.1
3,"[room, dirty, afraid, walk, barefoot, floor, l...",3.8
4,"[booked, company, line, showed, picture, room,...",6.7
...,...,...
515733,"[trolly, staff, help, take, luggage, room, loc...",7.0
515734,"[hotel, look, like, surely, breakfast, ok, got...",5.8
515735,"[ac, useless, hot, week, vienna, gave, hot, ai...",2.5
515736,"[negative, room, enormous, really, comfortable...",8.8


These two dataframes will be the two types of preprocessing methods I will consider in each of the models. This will show the difference between using POS tags and not using them. Additionally, we will now have two cases for exploring the impact on machine learning outcomes. 

The dataframes both are showing the data as expected so we can now proceed to implementing different machine learning models. We will then compare the results and determine which machine learning model is best for sentiment analysis of the Short-Term Rental review data. 

## Part 4: Classifying the Review Score Data for Machine Learning Labels

In [9]:
# first I will define the thresholds - our labels will be positive, negative, and neutral 
# I decided to use a small window for neutral, between 4.5 and 5.5 - this will ensure most the review classifications
# are sensative - it can be updated later if desired by changing the following threshold values

positive_threshold = 5.5
negative_threshold = 4.5

# next, I will classify by building a classification function

def classify_scores(score_value):
    if score_value >= positive_threshold:
        return 'positive'
    elif score_value <= negative_threshold:
        return 'negative'
    else:
        return 'neutral'
    
preprocessed_reviews_pos_tagging_final = preprocessed_reviews_pos_tagging.copy()
    
preprocessed_reviews_pos_tagging_final['Reviewer_Score'] = preprocessed_reviews_pos_tagging_final['Reviewer_Score'].apply(classify_scores)

preprocessed_reviews_pos_tagging_final

Unnamed: 0,POS_Tag,Reviewer_Score
0,"[(angry, JJ), (made, VBD), (post, NN), (availa...",negative
1,"[(negative, JJ), (real, JJ), (complaint, NN), ...",positive
2,"[(room, NN), (nice, RB), (elderly, JJ), (bit, ...",positive
3,"[(room, NN), (dirty, NN), (afraid, JJ), (walk,...",negative
4,"[(booked, VBN), (company, NN), (line, NN), (sh...",positive
...,...,...
515733,"[(trolly, RB), (staff, NN), (help, NN), (take,...",positive
515734,"[(hotel, NN), (look, NN), (like, IN), (surely,...",positive
515735,"[(ac, JJ), (useless, JJ), (hot, JJ), (week, NN...",negative
515736,"[(negative, JJ), (room, NN), (enormous, JJ), (...",positive


In [10]:
# I will also do the same for the no POS tagging scenario

preprocessed_reviews_no_pos_tagging_final = preprocessed_reviews_no_pos_tagging.copy()
    
preprocessed_reviews_no_pos_tagging_final['Reviewer_Score'] = preprocessed_reviews_no_pos_tagging_final['Reviewer_Score'].apply(classify_scores)

preprocessed_reviews_no_pos_tagging_final

Unnamed: 0,Total_Review_handled,Reviewer_Score
0,"[angry, made, post, available, via, possible, ...",negative
1,"[negative, real, complaint, hotel, great, grea...",positive
2,"[room, nice, elderly, bit, difficult, room, tw...",positive
3,"[room, dirty, afraid, walk, barefoot, floor, l...",negative
4,"[booked, company, line, showed, picture, room,...",positive
...,...,...
515733,"[trolly, staff, help, take, luggage, room, loc...",positive
515734,"[hotel, look, like, surely, breakfast, ok, got...",positive
515735,"[ac, useless, hot, week, vienna, gave, hot, ai...",negative
515736,"[negative, room, enormous, really, comfortable...",positive


In [11]:
# determine the counts of each category
review_counts = preprocessed_reviews_no_pos_tagging_final['Reviewer_Score'].value_counts()
print(review_counts)

positive    475509
neutral      24188
negative     16041
Name: Reviewer_Score, dtype: int64


The data has now been preprocessed and categorized. As can be seen, the dataset is very unbalanced as most of the data is positive reviews at this threshold. This may be problematic in some of our machine learning models that we will be exploring. There are techniques for unbalanced datasets, so we may need to employ some of those to get better classification. 

## Part 5: Exploring Multinomial Naive Bayes Outcomes for Sentiment Analysis

In [12]:
# The first machine learning algorithm I am going to explore is Multinomial Naive Bayes. 
# First, I will need to perform feature extraction this will be done using TF-IDF (Term Frequency-Inverse Document Frequency) 
# Also, I will need to split my data into training and testing data. 
# last, I will train the multinomial naive bayes algorithm and make predictions on the test data. 

#import the models I need for this machine learning algorithm - sklearn will be used [4]
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

# First I will perform the training on the POS_tagged data
# prepare the data back into strings which will be needed for Multinomial Naive bayes 
reviews_MNB = preprocessed_reviews_pos_tagging_final.copy()
reviews_MNB['Review'] = reviews_MNB['POS_Tag'].apply(lambda tokens: ' '.join([f'{word}_{pos}' for word, pos in tokens]))

#initialize the tfidf vectorizer 
vectorizer = TfidfVectorizer()

# vectorize the combined POS and Text from the review text and get the score values for y
X_combined = vectorizer.fit_transform(reviews_MNB['Review'])
y = reviews_MNB['Reviewer_Score']

#split into test and training data
X_train, X_test, y_train, y_test = train_test_split(X_combined, y, test_size=0.2, random_state=42)

#now, I need to initialize the classifier and train it. 
multi_naive_bayes = MultinomialNB()
multi_naive_bayes.fit(X_train, y_train)

#using the trained model, we will make predictions on the test set for final evaluation of the effectiveness 
y_predicted = multi_naive_bayes.predict(X_test)

In [13]:
# Last, I will evaluate the data using SKLearn metrics [5]
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# run metric functions for the predicted data
MNB_postag_report = classification_report(y_test, y_predicted)
MNB_postag_confusion_matrix = confusion_matrix(y_test, y_predicted)
MNB_postag_accuracy = accuracy_score(y_test, y_predicted)

#print the results
print("Classification Report:\n", MNB_postag_report)
print("Confusion Matrix:\n", MNB_postag_confusion_matrix)
print("Accuracy:", MNB_postag_accuracy)


Classification Report:
               precision    recall  f1-score   support

    negative       0.00      0.00      0.00      3198
     neutral       0.08      0.00      0.00      4981
    positive       0.92      1.00      0.96     94969

    accuracy                           0.92    103148
   macro avg       0.33      0.33      0.32    103148
weighted avg       0.85      0.92      0.88    103148

Confusion Matrix:
 [[    0     2  3196]
 [    3     3  4975]
 [    4    33 94932]]
Accuracy: 0.92037654632179


In [14]:
# this did not work very well, so we need to try and oversample the negative classification due to imbalance
# to do this I will use SMOTE and try to improve the outcome with oversampling [6]
!pip install -U imbalanced-learn



In [15]:
#import the library needed - SMOTE [7]
from imblearn.over_sampling import SMOTE

# Initialize SMOTE
smote = SMOTE(sampling_strategy='auto', random_state=42)

# Fit and transform the data
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

#now, I need to initialize the classifier and train it. 
multi_naive_bayes_resampled = MultinomialNB()
multi_naive_bayes_resampled.fit(X_resampled, y_resampled)

#using the trained model, we will make predictions on the test set for final evaluation of the effectiveness 
y_predicted_resampled = multi_naive_bayes_resampled.predict(X_test)

#evaluate
MNB_postag_report_resampled = classification_report(y_test, y_predicted_resampled)
MNB_postag_confusion_matrix_resampled = confusion_matrix(y_test, y_predicted_resampled)
MNB_postag_accuracy_resampled = accuracy_score(y_test, y_predicted_resampled)

#print results
print("Classification Report:\n", MNB_postag_report_resampled)
print("Confusion Matrix:\n", MNB_postag_confusion_matrix_resampled)
print("Accuracy:", MNB_postag_accuracy_resampled)

Classification Report:
               precision    recall  f1-score   support

    negative       0.18      0.53      0.27      3198
     neutral       0.14      0.51      0.22      4981
    positive       0.98      0.78      0.87     94969

    accuracy                           0.76    103148
   macro avg       0.43      0.61      0.45    103148
weighted avg       0.92      0.76      0.82    103148

Confusion Matrix:
 [[ 1694  1207   297]
 [ 1483  2557   941]
 [ 6310 14255 74404]]
Accuracy: 0.7625450808546942


When resampling was added, it can be seen that the overall accuracy decreased. This can be misleading, however, because before it had no success finding any of the negatives at all. Therefore, because the data was so imbalanced, it could get lucky predicting positive but it pretty much predicted positive for every value. Here, after the resampling took place, the precision and recall for the negative and neutral classes improved, as was expected because the dataset became more balanced. Also, the number of True Negatives increased in the confusion matrix, meaning that it has become better at finding the actual negative reviews. While the results aren't great, the model itself may not be the best fit for the inbalance dataset we are using. One solution for this may be to increase the threshold of what is considered a positive and negative review. This may be a possible solution because we really only care about the really good reviews, and anything that is below a higher threshold like 7 means that excellence was not achieved. I may consider doing this if none of the models work well with the imbalanced dataset. 

In [16]:
# last, I am going to try to find a better smoothing value for this imbalanced data to see if I can improve this
# this will be done using Grid Search - first I will do it on the original data and not resampled to see if I can
# improve the outcome - if so, I will see if resampling and adjusting the alpha value helps more.[8]

from sklearn.model_selection import GridSearchCV

# Range of alpha values to explore
alpha_values = [0.1, 1, 5, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 40, 50]

# Define the paramater grid
param_grid = {'alpha': alpha_values}

# Initialize the MultinomialNB classifier
multi_naive_bayes_smoothing = MultinomialNB()

# Perform grid search with cross-validation
grid_search = GridSearchCV(multi_naive_bayes_smoothing, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# show the best alpha value
optimal_alpha = grid_search.best_params_['alpha']

#show for confirmation
print(optimal_alpha)

11


In [17]:
# we will now train with this alpha on the original data without oversampling
# this is done using Multinomial Naive Bayes [9]
multi_naive_bayes_optimized_smoothing = MultinomialNB(alpha=optimal_alpha)

# this is meant to use the initial train values, as we are first testing on the original data before oversampling
multi_naive_bayes_optimized_smoothing.fit(X_train, y_train)

#using the trained model, we will make predictions on the test set for final evaluation of the effectiveness 
y_predicted_optimized_smoothing = multi_naive_bayes_optimized_smoothing.predict(X_test)

#evaluate the results
MNB_postag_report_optimized_smoothing = classification_report(y_test, y_predicted_optimized_smoothing)
MNB_postag_confusion_matrix_optimized_smoothing = confusion_matrix(y_test, y_predicted_optimized_smoothing)
MNB_postag_accuracy_optimized_smoothing = accuracy_score(y_test, y_predicted_optimized_smoothing)

#print the results
print("Classification Report:\n", MNB_postag_report_optimized_smoothing)
print("Confusion Matrix:\n", MNB_postag_confusion_matrix_optimized_smoothing)
print("Accuracy:", MNB_postag_accuracy_optimized_smoothing)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Classification Report:
               precision    recall  f1-score   support

    negative       0.00      0.00      0.00      3198
     neutral       1.00      0.00      0.00      4981
    positive       0.92      1.00      0.96     94969

    accuracy                           0.92    103148
   macro avg       0.64      0.33      0.32    103148
weighted avg       0.90      0.92      0.88    103148

Confusion Matrix:
 [[    0     0  3198]
 [    0     1  4980]
 [    0     0 94969]]
Accuracy: 0.9207158645829294


In [18]:
# this didn't impact the values of the original model with the imbalanced data - I will try it on the oversampled data
# to see if there is an improvement there
multi_naive_bayes_optimized_smoothing_oversampled = MultinomialNB(alpha=optimal_alpha)
multi_naive_bayes_optimized_smoothing_oversampled.fit(X_resampled, y_resampled)

#using the trained model, we will make predictions on the test set for final evaluation of the effectiveness 
y_predicted_optimized_smoothing_oversampled = multi_naive_bayes_optimized_smoothing_oversampled.predict(X_test)

#evaluate
MNB_postag_report_optimized_smoothing_oversampled = classification_report(y_test, y_predicted_optimized_smoothing_oversampled)
MNB_postag_confusion_matrix_optimized_smoothing_oversampled = confusion_matrix(y_test, y_predicted_optimized_smoothing_oversampled)
MNB_postag_accuracy_optimized_smoothing_oversampled = accuracy_score(y_test, y_predicted_optimized_smoothing_oversampled)

#print results
print("Classification Report:\n", MNB_postag_report_optimized_smoothing_oversampled)
print("Confusion Matrix:\n", MNB_postag_confusion_matrix_optimized_smoothing_oversampled)
print("Accuracy:", MNB_postag_accuracy_optimized_smoothing_oversampled)

Classification Report:
               precision    recall  f1-score   support

    negative       0.20      0.65      0.30      3198
     neutral       0.17      0.30      0.22      4981
    positive       0.97      0.86      0.91     94969

    accuracy                           0.83    103148
   macro avg       0.45      0.60      0.48    103148
weighted avg       0.91      0.83      0.86    103148

Confusion Matrix:
 [[ 2065   615   518]
 [ 1855  1476  1650]
 [ 6654  6452 81863]]
Accuracy: 0.8279753364098189


In [19]:
# this clearly improved the accuracy from just oversampling, and more True Negatives are now being found. 
# I will compare this now against the no POS Tagged data and see which performs better between POS and no POS
# prepare the data back into strings which will be needed for Multinomial Naive bayes 
reviews_MNB_noPOS = preprocessed_reviews_no_pos_tagging_final.copy()
reviews_MNB_noPOS['Total_Review_handled'] = reviews_MNB_noPOS['Total_Review_handled'].apply(lambda words: ' '.join(words))

#initialize the tfidf vectorizer 
vectorizer_noPOS = TfidfVectorizer()

# vectorize
X = vectorizer_noPOS.fit_transform(reviews_MNB_noPOS['Total_Review_handled'])
y = reviews_MNB['Reviewer_Score']

#split into test and training data
X_train_noPOS, X_test_noPOS, y_train_noPOS, y_test_noPOS = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize SMOTE
smote_noPOS = SMOTE(sampling_strategy='auto', random_state=42)
# Fit and transform the data
X_resampled_noPOS, y_resampled_noPOS = smote_noPOS.fit_resample(X_train_noPOS, y_train_noPOS)

# we will now train with the optimized alpha
multi_naive_bayes_optimized_smoothing_noPOS = MultinomialNB(alpha=11)
multi_naive_bayes_optimized_smoothing_noPOS.fit(X_resampled_noPOS, y_resampled_noPOS)

#using the trained model, we will make predictions on the test set for final evaluation of the effectiveness 
y_predicted_noPOS = multi_naive_bayes_optimized_smoothing_noPOS.predict(X_test_noPOS)

# evaluate
MNB_report_optimized_smoothing = classification_report(y_test_noPOS, y_predicted_noPOS)
MNB_confusion_matrix_optimized_smoothing = confusion_matrix(y_test_noPOS, y_predicted_noPOS)
MNB_accuracy_optimized_smoothing = accuracy_score(y_test_noPOS, y_predicted_noPOS)

#print the results
print("Classification Report:\n", MNB_report_optimized_smoothing)
print("Confusion Matrix:\n", MNB_confusion_matrix_optimized_smoothing)
print("Accuracy:", MNB_accuracy_optimized_smoothing)

Classification Report:
               precision    recall  f1-score   support

    negative       0.20      0.63      0.30      3198
     neutral       0.17      0.33      0.23      4981
    positive       0.98      0.86      0.91     94969

    accuracy                           0.82    103148
   macro avg       0.45      0.61      0.48    103148
weighted avg       0.91      0.82      0.86    103148

Confusion Matrix:
 [[ 2014   694   490]
 [ 1773  1654  1554]
 [ 6440  7275 81254]]
Accuracy: 0.8233024392135572


The results of the scenario when we didn't use POS tagging is slightly lower accuracy than when we did use POS tagging. However, the change is pretty much negligible. This means that the POS tagging has little impact on the results of the classification when using multinomial naive bayes. The biggest issue with this model is that the data is so imbalanced it is having trouble with finding the negative because it does not have enough negative examples to train with. One solution would be to skew the thresholds. In reality, only really good reviews (above say, 8) would be considered reviews that are worth reaching out to so you can highlight the review and get more info about what they loved. Additionally, when people choose lower than a 7, generally they aren't thrilled with their experience. 

To try and see if we can improve the accuracy of the classifier with this dataset, we could adjust the thresholds and only consider above 9 as a very positive review, and anything below 7 as a negative review worth reaching out to to collect data on about how to improve the guest experience. Doing so, with an simple Multinomial Naive Bayes implementation would yield the following results.

In [20]:
#adjust the thresholds to try and force a better balance in the data
positive_threshold = 9.0
negative_threshold = 7.0

# next, I will classify the scores by building a classification function
def classify_scores(score_value):
    if score_value >= positive_threshold:
        return 'positive'
    elif score_value <= negative_threshold:
        return 'negative'
    else:
        return 'neutral'

# I will use this function to categorize and provide labels to the Review_Score data
preprocessed_reviews_pos_tagging_skewed = preprocessed_reviews_pos_tagging.copy()   
preprocessed_reviews_pos_tagging_skewed['Reviewer_Score'] = preprocessed_reviews_pos_tagging_skewed['Reviewer_Score'].apply(classify_scores)
preprocessed_reviews_pos_tagging_skewed

Unnamed: 0,POS_Tag,Reviewer_Score
0,"[(angry, JJ), (made, VBD), (post, NN), (availa...",negative
1,"[(negative, JJ), (real, JJ), (complaint, NN), ...",neutral
2,"[(room, NN), (nice, RB), (elderly, JJ), (bit, ...",neutral
3,"[(room, NN), (dirty, NN), (afraid, JJ), (walk,...",negative
4,"[(booked, VBN), (company, NN), (line, NN), (sh...",negative
...,...,...
515733,"[(trolly, RB), (staff, NN), (help, NN), (take,...",negative
515734,"[(hotel, NN), (look, NN), (like, IN), (surely,...",negative
515735,"[(ac, JJ), (useless, JJ), (hot, JJ), (week, NN...",negative
515736,"[(negative, JJ), (room, NN), (enormous, JJ), (...",neutral


In [21]:
# determine the counts of each category to see balance of data
review_counts_skewed = preprocessed_reviews_pos_tagging_skewed['Reviewer_Score'].value_counts()
print(review_counts_skewed)

positive    247037
neutral     181439
negative     87262
Name: Reviewer_Score, dtype: int64


In [22]:
# when compared to the original thresholds, this is more balanced, but not much more 
# we will now explore the impact on the classification
reviews_MNB_skewed = preprocessed_reviews_pos_tagging_skewed.copy()
reviews_MNB_skewed['Review'] = reviews_MNB_skewed['POS_Tag'].apply(lambda tokens: ' '.join([f'{word}_{pos}' for word, pos in tokens]))

#initialize the tfidf vectorizer 
vectorizer_skewed = TfidfVectorizer()

# vectorize the combined POS and Text from the review text and get the score values for y
X_skewed = vectorizer_skewed.fit_transform(reviews_MNB_skewed['Review'])
y_skewed = reviews_MNB['Reviewer_Score']

#split into test and training data
X_train_skewed, X_test_skewed, y_train_skewed, y_test_skewed = train_test_split(X_skewed, y_skewed, test_size=0.2, random_state=42)

# Initialize SMOTE
smote_skewed = SMOTE(sampling_strategy='auto', random_state=42)
# Fit and transform the data
X_resampled_skewed, y_resampled_skewed = smote_skewed.fit_resample(X_train_skewed, y_train_skewed)

In [23]:
# Range of alpha values to explore
alpha_values_skewed = [0.01, 0.05, 0.1, 0.125, 0.15]

# Define the paramater grid
param_grid_skewed = {'alpha': alpha_values_skewed}

# Initialize the MultinomialNB classifier
multi_naive_bayes_skewed_smoothing = MultinomialNB()

# Perform grid search with cross-validation
grid_search_skewed = GridSearchCV(multi_naive_bayes_skewed_smoothing, param_grid_skewed, cv=5)
grid_search_skewed.fit(X_resampled_skewed, y_resampled_skewed)

# show the best alpha value
optimal_alpha_skewed = grid_search_skewed.best_params_['alpha']

#print for confirmation
print(optimal_alpha_skewed)

0.01


In [24]:
# we will now train 
multi_naive_bayes_skewed = MultinomialNB(alpha=optimal_alpha_skewed)
multi_naive_bayes_skewed.fit(X_resampled_skewed, y_resampled_skewed)

#using the trained model, we will make predictions on the test set for final evaluation of the effectiveness 
y_predicted_skewed = multi_naive_bayes_skewed.predict(X_test_skewed)

#evaluate results
MNB_report_skewed = classification_report(y_test_skewed, y_predicted_skewed)
MNB_confusion_matrix_skewed = confusion_matrix(y_test_skewed, y_predicted_skewed)
MNB_accuracy_skewed = accuracy_score(y_test_skewed, y_predicted_skewed)

#print the results
print("Classification Report:\n", MNB_report_skewed)
print("Confusion Matrix:\n", MNB_confusion_matrix_skewed)
print("Accuracy:", MNB_accuracy_skewed)

Classification Report:
               precision    recall  f1-score   support

    negative       0.17      0.46      0.25      3198
     neutral       0.13      0.46      0.20      4981
    positive       0.97      0.79      0.87     94969

    accuracy                           0.76    103148
   macro avg       0.43      0.57      0.44    103148
weighted avg       0.91      0.76      0.82    103148

Confusion Matrix:
 [[ 1465  1184   549]
 [ 1284  2312  1385]
 [ 5772 14203 74994]]
Accuracy: 0.7636696785201846


Even with the skewed data, the results are the same as the original with oversampling and optimized alpha, therefore, the conclusion is that in order to improve the results we may need to take further imbalance techniques or access more negative review data to further balance the data. When the data was skewed even further, the data became a bit more balanced, but it was using unreasonable thresholds and the results didn't improve all that much. This makes me think that we may want to explore better models for dealing with our data, find more data for training the model, or figure out better preprocessing techniques for this model to improve the results above 82.7%. 

## Part 6: Exploring Random Forest Outcomes for Sentiment Analysis

In [25]:
#To start, I need to import the classifier library for Random Forest [10]
from sklearn.ensemble import RandomForestClassifier

# confirm the dataframes are still in order
# preprocessed_reviews_pos_tagging_final
# preprocessed_reviews_no_pos_tagging_final

#Copy the data so we use a different then original version, but the same values
#This will ensure this section has it's own data
reviews_pos_tagging_RF = preprocessed_reviews_pos_tagging_final.copy()
reviews_no_pos_tagging_RF = preprocessed_reviews_no_pos_tagging_final.copy()

In [26]:
# first I will work on the pos tagged version - turn it into strings with the tags
reviews_pos_tagging_RF['Review'] = reviews_pos_tagging_RF['POS_Tag'].apply(lambda tokens: ' '.join([f'{word}_{pos}' for word, pos in tokens]))

# next, I will vectorize the data with TF-IDF
RF_vectorizer = TfidfVectorizer()  # Adjust max_features as needed
X_RF = RF_vectorizer.fit_transform(reviews_pos_tagging_RF['Review'])
y_RF = reviews_pos_tagging_RF['Reviewer_Score']

In [27]:
# now we need to split the data for Random Forest
X_RF_train, X_RF_test, y_RF_train, y_RF_test = train_test_split(X_RF, y_RF, test_size=0.2, random_state=42)

In [28]:
#initialize and train the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_RF_train, y_RF_train)

# determine the predictions without any balancing techniques first to confirm if it performs well on original data
y_RF_pred = rf_model.predict(X_RF_test)

#evaluate results
RF_report = classification_report(y_RF_test, y_RF_pred)
RF_confusion_matrix = confusion_matrix(y_RF_test, y_RF_pred)
RF_accuracy = accuracy_score(y_RF_test, y_RF_pred)

#print the results
print("Classification Report:\n", RF_report)
print("Confusion Matrix:\n", RF_confusion_matrix)
print("Accuracy:", RF_accuracy)

Classification Report:
               precision    recall  f1-score   support

    negative       0.69      0.03      0.05      3198
     neutral       0.35      0.01      0.01      4981
    positive       0.92      1.00      0.96     94969

    accuracy                           0.92    103148
   macro avg       0.65      0.34      0.34    103148
weighted avg       0.89      0.92      0.88    103148

Confusion Matrix:
 [[   81     6  3111]
 [   10    28  4943]
 [   27    45 94897]]
Accuracy: 0.9210648776515299


Our results are skewed still with Random Forest due to the imbalanced dataset. I will again resample the data to see if I can improve the results for the negative and neutral cases. 

In [29]:
# to oversample, I am going to try another method called random oversampler [11][12]
from imblearn.over_sampling import RandomOverSampler

# initiate the oversampler
oversampler = RandomOverSampler(random_state=42)

#create the oversampled data
X_RF_resampled, y_RF_resampled = oversampler.fit_resample(X_RF_train, y_RF_train)

# now we need to split the data for Random Forest
X_RF_resampled, X_RF_test_resampled, y_RF_resampled, y_RF_test_resampled = train_test_split(X_RF_resampled, y_RF_resampled, test_size=0.2, random_state=42)

#initialize and train the Random Forest model
rf_model_resampled = RandomForestClassifier(n_estimators=80, max_depth=90, random_state=42)
rf_model_resampled.fit(X_RF_resampled, y_RF_resampled)

In [30]:
# determine the predictions without any balancing techniques first to confirm if it performs well on original data
y_RF_pred_resampled = rf_model_resampled.predict(X_RF_test_resampled)

#evaluate results
RF_report_resampled = classification_report(y_RF_test_resampled, y_RF_pred_resampled)
RF_confusion_matrix_resampled = confusion_matrix(y_RF_test_resampled, y_RF_pred_resampled)
RF_accuracy_resampled = accuracy_score(y_RF_test_resampled, y_RF_pred_resampled)

#print results
print("Classification Report:\n", RF_report_resampled)
print("Confusion Matrix:\n", RF_confusion_matrix_resampled)
print("Accuracy:", RF_accuracy_resampled)

Classification Report:
               precision    recall  f1-score   support

    negative       0.96      0.92      0.94     75933
     neutral       0.91      0.91      0.91     76009
    positive       0.88      0.92      0.90     76382

    accuracy                           0.92    228324
   macro avg       0.92      0.92      0.92    228324
weighted avg       0.92      0.92      0.92    228324

Confusion Matrix:
 [[69543  2213  4177]
 [ 1283 69318  5408]
 [ 1768  4465 70149]]
Accuracy: 0.9154096809796605


Next, we will look at the case with no POS tagging, so that we can see the impact on POS tagging for RF. 

In [31]:
# no POS tags version
#  next I will work on the no pos tagged version - turn it into strings with the tags
reviews_no_pos_tagging_RF['Total_Review_handled'] = reviews_no_pos_tagging_RF['Total_Review_handled'].apply(lambda x: ' '.join(x))

# vectorize the data with TF-IDF [13]
RF_vectorizer_noPOS = TfidfVectorizer() 
X_RF_noPOS = RF_vectorizer_noPOS.fit_transform(reviews_no_pos_tagging_RF['Total_Review_handled'])
y_RF_noPOS = reviews_no_pos_tagging_RF['Reviewer_Score']

In [32]:
# now we need to split the data for Random Forest
X_RF_noPOS_train, X_RF_noPOS_test, y_RF_noPOS_train, y_RF_noPOS_test = train_test_split(X_RF_noPOS, y_RF_noPOS, test_size=0.2, random_state=42)

In [33]:
#initialize and train the Random Forest model
rf_model_noPOS = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model_noPOS.fit(X_RF_noPOS_train, y_RF_noPOS_train)

# determine the predictions without any balancing techniques first to confirm if it performs well on original data
y_RF_noPOS_pred = rf_model_noPOS.predict(X_RF_noPOS_test)

#evaluate results
RF_report_noPOS = classification_report(y_RF_noPOS_test, y_RF_noPOS_pred)
RF_confusion_matrix_noPOS = confusion_matrix(y_RF_noPOS_test, y_RF_noPOS_pred)
RF_accuracy_noPOS = accuracy_score(y_RF_noPOS_test, y_RF_noPOS_pred)

#print results
print("Classification Report:\n", RF_report_noPOS)
print("Confusion Matrix:\n", RF_confusion_matrix_noPOS)
print("Accuracy:", RF_accuracy_noPOS)

Classification Report:
               precision    recall  f1-score   support

    negative       0.65      0.03      0.05      3198
     neutral       0.37      0.01      0.01      4981
    positive       0.92      1.00      0.96     94969

    accuracy                           0.92    103148
   macro avg       0.65      0.34      0.34    103148
weighted avg       0.89      0.92      0.89    103148

Confusion Matrix:
 [[   87     4  3107]
 [   15    29  4937]
 [   32    46 94891]]
Accuracy: 0.9210745724589909


The outcome of no POS tagging was similar to with POS tags. However, the imbalanced data set is not working very well on the RF model. I will try resampling this as well to see if I can improve the results. 

In [34]:
# initiate the new oversampler for no POS tags
oversampler_noPOS = RandomOverSampler(random_state=42)

#create the oversampled data
X_RF_resampled_noPOS, y_RF_resampled_noPOS = oversampler_noPOS.fit_resample(X_RF_noPOS_train, y_RF_noPOS_train)

# now we need to split the data for Random Forest
X_RF_resampled_noPOS, X_RF_test_resampled_noPOS, y_RF_resampled_noPOS, y_RF_test_resampled_noPOS = train_test_split(X_RF_resampled_noPOS, y_RF_resampled_noPOS, test_size=0.2, random_state=42)

#initialize and train the Random Forest model
rf_model_resampled_noPOS = RandomForestClassifier(n_estimators=80, max_depth=90, random_state=42)
rf_model_resampled_noPOS.fit(X_RF_resampled_noPOS, y_RF_resampled_noPOS)

In [35]:
# determine the predictions without any balancing techniques first to confirm if it performs well on original data
y_RF_pred_resampled_noPOS = rf_model_resampled_noPOS.predict(X_RF_test_resampled_noPOS)

#evaluate results
RF_report_resampled_noPOS = classification_report(y_RF_test_resampled_noPOS, y_RF_pred_resampled_noPOS)
RF_confusion_matrix_resampled_noPOS = confusion_matrix(y_RF_test_resampled_noPOS, y_RF_pred_resampled_noPOS)
RF_accuracy_resampled_noPOS = accuracy_score(y_RF_test_resampled_noPOS, y_RF_pred_resampled_noPOS)

#print results
print("Classification Report:\n", RF_report_resampled_noPOS)
print("Confusion Matrix:\n", RF_confusion_matrix_resampled_noPOS)
print("Accuracy:", RF_accuracy_resampled_noPOS)

Classification Report:
               precision    recall  f1-score   support

    negative       0.96      0.94      0.95     75933
     neutral       0.93      0.94      0.93     76009
    positive       0.92      0.93      0.92     76382

    accuracy                           0.94    228324
   macro avg       0.94      0.94      0.94    228324
weighted avg       0.94      0.94      0.94    228324

Confusion Matrix:
 [[71564  1467  2902]
 [ 1276 71081  3652]
 [ 1783  3641 70958]]
Accuracy: 0.9355258317128291


## Part 7: Exploring Recurrent Neural Networks (RNN) Outcomes for Sentiment Analysis

In [36]:
# confirm the dataframes are still in order
# preprocessed_reviews_pos_tagging_final
# preprocessed_reviews_no_pos_tagging_final

reviews_pos_tagging_RNN = preprocessed_reviews_pos_tagging_final.copy()
reviews_no_pos_tagging_RNN = preprocessed_reviews_no_pos_tagging_final.copy()

In [37]:
# for this type of model I will use TensorFlow and Keras 

#install the deep learning framework Tensor Flow and library Keras [14][15]
!pip install tensorflow
!pip install keras



In [38]:
#In order to use a Recurrent Neural Network (RNN), we need to do a bit more processing on the data to format it the way it is needed for input
# This includes, seperating the words and the POS tags, adding padding to get the length consistent, and
# creating embeddings for the data, then splitting into test and train datasets for the model. 

#first I will import all the libraries needed for this model - I will use different tokenizers than previously
#to show capabilities with different technologies
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import LabelEncoder
from keras.models import Sequential
from keras.layers import Embedding, Bidirectional, LSTM, GRU, Dense #[16][17]
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

#with an RNN the scores can not be text - they need to be numbers - will encode these now
label_encoder_RNN = LabelEncoder()
encoded_scores_RNN = label_encoder_RNN.fit_transform(reviews_pos_tagging_RNN["Reviewer_Score"])

# get the text and POS data
review_RNN = reviews_pos_tagging_RNN["POS_Tag"]

# prepare the text and POS data for RNN usage
tokenizer_RNN = Tokenizer()
tokenizer_RNN.fit_on_texts([" ".join([rev for rev, _ in seq]) for seq in review_RNN])
sequences_RNN = tokenizer_RNN.texts_to_sequences([" ".join([rev for rev, _ in seq]) for seq in review_RNN])
review_prepped_RNN = pad_sequences(sequences_RNN, maxlen=10)

In [39]:
# prepare the test and train datasets
X_train_RNN, X_test_RNN, y_train_RNN, y_test_RNN = train_test_split(review_prepped_RNN, encoded_scores_RNN, test_size=0.2, random_state=42)


# I will oversample the data as the minority class is not being recognized well
# this was shown by training the model without any additional fine tuning and just the regular preprocessing. 
# smote_RNN = SMOTE(random_state=42)
# X_train_RNN_resampled, y_train_RNN_resampled = smote_RNN.fit_resample(X_train_RNN, y_train_RNN)

#adjust the class weights - it ended up overfitting the data so I will not use this technique [18]
# Compute class weights for balancing the dataset - this will increase the weight for the minority classes
class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(y_train_RNN), y=y_train_RNN)
# Convert class weights to a dictionary for use in the model
class_weights_dict = {i: weight for i, weight in enumerate(class_weights)}
# next we will manually apply class weights to the training data
sample_weights = np.array([class_weights_dict[y] for y in y_train_RNN])


#now that I have prepared the data, I can initiate the model
model_RNN = Sequential()
model_RNN.add(Embedding(input_dim=len(tokenizer_RNN.word_index) + 1, output_dim=128, input_length=10))
model_RNN.add(Bidirectional(LSTM(128)))
model_RNN.add(Dense(3, activation='softmax'))

# Compile the model with the updated class weights
model_RNN.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy'],
    weighted_metrics=['accuracy']
)

In [40]:
#now I will train the model and make predictions
model_RNN.fit(X_train_RNN, y_train_RNN, epochs=5, batch_size=64, validation_split=0.2, sample_weight=sample_weights)
pred_RNN = model_RNN.predict(X_test_RNN)

# convert back to positive, negative, or neutral
pred_RNN_converted_back = label_encoder_RNN.inverse_transform(pred_RNN.argmax(axis=1))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [41]:
#classification metrics didnt work out as we are getting probabilities of each classification from the model
#to fix this, we will convert it to the predicted outcome using numpy [19]
pred_RNN_converted = np.argmax(pred_RNN, axis=1)

#evaluate results from the RNN
RNN_report_POS = classification_report(y_test_RNN, pred_RNN_converted)
RNN_confusion_matrix_POS = confusion_matrix(y_test_RNN, pred_RNN_converted)
RNN_accuracy_POS = accuracy_score(y_test_RNN, pred_RNN_converted)

#print the results
print("Classification Report:\n", RNN_report_POS)
print("Confusion Matrix:\n", RNN_confusion_matrix_POS)
print("Accuracy:", RNN_accuracy_POS)

Classification Report:
               precision    recall  f1-score   support

           0       0.19      0.53      0.28      3198
           1       0.13      0.43      0.20      4981
           2       0.98      0.80      0.88     94969

    accuracy                           0.77    103148
   macro avg       0.43      0.58      0.45    103148
weighted avg       0.91      0.77      0.83    103148

Confusion Matrix:
 [[ 1679  1056   463]
 [ 1455  2142  1384]
 [ 5721 13694 75554]]
Accuracy: 0.7695253422267033


Using POS tags in the preprocessed data, we get an accuracy of 91.9%. This is good, but the Random Forest model still outperformed the RNN with this dataset. 

In [42]:
# next, I will use the RNN model for the no_POS tag scenario. 
#with an RNN the scores can not be text - they need to be numbers - will encode these now
label_encoder_RNN_no_POS = LabelEncoder()
encoded_scores_RNN_no_POS = label_encoder_RNN_no_POS.fit_transform(reviews_no_pos_tagging_RNN["Reviewer_Score"])

# get the review text data
review_RNN_no_POS = reviews_no_pos_tagging_RNN["Total_Review_handled"]

# prepare the review text data for RNN usage
tokenizer_RNN_noPOS = Tokenizer()
tokenizer_RNN_noPOS.fit_on_texts(review_RNN_no_POS)
review_prepped_RNN_noPOS = tokenizer_RNN_noPOS.texts_to_sequences(review_RNN_no_POS)

# pad the text with a set length - used 10 previously so will use it again
review_prepped_RNN_noPOS_padded = pad_sequences(review_prepped_RNN_noPOS, maxlen=10)

#split the data into train and test sets
X_train_RNN_noPOS, X_test_RNN_noPOS, y_train_RNN_noPOS, y_test_RNN_noPOS = train_test_split(review_prepped_RNN_noPOS_padded, encoded_scores_RNN_no_POS, test_size=0.2, random_state=42)

# I will oversample the data as the minority class is not being recognized well
# this was shown by training the model without any additional fine tuning and just the regular preprocessing. 
# smote_RNN_noPOS = SMOTE(random_state=42)
# X_train_RNN_resampled_noPOS, y_train_RNN_resampled_noPOS = smote_RNN_noPOS.fit_resample(X_train_RNN_noPOS, y_train_RNN_noPOS)

#adjust the class weights
# Compute class weights for balancing the dataset - this will increase the weight for the minority classes
class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(y_train_RNN_noPOS), y=y_train_RNN_noPOS)
# Convert class weights to a dictionary for use in the model
class_weights_dict = {i: weight for i, weight in enumerate(class_weights)}
# next we will manually apply class weights to the training data
sample_weights = np.array([class_weights_dict[y] for y in y_train_RNN_noPOS])

In [43]:
# next I will train the new RNN model for noPOS
#initiate
model_RNN_noPOS = Sequential()
model_RNN_noPOS.add(Embedding(input_dim=len(tokenizer_RNN_noPOS.word_index) + 1, output_dim=128, input_length=10))
model_RNN_noPOS.add(Bidirectional(LSTM(128)))
model_RNN_noPOS.add(Dense(3, activation='softmax'))  # 3 classes for "positive," "negative," and "neutral"

# Compile the model with the updated class weights
model_RNN_noPOS.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy'],
    weighted_metrics=['accuracy']
)

In [44]:
#now I will train the model and make predictions
model_RNN_noPOS.fit(X_train_RNN_noPOS, y_train_RNN_noPOS, epochs=5, batch_size=64, validation_split=0.2, sample_weight=sample_weights)
pred_RNN_noPOS = model_RNN_noPOS.predict(X_test_RNN_noPOS)

# convert back to positive, negative, or neutral
pred_RNN_noPOS_converted_back = label_encoder_RNN_no_POS.inverse_transform(pred_RNN_noPOS.argmax(axis=1))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [45]:
#classification metrics didnt work out as we are getting probabilities of each classification from the model
#to fix this, we will convert it to the predicted outcome using numpy
pred_RNN_noPOS_converted = np.argmax(pred_RNN_noPOS, axis=1)

#evaluate results from the RNN
RNN_report_noPOS = classification_report(y_test_RNN_noPOS, pred_RNN_noPOS_converted)
RNN_confusion_matrix_noPOS = confusion_matrix(y_test_RNN_noPOS, pred_RNN_noPOS_converted)
RNN_accuracy_noPOS = accuracy_score(y_test_RNN_noPOS, pred_RNN_noPOS_converted)

#print the results
print("Classification Report:\n", RNN_report_noPOS)
print("Confusion Matrix:\n", RNN_confusion_matrix_noPOS)
print("Accuracy:", RNN_accuracy_noPOS)

Classification Report:
               precision    recall  f1-score   support

           0       0.18      0.53      0.27      3198
           1       0.13      0.41      0.20      4981
           2       0.98      0.81      0.88     94969

    accuracy                           0.78    103148
   macro avg       0.43      0.58      0.45    103148
weighted avg       0.91      0.78      0.83    103148

Confusion Matrix:
 [[ 1693  1036   469]
 [ 1556  2018  1407]
 [ 6234 11940 76795]]
Accuracy: 0.7804901694652344


While the POS tagged data performed better, the results were still slightly less than the Random Forest model. While they are very similar, it seems the Random Forest model with no POS tagging in the preprocessed data performed the best out of all three models tested (Multinomial Naive Bayes, Random Forest, and a Recurrent Neural Network (RNN). It is possible that with further fine tuning some of these models can be improved even further, however, I will proceed with using the best model found through this exploratory exercise for the development of the sentiment analysis tool for Short-Term Rental reviews. 

The summary of all the results is shown below:

In [46]:
print("Multinomial Naive Bayes Results:\n")

print("Classification Report Multinomial Naive Bayes - POS Tags:\n", MNB_postag_report_optimized_smoothing_oversampled)
print("Confusion Matrix Multinomial Naive Bayes - POS Tags:\n", MNB_postag_confusion_matrix_optimized_smoothing_oversampled)
print("Accuracy Multinomial Naive Bayes - POS Tags:\n", MNB_postag_accuracy_optimized_smoothing_oversampled)

print("\n\n")

print("Classification Report Multinomial Naive Bayes - no POS Tags:\n", MNB_report_optimized_smoothing)
print("Confusion Matrix Multinomial Naive Bayes - no POS Tags:\n", MNB_confusion_matrix_optimized_smoothing)
print("Accuracy Multinomial Naive Bayes - no POS Tags:\n", MNB_accuracy_optimized_smoothing)

print("\n\n")

print("Random Forest Results:\n")

print("Classification Report Random Forest - POS Tags:\n", RF_report_resampled)
print("Confusion Matrix Random Forest - POS Tags:\n", RF_confusion_matrix_resampled)
print("Accuracy Random Forest - POS Tags:\n", RF_accuracy_resampled)

print("\n\n")

print("Classification Report Random Forest - no POS Tags:\n", RF_report_resampled_noPOS)
print("Confusion Matrix Random Forest - no POS Tags:\n", RF_confusion_matrix_resampled_noPOS)
print("Accuracy Random Forest - no POS Tags:\n", RF_accuracy_resampled_noPOS)

print("\n\n")

print("Recurrent Neural Network RNN Results:\n")

print("Classification Report Recurrent Neural Network RNN - POS Tags:\n", RNN_report_POS)
print("Confusion Matrix Recurrent Neural Network RNN - POS Tags:\n", RNN_confusion_matrix_POS)
print("Accuracy Recurrent Neural Network RNN - POS Tags:\n", RNN_accuracy_POS)

print("\n\n")

print("Classification Report Recurrent Neural Network RNN - no POS Tags:\n", RNN_report_noPOS)
print("Confusion Matrix Recurrent Neural Network RNN - no POS Tags:\n", RNN_confusion_matrix_noPOS)
print("Accuracy Recurrent Neural Network RNN - no POS Tags:\n", RNN_accuracy_noPOS)

Multinomial Naive Bayes Results:

Classification Report Multinomial Naive Bayes - POS Tags:
               precision    recall  f1-score   support

    negative       0.20      0.65      0.30      3198
     neutral       0.17      0.30      0.22      4981
    positive       0.97      0.86      0.91     94969

    accuracy                           0.83    103148
   macro avg       0.45      0.60      0.48    103148
weighted avg       0.91      0.83      0.86    103148

Confusion Matrix Multinomial Naive Bayes - POS Tags:
 [[ 2065   615   518]
 [ 1855  1476  1650]
 [ 6654  6452 81863]]
Accuracy Multinomial Naive Bayes - POS Tags:
 0.8279753364098189



Classification Report Multinomial Naive Bayes - no POS Tags:
               precision    recall  f1-score   support

    negative       0.20      0.63      0.30      3198
     neutral       0.17      0.33      0.23      4981
    positive       0.98      0.86      0.91     94969

    accuracy                           0.82    103148
   mac

As can be seen in the summary, my above statement is confirmed: The best results were using Random Forest without the POS tags in the dataset. This will be the model I use for sentiment analysis with Short-Term Rental Reviews

## Part 8: Scraping Reviews to Get New Data From a Website

In [47]:
#ensure libraries are installed for scraping needs [20][21]
!pip install beautifulsoup4
!pip install --upgrade selenium



In [48]:
# import relevant libraries
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time
import os

In [49]:
# next, I need to specify the driver path for the web driver for Selenium to work
#set up selenium webdriver [21]
service = Service('/Users/coreyreid/Documents/Final Project/Final Project - Implementation & Evaluation/chromedriver')
options = webdriver.ChromeOptions()
#options.add_argument('--headless') #to make it headless and not run the chrome instance
driver = webdriver.Chrome(service=service, options=options)

#example website to use for the scraping
url = "https://www.mainecottagekeepers.com/2-beautiful-water-view-properties-in-1-for-extended-families-orp5b4a2abx"
driver.get(url)

# delay as needed
time.sleep(10)  

# Switch to the iframe containing the reviews - need to switch the focus into the iframe
iframe = driver.find_element(By.CSS_SELECTOR, 'div.ownerrez-widget[data-widgetid="dc0c1ee92566447e81ba5e68a8a2ac1a"] iframe')
driver.switch_to.frame(iframe)

# delay as needed
time.sleep(10)  

# Extract the HTML content
html = driver.page_source

# now, I will scrape the website to expose the review content
soup = BeautifulSoup(html, 'html.parser')
review_elements = soup.find_all('div', class_='review-item')

In [50]:
#now that we have isolated the html element that matters, we will want to go through all the reviews
#and collect just the relevant text
#initialize
real_reviews = []

#loop through the review elements
for review in review_elements:
    # Scrape the review title text
    review_title = review.find("span", class_="review-item-title").strong
    # Scrape the review body text
    review_body = review.find("div", class_="has-read-more").text.strip()
    #check that a review exists
    if review_title and review_body:
        #get review title text 
        review_title_text = review_title.text.strip()
        #get review body text
        review_body_text = review_body
        # put the data in a dataframe
                # Append the scraped data as a dictionary to the 'data' list
        real_reviews.append(
            review_body_text
        )
        #print the details through the loop
        print("The title of the review is:", review_title_text)
        print("The review content is:", review_body_text)
        print("---------------------------------")
    else:
        #error handling
        print("There is no review title or text.")

The title of the review is: 
The review content is: A big thank you from us to be able to stay in this wonderful place in order to be with our family during the Lobstah fest.  We drove from Florida to see our Massachusetts family as we gathered in Maine to welcome our Grandson from the ship that anchored in Rockland for the Festival.  We loved being able to see him as he did us.  We felt at home in this Airbnb as we gathered at the fire pit and the deck table for meals. We did several meals at the Dip Net too.  Great food and very close to house.This house is very well equipped, clean, and is an old farmhouse with a lot of Character.We did host an Airbnb on the Cape at one time and know that Shawn-Elise deserves a 5 star rating to become a super host if she is not already.Would we come again?  You Bet.Sharyn
---------------------------------
The title of the review is: 
The review content is: Great stay. Just didn’t understand why they took 500 for a security deposit
------------------

In [51]:
# test the way the data is structured by looking at the output
real_reviews

['A big thank you from us to be able to stay in this wonderful place in order to be with our family during the Lobstah fest.\xa0 We drove from Florida to see our Massachusetts family as we gathered in Maine to welcome our Grandson from the ship that anchored in Rockland for the Festival.\xa0 We loved being able to see him as he did us.\xa0 We felt at home in this Airbnb as we gathered at the fire pit and the deck table for meals. We did several meals at the Dip Net too.\xa0 Great food and very close to house.This house is very well equipped, clean, and is an old farmhouse with a lot of Character.We did host an Airbnb on the Cape at one time and know that Shawn-Elise deserves a 5 star rating to become a super host if she is not already.Would we come again?\xa0 You Bet.Sharyn',
 'Great stay. Just didn’t understand why they took 500 for a security deposit',
 'The place is absolutely beautiful and there was plenty of room for all 10 f us to sleep. The houses are just a short drive from a b

## Part 9: Testing on Real Review Data

In [52]:
# in order to scale the preprocessing to be able to be used with new reviews, I need to generalize the 
# preprocessing function - I will do that now for the one with no POS tagging

def preprocess_no_pos(review):
    
    #remove special characters and ensure lowercase
    def preprocess_remove_special(review):
        review_handled = re.sub(r'[^a-zA-Z\s]', '', review)
        review_lower = review_handled.lower()
        return review_lower

    #tokenize
    def preprocess_tokenize(review):
        bag_of_words = word_tokenize(review)
        return bag_of_words

    #remove stopwords
    def preprocess_stopwords(review):
        words = stopwords.words('english')
        set_stop_words = set(words)
        removed_stopwords_words = []
        for word in review:
            if word not in set_stop_words:
                removed_stopwords_words.append(word)
        return removed_stopwords_words

    #lemmatize
    def preprocess_lemmatize(review):
        lemmatizer = WordNetLemmatizer()
        lemmatized_review = [lemmatizer.lemmatize(word) for word in review ]
        return lemmatized_review

    # Apply the preprocessing steps
    review = preprocess_remove_special(review)
    review = preprocess_tokenize(review)
    review = preprocess_stopwords(review)
    review = preprocess_lemmatize(review)

    # Join the tokens back into a single string
    preprocessed_review = ' '.join(review)

    #return the preprocessed reviews
    return preprocessed_review

In [53]:
# next, I will preprocess using this function and looping through all the reviews
real_reviews_processed = [preprocess_no_pos(review) for review in real_reviews]

In [54]:
# with this data, I now need to transform it to be used in the model
X_new_reviews = vectorizer_noPOS.transform(real_reviews_processed)

In [55]:
# predict the sentiment of the reviews
real_review_predictions = multi_naive_bayes_optimized_smoothing_noPOS.predict(X_new_reviews)

real_review_predictions

array(['positive', 'negative', 'positive', 'positive', 'positive',
       'positive', 'positive', 'positive', 'positive', 'positive'],
      dtype='<U8')

In [62]:
# turns out they are all positive, which when you look above you can see is the case - this will happen 
# from time to time. I will manually add data to the reviews data to put in a few negative reviews for further 
# testing - this will confirm the model is working properly

updated_real_reviews = real_reviews.copy()

updated_real_reviews.extend(['Awful', 'It was a bad rental unit and was very ugly. The ammenities were all broken.', 'I will never return to this or recommend it to anyone - it was a horrible rental and it ruined my holiday.'])

updated_real_reviews

['A big thank you from us to be able to stay in this wonderful place in order to be with our family during the Lobstah fest.\xa0 We drove from Florida to see our Massachusetts family as we gathered in Maine to welcome our Grandson from the ship that anchored in Rockland for the Festival.\xa0 We loved being able to see him as he did us.\xa0 We felt at home in this Airbnb as we gathered at the fire pit and the deck table for meals. We did several meals at the Dip Net too.\xa0 Great food and very close to house.This house is very well equipped, clean, and is an old farmhouse with a lot of Character.We did host an Airbnb on the Cape at one time and know that Shawn-Elise deserves a 5 star rating to become a super host if she is not already.Would we come again?\xa0 You Bet.Sharyn',
 'Great stay. Just didn’t understand why they took 500 for a security deposit',
 'The place is absolutely beautiful and there was plenty of room for all 10 f us to sleep. The houses are just a short drive from a b

In [63]:
# next, I will test this set with Multinomial Naive Bayes with no POS tags
# to start I will preprocess the data using this function and looping through all the reviews
updated_real_reviews_processed = [preprocess_no_pos(review) for review in updated_real_reviews]

# with this data, I now need to transform it to be used in the model
X_updated_real_reviews = vectorizer_noPOS.transform(updated_real_reviews_processed)

# predict the sentiment of the reviews - we will use the no POS tagging version as it performed slightly better
#than the POS tagging scenario
updated_real_reviews_predictions = multi_naive_bayes_optimized_smoothing_noPOS.predict(X_updated_real_reviews)

#show the results
updated_real_reviews_predictions

array(['positive', 'negative', 'positive', 'positive', 'positive',
       'positive', 'positive', 'positive', 'positive', 'positive',
       'negative', 'negative', 'negative'], dtype='<U8')

In [64]:
# for the RF model - it was seen that the no POS tagging scenario always performed a bit better than with tags.
# we will test the model on real data - the data is already processed above

# predict the sentiment of the reviews
updated_real_reviews_predictions = rf_model_resampled_noPOS.predict(X_updated_real_reviews)

#show the results
updated_real_reviews_predictions

array(['positive', 'positive', 'positive', 'positive', 'positive',
       'positive', 'positive', 'positive', 'positive', 'positive',
       'negative', 'negative', 'negative'], dtype=object)

In [65]:
#last I will test the RNN model with the real review data
# to do this, we will need to process the data a bit more to get it into it's final model training state.
# we will use the no POS tagging version as it always seemed to perform a bit better, even if the difference was
# very small between no POS tags and POS tagged data

# prepare the review text data for RNN usage
tokenizer_RNN_noPOS_test = Tokenizer()
tokenizer_RNN_noPOS_test.fit_on_texts(real_reviews_processed)
review_prepped_RNN_noPOS_test = tokenizer_RNN_noPOS_test.texts_to_sequences(real_reviews_processed)

# pad the text with a set length - used 10 previously so will use it again
review_prepped_RNN_noPOS_padded = pad_sequences(review_prepped_RNN_noPOS_test, maxlen=10)

real_pred_RNN_noPOS_test = model_RNN_noPOS.predict(review_prepped_RNN_noPOS_padded)

# convert back to positive, negative, or neutral
real_pred_RNN_noPOS_converted_back_test = label_encoder_RNN_no_POS.inverse_transform(real_pred_RNN_noPOS_test.argmax(axis=1))

# get the predictions
real_pred_RNN_noPOS_converted_back_test



array(['negative', 'positive', 'positive', 'positive', 'neutral',
       'negative', 'negative', 'positive', 'neutral', 'neutral'],
      dtype=object)

Upon reviewing the 3 different models on real life data, it was determined that the best results on real data come from the Multinomial Naive Bayes model. I will use this in the continuation of the minimum viable product development. 

## Part 10: Product Development 

In [66]:
#Import the necessary GUI libraries
#I will use message box from Tkinter to build the GUI and the Tkinter library in general [22]

import tkinter as tk
from tkinter import messagebox
import joblib
import string

In [67]:
#first I need to define a function that will run on the button click. 
# this will include the preprocessing, vectorization, and model execution for the sentiment analysis

# function to determine the preprocessed state for the reviews
def determine_preprocess(review):
    # Create a list of reviews, for any ";" will seperate the list so input needs to seperate reviews by ";"
    reviews_list = review.split(";")
    #initialize the results list to be used to append the results
    reviews_list_processed = []
    
    #loop through the reviews and analyze individually
    for review in reviews_list:
        # Preprocess each review individually
        processed_review = preprocess_no_pos(review)    
        #transform the review with our vectorizer
        reviews_list_vectorized = RF_vectorizer_noPOS.transform([processed_review])
        # predict the sentiment of the reviews
        reviews_list_predictions = rf_model_resampled_noPOS.predict(reviews_list_vectorized)
        #append the results
        reviews_list_processed.append((review, reviews_list_predictions[0])) 
    #return the results
    return reviews_list_processed

# create a function for the sentiment analysis
def get_sentiment():
    #get the entry from the form 
    user_reviews_list = entry.get()
    # if there are reviews, then analyze them 
    if user_reviews_list:
        #determine sentiment
        sentiment_out = determine_preprocess(user_reviews_list)
        #clear the previous review info to ensure each review is looked at individually
        review_text.delete(1.0, tk.END)
        
        # analyse the review
        for review, sentiment in sentiment_out:
            # in the text output that is part of the GUI, we will now get the values needed in our presentation of results
            # get the review
            review_text.insert(tk.END, "Review:\n{}\n".format(review))
            # get the sentiment
            review_text.insert(tk.END, "Sentiment: {}\n".format(sentiment))
            
            #logic for giving recommendations on the handling procedures for the review
            #positive
            if sentiment == 'positive':
                message = "Message: This is a positive review. You should consider reaching out to learn more about what was so great about the stay and thank them for leaving a good review."
            #negative
            elif sentiment == 'negative':
                message = "Message: This is a negative review. You should consider reaching out to learn more about their bad experience and offer them a 10% discount for their next stay. Also thank them for the feedback."
            #nuetral or other
            else:
                message = "Message: Sentiment is uncertain. You may want to connect with them for more information."
            
            #add the message to the text output for presentation
            review_text.insert(tk.END, message + "\n\n")
    #if no reviews input and button clicked
    else:
        #give an error to the user
        messagebox.showerror("Error", "Please enter a review.")


# initialize the tkinter model
root = tk.Tk()
#provide a title for the GUI
root.title("Short-Term Rental Review Sentiment Analysis")

# give the label for the GUI input
label = tk.Label(root, text="Enter your Short-Term Rental review (seperate multiple reviews with a ';')")
#update the label
label.pack()

#define the form entry size
entry = tk.Entry(root, width=50)
#update the entry
entry.pack()

#define the button and actions from the button click - in this case we will run the get_sentiment function
analyze_button = tk.Button(root, text="Determine Sentiment", command=get_sentiment)
#update the button
analyze_button.pack()

# Size the text widget which will be used to display the review analysis output - including the resolution message 
review_text = tk.Text(root, height=10, width=50)
# update the text details
review_text.pack()

#loop the root GUI to keep the window running
root.mainloop()

As can be seen when you run the code, a GUI is created that allows the user to input reviews - multiple are seperated with a ";" - and get outcomes from the sentiment analysis with recommendations on what to do. This proves that a sentiment analysis tool like this could be created, and the minimal viable product is completed. 