# DTSA 5510 - Unsupervised Algorithms in Machine Learning Final Project
Chris Murphy 07/23/2024

## Introduction 

For my final project, I wanted to attempt to create a sentiment analysis model. As I was thinking about ideas for my project, I thought back to the BBC News Classification mini project we were tasked with completing in week 4 and was thinking about different areas of unsupervised machine learning algorithms that could be applied to the same dataset. In this project, the articles were classified into five distinct categories but what if a model could be bilt to understand if the document that was being analyzed was a positive story or a negative story? This was an interesting project idea that I wanted to learn more about. After some additional research, I found there are many methodogies to answer this same question in the broader sentiment analysis category.      

With my mind made up on undertaking an in depth sentiment analysis project, I needed some data to work with. I had found a few interesting datasets that contained scraped Twitter data on various topics, however, nothing really resonated with me. After thinking about the data I could use some more, I realized I had all the data I needed to copmlete a sentiment analysis project. In my day job as a Data Analyst for an e-commerce company, everday I'm working with transactional Amazon data. Sure enough, there was Amazon review data that I could query. This data would become my foundation for my final project.

The dataset contains 10,000 distinct reviews for users who had purchased items in various categories in the United States. A sample of the data can be viewed below. Here are the columns and a basic overview of what they indicate:

- ID: The unique identifier of the review
- REVIEW_DATE: The date when the review was left
- IS_VERIFIED: Whether or not the review was verified. (I did not end up using this column)
- RATING: The rating the user gave for the product on a 1 to 5 scale
- REVIEW_TITLE: The title of the review 
- REVIEW_TEXT: The review the user left for the product 
- NAME: The country the user is in

## Imports

In [None]:
# standard imports
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# sklearn specific imports
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, accuracy_score
from sklearn.model_selection import train_test_split

from nltk.tokenize import RegexpTokenizer

from langdetect import detect

from textblob import TextBlob

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

## Data import

In [None]:
# data import
reviews = pd.read_csv('data/reviews.csv')

## Basic data cleaning

In [None]:
# here is an initial view of a sample of rows in the dataset
reviews.sample(3)

In [None]:
# From the sample dataframe above, the 'NAME' field is misleading, it will be changed from 'NAME' to 'COUNTRY'
reviews.rename(columns={'NAME': 'COUNTRY'}, inplace = True)

In [None]:
# there are 10,000 reviews in our initial dataset.
# The 'REVIEW_TITLE' column has one null entry which we will remove in a subsequent cell
reviews.info()

In [None]:
# Identifying the only row with a null value. Since we really care about the 'REVIEW_TEXT'
# this row could probably be left in but I will remove it just to be safe
reviews[reviews['REVIEW_TITLE'].isnull()]
reviews.dropna(subset = ['REVIEW_TITLE'], axis = 0, inplace = True)

In [None]:
# double checking that there are no more null values in the updated dataset
reviews.isnull().sum()

In [None]:
# There are two more adjustments that need to be made. The 'REVIEW_DATE' column should be a datetime
# also the 'ID' field should be an object
reviews['REVIEW_DATE'] = pd.to_datetime(reviews['REVIEW_DATE'])
reviews['ID'] = reviews['ID'].astype(object)

### Removing non-english reviews

There is another issue that will impact our results later on in the model testing phase. Even though these reviews are sourced from Amazon.com, there are reviews that are in different languages. A mask will be created to filter our all non English reviews.

In [None]:
# The langdetect package can be used to detect the language of a body of text.
# the package can then be used in a basic function to return the particular language of the review.
def detect_language(text):
    try:
        lang = detect(text)
    except:
        lang = 'unknown'
    return lang

In [None]:
# Running the 'REVIEW_TEXT' column of our dataset through the detect_language function from above
reviews['LANGUAGE'] = reviews['REVIEW_TEXT'].apply(detect_language)

In [None]:
# Here are the results of the  we can see that there are many english reviews,
# but also other language as determined by the langdetect package
reviews['LANGUAGE'].value_counts()

In [None]:
# Example of a portuguese review in the dataset.
# This review is an example of a review that will be removed 
pd.set_option('display.max_colwidth', None)
reviews['REVIEW_TEXT'][reviews['LANGUAGE'] == 'pt'].head(1)

In [None]:
# although it should be said that this package is not perfect.
# for example, this review was classified as 'Norwegian' but in reality it is english.
# instead of going through each example, I decided to leave these edge cases alone
pd.set_option('display.max_colwidth', None)
reviews['REVIEW_TEXT'][reviews['LANGUAGE'] == 'no'].head(1)

In [None]:
# creating a mask to get only english reviews
english_mask = reviews['LANGUAGE'] == 'en'
english_reviews = reviews[english_mask]

In [None]:
# calculating the word count of the view. This will be interesting to visualize later on in the notebook
english_reviews['word_count'] = english_reviews['REVIEW_TEXT'].apply(lambda x: len(x.split()))

In [None]:
# from the 'REVIEW_DATE' column, getting the month and year that the review was submitted.
# this will be interesting to visualize
english_reviews['YEAR'] = english_reviews['REVIEW_DATE'].dt.year
english_reviews['MONTH'] = english_reviews['REVIEW_DATE'].dt.month

In [None]:
print(f'The number of rows in the final dataset is: {len(english_reviews)}')

## Visualizations

In [None]:
# This first visualization looks at the count of reviews based on their 
plt.style.use('fivethirtyeight')
fix, ax = plt.subplots(figsize = (25, 10))

ratings = reviews['RATING'].value_counts()
ind = reviews['RATING'].unique()

plt.bar(ind, ratings, color = 'firebrick', align = 'center')

for p in ax.patches:
    ax.text(p.get_x() + p.get_width() / 2,
            p.get_height(),
            '{:.0f}'.format(p.get_height()),
            ha='center',
            va='bottom'
           )

plt.title('Count of Ratings')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()

In [None]:
# dataset is unbalanced, but that is ok in our context as we will not be exploring the relationship between variables, just the underlying text

In [None]:
reviews.head()

## Visualizations p2

In [None]:
plt.style.use('fivethirtyeight')
fig, ax = plt.subplots(figsize = (25, 10))

sns.histplot(data=english_reviews, x='word_count', hue='RATING', multiple='dodge', legend='RATING', kde = True, bins = 50)

plt.xlabel('Word count per review')
plt.ylabel('Count')
plt.title('Number of words by review histogram')

In [None]:
plt.style.use('fivethirtyeight')
fig, ax = plt.subplots(figsize = (25, 10))

wc_group = english_reviews.groupby('RATING')['word_count'].apply(list)
data = [wc_group[cat] for cat in english_reviews['RATING'].unique()]

bp = plt.boxplot(data, labels = english_reviews['RATING'].unique(), patch_artist=True)

colors = ['blue', 'black', 'green', 'yellow', 'purple']
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)

plt.title('Word count boxplot')
plt.xlabel('Review score')
plt.ylabel('Word count')

In [None]:
english_reviews.head()

In [None]:
monthly_avg_rating = english_reviews.groupby(['YEAR', 'MONTH'])['RATING'].mean().reset_index()
filtered_rating = monthly_avg_rating[monthly_avg_rating['YEAR'].isin([2021, 2022, 2023, 2024])]
pivot_df = filtered_rating.pivot(index = 'MONTH', columns = 'YEAR', values='RATING')

In [None]:
plt.style.use('fivethirtyeight')
fig, ax = plt.subplots(figsize = (25, 10))

for col in pivot_df.columns:
    plt.plot(pivot_df.index, pivot_df[col], marker = 'o', label = col)

plt.xlabel('month')
plt.ylabel('Avg. Rating')
plt.title('Avg. rating over time')
plt.legend()

In [None]:
pd.set_option('display.max_colwidth', None)
english_reviews['REVIEW_TEXT'].sample(1, random_state = 14)

## Count Vectorize

In [None]:
def remove_punctuation(text):
    final_text = []
    for row in text:
        clean_row = "".join(u for u in row if u not in ('?', '.', ';', ':', '!', '"', "'", '(', ')', '[', ']', '/', ',', '-', '*'))
        final_text.append(clean_row.lower())
    
    return final_text

In [None]:
token = RegexpTokenizer(f'[a-zA-Z0-9]+')

In [None]:
english_reviews['REVIEW_TEXT'] = remove_punctuation(english_reviews['REVIEW_TEXT'])

In [None]:
cv = CountVectorizer(stop_words = 'english', ngram_range = (1,2), tokenizer = token.tokenize)

In [None]:
text_counts = cv.fit_transform(english_reviews['REVIEW_TEXT'])

## Train Test Split

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(text_counts, english_reviews['RATING'], test_size = 0.2, random_state = 42, stratify=english_reviews['RATING'])

## Training the MNB model 

In [None]:
MNB = MultinomialNB()
MNB.fit(X_train, Y_train)

In [None]:
predicted = MNB.predict(X_test)

In [None]:
acc_score = accuracy_score(predicted, Y_test)

In [None]:
print(f'The accuracy score of the Multimodial Naive Bayes model is: {round(acc_score, 2)}')

In [None]:
predicted[:5]

In [None]:
Y_test[:5]

In [None]:
cm = confusion_matrix(Y_test, predicted)

disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = np.unique(Y_test))
disp.plot(cmap = plt.cm.Blues)
plt.title('Confusion Matrix')
plt.show()

### Text blob

Polartiy determines the sentiment of the text. -1 = highly negative review and 1 denotes a highly positive sentiment.

Subjectivity determines whether a text input is fact vs personal information. 0 denotes a fact and 1 denotes a personal opinion.

In [None]:
english_reviews['REVIEW_TEXT'].iloc[284]

In [None]:
p1 = TextBlob(english_reviews['REVIEW_TEXT'].iloc[284]).sentiment.polarity
s1 = TextBlob(english_reviews['REVIEW_TEXT'].iloc[284]).sentiment.subjectivity

In [None]:
print(f'The polarity of the test token is : {round(p1, 4)} and the subjectivity of the test token is: {round(s1, 4)}')

In [None]:
print(f'The actual rating was: {english_reviews['RATING'].iloc[284]}')

In [None]:
def calculate_polarity(text):
    blob = TextBlob(text)
    return blob.sentiment.polarity

def calculate_subjectivity(text):
    blob = TextBlob(text)
    return blob.sentiment.subjectivity

In [None]:
english_reviews['TEXT_BLOB_POLARITY_RAW'] = english_reviews['REVIEW_TEXT'].apply(calculate_polarity)
english_reviews['TEXT_BLOB_SUBJECTIVITY_RAW'] = english_reviews['REVIEW_TEXT'].apply(calculate_subjectivity)

In [None]:
def map_rating(value):
    mapped_value = (value + 1) * 2.5 + 0.5
    mapped_rating = round(mapped_value)
    mapped_rating = max(1, min(5, mapped_rating))
    return mapped_rating

In [None]:
english_reviews['TEXT_BLOB_POLARITY_MAPPED'] = english_reviews['TEXT_BLOB_POLARITY_RAW'].apply(map_rating)

In [None]:
english_reviews.sample(1)

In [None]:
english_reviews['TEXT_BLOB_POLARITY_MAPPED'].values

In [None]:
english_reviews['RATING'].values

In [None]:
acc_score_2 = accuracy_score(english_reviews['TEXT_BLOB_POLARITY_MAPPED'].values, english_reviews['RATING'].values)

In [None]:
print(f'The accuracy score of the \'Text Blob\' model accuracy score is: {round(acc_score_2, 4)}')

In [None]:
cm = confusion_matrix(english_reviews['RATING'].values, english_reviews['TEXT_BLOB_POLARITY_MAPPED'].values)

disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = np.unique(Y_test))
disp.plot(cmap = plt.cm.Blues)
plt.title('Confusion Matrix')
plt.show()

### VADER

In [None]:
sentiment = SentimentIntensityAnalyzer()

In [None]:
t_1 = english_reviews['REVIEW_TEXT'].iloc[1336]
t_1

In [None]:
sent_1 = sentiment.polarity_scores(t_1)
print(f'The VADER sentiment of review 1337 is: {sent_1}')

In [None]:
print(f'The actual rating for review 1337 is: {english_reviews['RATING'].iloc[1336]}')

In [None]:
def calculate_sentiment(text):
    vader = sentiment.polarity_scores(text)
    return vader

In [None]:
english_reviews['VADER_SENTIMENT_RAW'] = english_reviews['REVIEW_TEXT'].apply(calculate_sentiment)

In [None]:
english_reviews.iloc[72]

In [None]:
def map_sentiment(sentiment):
    pos_score = sentiment['pos']
    neg_score = sentiment['neg']

    weighted_score = (pos_score - neg_score) * 2 + 3
    mapped_score = round(max(1, min(5, weighted_score)))
    return mapped_score

In [None]:
print(f'The mapped sentiment for review 1337 is: {map_sentiment(sent_1)}')

In [None]:
english_reviews['VADER_SENTIMENT_MAPPED'] = english_reviews['VADER_SENTIMENT_RAW'].apply(map_sentiment)

In [None]:
english_reviews.iloc[72]

In [None]:
acc_score_3 = accuracy_score(english_reviews['VADER_SENTIMENT_MAPPED'].values, english_reviews['RATING'].values)

In [None]:
print(f'The accuracy score of the \'VADER\' model accuracy score is: {round(acc_score_3, 4)}')

In [None]:
cm = confusion_matrix(english_reviews['RATING'].values, english_reviews['VADER_SENTIMENT_MAPPED'].values)

disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = np.unique(Y_test))
disp.plot(cmap = plt.cm.Blues)
plt.title('Confusion Matrix')
plt.show()

## Conclsion