# Amazon Reviews Classification

Submitted By: 

Anushka Deshpande 

USC ID: 5914802345

Sentiment Analysis is extensively used to study customer behaviors using
reviews and survey responses, online and social media, and healthcare
materials for marketing and costumer service applications. 

Here we try to predict the class of reviews / star ratings (either 1, 2, or 3) based on text reviews. 

Please Note: The following sections take upto 7 minutes each to run on Google Colaboratory. 

*   Lemmatization step
*   TF_IDF vectorization step



Start with installing and importing some necessary libraries. 

In [1]:
!pip install contractions

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Collecting anyascii
  Downloading anyascii-0.3.1-py3-none-any.whl (287 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m287.5/287.5 KB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyahocorasick
  Downloading pyahocorasick-2.0.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (104 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.5/104.5 KB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.1 contractions-0.1.73 pyahocorasick-2.0.0 textsearch-0.0.24


In [2]:
!pip install scikit-learn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
import re
import nltk
import pandas as pd
import numpy as np
import contractions

from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, classification_report

#pd.set_option('display.max_colwidth', None)

In [4]:
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4')
nltk.download('wordnet')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

## Read Main Dataset

In [5]:
url = "https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Beauty_v1_00.tsv.gz"
main_dataset = pd.read_csv(url, sep="\t", on_bad_lines='skip')
main_dataset.head() # shows first 5 entries

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,1797882,R3I2DHQBR577SS,B001ANOOOE,2102612,The Naked Bee Vitmin C Moisturizing Sunscreen ...,Beauty,5,0.0,0.0,N,Y,Five Stars,"Love this, excellent sun block!!",2015-08-31
1,US,18381298,R1QNE9NQFJC2Y4,B0016J22EQ,106393691,"Alba Botanica Sunless Tanning Lotion, 4 Ounce",Beauty,5,0.0,0.0,N,Y,Thank you Alba Bontanica!,The great thing about this cream is that it do...,2015-08-31
2,US,19242472,R3LIDG2Q4LJBAO,B00HU6UQAG,375449471,"Elysee Infusion Skin Therapy Elixir, 2oz.",Beauty,5,0.0,0.0,N,Y,Five Stars,"Great Product, I'm 65 years old and this is al...",2015-08-31
3,US,19551372,R3KSZHPAEVPEAL,B002HWS7RM,255651889,"Diane D722 Color, Perm And Conditioner Process...",Beauty,5,0.0,0.0,N,Y,GOOD DEAL!,I use them as shower caps & conditioning caps....,2015-08-31
4,US,14802407,RAI2OIG50KZ43,B00SM99KWU,116158747,Biore UV Aqua Rich Watery Essence SPF50+/PA+++...,Beauty,5,0.0,0.0,N,Y,this soaks in quick and provides a nice base f...,This is my go-to daily sunblock. It leaves no ...,2015-08-31


## Basic Code

In this section, all the steps are carried out as mentioned in the assignment. The results obtained are printed at the end of this section. 

Following steps are carried out:

*   Data Cleaning:
    - convert all reviews into lowercase
    - remove the HTML and URLs from the reviews
    - remove non-alphabetical characters
    - remove extra spaces
    - perform contractions on the reviews, e.g., won’t → will not. Include as
      many contractions in English that you can think of.
*   Data Preprocessing:
    - remove the stop words
    - perform lemmatization
*   Feature Extraction: using TF-IDF Vectorization 
*   Data Split: Data is split 80-20 into training and testing sets
*   Model Building and Evaluation: The 4 models are built and the fitted to the dataset. The 4 models are:
    - Single Layer Perceptron
    - Support Vector Machine (SVM) classifier
    - Logistic Regression 
    - Multinomial Naive Bayes Classifier
*   Evaluation: Performed based on standard metrics and printed for the 3 distinct classes:
    - Precision
    - Recall
    - F1 Score









### Keep Necessary Columns

In [6]:
new_df = main_dataset[['star_rating', 'review_body']]
new_df.head()

Unnamed: 0,star_rating,review_body
0,5,"Love this, excellent sun block!!"
1,5,The great thing about this cream is that it do...
2,5,"Great Product, I'm 65 years old and this is al..."
3,5,I use them as shower caps & conditioning caps....
4,5,This is my go-to daily sunblock. It leaves no ...


Now we form 3 classes and keep 20k reviews in each class.

In [7]:
# Class 1: dataset containing 20k reviews which have rating 1 or 2 
df_1 = new_df.query("star_rating == 1 | star_rating == 2").sample(n=20000)
df_1['star_rating'] = 1

# Class 2: dataset containing 20k reviews which have rating 3
df_2 = new_df.query("star_rating == 3").sample(n=20000)
df_2['star_rating'] = 2

# Class 3: dataset containing 20k reviews which have rating 4 or 5
df_3 = new_df.query("star_rating == 4 | star_rating == 5").sample(n=20000)
df_3['star_rating'] = 3

#concatenate the 3 dataframes
dfs = [df_1, df_2, df_3]
final_df = pd.concat(dfs)

#shuffle the order of entries in the dataframe
from sklearn.utils import shuffle
df = shuffle(final_df)

#print the first 5 entries in the final dataframe
df.head()

Unnamed: 0,star_rating,review_body
3622600,2,"Got double package on time, but unfortunately,..."
3951970,2,I received the Venus & Olay Razor in Sugarberr...
3632459,3,I've been using this hand cream for the last 6...
2349108,3,Good quality. Great price! The only thing bad ...
4394793,3,I purchase the China Glaze because of the colo...


### Data Cleaning

This section performs the following steps:
- convert all reviews into lowercase
- remove the HTML and URLs from the reviews
- remove non-alphabetical characters
- remove extra spaces
- perform contractions on the reviews, e.g., won’t → will

In [8]:
#store the average length of 'review_body' column in a variable
lenBeforeCleaning = df['review_body'].str.len().mean()

In [9]:
#convert all entries in the 'star_rating' column to integer.
df['star_rating'] = df['star_rating'].astype('int')

#print the datatypes of all columns in the dataframe
df.dtypes

star_rating     int64
review_body    object
dtype: object

In [10]:
#find null values in both the columns and fill them with an empty string

for column in ['star_rating', 'review_body']:
  print(column + " - "  + str(df[column].isnull().sum()))

df = df.fillna("")

star_rating - 0
review_body - 4


In [11]:
#convert the text into lowercase
df['review_body'] = df['review_body'].apply(str.lower)
df.head()

Unnamed: 0,star_rating,review_body
3622600,2,"got double package on time, but unfortunately,..."
3951970,2,i received the venus & olay razor in sugarberr...
3632459,3,i've been using this hand cream for the last 6...
2349108,3,good quality. great price! the only thing bad ...
4394793,3,i purchase the china glaze because of the colo...


In [12]:
#remove html and urls from reviews
def urls(x):
  x = re.sub('http\S+|www.\S+', '', x)
  x = re.sub('<.*?>','',x)
  return x

df['review_body'] = [urls(x) for x in df['review_body']]
df.head()

Unnamed: 0,star_rating,review_body
3622600,2,"got double package on time, but unfortunately,..."
3951970,2,i received the venus & olay razor in sugarberr...
3632459,3,i've been using this hand cream for the last 6...
2349108,3,good quality. great price! the only thing bad ...
4394793,3,i purchase the china glaze because of the colo...


In [13]:
#remove non-alphabetic characters from reviews
def alpha(x):
  return re.sub(r'[^a-zA-Z\s]+', '', str(x))

df['review_body'] = [alpha(x) for x in df['review_body']]
df.head()

Unnamed: 0,star_rating,review_body
3622600,2,got double package on time but unfortunately w...
3951970,2,i received the venus olay razor in sugarberry...
3632459,3,ive been using this hand cream for the last m...
2349108,3,good quality great price the only thing bad is...
4394793,3,i purchase the china glaze because of the colo...


In [14]:
#remove extra blank spaces from reviews
def spaces(x):
  result = " ".join(x.split())
  return result

df['review_body'] = [spaces(x) for x in df['review_body']]
df.head()

Unnamed: 0,star_rating,review_body
3622600,2,got double package on time but unfortunately w...
3951970,2,i received the venus olay razor in sugarberry ...
3632459,3,ive been using this hand cream for the last mo...
2349108,3,good quality great price the only thing bad is...
4394793,3,i purchase the china glaze because of the colo...


In [15]:
#remove contractions from reviews
def con(x):
  return contractions.fix(x)

df['review_body'] = [con(x) for x in df['review_body']]
df.head()

Unnamed: 0,star_rating,review_body
3622600,2,got double package on time but unfortunately w...
3951970,2,i received the venus olay razor in sugarberry ...
3632459,3,i have been using this hand cream for the last...
2349108,3,good quality great price the only thing bad is...
4394793,3,i purchase the china glaze because of the colo...


In [16]:
#calculate the average length of reviews after data cleaning
lenAfterCleaning = df['review_body'].str.len().mean()

print(str(lenBeforeCleaning) + ", " + str(lenAfterCleaning))

290.51300086672444, 278.1562


### Data Pre-processing

This section does the following steps:
- Remove Stopwords
- Perform lemmatization

In [17]:
#calculate length before preprocessing
lenBeforePreprocessing = df['review_body'].str.len().mean()

In [18]:
#remove stopwords
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

def sw_removal(x):
  filtered_sentence = [w for w in x.split() if not w.lower() in stop_words]
  return " ".join(filtered_sentence)

df['review_body'] = [sw_removal(x) for x in df['review_body']]
df.head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Unnamed: 0,star_rating,review_body
3622600,2,got double package time unfortunately simply p...
3951970,2,received venus olay razor sugarberry free prov...
3632459,3,using hand cream last months noticed considera...
2349108,3,good quality great price thing bad smell brown...
4394793,3,purchase china glaze color love nailsi would r...


In [19]:
#perform lemmatization
from nltk.corpus import wordnet
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

In [20]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def lemm(x):
  return " ".join([lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in x.split()])

df['review_body'] = [lemm(x) for x in df['review_body']]
df.head()

Unnamed: 0,star_rating,review_body
3622600,2,get double package time unfortunately simply p...
3951970,2,receive venus olay razor sugarberry free provi...
3632459,3,use hand cream last month notice considerable ...
2349108,3,good quality great price thing bad smell brown...
4394793,3,purchase china glaze color love nailsi would r...


In [21]:
#calculate the average length of reviews after data cleaning
lenAfterPreprocessing = df['review_body'].str.len().mean()

print(str(lenBeforePreprocessing) + ", " + str(lenAfterPreprocessing))

278.1562, 163.39178333333334


### TF-IDF Vectorization

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range=(1,2))
tfidf_vectors = vectorizer.fit_transform(df['review_body'])

tfidf_vectors

<60000x695072 sparse matrix of type '<class 'numpy.float64'>'
	with 2746665 stored elements in Compressed Sparse Row format>

### Train-Test Split

Split the dataset now into training and testing (80-20 split).

In [23]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(tfidf_vectors, df['star_rating'], test_size=0.2, random_state = 25)
print("Train: ",X_train.shape,y_train.shape,"Test: ",(X_test.shape,y_test.shape))

Train:  (48000, 695072) (48000,) Test:  ((12000, 695072), (12000,))


### Training and Evaluation

#### Single Layer Perceptron

Apply the single layer perceptron model to train the data. Calculate the precision, recall and F1 score for each class.


In [24]:
from sklearn.linear_model import Perceptron

perceptron = Perceptron(tol=1e-3, random_state=0)
perceptron.fit(X_train, y_train)

Perceptron()

In [25]:
perceptron_predictions = perceptron.predict(X_test)

print("Average precision: ", precision_score(y_test, perceptron_predictions, average = 'weighted'))
print("Average recall: ", recall_score(y_test, perceptron_predictions, average = 'weighted'))
print("Average f1 score: ", f1_score(y_test, perceptron_predictions, average = 'weighted'))

Average precision:  0.6367053692234758
Average recall:  0.6385
Average f1 score:  0.6349957430265137


In [26]:
print("Classification report: \n", classification_report(y_test, perceptron_predictions))

Classification report: 
               precision    recall  f1-score   support

           1       0.61      0.71      0.66      3922
           2       0.57      0.48      0.52      3990
           3       0.73      0.72      0.72      4088

    accuracy                           0.64     12000
   macro avg       0.64      0.64      0.63     12000
weighted avg       0.64      0.64      0.63     12000



#### Support Vector Machine (SVM)

Apply the SVM model to train the data. Calculate the precision, recall and F1 score for each class.


In [27]:
from sklearn.svm import LinearSVC

svm_model = LinearSVC(random_state=0)
svm_model.fit(X_train, y_train)

LinearSVC(random_state=0)

In [28]:
svm_predictions = svm_model.predict(X_test)

print("Average precision: ", precision_score(y_test, svm_predictions, average = 'weighted'))
print("Average recall: ", recall_score(y_test, svm_predictions, average = 'weighted'))
print("Average f1 score: ", f1_score(y_test, svm_predictions, average = 'weighted'))

Average precision:  0.6803843386198024
Average recall:  0.68225
Average f1 score:  0.6810986000813194


In [29]:
print("Classification report: \n", classification_report(y_test, svm_predictions))

Classification report: 
               precision    recall  f1-score   support

           1       0.68      0.70      0.69      3922
           2       0.60      0.57      0.59      3990
           3       0.75      0.77      0.76      4088

    accuracy                           0.68     12000
   macro avg       0.68      0.68      0.68     12000
weighted avg       0.68      0.68      0.68     12000



#### Logistic Regression

Apply the Logistic Regression model to train the data. Calculate the precision, recall and F1 score for each class.


In [30]:
from sklearn.linear_model import LogisticRegression

logistic_model = LogisticRegression(max_iter=1000,solver='saga')

logistic_model.fit(X_train,y_train)

LogisticRegression(max_iter=1000, solver='saga')

In [31]:
logistic_predictions = logistic_model.predict(X_test)

print("Average precision: ", precision_score(y_test, logistic_predictions, average = 'weighted'))
print("Average recall: ", recall_score(y_test,  logistic_predictions, average = 'weighted'))
print("Average f1 score: ", f1_score(y_test,  logistic_predictions, average = 'weighted'))

Average precision:  0.6915747678958981
Average recall:  0.6910833333333334
Average f1 score:  0.691246338688441


In [32]:
print("Classification report: \n", classification_report(y_test, logistic_predictions))

Classification report: 
               precision    recall  f1-score   support

           1       0.69      0.71      0.70      3922
           2       0.61      0.61      0.61      3990
           3       0.77      0.75      0.76      4088

    accuracy                           0.69     12000
   macro avg       0.69      0.69      0.69     12000
weighted avg       0.69      0.69      0.69     12000



#### Multinomial Naive Bayes

Apply the Logistic Regression model to train the data. Calculate the precision, recall and F1 score for each class.

In [33]:
from sklearn.naive_bayes import MultinomialNB

nb_model = MultinomialNB()

nb_model.fit(X_train, y_train)

MultinomialNB()

In [34]:
nb_predictions = nb_model.predict(X_test)

print("Average precision: ", precision_score(y_test, nb_predictions , average = 'weighted'))
print("Average recall: ", recall_score(y_test,  nb_predictions , average = 'weighted'))
print("Average f1 score: ", f1_score(y_test,  nb_predictions , average = 'weighted'))

Average precision:  0.6830130656592149
Average recall:  0.679
Average f1 score:  0.6805321425192911


In [35]:
print("Classification report: \n", classification_report(y_test, nb_predictions))

Classification report: 
               precision    recall  f1-score   support

           1       0.69      0.68      0.69      3922
           2       0.58      0.63      0.61      3990
           3       0.77      0.73      0.75      4088

    accuracy                           0.68     12000
   macro avg       0.68      0.68      0.68     12000
weighted avg       0.68      0.68      0.68     12000



## Modified Code

To obtain higher values for the evaluation metrics, various methods were tried, out of which, the following proved to be the most effective. 

*   Include column 'review_headline': 

    > Including this column in combination with 'review_body' returned considerably higher accuracy (+4% at minimum).

*   Partial removal of StopWords:

    > The original stopwords in nltk contain negation words as well as words that share a positive or negative opinion of the reviewer. Hence, a curated list of stopwords was created (subset of nltk stopwords) which helped to remove the unnecessary words but retained words that expressed an opinion. 



### Keep Necessary Columns

This is where we add the 'review_headline' column to our dataframe

In [36]:
m_new_df = main_dataset[['star_rating', 'review_body', 'review_headline']]
m_new_df.head()

Unnamed: 0,star_rating,review_body,review_headline
0,5,"Love this, excellent sun block!!",Five Stars
1,5,The great thing about this cream is that it do...,Thank you Alba Bontanica!
2,5,"Great Product, I'm 65 years old and this is al...",Five Stars
3,5,I use them as shower caps & conditioning caps....,GOOD DEAL!
4,5,This is my go-to daily sunblock. It leaves no ...,this soaks in quick and provides a nice base f...


We form three classes and select 20000 reviews randomly from each class.



In [37]:
m_df_1 = m_new_df.query("star_rating == 1 | star_rating == 2").sample(n=20000)
m_df_1['star_rating'] = 1

m_df_2 = m_new_df.query("star_rating == 3").sample(n=20000)
m_df_2['star_rating'] = 2

m_df_3 = m_new_df.query("star_rating == 4 | star_rating == 5").sample(n=20000)
m_df_3['star_rating'] = 3

m_dfs = [m_df_1, m_df_2, m_df_3]
m_final_df = pd.concat(m_dfs)

from sklearn.utils import shuffle
m_df = shuffle(m_final_df)
m_df.head()

Unnamed: 0,star_rating,review_body,review_headline
469223,1,I do not feel that this product help set my ma...,It wore the same as it did before using it and...
224447,1,the darker blue is 'spoty' ;blotchy' for the f...,the darker blue is 'spoty'; blotchy' for the f...
1258949,1,I don't like the vital bath gel its not the same,Two Stars
4467681,1,I bought this product despite the reviews beca...,Not for dark hair
2399955,2,"So-so, they are good for one time use only if ...",they are good for one time use only if you don...


### Add 'review' Column

We add a 'review' column, which concatenates the two columns, 'review_body' and 'review_headline'.

In [38]:
m_df['review'] = m_df['review_headline'].astype(str) + " " + m_df['review_body'].astype(str)
m_df.head()

Unnamed: 0,star_rating,review_body,review_headline,review
469223,1,I do not feel that this product help set my ma...,It wore the same as it did before using it and...,It wore the same as it did before using it and...
224447,1,the darker blue is 'spoty' ;blotchy' for the f...,the darker blue is 'spoty'; blotchy' for the f...,the darker blue is 'spoty'; blotchy' for the f...
1258949,1,I don't like the vital bath gel its not the same,Two Stars,Two Stars I don't like the vital bath gel its ...
4467681,1,I bought this product despite the reviews beca...,Not for dark hair,Not for dark hair I bought this product despit...
2399955,2,"So-so, they are good for one time use only if ...",they are good for one time use only if you don...,they are good for one time use only if you don...


### Data Cleaning



In [39]:
#avg length of reviews before data cleaning
m_lenBeforeCleaning = m_df['review'].str.len().mean()

In [40]:
##convert all entries in the 'star_rating' column to integer
m_df['star_rating'] = m_df['star_rating'].astype('int')

#print the datatypes of all columns in the dataframe
m_df.dtypes

star_rating         int64
review_body        object
review_headline    object
review             object
dtype: object

In [41]:
#calculate null values in the dataframe
for column in ['star_rating', 'review_body', 'review_headline', 'review']:
  print(column + " - "  + str(m_df[column].isnull().sum()))

star_rating - 0
review_body - 4
review_headline - 1
review - 0


In [42]:
#replace null values with blank strings
m_df = m_df.fillna("")

#convert all text to lowercase
m_df['review'] = m_df['review'].apply(str.lower)
m_df.head()

Unnamed: 0,star_rating,review_body,review_headline,review
469223,1,I do not feel that this product help set my ma...,It wore the same as it did before using it and...,it wore the same as it did before using it and...
224447,1,the darker blue is 'spoty' ;blotchy' for the f...,the darker blue is 'spoty'; blotchy' for the f...,the darker blue is 'spoty'; blotchy' for the f...
1258949,1,I don't like the vital bath gel its not the same,Two Stars,two stars i don't like the vital bath gel its ...
4467681,1,I bought this product despite the reviews beca...,Not for dark hair,not for dark hair i bought this product despit...
2399955,2,"So-so, they are good for one time use only if ...",they are good for one time use only if you don...,they are good for one time use only if you don...


In [43]:
#remove html and urls from reviews
def m_urls(x):
  x = re.sub('http\S+|www.\S+', '', x)
  x = re.sub('<.*?>','',x)
  return x

m_df['review'] = [m_urls(x) for x in m_df['review']]
m_df.head()

Unnamed: 0,star_rating,review_body,review_headline,review
469223,1,I do not feel that this product help set my ma...,It wore the same as it did before using it and...,it wore the same as it did before using it and...
224447,1,the darker blue is 'spoty' ;blotchy' for the f...,the darker blue is 'spoty'; blotchy' for the f...,the darker blue is 'spoty'; blotchy' for the f...
1258949,1,I don't like the vital bath gel its not the same,Two Stars,two stars i don't like the vital bath gel its ...
4467681,1,I bought this product despite the reviews beca...,Not for dark hair,not for dark hair i bought this product despit...
2399955,2,"So-so, they are good for one time use only if ...",they are good for one time use only if you don...,they are good for one time use only if you don...


In [44]:
#remove non-alphabetic characters from reviews
import re

def m_alpha(x):
  return re.sub(r'[^a-zA-Z\s]+', '', str(x))

m_df['review'] = [m_alpha(x) for x in m_df['review']]
m_df.head()

Unnamed: 0,star_rating,review_body,review_headline,review
469223,1,I do not feel that this product help set my ma...,It wore the same as it did before using it and...,it wore the same as it did before using it and...
224447,1,the darker blue is 'spoty' ;blotchy' for the f...,the darker blue is 'spoty'; blotchy' for the f...,the darker blue is spoty blotchy for the first...
1258949,1,I don't like the vital bath gel its not the same,Two Stars,two stars i dont like the vital bath gel its n...
4467681,1,I bought this product despite the reviews beca...,Not for dark hair,not for dark hair i bought this product despit...
2399955,2,"So-so, they are good for one time use only if ...",they are good for one time use only if you don...,they are good for one time use only if you don...


In [45]:
#remove extra spaces from reviews
def m_spaces(x):
  result = " ".join(x.split())
  return result

m_df['review'] = [m_spaces(x) for x in m_df['review']]
m_df.head()

Unnamed: 0,star_rating,review_body,review_headline,review
469223,1,I do not feel that this product help set my ma...,It wore the same as it did before using it and...,it wore the same as it did before using it and...
224447,1,the darker blue is 'spoty' ;blotchy' for the f...,the darker blue is 'spoty'; blotchy' for the f...,the darker blue is spoty blotchy for the first...
1258949,1,I don't like the vital bath gel its not the same,Two Stars,two stars i dont like the vital bath gel its n...
4467681,1,I bought this product despite the reviews beca...,Not for dark hair,not for dark hair i bought this product despit...
2399955,2,"So-so, they are good for one time use only if ...",they are good for one time use only if you don...,they are good for one time use only if you don...


In [46]:
#remove contractions from the reviews
import contractions
def m_con(x):
  return contractions.fix(x)

m_df['review'] = [m_con(x) for x in m_df['review']]
m_df.head()

Unnamed: 0,star_rating,review_body,review_headline,review
469223,1,I do not feel that this product help set my ma...,It wore the same as it did before using it and...,it wore the same as it did before using it and...
224447,1,the darker blue is 'spoty' ;blotchy' for the f...,the darker blue is 'spoty'; blotchy' for the f...,the darker blue is spoty blotchy for the first...
1258949,1,I don't like the vital bath gel its not the same,Two Stars,two stars i do not like the vital bath gel its...
4467681,1,I bought this product despite the reviews beca...,Not for dark hair,not for dark hair i bought this product despit...
2399955,2,"So-so, they are good for one time use only if ...",they are good for one time use only if you don...,they are good for one time use only if you do ...


In [47]:
#avg length of reviews after data cleaning
m_lenAfterCleaning = m_df['review'].str.len().mean()

print(str(m_lenAfterCleaning) + ", " + str(m_lenBeforeCleaning))

297.66911666666664, 310.6356166666667


### Pre-processing

In this section we do the following:

*   remove unnecessary stopwords
*   perform lemmatization



In [48]:
#avg length of reviews before data preprocessing
m_lenBeforePreprocessing = m_df['review'].str.len().mean()

In [49]:
#declare and store unnessary stopwords
m_stopwords = ['i','me','my','myself','we','our','ours','ourselves','you','your','yours','yourself','yourselves','he',
             'him','his','himself','she','her','hers','herself','it','its','itself','they','them','their','theirs',
             'themselves','what','which','who','whom','this','that','these','those','am','is','are','was','were','be',
             'been','being','have','has','had','having','do','does','did','doing','a','an','the','and','at','by',
             'for','with','further','then','here','there','when','where','why','how','other','own','so','can','will']

In [50]:
#remove unnecessary stopwords
def m_sw(x):
    filtered_sentence = [w for w in x.split() if not w.lower() in m_stopwords]
    return " ".join(filtered_sentence)

m_df['review'] = [m_sw(x) for x in m_df['review']]
m_df.head()

Unnamed: 0,star_rating,review_body,review_headline,review
469223,1,I do not feel that this product help set my ma...,It wore the same as it did before using it and...,wore same as before using would not recommend ...
224447,1,the darker blue is 'spoty' ;blotchy' for the f...,the darker blue is 'spoty'; blotchy' for the f...,darker blue spoty blotchy first darker blue sp...
1258949,1,I don't like the vital bath gel its not the same,Two Stars,two stars not like vital bath gel not same
4467681,1,I bought this product despite the reviews beca...,Not for dark hair,not dark hair bought product despite reviews b...
2399955,2,"So-so, they are good for one time use only if ...",they are good for one time use only if you don...,good one time use only if not sweat soso good ...


In [51]:
#perform lemmatization
from nltk.corpus import wordnet
def m_get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

In [52]:
from nltk.stem import WordNetLemmatizer

m_lemmatizer = WordNetLemmatizer()

def m_lemm(x):
  return " ".join([m_lemmatizer.lemmatize(word, m_get_wordnet_pos(word)) for word in x.split()])

m_df['review'] = [m_lemm(x) for x in m_df['review']]
m_df.head()

Unnamed: 0,star_rating,review_body,review_headline,review
469223,1,I do not feel that this product help set my ma...,It wore the same as it did before using it and...,wore same a before use would not recommend to ...
224447,1,the darker blue is 'spoty' ;blotchy' for the f...,the darker blue is 'spoty'; blotchy' for the f...,darker blue spoty blotchy first darker blue sp...
1258949,1,I don't like the vital bath gel its not the same,Two Stars,two star not like vital bath gel not same
4467681,1,I bought this product despite the reviews beca...,Not for dark hair,not dark hair bought product despite review be...
2399955,2,"So-so, they are good for one time use only if ...",they are good for one time use only if you don...,good one time use only if not sweat soso good ...


In [53]:
#avg length of reviews after data preprocessing
m_lenAfterPreprocessing = m_df['review'].str.len().mean()

print(str(m_lenAfterPreprocessing) + ", " + str(m_lenBeforePreprocessing))

212.73753333333335, 297.66911666666664


### TF-IDF Feature Extraction

In [54]:
from sklearn.feature_extraction.text import TfidfVectorizer

m_vectorizer = TfidfVectorizer(ngram_range=(1,5))
m_vectors = m_vectorizer.fit_transform(m_df['review'])

print(m_vectors.shape)

(60000, 6081446)


In [55]:
m_df.head()

Unnamed: 0,star_rating,review_body,review_headline,review
469223,1,I do not feel that this product help set my ma...,It wore the same as it did before using it and...,wore same a before use would not recommend to ...
224447,1,the darker blue is 'spoty' ;blotchy' for the f...,the darker blue is 'spoty'; blotchy' for the f...,darker blue spoty blotchy first darker blue sp...
1258949,1,I don't like the vital bath gel its not the same,Two Stars,two star not like vital bath gel not same
4467681,1,I bought this product despite the reviews beca...,Not for dark hair,not dark hair bought product despite review be...
2399955,2,"So-so, they are good for one time use only if ...",they are good for one time use only if you don...,good one time use only if not sweat soso good ...


### Train-Test Split

In [56]:
from sklearn.model_selection import train_test_split

Xtrain, Xtest, ytrain, ytest = train_test_split(m_vectors, m_df['star_rating'], test_size=0.2, random_state = 25)

print("Train: ",Xtrain.shape,ytrain.shape,"Test: ",(Xtest.shape,ytest.shape))

Train:  (48000, 6081446) (48000,) Test:  ((12000, 6081446), (12000,))


### Training and Evaluation:

#### Single Layer Perceptron:

Applying the single layer perceptron model to the modified reviews columnn and calculating the evaluation metrics on that models' predictions.

In [57]:
from sklearn.linear_model import Perceptron

m_perceptron = Perceptron(tol=1e-3, random_state=0)
m_perceptron.fit(Xtrain, ytrain)

Perceptron()

In [58]:
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report

m_perceptron_predictions = m_perceptron.predict(Xtest)

print("precision: ", precision_score(ytest, m_perceptron_predictions, average = 'weighted'))
print("recall: ", recall_score(ytest, m_perceptron_predictions, average = 'weighted'))
print("f1 score: ", f1_score(ytest, m_perceptron_predictions, average = 'weighted'))

precision:  0.7782947054456152
recall:  0.7800833333333334
f1 score:  0.7788280303398466


In [59]:
print("classification report: \n", classification_report(ytest, m_perceptron_predictions))

classification report: 
               precision    recall  f1-score   support

           1       0.78      0.78      0.78      3941
           2       0.72      0.69      0.71      4066
           3       0.83      0.87      0.85      3993

    accuracy                           0.78     12000
   macro avg       0.78      0.78      0.78     12000
weighted avg       0.78      0.78      0.78     12000



#### Support Vector Machine (SVM):

Applying the SVM model to the modified reviews columnn and calculating the evaluation metrics on that models' predictions.

In [60]:
from sklearn.svm import LinearSVC

m_svm_model = LinearSVC(random_state=0)

m_svm_model.fit(Xtrain, ytrain)

m_svm_predictions = m_svm_model.predict(Xtest)

print("precision: ", precision_score(ytest, m_svm_predictions, average = 'weighted'))
print("recall: ", recall_score(ytest, m_svm_predictions, average = 'weighted'))
print("f1 score: ", f1_score(ytest, m_svm_predictions, average = 'weighted'))

precision:  0.7917433868031925
recall:  0.793
f1 score:  0.7920405196811229


In [61]:
print("classification report: \n", classification_report(ytest, m_svm_predictions))

classification report: 
               precision    recall  f1-score   support

           1       0.78      0.81      0.80      3941
           2       0.74      0.70      0.72      4066
           3       0.85      0.87      0.86      3993

    accuracy                           0.79     12000
   macro avg       0.79      0.79      0.79     12000
weighted avg       0.79      0.79      0.79     12000



#### Logistic Regression

Applying the SVM model to the modified reviews columnn and calculating the evaluation metrics on that models' predictions.

In [62]:
from sklearn.linear_model import LogisticRegression

m_logistic_model = LogisticRegression(max_iter=1000,solver='saga')

m_logistic_model.fit(Xtrain,ytrain)

m_logistic_predictions = m_logistic_model.predict(Xtest)


print("precision: ", precision_score(ytest, m_logistic_predictions, average = 'weighted'))
print("recall: ", recall_score(ytest,  m_logistic_predictions, average = 'weighted'))
print("f1 score: ", f1_score(ytest,  m_logistic_predictions, average = 'weighted'))

precision:  0.7844740969430881
recall:  0.7845833333333333
f1 score:  0.7844533311254656


In [63]:
print("classification report: \n", classification_report(ytest, m_logistic_predictions))

classification report: 
               precision    recall  f1-score   support

           1       0.78      0.80      0.79      3941
           2       0.72      0.71      0.72      4066
           3       0.85      0.84      0.85      3993

    accuracy                           0.78     12000
   macro avg       0.78      0.79      0.78     12000
weighted avg       0.78      0.78      0.78     12000



#### Naive Bayes

Applying the SVM model to the modified reviews columnn and calculating the evaluation metrics on that models' predictions.

In [64]:
from sklearn.naive_bayes import MultinomialNB

m_nb_model = MultinomialNB()

m_nb_model.fit(Xtrain, ytrain)

m_nb_predictions = m_nb_model.predict(Xtest)

print("precision: ", precision_score(ytest, m_nb_predictions , average = 'weighted'))
print("recall: ", recall_score(ytest,  m_nb_predictions , average = 'weighted'))
print("f1 score: ", f1_score(ytest,  m_nb_predictions , average = 'weighted'))

precision:  0.7858596191223747
recall:  0.77575
f1 score:  0.778196889880948


In [65]:
print("classification report: \n", classification_report(ytest, m_nb_predictions ))

classification report: 
               precision    recall  f1-score   support

           1       0.76      0.80      0.78      3941
           2       0.69      0.75      0.72      4066
           3       0.91      0.77      0.84      3993

    accuracy                           0.78     12000
   macro avg       0.79      0.78      0.78     12000
weighted avg       0.79      0.78      0.78     12000



## Final Evaluation Metrics

Average Review Lengths Before and After Data Cleaning

In [66]:
print(str(m_lenBeforeCleaning) + ", " + str(m_lenAfterCleaning))

310.6356166666667, 297.66911666666664


Average Review Lengths Before and After Data Preprocessing

In [67]:
print(str(m_lenBeforePreprocessing) + ", " + str(m_lenAfterPreprocessing))

297.66911666666664, 212.73753333333335


Classification Report: Single Layer Perceptron

In [68]:
print(classification_report(ytest, m_perceptron_predictions))

              precision    recall  f1-score   support

           1       0.78      0.78      0.78      3941
           2       0.72      0.69      0.71      4066
           3       0.83      0.87      0.85      3993

    accuracy                           0.78     12000
   macro avg       0.78      0.78      0.78     12000
weighted avg       0.78      0.78      0.78     12000



Classification Report: SVM

In [69]:
print(classification_report(ytest, m_svm_predictions))

              precision    recall  f1-score   support

           1       0.78      0.81      0.80      3941
           2       0.74      0.70      0.72      4066
           3       0.85      0.87      0.86      3993

    accuracy                           0.79     12000
   macro avg       0.79      0.79      0.79     12000
weighted avg       0.79      0.79      0.79     12000



Classification Report: Logistic Regression

In [70]:
print(classification_report(ytest, m_logistic_predictions))

              precision    recall  f1-score   support

           1       0.78      0.80      0.79      3941
           2       0.72      0.71      0.72      4066
           3       0.85      0.84      0.85      3993

    accuracy                           0.78     12000
   macro avg       0.78      0.79      0.78     12000
weighted avg       0.78      0.78      0.78     12000



Classification Report: Multinomial Naive Bayes

In [71]:
print(classification_report(ytest, m_nb_predictions))

              precision    recall  f1-score   support

           1       0.76      0.80      0.78      3941
           2       0.69      0.75      0.72      4066
           3       0.91      0.77      0.84      3993

    accuracy                           0.78     12000
   macro avg       0.79      0.78      0.78     12000
weighted avg       0.79      0.78      0.78     12000

