# Welcome to Amazon Reviews Sentiment Analysis project
In this project we will be taking in reviews from random products on amazon and analyzing the sentiment of the review according to the word choice of the reivewer.
Reviews are considered positive if theyre given 4 or more stars (out of 5).
Reviews are considered negative if theyre given 2 or less stars (out of 5).
Reviews are considered neutral if theyre given 3 stars, so we will be disregarding all 3star reviews.


By: Awsam Agbarya and Ahmad Kabha

______________________________________________________________________________

To implement this project we are going to be using different NLP oriented libraries along with sklearn and tenserflow for trying our different models and picking the best one.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import re
import nltk
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from langdetect import detect
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, f1_score, precision_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import SGDClassifier
import pickle

We import out dataset that has all the reviews including a title and a score(amount of stars out of 5)

In [2]:
df = pd.read_csv('./train.csv')
labels =['score', 'title','review']
df.head()

Unnamed: 0,score,title,review
0,5,Inspiring,I hope a lot of people hear this cd. We need m...
1,5,The best soundtrack ever to anything.,I'm reading a lot of reviews saying that this ...
2,4,Chrono Cross OST,The music of Yasunori Misuda is without questi...
3,5,Too good to be true,Probably the greatest soundtrack in history! U...
4,5,There's a reason for the price,"There's a reason this CD is so expensive, even..."


# Matching the dataset to our project
We need to switch the format of the dataset to something easier to deal with, and more applicable for a classification problem,
Therefore we have to combine the title within the start of the review as one feature, and we switch the values of the score from 0-1 instead of 1-5 (the same way it was explained earlier)

In [3]:
dfCombined=pd.DataFrame(index=(range(39998)), columns=['review','score'])
dfCombined['review']=df['title']+' '+df['review']
dfCombined['score'] = df['score']
dfCombined.loc[dfCombined.score==2, 'score'] =0
dfCombined.loc[dfCombined.score==1, 'score'] =0
dfCombined.loc[dfCombined.score==5, 'score'] =1
dfCombined.loc[dfCombined.score==4, 'score'] =1
dfCombined.loc[dfCombined.score==3, 'score'] =np.NaN
dfCombined=dfCombined.dropna() 
dfCombined

Unnamed: 0,review,score
0,Inspiring I hope a lot of people hear this cd....,1.0
1,The best soundtrack ever to anything. I'm read...,1.0
2,Chrono Cross OST The music of Yasunori Misuda ...,1.0
3,Too good to be true Probably the greatest soun...,1.0
4,There's a reason for the price There's a reaso...,1.0
...,...,...
39993,A mom of three We bought this tent for my daug...,1.0
39994,we don't wish to be disturbed I bought this to...,1.0
39995,Pacific play tent - Lots of fun & adventure pl...,1.0
39996,A nice hideaway... Our one year old really enj...,1.0


Our reviews include entries in other languages like spanish.
While we want to make our model as inclusive as possible, we have a very small amount of spanish reviews compared to english, therefore the model will not have enough data to be trained to recognize sentiment in spanish.
We get rid of all spanish reviews for better results for the model.

In [4]:
dfEnglish =dfCombined
counter=0
for key,value in dfEnglish['review'].iteritems():
    if(key==22000):
        continue
    lang = detect(value)
    if(lang!='en'):
        counter+=1
        dfEnglish=dfEnglish.drop(key)
dfEnglish = dfEnglish.dropna() 
print('Amount of spanish entries removed:',counter)

Amount of spanish entries removed: 65


We started the dataset with 39997 entries, but we cleaned the model from some unecessary entries, therefore we reset the index

We also do a quick check on the balance of the data after we got rid of the unecessary entries. imbalanced data could cause problems that we need to deal with, therefore its good to check the balance before we begin processing.

In [5]:
dfEnglish= dfEnglish.reset_index(drop=True)


bad=dfEnglish.loc[dfEnglish['score']==0]['review'].count()
good=dfEnglish.loc[dfEnglish['score']==1]['review'].count()
precentage = 100*good/(bad+good)
print('%',precentage)
if(precentage>99 or precentage<1): print("Extremely imbalanced data")
elif(precentage>80 or precentage<20): print("Moderately imbalanced data")
elif(precentage>60 or precentage<40): print("Mildly imbalanced data")
else: print("balanced data")
dfEnglish

% 50.74161549362305
balanced data


Unnamed: 0,review,score
0,Inspiring I hope a lot of people hear this cd....,1.0
1,The best soundtrack ever to anything. I'm read...,1.0
2,Chrono Cross OST The music of Yasunori Misuda ...,1.0
3,Too good to be true Probably the greatest soun...,1.0
4,There's a reason for the price There's a reaso...,1.0
...,...,...
31750,A mom of three We bought this tent for my daug...,1.0
31751,we don't wish to be disturbed I bought this to...,1.0
31752,Pacific play tent - Lots of fun & adventure pl...,1.0
31753,A nice hideaway... Our one year old really enj...,1.0


# PreProcessing of our data
To make things easier to learn for our model we have to eleminate as many unecessary features in the reviews that have no contribution to sentiment:
1. We turn all the letters to lower case to remove the differentiation between the same word with a capital letter and without
2. Reviews sometimes has reference links to websites and other things, the links themselves do not have a meaning therefore we remove them
3. While punctuation is important for meaning, it is rarely a major contributer to the sentiment therefore we get rid of it.
4. We split the sentences into words through a tokenizer to process each word individually
5. We remove all none english alphabetical letters from the sentences (things like numbers)
6. We remove all english stopwords, stopwords are common english words and connectors for context and grammatical use that do not contribute to sentiment and theyre too frequent in english to make any relevance being used in a negative or positive manner (examples of stop words will be printed below)
7. We stem all the words into it's root to remove the differentiation between words that have the same meaning but with different suffix/prefixes (example, close and closely, go and going, etc) this is done to limit the amount of words in our vocabulary that have similar sentiment yet not the same word
8. We join back all the words of each review into one sentence

In [6]:
def clean_text(df):
    all_reviews = list()
    lines = df["review"].values.astype(str).tolist()
    for text in lines:
        #1
        text = text.lower()
        #2
        pattern = re.compile('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
        text = pattern.sub('', text)
        #3
        text = re.sub(r"[,.\"!@#$%^&*(){}?/;`~:<>+=-]", "", text)
        #4
        tokens = word_tokenize(text)
        #5
        words = [word for word in tokens if word.isalpha()]
        #6
        stop_words = set(stopwords.words("english"))
        stop_words.discard("not")
        words = [w for w in words if not w in stop_words]
        #7
        SB = SnowballStemmer(language='english')
        words = [SB.stem(w) for w in words if not w in stop_words]
        #8
        words = ' '.join(words)
        all_reviews.append(words)
        
    return all_reviews
print(stopwords.words("english"))
all_reviews = clean_text(dfEnglish)
all_reviews[:5]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

['inspir hope lot peopl hear cd need strong posit vibe like great vocal fresh tune crosscultur happi blue gut pop sound catchi matur',
 'best soundtrack ever anyth read lot review say best soundtrack figur write review disagre bit opinino yasunori mitsuda ultim masterpiec music timeless listen year beauti simpli refus fadeth price tag pretti stagger must say go buy cd much money one feel would worth everi penni',
 'chrono cross ost music yasunori misuda without question close second great nobuo uematsuchrono cross ost wonder creation fill rich orchestra synthes sound ambianc one music major factor yet time uplift vigor favourit track includ scar left time girl stole star anoth world',
 'good true probabl greatest soundtrack histori usual better play game first enjoy anyway work hard get soundtrack spend money get realli worth everi penni get ost amaz first track danc around delight especi scar left time buy',
 'reason price reason cd expens even version not importsom best music ever co

# Term frequency and word relevance
We cleaned all the reviews from words and things that do not include sentiment, but even then, not all words contribute to a positive/negative meaning therefore we have to give each word a "value" for our model to know how important or relevant that word is in contributing to sentiment.


We have a method that implement it:
1. TFIDFVectorizer : TFIDF is a statistical measurring method. It consists of 2 parts, TF (Term Frequency) multiplied with IDF (Inverse Document Frequency). The main intuition being some words that appear frequently in 1 document and less frequently in other documents could be considered as providing extra insight for that 1 document and could help our model learn from this additional piece of information. In short, common words are penalized. These are relative frequencies identified as floating point numbers.

We test both models to see which gives us better results.


In [7]:
#We define TfidfVectorizer and we give it our reviews to transform and our classes
TV = TfidfVectorizer(min_df=3)   
X = TV.fit_transform(all_reviews).toarray()
y = dfEnglish['score'].to_numpy()

#We split the data to train and test by 80%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

#We define our first model, SGD, we train it and test the results
model = SGDClassifier(loss='hinge')
model.fit(X_train,y_train)

y_pred = model.predict(X_test)
print('SGD with TfidfVectorizer results:')

print('Training accuracy:', model.score(X_train,y_train))
print('Test accuracy:', model.score(X_test,y_test))
print('Precision:',precision_score(y_test, y_pred))

SGD with TfidfVectorizer results:
Training accuracy: 0.9064320579436309
Test accuracy: 0.8669500866005353
Precision: 0.8611449451887941


In [8]:
filename = 'finalized_model.sav'
pickle.dump(model, open(filename, 'wb'))
TFfilename = 'finalized_TFIDF.sav'
pickle.dump(TV, open(TFfilename, 'wb'))

In [9]:
filename = 'finalized_model.sav'
loaded_model = pickle.load(open(filename, 'rb'))
y_pred = loaded_model.predict(X_test)
print('SGD with TfidfVectorizer results:')

print('Training accuracy:', loaded_model.score(X_train,y_train))
print('Test accuracy:', loaded_model.score(X_test,y_test))
print('Precision:',precision_score(y_test, y_pred))

SGD with TfidfVectorizer results:
Training accuracy: 0.9064320579436309
Test accuracy: 0.8669500866005353
Precision: 0.8611449451887941
