# Random Forest on Amazon Food Reviews

Dataset Source: https://www.kaggle.com/snap/amazon-fine-food-reviews

The Amazon Fine Food Reviews dataset consists of reviews of the fine food from Amazon.

Timespan: Oct 1999 - Oct 2012
Total Number of Reviews: 568,454
Total Number of Users: 256,059
Total Number of Products: 74,258
Total Number of Profile Name: 218,418

Number of attributes/columns: 10

Attributes/Columns:

1. Id: Row Id
2. ProductId: Unique identifier for the product
3. UserId: Unqiue identifier for the user
4. ProfileName: Profile name of the user
5. HelpfulnessNumerator: Number of users who found the review helpful
6. HelpfulnessDenominator: Number of users who indicated whether they found the review helpful or not
7. Score: Rating between 1 and 5
8. Time: Timestamp for the review
9. Summary: Brief summary of the review
10. Text: Text of the review

**Aim: Convert all the reviews into a vector using two techniques:**
1. TFIDF
2. Word2Vec

**Then perform following tasks under each technique:**
1. Split dataset in train and test data in ratio of 80:20.
2. Perform GridSearch Cross Validation to find optimal value of number of base models in Random Forest.
3. Apply Random Forest and check accuracy.

# Loading the data

The dataset is available in two forms on kaggle:

1. .csv file

2. SQLite Database

To load data, I have used SQLite dataset as it is easier to query and visualise the data.Here I have to classifiy the sentiment into positive and negative, so I will ignore all the reviews with Scores equal to 3.If the Score is greater than 3 then it is positive otherwise it is negative.

In [1]:
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
conn=sqlite3.connect('CleanedAmazonFoodReviewDataset.sqlite')

In [3]:
dataset=pd.read_sql_query("SELECT * FROM REVIEWS",conn)

In [4]:
dataset.head()

Unnamed: 0,index,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,ProcessedText
0,0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,Positive,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...,have bought sever the vital can dog food produ...
1,1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,Negative,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,product arriv label jumbo salt peanutsth peanu...
2,2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,Positive,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...,this confect that has been around few centuri ...
3,4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,Positive,1350777600,Great taffy,Great taffy at a great price. There was a wid...,great taffi great price there was wide assort ...
4,5,6,B006K2ZZ7K,ADT0SRK1MGOEU,Twoapennything,0,0,Positive,1342051200,Nice Taffy,I got a wild hair for taffy and ordered this f...,got wild hair for taffi and order this five po...


In [5]:
dataset.shape

(364171, 12)

In [6]:
dataset["Score"].value_counts()

Positive    307061
Negative     57110
Name: Score, dtype: int64

In [7]:
def changeScore(score):
    if(score=="Positive"):
        return 1
    else:
        return 0

In [8]:
scores=list(dataset["Score"])

In [9]:
convertedScore=list(map(changeScore,scores))

In [10]:
convertedScore[:10]

[1, 0, 1, 1, 1, 1, 1, 1, 1, 1]

In [11]:
dataset["Score"]=convertedScore

In [12]:
dataset.head()

Unnamed: 0,index,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,ProcessedText
0,0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,1,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...,have bought sever the vital can dog food produ...
1,1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,0,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,product arriv label jumbo salt peanutsth peanu...
2,2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,1,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...,this confect that has been around few centuri ...
3,4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,1,1350777600,Great taffy,Great taffy at a great price. There was a wid...,great taffi great price there was wide assort ...
4,5,6,B006K2ZZ7K,ADT0SRK1MGOEU,Twoapennything,0,0,1,1342051200,Nice Taffy,I got a wild hair for taffy and ordered this f...,got wild hair for taffi and order this five po...


In [13]:
data_10000 = dataset.sample(n = 10000)

In [14]:
data_10000.shape

(10000, 12)

In [15]:
data_10000.head()

Unnamed: 0,index,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,ProcessedText
336469,486088,525620,B000HD3YMQ,A2UKQDFLZPI3KT,"Jaden P. Brulotte ""Jaden""",0,0,1,1317772800,Yummy!,What can I say that is bad about Crunch N' Mun...,what can say that bad about crunch munch asid ...
351329,508169,549496,B001RQOY94,ATBVMV9W2ZLSO,P. Stegner,7,9,1,1251504000,quick service good product,The casings are very good and the product was ...,the case are veri good and the product was shi...
281654,407245,440376,B000JM646S,A2LQ5AAGB3OFBR,fops,0,0,1,1332979200,Great Stuff!,Great Stuff tastes great and has given me back...,great stuff tast great and has given back favo...
39995,57620,62487,B000FBL8FU,ADXXF989KXDPV,"Mary Mckercher ""Mary McK""",2,23,0,1158537600,Empty Calories,I am really disappointed in this beand as they...,realli disappoint this beand they are market t...
320987,461836,499415,B000QSQTRY,A1DN8OM1DDSFUC,"C Bismuth ""CatB; CroquetCreative/HypFoods""",0,0,1,1342656000,Top Choice for Cats with DIGESTIVE ALLERGIES,"Okay, I'll be blunt: Diarrhea. We bought 2 Sib...",okay ill diarrhea bought siberian pure bred li...


In [16]:
finalSortedDataFrame10000=data_10000.sort_values('Time',inplace=False,axis=0,ascending=True)

In [17]:
finalSortedDataFrame10000.shape

(10000, 12)

In [18]:
finalSortedDataFrame10000.head()

Unnamed: 0,index,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,ProcessedText
85385,121041,131217,B00004RAMX,A5NQLNC6QPGSI,Kim Nason,7,8,1,965001600,End your Gopher Problems,I have just recently purchased the Woodstream ...,have just recent purchas the woodstream corp g...
98445,139503,151400,B0000DHZY1,AJ6ZYWQ5C2RSX,Sue Thomas,0,0,0,1068508800,This is not pure,I have baked with this organic vanilla in the ...,have bake with this organ vanilla the past and...
329644,475065,513757,B0001YNGMO,A1L4N3V2L2X1EM,Audra Simpkins,2,2,1,1081296000,The best tasting GREEN SuperFood I've ever tri...,I love this stuff. In the first week of takin...,love this stuff the first week take immedi not...
109836,155436,168584,B0001ES9F8,AB0I8KCAEG9ZE,David J. Cuccia,3,6,0,1083110400,Get the dark roast instead!,The Senseo machine shipped with two 18-pod sam...,the senseo machin ship with two sampl pack lig...
50727,71574,77922,B0001XXARQ,A3G16WD9AXJNRV,Daniel Chait,1,1,1,1084147200,Beautiful Mother's Day Gift,I bought this for my mother-in-law for Mother'...,bought this for for mother day and was absolut...


In [19]:
finalSortedDataFrameScore10000=data_10000["Score"]

In [20]:
finalSortedDataFrameScore10000.shape

(10000,)

In [21]:
finalSortedDataFrameScore10000.head()

336469    1
351329    1
281654    1
39995     0
320987    1
Name: Score, dtype: int64

In [22]:
final_data=finalSortedDataFrame10000

In [23]:
final_data.shape

(10000, 12)

In [24]:
final_data_labels=finalSortedDataFrameScore10000

In [25]:
final_data_labels.shape

(10000,)

In [26]:
from wordcloud import WordCloud

In [27]:
def PlotWordCloud(frequency):
    wordCloudPlot=WordCloud(background_color="white",width=150,height=80)
    wordCloudPlot.generate_from_frequencies(frequencies=frequency)
    plt.figure(figsize=(15,10))
    plt.imshow(wordCloudPlot,interpolation="bilinear")
    plt.axis("off")
    plt.show()

# 1. TFIDF

In [28]:
final_data.shape

(10000, 12)

In [29]:
final_data_labels.shape

(10000,)

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [31]:
tfidf_vector=TfidfVectorizer(ngram_range=(1,2))

In [32]:
final_data_tfidf=tfidf_vector.fit_transform(final_data['ProcessedText'].values)

In [33]:
final_data_tfidf.shape

(10000, 235976)

**Splitting into train and test data in ratio of 80:20**

In [34]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(final_data_tfidf,final_data_labels,test_size=0.2,random_state=42)

In [35]:
X_train.shape,X_test.shape,y_train.shape,y_test.shape

((8000, 235976), (2000, 235976), (8000,), (2000,))

In [36]:
from sklearn.ensemble import RandomForestClassifier

In [37]:
rf=RandomForestClassifier(n_jobs=-1)

In [38]:
rf.fit(X_train,y_train)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
                       oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

In [39]:
y_pred=rf.predict(X_test)

In [40]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

In [41]:
accuracy = accuracy_score(y_test, y_pred)
accuracy

0.822

In [42]:
recall=recall_score(y_test, y_pred)
recall

0.9945520581113801

In [43]:
precision=precision_score(y_test, y_pred)
precision

0.8256281407035176

In [44]:
f1score=f1_score(y_test, y_pred)
f1score

0.9022515101592532

In [45]:
cm=confusion_matrix(y_test, y_pred)
cm

array([[   1,  347],
       [   9, 1643]], dtype=int64)

**Perform GridSearch Cross Validation to find optimal value of number of base models in Random Forest.**

In [49]:
from sklearn.model_selection import GridSearchCV

In [50]:
values = []
for i in range(1, 31, 2):
    values.append(i)

clf = RandomForestClassifier(n_jobs = -1)

hyper_parameters = {'n_estimators': values}
grid_search = GridSearchCV(clf, hyper_parameters, scoring = "accuracy", cv = 3)

In [51]:
grid_search.fit(X_train,y_train)

GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=-1,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid='

In [52]:
print(grid_search.best_estimator_)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=23, n_jobs=-1,
                       oob_score=False, random_state=None, verbose=0,
                       warm_start=False)


In [53]:
best_parameter=grid_search.best_params_
best_param=best_parameter['n_estimators']
best_param

23

**Apply Random Forest and check accuracy.**

In [54]:
clf = RandomForestClassifier(n_estimators=best_param,n_jobs = -1)

In [55]:
clf.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=23, n_jobs=-1,
                       oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

In [57]:
y_pred=rf.predict(X_test)

In [58]:
accuracy = accuracy_score(y_test, y_pred)
accuracy

0.822

In [59]:
recall=recall_score(y_test, y_pred)
recall

0.9945520581113801

In [60]:
precision=precision_score(y_test, y_pred)
precision

0.8256281407035176

In [61]:
f1score=f1_score(y_test, y_pred)
f1score

0.9022515101592532

# 2. Word2Vec

In [51]:
final_data.head()

Unnamed: 0,index,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,ProcessedText
238718,346116,374422,B00004CI84,A1048CYU0OV4O8,Judy L. Eans,2,2,1,947376000,GREAT,THIS IS ONE MOVIE THAT SHOULD BE IN YOUR MOVIE...,this one movi that should your movi collect fi...
772,1146,1245,B00002Z754,A29Z5PI9BW2PU3,Robbie,7,7,1,961718400,Great Product,This was a really good idea and the final prod...,this was realli good idea and the final produc...
238716,346114,374420,B00004CI84,A1ZH086GZYL5MZ,Doug DeBolt,2,2,1,1013385600,"A little gross, a lot of fun",Michael Keaton was already on his way to being...,michael keaton was alreadi his way be major st...
238713,346111,374417,B00004CI84,A23QOAXJSWIBS6,"Daniel S. Russell ""syzygy121""",2,2,1,1066435200,Underrated showcase of scary-zany Burtonland,This is definitely a guilty pleasure. As ofte...,this definit guilti pleasur often think not su...
19460,28087,30630,B00008RCMI,A284C7M23F0APC,A. Mendoza,0,0,1,1067040000,Best sugarless gum ever!,I love this stuff. It is sugar-free so it does...,love this stuff doesnt rot your gum and tast g...


In [52]:
final_data.shape

(10000, 12)

In [53]:
final_data_labels.head()

243990    1
83670     1
355739    1
126801    1
146314    1
Name: Score, dtype: int64

In [54]:
final_data_labels.shape

(10000,)

In [55]:
listOfSentences=[]
for sentence in final_data["ProcessedText"].values:
    s=[]
    for word in sentence.split():
        s.append(word)
    listOfSentences.append(s);

In [56]:
listOfSentences[0:1]

[['this',
  'one',
  'movi',
  'that',
  'should',
  'your',
  'movi',
  'collect',
  'fill',
  'with',
  'comedi',
  'action',
  'and',
  'whatev',
  'els',
  'you',
  'want',
  'call']]

In [57]:
final_data["ProcessedText"].values[0]

'this one movi that should your movi collect fill with comedi action and whatev els you want call'

In [58]:
from gensim.models import Word2Vec

In [59]:
w2v=Word2Vec(listOfSentences,size=350,min_count=5)

In [60]:
type(w2v)

gensim.models.word2vec.Word2Vec

In [61]:
w2v.wv[word][0]

0.122383885

In [62]:
w2v.vector_size

350

In [63]:
sentenceW2V=[]
for sentence in listOfSentences:
    sentVector=np.zeros(350)
    for word in sentence:
        try:
            vector=w2v.wv[word]
            sentVector+=vector
        except:
            pass
    sentenceW2V.append(sentVector)

In [64]:
len(sentenceW2V)

10000

In [65]:
len(sentenceW2V[0])

350

**Split into train and test data in ratio of 80:20**

In [66]:
from sklearn.model_selection import train_test_split

In [67]:
X_train_W2V,X_test_W2V,y_train_W2V,y_test_W2V=train_test_split(sentenceW2V,final_data_labels,test_size=0.2,random_state=42)

In [68]:
type(X_train_W2V)

list

In [69]:
X_train_W2V_n=np.array(X_train_W2V)
X_test_W2V_n=np.array(X_test_W2V)
y_train_W2V_n=np.array(y_train_W2V)
y_test_W2V_n=np.array(y_test_W2V)

In [70]:
X_train_W2V_n.shape,X_test_W2V_n.shape,y_train_W2V_n.shape,y_test_W2V_n.shape

((8000, 350), (2000, 350), (8000,), (2000,))

In [71]:
rfc=RandomForestClassifier(n_jobs=-1)

In [72]:
rf.fit(X_train_W2V_n,y_train_W2V_n)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
                       oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

In [73]:
y_pred_w2v=rf.predict(X_test_W2V_n)

In [74]:
accuracy = accuracy_score(y_test_W2V_n, y_pred_w2v)
accuracy

0.8165

In [75]:
recall=recall_score(y_test_W2V_n, y_pred_w2v)
recall

0.9530240751614797

In [76]:
precision=precision_score(y_test_W2V_n, y_pred_w2v)
precision

0.849738219895288

In [77]:
f1score=f1_score(y_test_W2V_n, y_pred_w2v)
f1score

0.8984223636866869

In [78]:
cm=confusion_matrix(y_test_W2V_n, y_pred_w2v)
cm

array([[  10,  287],
       [  80, 1623]], dtype=int64)

**Perform GridSearch Cross Validation to find optimal value of number of base models in Random Forest.**

In [79]:
from sklearn.model_selection import GridSearchCV

In [80]:
values = []
for i in range(1, 31, 2):
    values.append(i)

clf = RandomForestClassifier(n_jobs = -1)

hyper_parameters = {'n_estimators': values}
grid_search = GridSearchCV(clf, hyper_parameters, scoring = "accuracy", cv = 3)

In [81]:
grid_search.fit(X_train_W2V_n,y_train_W2V_n)

GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=-1,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid='

In [82]:
print(grid_search.best_estimator_)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=27, n_jobs=-1,
                       oob_score=False, random_state=None, verbose=0,
                       warm_start=False)


In [83]:
best_parameter=grid_search.best_params_
best_param=best_parameter['n_estimators']
best_param

27

**Apply Random Forest and report accuracy.**

In [84]:
clf = RandomForestClassifier(n_estimators=best_param,n_jobs = -1)

In [85]:
clf.fit(X_train_W2V_n,y_train_W2V_n)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=27, n_jobs=-1,
                       oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

In [86]:
y_pred_w2v_b=rf.predict(X_test_W2V_n)

In [87]:
accuracy = accuracy_score(y_test_W2V_n,y_pred_w2v_b)
accuracy

0.8165

In [88]:
recall=recall_score(y_test_W2V_n, y_pred_w2v_b)
recall

0.9530240751614797

In [89]:
precision=precision_score(y_test_W2V_n, y_pred_w2v_b)
precision

0.849738219895288

In [90]:
f1score=f1_score(y_test_W2V_n, y_pred_w2v_b)
f1score

0.8984223636866869

In [91]:
cm=confusion_matrix(y_test_W2V_n, y_pred_w2v_b)
cm

array([[  10,  287],
       [  80, 1623]], dtype=int64)

**TFIDF:**

Without performing GridSearchCV --  Accuracy = 82.2%

After performing GridSearchCV -- Accuracy = 82.2%


**Word2Vec:**

Without performing GridSearchCV --  Accuracy = 81.65%

After performing GridSearchCV -- Accuracy = 81.65%