In [221]:
import os
os.chdir('/Users/ernestng/Desktop/projects/foodreview/amazon-fine-food-reviews/')
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix, recall_score, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
from imblearn.over_sampling import SMOTE, ADASYN
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [189]:
fooddf = pd.read_csv("./Reviews.csv")
fooddf = fooddf[['Score', 'Text']]

Check counts of each rating(1-5)

In [23]:
fooddf.groupby('Score').count()

Unnamed: 0_level_0,Text
Score,Unnamed: 1_level_1
1,52268
2,29769
3,42640
4,80655
5,363122


We see that there is a significantly larger number of reviews with rating of 5, so I plan to use random under sampling to balance the data.

Random Under sampling aims to balance class distribution by randomly eliminating majority class examples. This is done until the majority and minority class instances are balanced out. In this approach, I reduce the data from higher class (data with 4 and 5 rating) to match the data with lower class(data with 1 and 2 rating).

I can eliminate data with 3 rating since I want to create a binary classifer and rating 3 is really sitting on the fence, although we can keep in mind rating 3 might be useful for multiclass classifier

counts of rating 1-2 : 82037
counts of rating 4-5 : 443777 (I will randomly pick 82037 data points from here)

In [25]:
food_low = fooddf[fooddf['Score']<3]
food_low.head()

Unnamed: 0,Score,Text
1,1,Product arrived labeled as Jumbo Salted Peanut...
3,2,If you are looking for the secret ingredient i...
12,1,My cats have been happily eating Felidae Plati...
16,2,I love eating them and they are good for watch...
26,1,"The candy is just red , No flavor . Just plan..."


In [50]:
food_high = fooddf[fooddf['Score']>3]
food_high.head()
food_high_resampled = food_high.sample(n = 82037, random_state=12)

In [49]:
new_food = pd.concat([food_low,food_high_resampled])
def sentiment(x):
    if x < 3:
        return 0
    else:
        return 1
new_food["sentiment"] = new_food.Score.map(sentiment)
new_food.sentiment.value_counts()

1    82037
0    82037
Name: sentiment, dtype: int64

In [51]:
new_food

Unnamed: 0,Score,Text,sentiment
1,1,Product arrived labeled as Jumbo Salted Peanut...,0
3,2,If you are looking for the secret ingredient i...,0
12,1,My cats have been happily eating Felidae Plati...,0
16,2,I love eating them and they are good for watch...,0
26,1,"The candy is just red , No flavor . Just plan...",0
...,...,...,...
410093,5,Loved this for breakfast as a child. My child...,1
217854,5,"After a lengthy trek in the Himalayan Range, t...",1
426527,5,These wafers are wonderfully light and taste j...,1
395540,5,I'm a huge fan of Earnest Eats with my favorit...,1


X input : Text
y output : sentiment

In [192]:
X_train, X_test, y_train, y_test = train_test_split(new_food.Text, new_food.sentiment, test_size=0.1, random_state=42)


After splitting the data, I use CountVectorizer() to convert text documents to matrix of token counts. 

This configuration tokenize the strings and convert them to lower case and build a vocabulary of comma separated tokens.

In [164]:
xvect = CountVectorizer().fit(X_train)
xtrain_vect = xvect.transform(X_train)

This data is further processed by applying Tfidf Vectorizer, which helps us to give more weight-age to important words which less important words for the case study would be given more weights.
Since, our code is based on counting the frequency of each word in the document, so if certain words like ‘the’, ‘if’ etc. which are present more frequently then words which are more important such as ‘buy’,’product’ etc. , which gives us the context.

In [68]:
xvect = TfidfVectorizer().fit(X_train)
xtrain_vect = xvect.transform(X_train)
len(xvect.get_feature_names())

58973

I further improve my model, for example it would help us to differentiate between ‘good’ and ‘not good’ as it would take both words together(for bi gram count pairs). Also, it would help us to work with more features. I have set the n-grams in the range of 1–2 which helps us to extract features for 1 and 2 grams.

In [235]:
xvect = TfidfVectorizer(ngram_range = (1,3)).fit(X_train)
xtrain_vect = xvect.transform(X_train)
len(xvect.get_feature_names())

6002749

To fit this model, I am going to use Logistic Regression and Multinomial Naive Bayes Algorithm and I will compare both the models.

Multinomial NB

In [71]:
food_model = MultinomialNB()
food_model.fit(xtrain_vect, y_train)


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [74]:
predictions = food_model.predict(xvect.transform(X_test))
roc_auc_score(y_test, predictions)

0.9249126449932409

Logistic Regression

Parameters I changed: 
1) multi_class='ovr' (Binary problem is fit for each label)
2) n_jobs=1 (Number of CPU cores used when parallelizing over classes if multi_class=’ovr’)
3) solver='liblinear' (Algorithm to use when optimizing over a small dataset such as this)

In [236]:
food_LRmodel = LogisticRegression(n_jobs=1,multi_class='ovr',solver='liblinear')
food_LRmodel.fit(xtrain_vect, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
                   solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [237]:
predictionsLR = food_LRmodel.predict(xvect.transform(X_test))
roc_auc_score(y_test, predictionsLR)

0.9358109760522249

Logistic Regression(LR) gives us a better AUC score than Multinomial NB(MNB). 
LR gives better prediction than MNB, hence I will the LR model

Function to determine review input

In [238]:
def predictreview(x):
    rev = food_LRmodel.predict(xvect.transform(x))
    rev_list = np.where(rev==1,"Positive","Negative").tolist()
    return dict(zip(x, rev_list))

Test cases

In [269]:
test = ["the ayam penyet i had today was awesome!",
         "good plate of fried rice, but expensive",
        "portion too small",
        "great value for money!",
        "cheap and good food",
        "cheap food",
        "waste of money",
        "terrible service and the food is badly done",
        "edible but i will not visit again",
       "wonderful service, i will definitely visit again",
       "not very good"]

test_chinese = ["hen haochi", 
                "hen bu haochi", 
                "shiwu bu hao chi",
                "wo bu xi huan shiwu",
                "wo xi huan shiwu"]
predictreview(test)

{'the ayam penyet i had today was awesome!': 'Positive',
 'good plate of fried rice, but expensive': 'Positive',
 'portion too small': 'Negative',
 'great value for money!': 'Positive',
 'cheap and good food': 'Positive',
 'cheap food': 'Negative',
 'waste of money': 'Negative',
 'terrible service and the food is badly done': 'Negative',
 'edible but i will not visit again': 'Negative',
 'wonderful service, i will definitely visit again': 'Positive',
 'not very good': 'Negative'}

I included some hanyupinyin to test if my model can understand chinese words

In [270]:
predictreview(test_chinese)

{'hen haochi': 'Positive',
 'hen bu haochi': 'Negative',
 'shiwu bu hao chi': 'Negative',
 'wo bu xi huan shiwu': 'Negative',
 'wo xi huan shiwu': 'Positive'}

Confusion matrix looks good, number of true positives and true negatives are quite even 

In [244]:
confusion_matrix(y_test, predictionsLR)

array([[7743,  479],
       [ 574, 7612]])

The recall_score tells us intuitively the ability of the classifier to find all the positive samples.

In [245]:
recall_score(y_test, predictionsLR)

0.9298802834107012

Conclusion: This model is trained against Amazon Fine Food Dataset with 164074 reviews after cleaning the data. We have achieved an accuracy score of around 93.57% which is quite good. I believe when I train this model against a larger dataset, I can achieve a much higher accuracy.
