<a href="https://colab.research.google.com/github/amritnaruto/Natural-Language-Processing/blob/master/Amazon_fine_food_reviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting our data

Amazon Fine Food Reviews

copy $\text{kaggle.json}$ into your google drive<br>
or just copy it into the colab file space<br>
which ever way you want.

If you do not have this file,<br>
go to your kaggle account<br>
and $\text{create new API token}$

Follow along the commands,<br>
you will understand what is gong on..

In [1]:
!pip install -q kaggle
!mkdir /root/.kaggle/

In [2]:
!cp /content/drive/My\ Drive/Datasets/kaggle.json /root/.kaggle/

# !cp < kaggle.json file path> /root/.kaggle

In the dataset page of Amazon Fine Food Reviews,<br>
beside the 'New Notebook' button<br>
click the options (three dots) and<br>
select 'Copy API Command'<br>

incase kaggle changes the layout, the option for api command would still be there somewhere.

In [3]:
!kaggle datasets download -d snap/amazon-fine-food-reviews

Downloading amazon-fine-food-reviews.zip to /content
100% 242M/242M [00:02<00:00, 163MB/s]
100% 242M/242M [00:02<00:00, 112MB/s]


In [4]:
!unzip amazon-fine-food-reviews.zip

Archive:  amazon-fine-food-reviews.zip
  inflating: Reviews.csv             
  inflating: database.sqlite         
  inflating: hashes.txt              


And we have our data

In [5]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score, log_loss

from gensim.parsing.preprocessing import remove_stopwords
from gensim.utils import simple_preprocess
from gensim.parsing.porter import PorterStemmer
from gensim.models import Word2Vec

# Data preparation

We will work on simply classifying the each reviews as positive or negative based on the scores.

So we will only need the reviews column and the score column

## Let's take a look at the entire dataset

In [6]:
reviews_log = pd.read_csv('Reviews.csv')
reviews_log.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


Let's look at a review..

In [7]:
reviews_log.iloc[0,-1]

'I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.'

How does the score column look like?..

In [8]:
reviews_log['Score']

0         5
1         1
2         4
3         2
4         5
         ..
568449    5
568450    2
568451    5
568452    5
568453    5
Name: Score, Length: 568454, dtype: int64

## Taking out Reviews and Score into a separate dataframe

Now since this is a binary classification: positive, negative<br>

scores above 3 will be considered positive and ones below 3 as negative.. obviously..!!


First drop the columns with score of 3.. we won't need them

In [9]:
reviews_log.drop( reviews_log[reviews_log['Score'] == 3].index, inplace=True )

Converting score to 1s (for positive) and 0s (for negative)

In [10]:
reviews_log['Score'] = reviews_log['Score'].apply(lambda x: 1 if x > 3 else 0)

Now we create the dataset on which we will work

In [11]:
dataset = pd.DataFrame({'review': [], 'sentiment': []})
dataset['review'] = reviews_log['Text']
dataset['sentiment'] = reviews_log['Score']

dataset.head()

Unnamed: 0,review,sentiment
0,I have bought several of the Vitality canned d...,1
1,Product arrived labeled as Jumbo Salted Peanut...,0
2,This is a confection that has been around a fe...,1
3,If you are looking for the secret ingredient i...,0
4,Great taffy at a great price. There was a wid...,1


## Converting text to numbers

More specifically convert words to vectors.

But before that, we need to clean our text

gensim provides all the necessary functions...

In [12]:
dataset['review'] = dataset['review'].apply(remove_stopwords)

dataset.head()

Unnamed: 0,review,sentiment
0,I bought Vitality canned dog food products goo...,1
1,Product arrived labeled Jumbo Salted Peanuts.....,0
2,"This confection centuries. It light, pillowy c...",1
3,If looking secret ingredient Robitussin I beli...,0
4,Great taffy great price. There wide assortment...,1


gensim's simple_preprocess() tokenizes each review into a list of words<br>

It also removes punctuation marks

In [13]:
dataset['review'] = dataset['review'].apply(simple_preprocess)

dataset.head()

Unnamed: 0,review,sentiment
0,"[bought, vitality, canned, dog, food, products...",1
1,"[product, arrived, labeled, jumbo, salted, pea...",0
2,"[this, confection, centuries, it, light, pillo...",1
3,"[if, looking, secret, ingredient, robitussin, ...",0
4,"[great, taffy, great, price, there, wide, asso...",1


Stemming

In [14]:
dataset['review'] = dataset['review'].apply(lambda x: [PorterStemmer().stem(w) for w in x] )

dataset.head()

Unnamed: 0,review,sentiment
0,"[bought, vital, can, dog, food, product, good,...",1
1,"[product, arriv, label, jumbo, salt, peanut, t...",0
2,"[thi, confect, centuri, it, light, pillowi, ci...",1
3,"[if, look, secret, ingredi, robitussin, believ...",0
4,"[great, taffi, great, price, there, wide, asso...",1


As you can observe PorterStemmer didn't do that well..

It converted 'this' to 'thi',<br>
'arrives' to 'arriv', etc.

You can use gensim's or nltk's lemmatize instead.

Before we apply Word2Vec,<br>
it is necessary that we split the dataset now to train and test set<br>
to avoid information leak into our unobserved data 

In [15]:
train_text, test_text, y_train, y_test = train_test_split(dataset['review'],
                                                    dataset['sentiment'],
                                                    stratify = dataset['sentiment'],
                                                    random_state=0)

print(train_text.shape)
print(test_text.shape)

(394360,)
(131454,)


Now we apply...

**min_count** sets the minimum count for each word necessary to be considered. If a word's occurance is less than min_count, then that word will not be vectorized. We will set that to 1. meh..

**size** is the dimensionality of the vector.

**window** sets the size of context.

**sg**: 0 for using CBOW, and 1 for Skip-gram

In [16]:
w2v_model = Word2Vec(train_text, min_count=1,
               size = 100, workers=3,
               window=3, sg=1)

w2v_model

<gensim.models.word2vec.Word2Vec at 0x7f00b9fbae80>

## Final prep

our word2vec model is trained, and now we convert.

In [17]:
temp = train_text.apply(lambda x: np.mean([ w2v_model[token] for token in x ],
                                          axis=0))

temp

  """Entry point for launching an IPython kernel.
  out=out, **kwargs)


323249    [0.25243348, 0.21213992, 0.032568213, -0.23740...
342801    [0.30880803, 0.024606204, -0.121111214, -0.008...
164441    [0.3022396, 0.085057326, -0.13120477, -0.08813...
423976    [0.18207915, 0.24881898, -0.15377955, -0.05449...
229980    [0.33945492, -0.14384131, 0.01150136, -0.05900...
                                ...                        
484657    [0.32042098, 0.008284028, -0.14173505, -0.1953...
154976    [0.23416626, -0.033135507, -0.106669135, -0.12...
363974    [0.2624597, -0.09991959, 0.030970657, 0.036036...
357675    [0.27868813, 0.0098297, -0.012811714, -0.15416...
443959    [0.3356177, 0.063156694, -0.2243333, -0.067290...
Name: review, Length: 394360, dtype: object

If you observe, this is a series..

A series with each element being a list of values..

We don't want this..<br> 
rather we want it to be a dataframe with multiple columns

First lets remove any nan values

In [18]:
temp[temp.isna()]

299605    NaN
378643    NaN
544869    NaN
233938    NaN
324249    NaN
388831    NaN
487863    NaN
188001    NaN
Name: review, dtype: object

In [19]:
v = np.mean(temp)
for i in temp[temp.isna()].index:
    temp.loc[i] = v

In [20]:
temp[temp.isna()]

Series([], Name: review, dtype: object)

Now we convert the given series into our prefered dataframe

In [21]:
X_train = pd.DataFrame.from_dict( dict(zip(temp.index, temp.values)),
                                 orient='index')

In [22]:
X_train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99
323249,0.252433,0.21214,0.032568,-0.237404,-0.148126,-0.136992,-0.121911,-0.161791,-0.53723,0.185514,0.479198,-0.049389,-0.245795,0.076609,-0.085898,-0.102754,-0.06809,0.181401,-0.170236,-0.068251,0.016463,-0.173762,0.071989,0.200722,-0.052284,0.491653,0.067551,-0.261352,0.062048,0.023287,-0.050551,0.063143,-0.240281,-0.090183,-0.339972,-0.333721,-0.012428,-0.265852,-0.100961,0.107584,...,0.160234,0.102618,0.274741,-0.103355,-0.274502,-0.271442,-0.583706,0.052091,-0.042066,-0.037004,0.164125,0.445603,0.215982,-0.237372,-0.129968,0.162855,-0.021956,-0.19458,0.087771,0.295209,0.176602,0.318725,-0.032127,0.262596,-0.009972,-0.055356,0.265231,-0.01334,0.407962,0.126266,0.253608,0.054547,0.375947,-0.107142,0.324681,0.070237,0.123281,0.270396,-0.196323,0.47335
342801,0.308808,0.024606,-0.121111,-0.008237,-0.14402,-0.025024,0.071191,-0.334631,-0.321876,0.100054,0.329729,-0.16835,-0.061783,-0.049374,0.118907,-0.08727,-0.25939,0.146931,-0.094052,0.332739,0.129923,-0.233296,0.187707,0.108035,-0.075546,0.439267,0.247168,-0.117663,0.148518,0.16122,-0.127474,0.145068,-0.172556,-0.0389,-0.296869,-0.304696,-0.138908,-0.309037,-0.174421,0.131234,...,0.094397,0.053523,-0.04982,-0.050218,-0.155248,-0.241372,-0.398131,0.074371,0.152284,0.075669,0.151866,0.418107,0.169767,-0.100227,0.061269,0.005876,-0.194562,-0.12787,0.342907,0.275554,0.024734,0.195368,0.178669,0.200804,-0.257308,0.193324,0.121643,-0.048252,0.431312,0.069342,0.168324,0.167365,0.032077,-0.229165,0.410703,0.18256,0.232444,0.381155,0.037238,0.285413
164441,0.30224,0.085057,-0.131205,-0.088136,-0.173133,-0.085174,0.206814,-0.324954,-0.322299,-0.019898,0.343496,-0.279253,-0.136407,-0.081176,0.109006,-0.219649,-0.235751,0.154892,-0.141298,0.169248,0.037082,-0.244258,0.238271,0.138626,-0.133109,0.446502,0.237258,-0.099168,0.20116,0.16733,-0.17183,0.045361,-0.156199,-0.014034,-0.336798,-0.18134,-0.111324,-0.365484,-0.067612,0.113932,...,0.121788,0.074314,-0.180642,-0.036687,-0.220254,-0.389361,-0.193532,0.144187,0.041006,0.086013,0.10881,0.443895,0.020758,-0.102047,0.02744,0.178896,-0.162796,-0.053964,0.240465,0.238298,0.085796,0.287646,0.105647,0.252919,-0.243448,0.123115,0.247049,-0.134708,0.407281,0.041078,0.146423,0.209838,0.104267,-0.149227,0.387239,0.190012,0.24694,0.251289,-0.051736,0.282075
423976,0.182079,0.248819,-0.15378,-0.054491,-0.264087,-0.340695,-0.057152,-0.399522,-0.334112,-0.040897,0.40601,-0.335896,-0.111445,-0.290381,0.051144,-0.02601,-0.197804,0.226103,-0.192778,0.091181,-0.03295,-0.229786,0.369223,0.129695,-0.071986,0.404963,0.45666,-0.340429,0.260616,0.315885,-0.226785,-0.070739,-0.019855,-0.053925,-0.193267,-0.298893,-0.187209,-0.061662,-0.076553,0.118591,...,0.011202,0.105276,-0.204127,0.027832,-0.070651,-0.548815,-0.196778,0.166535,0.097607,0.094007,-0.006582,0.284871,-0.006597,-0.238071,-0.017583,0.280364,-0.090666,-0.035711,0.313027,0.302798,0.214245,0.312013,0.28155,0.293477,-0.097448,0.073384,0.410453,-0.164593,0.177724,-0.029533,-0.086769,0.432034,0.254311,-0.036286,0.315286,-0.008311,0.233776,0.25821,0.07327,0.317053
229980,0.339455,-0.143841,0.011501,-0.059003,-0.165186,0.039881,0.045913,-0.246844,-0.149052,-0.036479,0.372923,-0.261831,0.028815,0.066781,0.116302,-0.156279,-0.183813,0.203778,-0.045273,0.252633,0.123726,-0.361893,0.188957,0.129807,-0.047814,0.390435,0.167822,-0.130743,0.107405,0.168218,-0.076218,0.116501,-0.065357,-0.020653,-0.30808,-0.329655,-0.045098,-0.342716,-0.179305,0.398697,...,0.125375,0.161579,-0.08302,-0.021796,-0.191142,-0.268307,-0.239278,0.165276,0.175487,-0.006518,0.029621,0.399537,0.113384,-0.121505,0.008987,0.05763,-0.111309,-0.111618,0.323647,0.212133,0.119831,0.321174,0.054711,0.115511,-0.271028,0.132794,0.262815,-0.042991,0.359088,-0.107947,0.067196,0.24379,0.025918,-0.169469,0.318569,0.093413,0.180139,0.328849,0.140244,0.275698


Yep.. this is what we want..

Now in the test set, there may be words which did not occur in the train set<br>

so such words won't occur in the word2vec vocabulary<br> 
and therefore cannot be vectorzed.

we will convert such words into a zero vector.

In [23]:
def w2v_func(x):
    try:
        return w2v_model[x]
    except:
        return np.zeros(100,)

temp = test_text.apply(lambda x: np.mean( [w2v_func(token) for token in x], 
                                         axis=0) )

for i in temp[temp.isna()].index:
    temp.loc[i] = v

X_test = pd.DataFrame.from_dict( dict(zip(temp.index, temp.values)), 
                                orient='index')

  This is separate from the ipykernel package so we can avoid doing imports until


In [24]:
X_test.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99
208568,0.294,-0.068598,-0.010935,0.012504,-0.151976,-0.008306,-0.237612,-0.309721,-0.269854,0.073029,0.459732,-0.351365,-0.117844,-0.101529,0.230719,-0.005994,-0.12554,0.194221,-0.131212,0.086875,0.002288,-0.344901,0.250786,0.138686,-0.137821,0.392865,0.082621,-0.102847,-0.074401,0.19767,-0.123191,0.038767,0.063131,-0.039194,-0.248121,-0.341854,-0.113912,-0.231794,-0.07569,0.267323,...,0.098183,0.135395,-0.035856,0.061143,-0.116041,-0.496005,-0.299474,0.023949,0.162272,-0.084995,-0.100098,0.211273,0.134238,-0.129615,0.113928,0.168231,-0.190219,-0.025537,0.40327,0.274388,0.206388,0.289022,0.066736,0.187616,-0.282719,0.143898,0.180387,-0.053885,0.440821,-0.32341,0.105959,0.395341,0.015426,-0.081143,0.444694,0.254494,0.199947,0.283238,0.077414,0.27794
326959,0.176762,0.057403,-0.187187,-0.04262,-0.230203,-0.141246,-0.065864,-0.275595,-0.364767,-0.020605,0.333426,-0.323465,-0.016116,-0.104303,0.136378,-0.163902,-0.265134,0.224896,-0.24721,0.323927,-0.014867,-0.309821,0.241263,0.043485,-0.151614,0.355285,0.218123,-0.155152,0.160314,0.242147,-0.093906,-0.032858,-0.06608,-0.060644,-0.215631,-0.402417,-0.049859,-0.206686,-0.191616,0.190606,...,0.023554,0.142939,-0.095457,0.018947,0.005588,-0.375345,-0.299792,0.041586,0.0828,0.007549,-0.092914,0.336724,0.105661,-0.15549,0.04163,0.183465,-0.224967,-0.155598,0.314403,0.263891,0.187141,0.226873,0.172743,0.269579,-0.196931,0.054821,0.247641,-0.118991,0.330974,-0.086853,-0.08061,0.35658,0.04505,-0.075183,0.324655,0.194116,0.2366,0.21957,0.043543,0.298795
292064,0.192817,-0.001591,-0.112183,-0.057475,-0.165857,-0.123477,0.083526,-0.254859,-0.395052,0.140516,0.441834,-0.155495,-0.223818,0.000567,-0.025255,-0.084663,-0.160085,0.274446,-0.191406,0.042795,0.03293,-0.311123,0.238399,0.166961,-0.075368,0.4613,0.179413,-0.225638,0.150521,0.119108,-0.081082,0.057445,0.026054,0.018466,-0.226195,-0.243417,-0.001161,-0.197971,-0.038239,0.23991,...,0.094142,0.193733,0.044387,-0.004768,-0.088984,-0.223009,-0.275863,-0.041844,-0.079369,0.010079,-0.070787,0.464929,0.34713,-0.010274,-0.145332,0.151197,-0.135199,-0.091008,0.11212,0.339533,0.185669,0.315048,0.166237,0.216483,-0.124235,0.076727,0.235166,-0.047999,0.357511,0.141112,0.124243,0.245302,0.096513,-0.185765,0.353029,0.173698,0.172993,0.296124,-0.092636,0.322293
85577,0.184148,0.080792,-0.179845,0.012167,-0.064042,-0.048932,0.019091,-0.365839,-0.234425,-0.134919,0.269273,-0.305107,0.047456,0.000991,0.11302,0.022207,-0.21429,0.139816,-0.154448,0.015033,0.055423,-0.216445,0.225629,0.145958,-0.058737,0.442551,0.27988,-0.262989,0.078579,0.225602,-0.202575,0.061272,-0.069509,0.022825,-0.416728,-0.43918,0.000137,-0.049983,-0.209522,0.297469,...,0.047691,0.120784,-0.00533,0.070589,-0.021254,-0.416986,-0.38067,0.01276,0.183034,0.025214,0.116842,0.364985,0.105362,-0.153549,-0.082926,0.222998,-0.048693,-0.071802,0.197704,0.365397,0.22064,0.392912,0.193886,0.358354,-0.148871,0.041759,0.315767,-0.218229,0.280108,-0.023471,0.060808,0.394134,0.124328,-0.267365,0.365408,0.0932,0.19673,0.285905,-0.013367,0.226287
164229,0.308732,0.044122,-0.077944,-0.122425,-0.131907,-0.050913,0.104627,-0.290712,-0.287747,0.190735,0.387199,-0.125616,-0.01036,-0.122454,0.205217,-0.161019,-0.137536,0.219214,-0.082063,0.279173,0.046942,-0.222632,0.269959,0.079076,-0.087607,0.469959,0.179964,-0.084298,0.181634,0.125343,-0.045368,0.030069,0.025873,-0.098591,-0.320751,-0.389305,-0.181913,-0.297829,-0.19015,0.23682,...,0.100436,0.121525,0.02541,-0.107121,-0.054281,-0.298962,-0.321711,0.040099,-0.015827,0.017738,2.3e-05,0.398908,0.214232,-0.099377,-0.11431,0.115488,-0.275936,-0.092454,0.325106,0.246393,0.174885,0.220582,0.073413,0.139817,-0.240976,0.106088,0.234622,-0.11341,0.406965,0.005672,0.127439,0.197885,0.212236,-0.173546,0.337989,0.192327,0.245536,0.256984,0.218902,0.351517


# Applying model

We simply use a Logistic Regression to see how it works

In [25]:
lr_cv = LogisticRegressionCV(solver='sag', n_jobs=-1, random_state=0)

lr_cv.fit(X_train, y_train)

LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=100, multi_class='auto', n_jobs=-1, penalty='l2',
                     random_state=0, refit=True, scoring=None, solver='sag',
                     tol=0.0001, verbose=0)

The regularisation strength values used

In [26]:
lr_cv.Cs_

array([1.00000000e-04, 7.74263683e-04, 5.99484250e-03, 4.64158883e-02,
       3.59381366e-01, 2.78255940e+00, 2.15443469e+01, 1.66810054e+02,
       1.29154967e+03, 1.00000000e+04])

The best weight vector obtained.

In [27]:
lr_cv.coef_

array([[ 2.67722468,  4.33575377, -2.44593323,  1.05756641,  0.81989912,
         2.12842061, -6.31270017, -2.46614167, -2.05674866, -0.49966624,
        -0.30189814, -3.13809181, -1.23617525, -5.81015876,  3.33279935,
        -2.76042927,  0.65482648,  0.97269735,  1.54153071,  2.75857937,
        -2.39286875,  3.7123978 ,  6.35282296,  1.12890002, -2.95380791,
        -1.78083075, -0.43175794,  0.21194503,  0.30432438, -2.89556041,
         0.47894346, -0.34042328,  2.38409915,  2.77873428,  1.48143093,
         4.01556727, -2.35166644,  2.97706741,  4.28078537, -6.6557546 ,
         0.77195948, -2.34034622,  1.3317324 , -2.8601829 , -2.24085788,
        -1.35986042,  1.50596747, -2.17347994,  1.180456  ,  3.23394088,
        -2.42485462,  0.85102209, -3.64011884,  1.88894153, -1.15360687,
        -2.32806724,  0.20211164,  0.6532244 , -1.92585417, -3.14224459,
        -2.99210761, -4.74444027,  1.29856326, -0.05937195,  0.15642827,
        -0.65446181,  0.21329185,  1.47103112,  0.6

In [28]:
y_pred = lr_cv.predict(X_test)
y_pred_proba = lr_cv.predict_proba(X_test)

print(accuracy_score(y_test, y_pred))
print(log_loss(y_test, y_pred_proba))

0.9083329529721423
0.22638783146485286


Hmmmm.. not bad for a casual run..!!