## Exploratory Data Analysis for Amazon Fine Food Reviews

### Purpose

Build a model to predict the helpfulness of Amazon Fine Food Reviews. This will improve Amazon's selection of helpful reviews at the top of the review section and improve customer's purchasing decisions. It could also help other reviewers as a guide to writing helpful reviews.

This dataset comes from over 568,0454 Amazon Fine Food Reviews. 

In [1]:
#data dictionary

## Load the Data

In [2]:
#load data and and score helpfulness

In [3]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from sklearn.cross_validation import train_test_split

# this allows plots to appear directly in the notebook
%matplotlib inline

In [4]:
# read data into a DataFrame
data = pd.read_csv('Reviews.csv', index_col=0)
data.head(2)

Unnamed: 0_level_0,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...


In [5]:
#make a copy of columns I need from raw data
df1 = data.iloc[:, [3,4,5,8]]
df1.head()

Unnamed: 0_level_0,HelpfulnessNumerator,HelpfulnessDenominator,Score,Text
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,1,1,5,I have bought several of the Vitality canned d...
2,0,0,1,Product arrived labeled as Jumbo Salted Peanut...
3,1,1,4,This is a confection that has been around a fe...
4,3,3,2,If you are looking for the secret ingredient i...
5,0,0,5,Great taffy at a great price. There was a wid...


In [6]:
#change data type of non-Text features from string to integer
df1.iloc[:, 0:3] = df1.iloc[:, 0:3].apply(pd.to_numeric)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [7]:
#create new dataframe from reviews that have helpfulness data
df2 = df1[(df1.HelpfulnessDenominator != 0)]

In [8]:
print df1.shape
print df2.shape

(568454, 4)
(298402, 4)


#### Notes

number of rows now that have helpful data. half the size of dataset.

## Clean the Data

In [9]:
print df2.isnull().sum()

HelpfulnessNumerator      0
HelpfulnessDenominator    0
Score                     0
Text                      0
dtype: int64


In [10]:
# convert text to lowercase
df2.loc[:, 'Text'] = df2['Text'].str.lower()
df2["Text"][11]

"i don't know if it's the cactus or the tequila or just the unique combination of ingredients, but the flavour of this hot sauce makes it one of a kind!  we picked up a bottle once on a trip we were on and brought it back home with us and were totally blown away!  when we realized that we simply couldn't find it anywhere in our city we were bummed.<br /><br />now, because of the magic of the internet, we have a case of the sauce and are ecstatic because of it.<br /><br />if you love hot sauce..i mean really love hot sauce, but don't want a sauce that tastelessly burns your throat, grab a bottle of tequila picante gourmet de inclan.  just realize that once you taste it, you will never want to use any other sauce.<br /><br />thank you for the personal, incredible service!"

In [11]:
#remove text punctuation
import string
from string import maketrans

intab = string.punctuation
outtab = "                                "
trantab = maketrans(intab, outtab)
df2.loc[:, 'Text'] = df2["Text"].str.translate(trantab)
df2["Text"][11]

'i don t know if it s the cactus or the tequila or just the unique combination of ingredients  but the flavour of this hot sauce makes it one of a kind   we picked up a bottle once on a trip we were on and brought it back home with us and were totally blown away   when we realized that we simply couldn t find it anywhere in our city we were bummed  br    br   now  because of the magic of the internet  we have a case of the sauce and are ecstatic because of it  br    br   if you love hot sauce  i mean really love hot sauce  but don t want a sauce that tastelessly burns your throat  grab a bottle of tequila picante gourmet de inclan   just realize that once you taste it  you will never want to use any other sauce  br    br   thank you for the personal  incredible service '

In [12]:
#remove stop words
from nltk.corpus import stopwords
cachedStopWords = stopwords.words("english")
df2["Text"] = ' '.join([word for word in df1["Text"].str.split() if word not in stopwords.words("english")]))

#stopset = set(stopwords.words('english'))rez
#tokens = nltk.word_tokenize(df1["Text"])
#cleanup = [token for token in tokens if token.lower() not in stopset]
# http://stackoverflow.com/questions/5486337/how-to-remove-stop-words-using-nltk-or-python

SyntaxError: invalid syntax (<ipython-input-12-076ab074b966>, line 4)

#### Notes

used stop words, lower case, X. Didn't use porter stem. did stop words now so stop words won't be in n-gram frequency distributions.

## Exploratory Data Analysis

### Create a binary variable "Helpfulness"

In [13]:
#transform Helpfulness into a binary variable with 0.50 ratio
df2.loc[:, 'Helpful'] = np.where(df2.loc[:, 'HelpfulnessNumerator'] / df2.loc[:, 'HelpfulnessDenominator'] >=0.50, 1, 0)
df2.head(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)


Unnamed: 0_level_0,HelpfulnessNumerator,HelpfulnessDenominator,Score,Text,Helpful
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1,1,5,i have bought several of the vitality canned d...,1
3,1,1,4,this is a confection that has been around a fe...,1
4,3,3,2,if you are looking for the secret ingredient i...,1


In [14]:
df2.groupby('Helpful').count()

Unnamed: 0_level_0,HelpfulnessNumerator,HelpfulnessDenominator,Score,Text
Helpful,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,50113,50113,50113,50113
1,248289,248289,248289,248289


In [15]:
#transform Helpfulness into a binary variable with 0.75 ratio
df2.loc[:, 'Helpful2'] = np.where(df2.loc[:, 'HelpfulnessNumerator'] / df2.loc[:, 'HelpfulnessDenominator'] >=0.75, 1, 0)
df2.head(3)

Unnamed: 0_level_0,HelpfulnessNumerator,HelpfulnessDenominator,Score,Text,Helpful,Helpful2
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,1,1,5,i have bought several of the vitality canned d...,1,1
3,1,1,4,this is a confection that has been around a fe...,1,1
4,3,3,2,if you are looking for the secret ingredient i...,1,1


In [16]:
df2.groupby('Helpful2').count()

Unnamed: 0_level_0,HelpfulnessNumerator,HelpfulnessDenominator,Score,Text,Helpful
Helpful2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,89202,89202,89202,89202,89202
1,209200,209200,209200,209200,209200


#### Notes

### Frequency distributions for review text

compare for helpful ratio 0.5 vs 0.75
1) most frequent words for helpful reviews
2) most frequent for unhelpful reviews
3) most frequent top words overall
(compare)

In [17]:
df3 = df2.iloc[:, 0:6]

In [18]:
print type(df3)
print df3.columns
print df3["Text"].head(2)

<class 'pandas.core.frame.DataFrame'>
Index([u'HelpfulnessNumerator', u'HelpfulnessDenominator', u'Score', u'Text',
       u'Helpful', u'Helpful2'],
      dtype='object')
Id
1    i have bought several of the vitality canned d...
3    this is a confection that has been around a fe...
Name: Text, dtype: object


In [19]:
type(df3)
text = df3["Text"]

In [None]:
type(text)

In [None]:
import nltk
from nltk.collocations import *
#df3["unigrams"] = df3["Text"].apply(nltk.word_tokenize)

In [None]:
#df3["unigrams"].head(3)

In [None]:
#my_bigrams = nltk.bigrams(df3.unigrams)
#my_trigrams = nltk.trigrams(df3.unigrams)

In [None]:
#fdist = nltk.FreqDist(df3["unigrams"])

In [None]:
fdist = nltk.FreqDist(my_bigrams)
#http://www.ling.helsinki.fi/kit/2009s/clt231/NLTK/book/ch01-LanguageProcessingAndPython.html#frequency-distributions

In [None]:
# http://stackoverflow.com/questions/33098040/how-to-use-word-tokenize-in-data-frame
#how to use tokenizer in dataframe
# https://www.strehle.de/tim/weblog/archives/2015/09/03/1569
#from nltk.util import ngrams
#ngrams(, 3))

### 

#### Notes

bi-grams http://rstudio-pubs-static.s3.amazonaws.com/163569_f06e862a8f444e4c9cb8cca323b77f1a.html
https://www.kaggle.com/gpayen/d/snap/amazon-fine-food-reviews/building-a-prediction-model

http://stackoverflow.com/questions/24347029/python-nltk-bigrams-trigrams-fourgrams

http://stackoverflow.com/questions/14364762/counting-n-gram-frequency-in-python-nltk
NLTK Bigram generator

### Clustering of Reviews

In [76]:
df4 = df2["Text"]

In [77]:
#Apply TfidfVectorizer to review text
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

vectorizer = TfidfVectorizer(stop_words='english', min_df=5)
X = vectorizer.fit_transform(df4)

In [80]:
#Fit review text to KMeans
true_k = 3
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)

KMeans(copy_x=True, init='k-means++', max_iter=100, n_clusters=3, n_init=1,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)

In [81]:
#Find the top terms per cluster
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print "Cluster %d:" % i,
    for ind in order_centroids[i, :10]:
        print ' %s' % terms[ind],
    print

Top terms per cluster:
Cluster 0:  like  great  tea  good  product  taste  love  just  food  flavor
Cluster 1:  coffee  cup  br  like  cups  strong  flavor  good  taste  roast
Cluster 2:  br  like  food  tea  good  product  just  taste  flavor  amazon


In [84]:
len(vectorizer.vocabulary_)

33562

In [86]:
#review cluster labels
model.labels_

array([0, 0, 0, ..., 0, 0, 0], dtype=int32)

In [91]:
# save the cluster labels and sort by cluster
df2['cluster'] = model.labels_
df2.groupby('cluster').count()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


Unnamed: 0_level_0,HelpfulnessNumerator,HelpfulnessDenominator,Score,Text,Helpful,Helpful2
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,224785,224785,224785,224785,224785,224785
1,23768,23768,23768,23768,23768,23768
2,49849,49849,49849,49849,49849,49849


### Logistic Regression of Helpful/Not Helpful Reviews

In [92]:
# cluster labels + score to predict helpful or not helpful

In [97]:
from sklearn.linear_model import LogisticRegression
from sklearn import grid_search, cross_validation

In [98]:
df2.columns

Index([u'HelpfulnessNumerator', u'HelpfulnessDenominator', u'Score', u'Text',
       u'Helpful', u'Helpful2', u'cluster'],
      dtype='object')

In [94]:
feature_set = df2[['Score', 'cluster']]
gs = grid_search.GridSearchCV(
    estimator=LogisticRegression(),
    param_grid={'C': [10**-i for i in range(-5, 5)], 'class_weight': [None, 'auto']},
    cv=cross_validation.KFold(n=len(df2), n_folds=10),
    scoring='roc_auc'
)


gs.fit(feature_set, df2.Helpful)
gs.grid_scores_
#print gs.best_estimator_