## Exploratory Data Analysis for Amazon Fine Food Reviews

### Purpose

Build a model to predict the helpfulness of Amazon Fine Food Reviews. This will improve Amazon's selection of helpful reviews at the top of the review section and improve customer's purchasing decisions. It could also help other reviewers as a guide to writing helpful reviews.

This dataset comes from over 568,0454 Amazon Fine Food Reviews. 

In [1]:
#data dictionary

## Load the Data

In [4]:
#load data and and score helpfulness

In [6]:
# imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn import metrics

# this allows plots to appear directly in the notebook
%matplotlib inline

ImportError: cannot import name multiarray

In [None]:
# read data into a DataFrame
data = pd.read_csv('Reviews.csv', index_col=0)
data.head(2)

In [None]:
#make a copy of columns I need from raw data
df1 = data.iloc[:, [3,4,5,8]]
df1.head()

In [None]:
#change data type of non-Text features from string to integer
df1.iloc[:, 0:3] = df1.iloc[:, 0:3].apply(pd.to_numeric)

In [None]:
#create new dataframe from reviews that have helpfulness data
df1 = df1[(df1.HelpfulnessDenominator > 10)]

#### Notes

number of rows now that have helpful data. half the size of dataset.

## Clean the Data

In [None]:
print df1.isnull().sum()

In [None]:
# convert text to lowercase
df1.loc[:, 'Text'] = df1['Text'].str.lower()
df1["Text"].head(10)

In [None]:
#strip html tags
import sys
reload(sys)  
sys.setdefaultencoding('Cp1252')

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    parser = HTMLParser()
    html = parser.unescape(html)
    s = MLStripper()
    s.feed(html)
    return s.get_data()

In [None]:
df1.loc[:, "Text"] = df1.loc[:, "Text"].apply(lambda x: strip_tags(x))

In [None]:
#v = TfidfVectorizer(decode_error='replace', encoding='utf-8')
#df1.loc[:, "Text"] = v.fit_transform(df1['Text'].values.astype('U'))

In [None]:
#remove text punctuation
import string
from string import maketrans

intab = string.punctuation
outtab = "                                "
trantab = maketrans(intab, outtab)
df1.loc[:, 'Text'] = df1["Text"].str.translate(trantab)
df1["Text"].head(10)

#### Notes

used stop words, lower case, X. Didn't use porter stem. did stop words now so stop words won't be in n-gram frequency distributions.

## Exploratory Data Analysis

### Create a binary variable "Helpfulness"

In [None]:
#transform Helpfulness into a binary variable with 0.50 ratio
df1.loc[:, 'Helpful'] = np.where(df1.loc[:, 'HelpfulnessNumerator'] / df1.loc[:, 'HelpfulnessDenominator'] > 0.50, 1, 0)
df1.head(3)

In [None]:
df1.groupby('Helpful').count()

#### Notes

ratio is > 0.5. helpfulness = outcome

In [None]:
df1['Text'].shape

### Frequency distributions for review text

In [None]:
#df3 = df2.iloc[:, 0:6]

In [None]:
#print type(df3)
#print df3.columns
#print df3["Text"].head(2)

In [None]:
#type(df3)
#text = df3["Text"]

In [None]:
#type(text)

In [None]:
#import nltk
#from nltk.collocations import *
#df3["unigrams"] = df3["Text"].apply(nltk.word_tokenize)

In [None]:
#df3["unigrams"].head(3)

In [None]:
#my_bigrams = nltk.bigrams(df3.unigrams)
#my_trigrams = nltk.trigrams(df3.unigrams)

In [None]:
#fdist = nltk.FreqDist(df3["unigrams"])

In [None]:
#fdist = nltk.FreqDist(my_bigrams)
#http://www.ling.helsinki.fi/kit/2009s/clt231/NLTK/book/ch01-LanguageProcessingAndPython.html#frequency-distributions

In [None]:
# http://stackoverflow.com/questions/33098040/how-to-use-word-tokenize-in-data-frame
#how to use tokenizer in dataframe
# https://www.strehle.de/tim/weblog/archives/2015/09/03/1569
#from nltk.util import ngrams
#ngrams(, 3))

### 

#### Notes

bi-grams http://rstudio-pubs-static.s3.amazonaws.com/163569_f06e862a8f444e4c9cb8cca323b77f1a.html
https://www.kaggle.com/gpayen/d/snap/amazon-fine-food-reviews/building-a-prediction-model

http://stackoverflow.com/questions/24347029/python-nltk-bigrams-trigrams-fourgrams

http://stackoverflow.com/questions/14364762/counting-n-gram-frequency-in-python-nltk
NLTK Bigram generator

## TfIDF Vectorizer

In [None]:
#Apply TfidfVectorizer to review text

In [None]:
vectorizer = TfidfVectorizer(min_df = 0.1, max_df=0.9,
                             ngram_range=(1, 4), 
                             stop_words='english')
vectorizer.fit(df1['Text'])

In [None]:
X_train = vectorizer.transform(df1['Text'])
vocab = vectorizer.get_feature_names()

In [None]:
vocab

In [None]:
X_train

In [None]:
df1.Helpful.value_counts()

## Logistic Regression to Predict Review Helpfulness when Helpful threshold is > 50%

In [None]:
#add Score column to top words
print X_train.shape

In [None]:
from sklearn import grid_search, cross_validation
from sklearn.linear_model import LogisticRegression
feature_set = X_train
gs = grid_search.GridSearchCV(
    estimator=LogisticRegression(),
    param_grid={'C': [10**-i for i in range(-5, 5)], 'class_weight': [None, 'balanced']},
    cv=cross_validation.StratifiedKFold(df1.Helpful,n_folds=10),
    scoring='roc_auc'
)


gs.fit(X_train, df1.Helpful)
gs.grid_scores_

In [None]:
print gs.best_estimator_

In [None]:
y_pred = gs.predict(feature_set)

In [None]:
# Coefficients represent the log-odds
print gs.best_estimator_.coef_
print gs.best_estimator_.intercept_

In [None]:
len(gs.best_estimator_.coef_[0])

In [None]:
print gs.best_estimator_.score(feature_set, df1.Helpful)

In [None]:
from sklearn.metrics import roc_auc_score, roc_curve
actuals = gs.predict(feature_set) 
probas = gs.predict_proba(feature_set)
plt.plot(roc_curve(df1[['Helpful']], probas[:,1])[0], roc_curve(df1[['Helpful']], probas[:,1])[1])

In [None]:
y_score = probas

In [None]:
test2 = np.array(list(df1.Helpful))
test2 = test2.reshape(21463,1)
y_true = test2

In [None]:
roc_auc_score(y_true, y_score[:,1].T)

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Blues):
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    plt.tight_layout()
    plt.ylabel('True Helpfulness')
    plt.xlabel('Predicted Helpfulness')


# Compute confusion matrix
cm = confusion_matrix(y_true, y_pred)
np.set_printoptions(precision=2)
print('Confusion matrix, without normalization')
print(cm)
plt.figure()
plot_confusion_matrix(cm)

# Normalize the confusion matrix by row (i.e by the number of samples
# in each class)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print('Normalized confusion matrix')
print(cm_normalized)
plt.figure()
plot_confusion_matrix(cm_normalized, title='Normalized confusion matrix')

plt.show()

In [None]:
sorted(zip(vectorizer.vocabulary_,gs.bag of.coef_[0]),key=lambda x:x[1])