## Machine Learning with Keras on Amazon's fine food review Part 1
## Creating baseline models

To analyze how effective neural network models are for this dataset, we will be creating a logistic regression model and a SVM model as baselines to compare accuracies.  For these, the dataset will be encoded using the straightforward bag of words method after removing stopwords using the list provided by nltk library.  There will also be some hyperparameter tuning to try to find the best possible fit of these models.  Afterwards, a bootstrap test will be done to assess a 99% confidence interval for the accuracy for these models.  By creating a confidence interval, we will be able to statistically prove whether the accuracy achieved by the neural network is significantly better than the baseline models or not.


   The notebook will be organized as below:
1. Create a bag of words representation to encode the text into numerical vectors.
2. Use the numerical vectors and fit them onto a logistic regressor and SVM.
3. Hyperparameter tuning on the baseline models will be done using sklearn's gridsearch method.
4. Once the hyperparameters are tuned, the models will be trained and tested for accuracy multiple times. By doing this, a confidence interval of accuracies will be developed which will be used to test if the accuracy improvement seen in neural networks are significant

In [1]:
#data importing/wrangling
import pandas as pd

#data split
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.model_selection import KFold

#vizualization
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
plt.style.use('ggplot')

from collections import defaultdict
import pickle

import nltk
#only required first time
#nltk.download()
from nltk.corpus import stopwords
import re
from sklearn.feature_extraction.text import CountVectorizer

from sklearn import linear_model
from sklearn import svm
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import BaggingClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import SGDClassifier
from scipy import stats
from sklearn.feature_selection import chi2

import scipy.sparse

input_location = '/Users/momori/data/reviews_processed.csv'

In [2]:
data = pd.read_csv(input_location)

In [3]:
data.columns

Index([u'Unnamed: 0', u'Id', u'ProductId', u'UserId', u'ProfileName',
       u'HelpfulnessNumerator', u'HelpfulnessDenominator', u'Score', u'Time',
       u'Summary', u'Text', u'HelpfulnessRatio', u'avg_score',
       u'normalized_score', u'positive_review'],
      dtype='object')

In [4]:
#create documents and labels to train the model later
docs = data['Text']
labels = data['positive_review']

## Create bag of words

A bag of words representation if one way to encode language into numerical vectors.  This happens by first creating a list of all vocabularies used by the entire dataset.  Then for each data point, the above list is initialized with all zeros, then the corresponding index's entry is incremented for each word in the data point.  <br>

For example, consider the sentence 'I eat an apple' with a vocabulary list of 'I', 'eat', 'an', 'apple', 'orange'.  The cardinality of the vocabulary list is five, so there will be a 1x5 vector representation of each datapoint.  In the above example's case, the representation will be [1,1,1,1,0].  Similarly, for the sentence 'I eat eat an orange,' the representaiton would be [1,2,1,0,1].<br> 

The benefit of this representation is that document similarities can be calculated via cosine similarity defined as below:

$$similarity = \frac{dot\_product(d_1, d_2)}{||d_1||*||d_2||}$$

where d_1, d_2 are the encoded vectors. Intuitively, this measures the closeness of the two vectors in n-dimensional space, where n is the number of vocabularies used to encode the text.  However, the main issue with this encoding is that the order of words are not kept, so by this standard, the sentences 'cat eat rat' and 'rat eat cat' are identical.  Regardless, it is a common encoding used in NLP and will be used to create the initial baseline models.

Another issue with this representation which requires some preprocessing of the data.  First off, the existence of stopwords heavily bias the resulting vectors.  For example, most sentences will have very common words such as 'the', 'a', 'an', punctuations and the likes.  These words are removed from the original data source so the models can only look at significant terms, and the stopwords are provided by the NLTK library.  Lastly, the vectorizer class used to create the bag of words representation will differentiate between terms of different cases, such as 'apple' and 'Apple.' Hence, before we start the vectorization process, the datasource will be turned into all lowercase.


In [5]:
#functions

def remove_stopwords(text):
    text = keep_letters(text)
    text = to_lower(text)
    words = [w for w in text if not w in cached_stop_words]
    return(" ".join(words))

def to_lower(text):
    return [w.lower() for w in text.split()]

def keep_letters(text):
    letters_only = re.sub("[^a-zA-Z]", " ", text)
    return letters_only

def calculate_accuracy(prediction, actual):
    zipped = zip(prediction, actual)
    acc = [i for i in zipped if i[0]==i[1]]
    return len(acc)/float(len(zipped))

In [6]:
#change the text to lower case and remove stopwords
cached_stop_words = set(stopwords.words("english"))

reduced_text = docs.apply(remove_stopwords)

reduced_text.head()

0    bought several vitality canned dog food produc...
1    product arrived labeled jumbo salted peanuts p...
2    confection around centuries light pillowy citr...
3    looking secret ingredient robitussin believe f...
4    great taffy great price wide assortment yummy ...
Name: Text, dtype: object

Now we have a processed dataset without punctuations, all lowerccase for easy comparison and without stopwords.  Here we will create a bag of words representation with this text using the CountVectorizer.  <br>

The below CountVectorizer is part of sklearn.feature_extraction library and provides a simple way to create a sparse matrix of token counts from the original text. As seen below, the vectorizer created a matrix with the dimension (568454, 110979), where there are equivalent number of rows as the data points in the source, and recognized 110979 different vocabularies in the dataset

In [7]:
vectorizer = CountVectorizer(analyzer = "word",   
                             tokenizer = None,    
                             preprocessor = None, 
                             stop_words = None) 

train_data_features = vectorizer.fit_transform(reduced_text)

In [8]:
train_data_features.shape

(568454, 110979)

## Creating the baseline models


Now that we have the bag of words representation for the text summaries, the goal of the below cells will be to fit a logistic regression and SVM model to measure the accuracy, to set the baseline.  The processed data will be split into a train/test split, with 70% of the data as the training set.  Then the models will be fit with the training data, and accuracy will be measured on the testing set.  Hyperparameter tuning will be done using gridsearch also. Once the optimal hyperparamers are found, a confidence interval of accuracy will be measured to statistically prove the accuracies measured in the neural network in the next notebook is significantly better or not.  To speed up the process of training multiple logistic regressions, multiprocessing module was used.


## Training a Logistic Regression Model

In [42]:
%%time
#split the data
test_size = 0.3

x_train_base, x_test_base, y_train_base, y_test_base = train_test_split(
    train_data_features, labels, test_size=test_size)

CPU times: user 1.22 s, sys: 1.23 s, total: 2.45 s
Wall time: 19.8 s


In [26]:
lr_parameters = [{'penalty':['l1','l2'], 'max_iter':[10,100]}]

grid = GridSearchCV(linear_model.LogisticRegression(), lr_parameters)
grid.fit(x_train_base, y_train_base)

lr_df = pd.DataFrame(zip(grid.cv_results_['params'], grid.cv_results_['mean_test_score']), columns=['params','accuracy'])
lr_df['params']=lr_df['params'].astype(str)

print 'Accuracies for chosen parameters'
lr_df

Accuracies for chosen parameters


Unnamed: 0,params,accuracy
0,"{'penalty': 'l1', 'max_iter': 10}",0.81381
1,"{'penalty': 'l2', 'max_iter': 10}",0.825714
2,"{'penalty': 'l1', 'max_iter': 100}",0.814286
3,"{'penalty': 'l2', 'max_iter': 100}",0.825714


In [73]:
#split the data. We will use a smaller set for the repeated tests here as the 
#training time will increase too drastically. We notice that the accuracies achieved
#are still very similar to the model when trained with the whole dataset. 
test_size = 0.3

x_train_sample, x_test_sample, y_train_sample, y_test_sample = train_test_split(
    train_data_features[:5000], labels[:5000], test_size=test_size)

num_trials = 1000
inputs = range(num_trials)
acc = np.empty(num_trials)
lr = linear_model.LogisticRegression(penalty='l2', max_iter=10,solver='sag',n_jobs=-1)
results=[]
def do_trial(i):
    print i,
    lr.fit(x_train_sample, y_train_sample)
    #acc[i] = lr.score(x_test_base, y_test_base)
    x = lr.score(x_test_sample, y_test_sample)
    return x
    

In [74]:
%%time
import warnings
warnings.filterwarnings("ignore")
import multiprocessing as mp
import tqdm
results = []
pool = mp.Pool(processes=mp.cpu_count())
results = pool.map(do_trial, inputs)




CPU times: user 2.59 s, sys: 2.44 s, total: 5.02 s
Wall time: 2min 8s


In [75]:
np.percentile(results, [0.5, 99.5])

array([ 0.83733333,  0.84466667])

## Training a SVM Model


In [76]:
svm_parameters = [{'penalty':['l1','l2'], 'loss':['hinge', 'log', 'squared_hinge']}]

svm_grid = GridSearchCV(SGDClassifier(), svm_parameters)
svm_grid.fit(x_train_base, y_train_base)
pd.set_option('display.max_colwidth', -1)

svm_df = pd.DataFrame(zip(svm_grid.cv_results_['params'],svm_grid.cv_results_['mean_test_score']),columns=['params','accuracy'])
svm_df.params = svm_df.params.astype(str)
svm_df

Unnamed: 0,params,accuracy
0,"{'penalty': 'l1', 'loss': 'hinge'}",0.790714
1,"{'penalty': 'l2', 'loss': 'hinge'}",0.805
2,"{'penalty': 'l1', 'loss': 'log'}",0.799286
3,"{'penalty': 'l2', 'loss': 'log'}",0.794286
4,"{'penalty': 'l1', 'loss': 'squared_hinge'}",0.783571
5,"{'penalty': 'l2', 'loss': 'squared_hinge'}",0.802857


In [77]:
num_trials = 1000
acc = np.empty(num_trials)
clf = SGDClassifier(penalty='l2', loss='squared_hinge',n_jobs=-1)
def do_trial_sv(i):
    print i,
    clf.fit(x_train_sample, y_train_sample)
    #acc[i] = lr.score(x_test_base, y_test_base)
    x = clf.score(x_test_sample, y_test_sample)
    return x
    

In [78]:
%%time
results_sv = []
pool = mp.Pool(processes=mp.cpu_count())
results_sv = pool.map(do_trial, inputs)




CPU times: user 2.51 s, sys: 2.52 s, total: 5.03 s
Wall time: 2min


In [79]:
np.percentile(results_sv, [0.5, 99.5])

array([ 0.83666667,  0.84466667])

In the above section, we created a baseline model using the following steps <br>
1) Use the bag of words algorithm to encode the texts into numerical vectors <br>
2) Use logistic regression and support vector machines to tune the hyperparameters of the models.<br>
3) fit the models on the dataset numerous times to obtain a distribution of accuracies achieved.   
<br>

Logistic regression and SVM both achieve a decent result of around 83%. 
The interesting fact to note here is that both models seems to converge rather quickly and to a similar accuracy for this dataset.  This is good bases to believe that further tuning of these models may not improve the accuracy much further.  

The interesting analysis to take note here is that logistic regressions fit the dataset relatively well, which can denote that the documents and the labels fit a pretty linear relationship.  This may mean that majority of the text can be taken for what it is without sarcasm.  (If there were a lot of sarcasm and dependency on the tone of the review, then the relationship may not have been linear).  <br>

We will now approach the same dataset using a different encoding algorithm and combining that with neural networks in the next notebook to measure the increase in accuracy in NeuralNets.ipynb

In [None]:
#buffer