## Machine Learning with Keras on Amazon's fine food review Part 1
## Creating baseline models

To analyze how effective neural network models are for this dataset, we will be creating a logistic regression model and a SVM model as baselines to compare accuracies.  For these, the dataset will be encoded using the popular and straightforward bag of words method after removing stopwords using the list provided by nltk library.  There will also be some hyperparameter tuning to attempt at a best fit baseline models.


   The notebook will be organized as below:
1. Create a bag of words representation to encode the text into numerical vectors.
2. Use the numerical vectors and fit them onto a logistic regressor and SVM.
3. Hyperparameter tuning on the baseline models will be done using sklearn's gridsearch method.

In [2]:
#data importing/wrangling
import pandas as pd

#text processing 
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

#CNN modeling
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
from keras.models import load_model
from keras.layers import Conv2D, Conv1D, MaxPooling1D, MaxPooling2D



#data split
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

#vizualization
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
plt.style.use('ggplot')

from collections import defaultdict
import pickle

import nltk
#only required first time
#nltk.download()
from nltk.corpus import stopwords
import re
from sklearn.feature_extraction.text import CountVectorizer

from sklearn import linear_model
from sklearn import svm
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import BaggingClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import SGDClassifier

import scipy.sparse

input_location = '/Users/momori/data/reviews_processed.csv'

Using TensorFlow backend.


In [3]:
data = pd.read_csv(input_location)

In [4]:
data.columns

Index([u'Unnamed: 0', u'Id', u'ProductId', u'UserId', u'ProfileName',
       u'HelpfulnessNumerator', u'HelpfulnessDenominator', u'Score', u'Time',
       u'Summary', u'Text', u'HelpfulnessRatio', u'avg_score',
       u'normalized_score', u'positive_review'],
      dtype='object')

In [6]:
#create documents and labels to train the model later
docs = data['Text']
labels = data['positive_review']

## Create bag of words

A bag of words representation if one way to encode language into numerical vectors.  This happens by first creating a list of all vocabularies used by the entire dataset.  Then for each data point, the above list is initialized with all zeros, then the corresponding index's entry is incremented for each word in the data point.  <br>

For example, consider the sentence 'I eat an apple' with a vocabulary list of 'I', 'eat', 'an', 'apple', 'orange'.  The cardinality of the vocabulary list is five, so there will be a 1x5 vector representation of each datapoint.  In the above example's case, the representation will be [1,1,1,1,0].  Similarly, for the sentence 'I eat an orange,' the representaiton would be [1,1,1,0,1].<br>

There are a few issues with this representation which requires some preprocessing of the data.  First off, the existence of stopwords heavily bias the resulting vectors.  For example, most sentences will have very common words such as 'the', 'a', 'an', punctuations and the likes.  These words are removed from the original data source so the models can only look at significant terms, and the stopwords are provided by the NLTK library.  Another point to note is that the vectorizer class used to create the bag of words representation will differentiate between terms of different cases, such as 'apple' and 'Apple.' Hence, before we start the vectorization process, the datasource will be turned into all lowercase.


In [7]:
#functions

def remove_stopwords(text):
    text = keep_letters(text)
    text = to_lower(text)
    words = [w for w in text if not w in cached_stop_words]
    return(" ".join(words))

def to_lower(text):
    return [w.lower() for w in text.split()]

def keep_letters(text):
    letters_only = re.sub("[^a-zA-Z]", " ", text)
    return letters_only

def calculate_accuracy(prediction, actual):
    zipped = zip(prediction, actual)
    acc = [i for i in zipped if i[0]==i[1]]
    return len(acc)/float(len(zipped))

In [8]:
#change the text to lower case and remove stopwords
cached_stop_words = set(stopwords.words("english"))

reduced_text = docs.apply(remove_stopwords)

reduced_text.head()

0    bought several vitality canned dog food produc...
1    product arrived labeled jumbo salted peanuts p...
2    confection around centuries light pillowy citr...
3    looking secret ingredient robitussin believe f...
4    great taffy great price wide assortment yummy ...
Name: Text, dtype: object

Now we have a processed dataset without punctuations, all lowerccase for easy comparison and without stopwords.  Now we will create a bag of words representation with this text using the CountVectorizer.  <br>

The below CountVectorizer is part of sklearn.feature_extraction library and provides a simple way to create a sparse matrix of token counts from the original text. As seen below, the vectorizer created a matrix with the dimension (568454, 110979), where there are equivalent number of rows as the data points in the source, and recognized 110979 different vocabularies in the dataset

In [9]:
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None) 

train_data_features = vectorizer.fit_transform(reduced_text)

In [10]:
train_data_features.shape

(568454, 110979)

## Creating the baseline models


Now that we have the bag of words representation for the text summaries, the goal of the below cells will be to fit a logistic regression and SVM model to measure the accuracy, to set the baseline.  The processed data will be split into a train/test split, with 70% of the data as the training set.  Then the models will be fit with the training data, and accuracy will be measured on the testing set.  Hyperparameter tuning will be done using gridsearch also.


## Training a Logistic Regression Model

In [12]:
#split the data
train_size = 0.7

x_train_base, x_test_base, y_train_base, y_test_base = train_test_split(
    train_data_features, labels, test_size=train_size, random_state=42)

In [29]:
lr_parameters = [{'penalty':['l1','l2'], 'max_iter':[10,100, 200]}]

grid = GridSearchCV(linear_model.LogisticRegression(), lr_parameters)
grid.fit(x_train_base, y_train_base)


GridSearchCV(cv=None, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'penalty': ['l1', 'l2'], 'max_iter': [10, 100, 200]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [31]:
means = grid.cv_results_['mean_test_score']
stds = grid.cv_results_['std_test_score']

for mean, params in zip(means,grid.cv_results_['params']):
    print mean, params

0.889724163813 {'penalty': 'l1', 'max_iter': 10}
0.887859454895 {'penalty': 'l2', 'max_iter': 10}
0.8895775672 {'penalty': 'l1', 'max_iter': 100}
0.89036918891 {'penalty': 'l2', 'max_iter': 100}
0.889595158793 {'penalty': 'l1', 'max_iter': 200}
0.89036918891 {'penalty': 'l2', 'max_iter': 200}


In [58]:
lr_df = pd.DataFrame(zip(grid.cv_results_['params'], means), columns=['params','accuracy'])
lr_df['params']=lr_df['params'].astype(str)
lr_df

Unnamed: 0,params,accuracy
0,"{'penalty': 'l1', 'max_iter': 10}",0.889724
1,"{'penalty': 'l2', 'max_iter': 10}",0.887859
2,"{'penalty': 'l1', 'max_iter': 100}",0.889578
3,"{'penalty': 'l2', 'max_iter': 100}",0.890369
4,"{'penalty': 'l1', 'max_iter': 200}",0.889595
5,"{'penalty': 'l2', 'max_iter': 200}",0.890369


## Training a SVM Model


In [63]:
svm_parameters = [{'penalty':['l1','l2'], 'loss':['hinge', 'log', 'squared_hinge']}]

svm_grid = GridSearchCV(SGDClassifier(), svm_parameters)
svm_grid.fit(x_train_base, y_train_base)

GridSearchCV(cv=None, error_score='raise',
       estimator=SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=5, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
       shuffle=True, tol=None, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'penalty': ['l1', 'l2'], 'loss': ['hinge', 'log', 'squared_hinge']}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [65]:
pd.set_option('display.max_colwidth', -1)

svm_df = pd.DataFrame(zip(svm_grid.cv_results_['params'],means),columns=['params','accuracy'])
svm_df.params = svm_df.params.astype(str)
svm_df

Unnamed: 0,params,accuracy
0,"{'penalty': 'l1', 'loss': 'hinge'}",0.889724
1,"{'penalty': 'l2', 'loss': 'hinge'}",0.887859
2,"{'penalty': 'l1', 'loss': 'log'}",0.889578
3,"{'penalty': 'l2', 'loss': 'log'}",0.890369
4,"{'penalty': 'l1', 'loss': 'squared_hinge'}",0.889595
5,"{'penalty': 'l2', 'loss': 'squared_hinge'}",0.890369


In the above section, we created a baseline model using the following steps <br>
1) Use the bag of words algorithm to encode the texts into numerical vectors <br>
2) Use logistic regression and support vector machines to create a baseline model.<br>
<br>

Interestingly enough, logistic regression and SVM both achieve a very respectable accuracy of ~89%. With further parameter tuning, the accuracy may improve slightly.  The interesting analysis to take note here is that logistic regressions fit the dataset relatively well, which can denote that the documents and the labels fit a pretty linear relationship.  This may mean that majority of the text can be taken for what it is without sarcasm.  (If there were a lot of sarcasm and dependency on the tone of the review, then the relationship may not have been linear).  <br>

We will now approach the same dataset using a different encoding algorithm and combining that with neural networks in the next notebook to measure the increase in accuracy in NeuralNets.ipynb

In [None]:
#buffer