### Machine Learning: Summer 2020
### Project 2

# Sentiment Analysis of Amazon Product Reviews

### Anshul Dabas and Anndi Russell

In [25]:
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier
import pickle
import os
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier
import re

## Experiment Objective

Our original dataset consisted of approximately 34,000 product reviews (text) and ratings (out of 5 stars) for a variety of products on Amazon. It was downloaded from Kaggle; it's a subset of a larger original dataset available through Datafiniti. Datafiniti is a web scraping service with data available for download. Ethically, we are able to use this data since Amazon reviews are publically available without a login. 

We will predict, based on text of a review, whether the sentiment of the review of the product was positive or negative. For our data, we determined the sentiment based on the rating the reviewer gave; ratings of 1 or 2 were considered negative, and ratings of 4 or 5 were considered positive. We dropped reviews with a rating of 3, as they can be considered neutral and thus are not meaningful for our binary sentiment analysis. 

We chose this data because we were able to apply the sentiment analysis concept, since we could turn ratings into a positive or negative sentiment, resulting in labeled data for training a model.

With this research, we could predict whether a customer felt positively or negatively about their product based on the text they use in a review. 

## Data Collection

Import data from csv (downloade dfrom archive here:https://www.kaggle.com/datafiniti/consumer-reviews-of-amazon-products/data#Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products.csv):

In [26]:
reviews_df = pd.read_csv('amazon_reviews.csv')

  interactivity=interactivity, compiler=compiler, result=result)


Select only relevant columns:

In [27]:
reviews_df = reviews_df[['reviews.rating','reviews.text']]
reviews_df.columns=['Rating','Review']
reviews_df.head()

Unnamed: 0,Rating,Review
0,5.0,This product so far has not disappointed. My c...
1,5.0,great for beginner or experienced person. Boug...
2,5.0,Inexpensive tablet for him to use and learn on...
3,4.0,I've had my Fire HD 8 two weeks now and I love...
4,5.0,I bought this for my grand daughter when she c...


We now have a complete dataframe ready for processing.

## Data Preprocessing

In [28]:
reviews_df.isnull().sum()

Rating    33
Review     1
dtype: int64

Drop null values:

In [29]:
reviews_df.dropna(inplace=True)
reviews_df.isnull().sum()

Rating    0
Review    0
dtype: int64

A rating of 4 or 5 is considered positive and will be a '1'. A rating of 1 or 2 is considered negative and will be a '0'. Ratings of 3 will be dropped since they are neither positive nor negative.
Mapping:

In [30]:
ratingmap = {4:1,5:1,1:0,2:0}
reviews_df['Sentiment'] = reviews_df['Rating'].map(ratingmap)
reviews_df.drop(['Rating'], axis=1, inplace=True) #drop original rating column
reviews_df.dropna(subset=['Sentiment'], inplace=True) #drop nulls-- these were 3s in the original column
reviews_df['Sentiment']=reviews_df['Sentiment'].astype(int) #cast to int

In [31]:
reviews_df.head()

Unnamed: 0,Review,Sentiment
0,This product so far has not disappointed. My c...,1
1,great for beginner or experienced person. Boug...,1
2,Inexpensive tablet for him to use and learn on...,1
3,I've had my Fire HD 8 two weeks now and I love...,1
4,I bought this for my grand daughter when she c...,1


In [32]:
reviews_df['Sentiment'].value_counts()

1    32315
0      812
Name: Sentiment, dtype: int64

In [33]:
reviews_df.shape

(33127, 2)

We have 812 negative reviews, and 32315 positive reviews. This is very imbalanced, so to correct this we will downsammple the majority class.

Downsample majority:
(Citation: https://elitedatascience.com/imbalanced-classes)

In [34]:
from sklearn.utils import resample

# Separate majority and minority classes
df_majority = reviews_df[reviews_df.Sentiment==1]
df_minority = reviews_df[reviews_df.Sentiment==0]
 
# Downsample majority class
df_majority_downsampled = resample(df_majority, 
                                 replace=False,
                                 n_samples=1500,
                                 random_state=123)
 
# Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_minority])
 
# Display new class counts
print(df_downsampled.Sentiment.value_counts())

reviews_df=df_downsampled
reviews_df.reset_index(inplace=True)
reviews_df.drop(['index'], axis=1, inplace=True)
reviews_df.shape

1    1500
0     812
Name: Sentiment, dtype: int64


(2312, 2)

Now we have 1500 positive reviews and 812 negative reviews, which is much more balanced than we had initially. Another option would have been to upsample the minority class, but because our minority class was so much smaller it would have resulted in many duplicate rows of our minority class and likely led to overfitting and lack of generalizability. We decided downsampling was a better approach, even though it resulted in quite a lot of lost data.

In [35]:
reviews_df.head()

Unnamed: 0,Review,Sentiment
0,seems to just be a novelty. which has worn off...,1
1,Christmas gift.,1
2,This Amazon Fire is perfect for what I need - ...,1
3,This was perfect for entertaining my 20-month ...,1
4,It is very easy to set up and use. It has grea...,1


In [36]:
reviews_df.isnull().sum()

Review       0
Sentiment    0
dtype: int64

Shuffle dataset so sentiments are not grouped:

In [37]:
reviews_df = reviews_df.sample(frac=1, random_state=0).reset_index(drop=True)

In [38]:
reviews_df.head()

Unnamed: 0,Review,Sentiment
0,Bestbuy came through before the holiday better...,0
1,Alexa is a key component for easy smart home s...,1
2,we're early in the stages of home smartness bu...,1
3,Got this over the google home because I felt i...,1
4,I paired this with the logitech elite universa...,1


In [39]:
reviews_df.head(20)

Unnamed: 0,Review,Sentiment
0,Bestbuy came through before the holiday better...,0
1,Alexa is a key component for easy smart home s...,1
2,we're early in the stages of home smartness bu...,1
3,Got this over the google home because I felt i...,1
4,I paired this with the logitech elite universa...,1
5,I spent more time trying get it functioning th...,0
6,Easy for a 5 year old to learn.. the apps on t...,1
7,Sure I can use my iPhone for internet browsing...,1
8,You have to pay for every thing -any thing you...,0
9,I bought it as a gift for my husband and he lo...,1


Defining our tokenizers: Clean the text to remove punctuation (except emoticons), put everything in lower case, and tokenize. These 2 tokenizers will be used in our grid search, and 1 will be chosen for our final model:

In [40]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()

def tokenizer(text):
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    return text.split()

def tokenizerporter(text):
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    return [porter.stem(word) for word in text.split()]


Create vectorizer to use in gridsearch below:

In [41]:
tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

tfidf.get_params().keys()

dict_keys(['analyzer', 'binary', 'decode_error', 'dtype', 'encoding', 'input', 'lowercase', 'max_df', 'max_features', 'min_df', 'ngram_range', 'norm', 'preprocessor', 'smooth_idf', 'stop_words', 'strip_accents', 'sublinear_tf', 'token_pattern', 'tokenizer', 'use_idf', 'vocabulary'])

## Model Optimization and Serialization

50/50 train/test split:

In [42]:
X_train = reviews_df.loc[:812, 'Review'].values
y_train = reviews_df.loc[:812, 'Sentiment'].values
X_test = reviews_df.loc[812:, 'Review'].values
y_test = reviews_df.loc[812:, 'Sentiment'].values

Grid search to find optimal hyperparameters:

In [44]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [52]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV


stop = stopwords.words('english')


param_grid = [{'vect__ngram_range': [(1, 1), (1, 2)],  # unigrams or bigrams
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizerporter],
               'clf__penalty': ['l1', 'l2'],
               'clf__n_iter_no_change': [3,5,7,9],
               'clf__alpha': [1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3]},
              {'vect__ngram_range': [(1, 1), (1, 2)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizerporter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__n_iter_no_change': [3,5,7,9],
               'clf__alpha': [1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3]},
              ]


lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', SGDClassifier(loss='log', random_state=1))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=2,
                           n_jobs=-1)

In [53]:
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 1024 candidates, totalling 5120 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=-1)]: Done  29 tasks      | elapsed:    4.7s
[Parallel(n_jobs=-1)]: Done 150 tasks      | elapsed:   19.9s
[Parallel(n_jobs=-1)]: Done 353 tasks      | elapsed:   42.6s
[Parallel(n_jobs=-1)]: Done 636 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 1001 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done 1446 tasks      | elapsed:  3.5min
[Parallel(n_jobs=-1)]: Done 1973 tasks      | elapsed:  4.8min
[Parallel(n_jobs=-1)]: Done 2580 tasks      | elapsed:  6.4min
[Parallel(n_jobs=-1)]: Done 3269 tasks      | elapsed:  8.1min
[Parallel(n_jobs=-1)]: Done 4038 tasks      | elapsed: 11.0min
[Parallel(n_jobs=-1)]: Done 4889 tasks      | elapsed: 13.0min
[Parallel(n_jobs=-1)]: Done 5120 out of 5120 | elapsed: 13.6min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=False,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        n

In [54]:
print('Best parameter set: %s ' % gs_lr_tfidf.best_params_)
print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)

Best parameter set: {'clf__alpha': 0.0001, 'clf__n_iter_no_change': 5, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x7fa9803ad680>} 
CV Accuracy: 0.872


In [55]:
clf = gs_lr_tfidf.best_estimator_
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))

Test Accuracy: 0.851


##### Best parameters are: 
<br>'clf__alpha': 0.0001
<br>'clf__penalty': 'l2'
<br>'clf__n_iter_no_change': 5
<br>'vect__ngram_range': (1, 1)
<br>'vect__stop_words': None
<br>'vect__tokenizer': tokenizer

We included ngram_range as a hyperparameter because the vectorizer will look at pairs of words instead of single words; we thought this might add value if the ways in which words appear next to one another has some meaning for sentiment. But our grid search showed (1,1) was better than (1,2) so we are just looking at single words. The book example used just (1,1) as the value, so adding (1,2) as an option for ours did not offer any improvement since (1,1) was still the best.
    <br>We altered the 'C' parameter the book used to 'alpha' since we are fitting with a different algorithm. For our alpha value, we included a greater range of options than the book did with their 'C' value. The alpha was chosen as 0.0001, which is the default.
    <br>We also added n_iter_no_change so that there will be early stopping if there is no improvement after a certain number of iterations (default is 5, so that's what the book example's grid was using). After the grid search, the best n_iter_no_change was found to be 5, which was the same as the default that the book used. So adding this to the grid did not change the model performance. 
    <br>In conclusion, we would have gotten the same optimaal hyperparameters by using the same grid as the book (except changing C to alpha).

Define HashingVectorizer with parameters from grid search:

In [56]:
vect = HashingVectorizer(decode_error='ignore',
                         n_features=2**21,
                         preprocessor=None,
                         ngram_range=(1,1),
                         stop_words=None,
                         tokenizer=tokenizer)

Fit streaming model with chosen parameters:

In [57]:
from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version

def stream_docs(df):
    for index, row in df.iterrows():
        text, label = row['Review'], row['Sentiment']
        yield text, label
        
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

clf = SGDClassifier(loss='log', random_state=1, penalty='l2', alpha=0.0001)

doc_stream = stream_docs(reviews_df)

In [58]:
import pyprind
pbar = pyprind.ProgBar(45)

classes = np.array([0, 1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=20)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:01


In [59]:
X_test, y_test = get_minibatch(doc_stream, size=20)
X_test = vect.transform(X_test)
print('Accuracy: %.3f' % clf.score(X_test, y_test))

Accuracy: 0.850


Model creation:

In [60]:
clf = clf.partial_fit(X_test, y_test)

Save model and stop words to pickle file:

In [61]:
import pickle
import os

dest = os.path.join('website', 'pkl_objects')
if not os.path.exists(dest):
    os.makedirs(dest)

pickle.dump(stop, open(os.path.join(dest, 'stopwords.pkl'), 'wb'), protocol=4)   
pickle.dump(clf, open(os.path.join(dest, 'amazon_classifier.pkl'), 'wb'), protocol=4)


Write tokenizer and hashing vectorizer to .py file called amazonreview_vectorizer.py (we included both tokenizers and the stop words in case future hyperparameter tuning would require it, though currently we are only using the simple tokenizer and no stop words):

In [62]:
%%writefile website/amazonreview_vectorizer.py 
from sklearn.feature_extraction.text import HashingVectorizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import pickle
import re
import os

porter=PorterStemmer()

cur_dir = os.path.dirname(__file__)
stop = pickle.load(open(
                os.path.join(cur_dir, 
                'pkl_objects', 
                'stopwords.pkl'), 'rb'))


def tokenizer(text):
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    return text.split()

def tokenizerporter(text):
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    return [porter.stem(word) for word in text.split()]

vect = HashingVectorizer(decode_error='ignore',
                         n_features=2**21,
                         preprocessor=None,
                         ngram_range=(1,1),
                         stop_words=None,
                         tokenizer=tokenizer)

Overwriting website/amazonreview_vectorizer.py


In [63]:
from sklearn.feature_extraction.text import HashingVectorizer

Load pickle file:

In [64]:
import os
os.chdir('website')

In [65]:
import pickle
import re
from amazonreview_vectorizer import vect

model = pickle.load(open(os.path.join('pkl_objects', 'amazon_classifier.pkl'), 'rb'))

In [66]:
model

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='log', max_iter=1000,
              n_iter_no_change=5, n_jobs=None, penalty='l2', power_t=0.5,
              random_state=1, shuffle=True, tol=0.001, validation_fraction=0.1,
              verbose=0, warm_start=False)

In [67]:
import re

Below are a few test samples of new reviews and predictions:

In [68]:
import numpy as np
label = {0:'negative', 1:'positive'}

example = ["I love this product. It's amazing."]
X = vect.transform(example)
print('Prediction: %s\nProbability: %.2f%%' %\
      (label[model.predict(X)[0]], 
       np.max(model.predict_proba(X))*100))

Prediction: positive
Probability: 96.89%


In [69]:
label = {0:'negative', 1:'positive'}

example = ["bad bad bad terrible awful bad don't hate loser awful regret"]
X = vect.transform(example)
print('Prediction: %s\nProbability: %.2f%%' %\
      (label[model.predict(X)[0]], 
       np.max(model.predict_proba(X))*100))

Prediction: negative
Probability: 91.95%


In [70]:
label = {0:'negative', 1:'positive'}

example = ["amazing i love it so much and great happy again love yes yay like smile"]
X = vect.transform(example)
print('Prediction: %s\nProbability: %.2f%%' %\
      (label[model.predict(X)[0]], 
       np.max(model.predict_proba(X))*100))

Prediction: positive
Probability: 99.88%


In [71]:
label = {0:'negative', 1:'positive'}

example = ["too slow for me you tube don t work sometimes"]
X = vect.transform(example)
print('Prediction: %s\nProbability: %.2f%%' %\
      (label[model.predict(X)[0]], 
       np.max(model.predict_proba(X))*100))

Prediction: negative
Probability: 96.91%


## Website Creation and Publishing

Setting up SQLite:

In [73]:
import os
os.getcwd()

'/home/jovyan/projects/project02/website'

In [74]:
import sqlite3
import os

conn = sqlite3.connect('reviews.sqlite')
c = conn.cursor()

c.execute('DROP TABLE IF EXISTS review_db')
c.execute('CREATE TABLE review_db (review TEXT, sentiment INTEGER, date TEXT)')


conn.commit()
conn.close()

### Link to Python Anywhere website:
http://anndirussell.pythonanywhere.com/?