In [1]:
import matplotlib
import seaborn as sns
import os

sns.set()
matplotlib.rcParams['figure.dpi'] = 81

## Introduction

The objective of this miniproject is to gain experience with natural language processing and how to use text data to train a machine learning model to make predictions. For the miniproject, we will be working with product review text from Amazon. The reviews are for only products in the "Electronics" category. The objective is to train a model to predict the rating, ranging from 1 to 5 stars.

## Downloading and loading the data

The data set is available on Amazon S3 and comes as a compressed file where each line is a JSON object. To load the data set, we  need to use the `gzip` library to open the file and decode each JSON into a Python dictionary. In the end, we have a list of dictionaries, where each dictionary represents an observation.

In [4]:
%%bash
mkdir data

os.system('/usr/local/bin/wget "http://dataincubator-wqu.s3.amazonaws.com/mldata/amazon_electronics_reviews_training.json.gz" -nc -P ./nlp-data/')# had to install wget first 

In [2]:
import gzip
import ujson as json
os.chdir('/Users/bmr225/Documents/WorldQuantUniversity')
with gzip.open("./nlp-data/amazon_electronics_reviews_training.json.gz", "r") as f:                                  
    data = [json.loads(line) for line in f]

In [9]:
# Sample observation
data[0]['reviewText']

"I bought this mouse to use with my laptop because I don't like those little touchpads.  I could not be happier.Since it's USB, I can plug it in with the computer already on and expect it to work automatically.  Since it's optical (the new kind, not to be confused with the old Sun optical mice that required a special checkered mouse pad) it works on most surfaces, including my pant legs, my couch, and random tables that I put my laptop down on.  It's also light and durable, features that help with portability.The wheel is surprisingly useful.  In addition to scrolling, it controls zoom and pan in programs like Autocad and 3D Studio Max.  I can no longer bear using either of these programs without it.One complaint - the software included with the Internet navigation features is useless.  Don't bother installing it if you have a newer Windows version that automatically supports wheel mice.  Just plug it in and use it - it's that easy."

In [3]:
# Constructing a custom transformer to extract the value of the key reviewText
from sklearn.base import BaseEstimator, TransformerMixin
class ExtractKey(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key
        
    def fit(self,X,y = None):
        return self
    def transform(self,X):  
        return [X[i][self.key] for i in range(len(X))]

In [4]:
# I would have to include Spacy first
import spacy
# load text processing pipeline
nlp = spacy.load('en')

def lemmatizer(text):
    return [word.lemma_ for word in nlp(text)]

In [5]:
# Updating the stopwords
from spacy.lang.en import STOP_WORDS
STOP_WORDS_rev = STOP_WORDS.union({'mouse','software','computer','plug','USB','pad','touchpad','pad','cable','Floppy Disk Drive','install'})

In [6]:
ratings = [data[i]['overall'] for i in range(len(data))]

### Comparison between CountVectorizer and tfidfVectorizer

In [8]:
# I. CountVectorizer
from spacy.lang.en import STOP_WORDS
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import SGDRegressor
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline


from sklearn.linear_model import Ridge
Extract_reviewtext = ExtractKey('reviewText')
tdidf = TfidfVectorizer(stop_words=STOP_WORDS_rev,ngram_range=(1,1),tokenizer = None)
countvectorizer = CountVectorizer()
ridge_regressor = Ridge(alpha=1, solver ='sag')


bag_of_words_model = Pipeline([('extract_reviews', Extract_reviewtext), ('vectorizer',countvectorizer), ('regressor', ridge_regressor)])
bag_of_words_model.fit(data, ratings)

print("Model r2: {}".format(bag_of_words_model.score(data, ratings)))

Model r2: 0.42811605136161623


In [15]:
#II. Normalized Model: TfidVectorizer

from sklearn.linear_model import Ridge
Extract_reviewtext = ExtractKey('reviewText')
tdidf = TfidfVectorizer(stop_words=STOP_WORDS_rev,tokenizer = None)
countvectorizer = CountVectorizer()
ridge_regressor = Ridge(alpha=1, solver ='sag')


normalized_model = Pipeline([('extract_reviews', Extract_reviewtext), ('vectorizer',tdidf), ('regressor', ridge_regressor)])
normalized_model.fit(data, ratings)

print("Model r2: {}".format(normalized_model.score(data, ratings)))

  'stop_words.' % sorted(inconsistent))
  '"sag" solver requires many iterations to fit '


Model r2: 0.5895969413827675


Model improved with the normalized count

The model performance may increase when including additional features generated by counting bigrams. Include bigrams to your model. When using more features, the risk of overfitting increases. 

## Hyperparameter tuning for Bigrams and Ridge Regressor

In [11]:
# To hide all the warnings in Python
import warnings
warnings.filterwarnings('ignore')


import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from tempfile import mkdtemp
from shutil import rmtree

# To check for overfitting
X_train, X_test, y_train, y_test = train_test_split(data,ratings, test_size = 0.2, random_state = 0)

Extract_reviewtext = ExtractKey('reviewText')
tdidf = TfidfVectorizer(stop_words=STOP_WORDS_rev)
countvectorizer = CountVectorizer()
ridge_regressor = Ridge(solver ='sag')

cachedir = mkdtemp()
pipe = Pipeline([('extract_reviews', Extract_reviewtext), ('vectorizer',tdidf), ('regressor', ridge_regressor)])
#pipe.fit(data, ratings)


param_grid = {'vectorizer__ngram_range':[(1,1),(1,2),(2,2)],
              'vectorizer__tokenizer':[None,lemmatizer],
              'regressor__alpha':np.logspace(-8,1)} # for now let's do like the teacher.


grid_search = GridSearchCV(pipe, param_grid, cv = 5, verbose = 1)
grid_search.fit(X_train[:3000],y_train[:3000]) #Using part of the data for tuning to speed up the process


print('Training R^2:{}'.format(grid_search.score(X_train,y_train)))
print('CV R^2:{}'.format(grid_search.best_score_)) 
print('Test R^2:{}'.format(grid_search.score(X_test,y_test)))

Fitting 5 folds for each of 300 candidates, totalling 1500 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1500 out of 1500 | elapsed: 1167.5min finished


Training R^2:0.32678109133473887
CV R^2:0.2974161548844446
Test R^2:0.2908774906523537


In [24]:
grid_search.best_params_

{'regressor__alpha': 0.22229964825261955,
 'vectorizer__ngram_range': (1, 2),
 'vectorizer__tokenizer': None}

### The final model: stop_words, bigram, tdidf

In [23]:
# Building the model with hypertuned parameters
from sklearn.linear_model import Ridge
Extract_reviewtext = ExtractKey('reviewText')
tdidf = TfidfVectorizer(stop_words=STOP_WORDS_rev,ngram_range=(1,2),tokenizer = None)
ridge_regressor = Ridge(alpha=0.2222996482, solver = 'sag')


bigrams_model = Pipeline([('extract_reviews', Extract_reviewtext), ('vectorizer',tdidf), ('regressor', ridge_regressor)])
bigrams_model.fit(X_train, y_train)


Pipeline(memory=None,
         steps=[('extract_reviews', ExtractKey(key='reviewText')),
                ('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 2), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words={"'d...
                                             'also', 'although', 'always', 'am',
                                             'among', 'amongst', 'amount', 'an',
                                             'and', ...},
                                 strip_accents=None, sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\

In [24]:
print("Training r2: {}".format(bigrams_model.score(X_train, y_train)))
print("Test r2: {}".format(bigrams_model.score(X_test, y_test)))

Training r2: 0.9690501453968942
Test r2: 0.4406043971082288


In [None]:
# Clearly there is an overfitting problem.. 

## Investigation II

## Polarity analysis
Let's derive some insight from our analysis. We want to determine the most polarizing words in the corpus of reviews. In other words, we want identify words that strongly signal a review is either positive or negative. For example, we understand a word like "terrible" will mostly appear in negative rather than positive reviews. The naive Bayes model calculates probabilities such as  $P(\text{terrible } | \text{ negative})$, the probability the review is negative given the word "terrible" appears in the text. Using these probabilities, we can derive a polarity score for each counted word,


$$
\text{polarity} =  \log\left(\frac{P(\text{word } | \text{ positive})}{P(\text{word } | \text{ negative})}\right).
$$ 

 
The polarity analysis is an example where a simpler model offers more explicability than a more complicated model. For this question, you are asked to determine the top twenty-five words with the largest positive and largest negative polarity, for a total of fifty words.

In [27]:
del data, ratings # empty the memory from first data


In [28]:
import gzip
import ujson as json
import numpy as np
from sklearn.naive_bayes import MultinomialNB

# create data set and labels
with gzip.open('./nlp-data/amazon_one_and_five_star_reviews.json.gz', "r") as f:
    data_polarity = [json.loads(line) for line in f]

In [29]:
ratings = [data_polarity[i]['overall'] for i in range(len(data_polarity))]
reviews = [data_polarity[i]['reviewText'] for i in range(len(data_polarity))]

In [30]:
# This is bad as there is no pipeline
# TfidVectorizer fit_transform
from spacy.lang.en import STOP_WORDS
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import SGDRegressor
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline


tfidf = TfidfVectorizer(stop_words=STOP_WORDS) 
#tfidf = TfidfVectorizer(stop_words=STOP_WORDS,tokenizer=lemmatizer, ngram_range=(1, 1)) # This gave a lot of punctuation
# lemmatizer fails now.. 
reviews_tfdif = tfidf.fit_transform(reviews)

# create and train pipeline
# Train-Test split
X_train, X_test, y_train, y_test = train_test_split(reviews_tfdif, ratings , test_size = 0.2, random_state = 0)

#Train the classifier
class_model = MultinomialNB()
class_model.fit(X_train,y_train)

y_train = np.array(y_train)
y_train = y_train.reshape(-1,1)

y_proba = class_model.predict_proba(X_train)

predicted_indices =[1 if y_proba[i,0]>0.5 else 5 for i in range(len(y_proba))] 

print("Training accuracy: {}".format(class_model.score(X_train, y_train)))

  'stop_words.' % sorted(inconsistent))


Training accuracy: 0.9085


In [31]:
from sklearn import metrics
print("Training accuracy: {}".format(class_model.score(X_train, y_train)))
print("accuracy: {}".format(metrics.accuracy_score(y_train, predicted_indices))) # Same thing.

# Compute the error.  
fpr, tpr, thresholds = metrics.roc_curve(y_train, predicted_indices, pos_label=1)
print("Multinomial naive bayes AUC: {0}".format(metrics.auc(fpr, tpr)))# is this very low... how can I fix that.. 
print("Test accuracy: {}".format(class_model.score(X_test, y_test))) # overfitting happening here... 

Training accuracy: 0.9085
accuracy: 0.9085
Multinomial naive bayes AUC: 0.0916058960783564
Test accuracy: 0.849


In [32]:
# Now fitting the model on the whole data set
class_model = MultinomialNB()
class_model.fit(reviews_tfdif,ratings)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [33]:
# Deriving Polarity
negative_prob = class_model.feature_log_prob_[0] # both formats work 
positive_prob = class_model.feature_log_prob_[1,:]

# Computing the log of the ratios of probabilities 
post_pos_ratio = positive_prob - negative_prob  # these are log of the ratio of prob
post_neg_ratio = negative_prob - positive_prob


## Sorting the log of the ratios
sorted_post_pos_ratio_ind = post_pos_ratio.argsort() # usually in ascending.. 
# For positive Class
sorted_post_neg_ratio_ind = post_neg_ratio.argsort()
# list of features
features_l = list(tfidf.get_feature_names())


In [34]:
print([features_l[ind] for ind in sorted_post_pos_ratio_ind[:-26:-1]])

['highly', 'beat', 'protects', 'perfect', 'monopod', 'portrait', 'amazing', 'sturdy', 'macro', 'incredible', 'excellent', 'bokeh', 'pleased', '200mm', 'charm', 'handy', 'awesome', 'portraits', 'dslr', 'crisp', 'photography', 'telephoto', 'buck', 'fantastic', 'regrets']


In [35]:
print([features_l[ind] for ind in sorted_post_neg_ratio_ind[:-26:-1]])

['refund', 'waste', 'return', 'returning', 'worst', 'junk', 'terrible', 'returned', 'garbage', 'useless', 'worthless', 'worse', 'trash', 'defective', 'beware', 'poor', 'unacceptable', 'awful', 'horrible', 'unreliable', 'stopped', 'randomly', 'disappointing', 'threw', 'refused']


### The top 50 words: 25 top positive and 25 top negative words

In [36]:
top_pos_25 = [features_l[ind] for ind in sorted_post_pos_ratio_ind[:-26:-1]]
top_neg_25 = [features_l[ind] for ind in sorted_post_neg_ratio_ind[:-26:-1]]
top_50 = top_pos_25 + top_neg_25 

## Topic modeling 


Topic modeling is the analysis of determining the key topics or themes in a corpus. With respect to machine learning, topic modeling is an unsupervised technique. One way to uncover the main topics in a corpus is to use non-negative matrix factorization. For this question, use non-negative matrix factorization to determine the top ten words for the first twenty topics.

In [37]:
# NLP example: code adapted from course resources
from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline


data = reviews

n_topics = 20
n_top_words = 10

tfidf = TfidfVectorizer(stop_words='english')
nmf = NMF(n_components=n_topics, random_state=0)
pipe = Pipeline([('vectorizer', tfidf), ('dim-red', nmf)])
pipe.fit(data)

feature_names = tfidf.get_feature_names()

for i, topic in enumerate(nmf.components_):
    print("Topic: {}".format(i))
    indices = topic.argsort()[-n_top_words-1:-1]
    top_words = [feature_names[ind] for ind in indices]
    print(" ".join(top_words), "\n")

Topic: 0
working got amazon months money don buy bought unit time 

Topic: 1
zoom wide nikon sharp light hood 50mm focus lenses canon 

Topic: 2
volume music headphone comfortable head bass pair ears ear sound 

Topic: 3
video needed length extension modem belkin signal connect monitor tv 

Topic: 4
picture remote flash battery batteries canon cameras use digital pictures 

Topic: 5
players dvds tv unit discs disc sony play cd player 

Topic: 6
bought perfect use value need worked highly easy recommend price 

Topic: 7
button hand mice used buttons microsoft trackball use logitech keyboard 

Topic: 8
excellent products tech advertised service purchase does support amazon recommend 

Topic: 9
better volume set bose surround sub bass wire speaker sound 

Topic: 10
sd drivers software quot sandisk cf install memory windows cards 

Topic: 11
tech link firmware internet support connection network netgear linksys wireless 

Topic: 12
expensive protection quality lenses protect hoya tiffen gl