In [14]:
%logstop
%logstart -rtq ~/.logs/nlp.py append
%matplotlib inline
import matplotlib
import seaborn as sns
sns.set()
matplotlib.rcParams['figure.dpi'] = 144

In [15]:
from static_grader import grader

# NLP Miniproject

## Introduction

The objective of this miniproject is to gain experience with natural language processing and how to use text data to train a machine learning model to make predictions. For the miniproject, we will be working with product review text from Amazon. The reviews are for only products in the "Electronics" category. The objective is to train a model to predict the rating, ranging from 1 to 5 stars.

## Scoring

For most of the questions, you will be asked to submit the `predict` method of your trained model to the grader. The grader will use the passed `predict` method to evaluate how your model performs on a test set with respect to a reference model. The grader uses the [R<sup>2</sup>-score](https://scikit-learn.org/stable/modules/model_evaluation.html#r2-score) for model evaluation. If your model performs better than the reference solution, then you can score higher than 1.0. For the last question, you will submit the result of an analysis and your passed answer will be compared directly to the reference solution.

## Downloading and loading the data

The data set is available on Amazon S3 and comes as a compressed file where each line is a JSON object. To load the data set, we will need to use the `gzip` library to open the file and decode each JSON into a Python dictionary. In the end, we have a list of dictionaries, where each dictionary represents an observation.

In [16]:
%%bash
mkdir data
wget http://dataincubator-wqu.s3.amazonaws.com/mldata/amazon_electronics_reviews_training.json.gz -nc -P ./data

mkdir: cannot create directory ‘data’: File exists
File ‘./data/amazon_electronics_reviews_training.json.gz’ already there; not retrieving.



In [17]:
import gzip
import ujson as json

with gzip.open("data/amazon_electronics_reviews_training.json.gz", "r") as f:                                  
    data = [json.loads(line) for line in f]

In [14]:
data[0]
# 'reviewText' is the key corresponding to the actual document

{'reviewerID': 'A238V1XTSK9NFE',
 'asin': 'B00004VX3T',
 'reviewerName': 'Andrew Lynn',
 'helpful': [2, 2],
 'reviewText': "I bought this mouse to use with my laptop because I don't like those little touchpads.  I could not be happier.Since it's USB, I can plug it in with the computer already on and expect it to work automatically.  Since it's optical (the new kind, not to be confused with the old Sun optical mice that required a special checkered mouse pad) it works on most surfaces, including my pant legs, my couch, and random tables that I put my laptop down on.  It's also light and durable, features that help with portability.The wheel is surprisingly useful.  In addition to scrolling, it controls zoom and pan in programs like Autocad and 3D Studio Max.  I can no longer bear using either of these programs without it.One complaint - the software included with the Internet navigation features is useless.  Don't bother installing it if you have a newer Windows version that automatical

The ratings are stored in the keyword `"overall"`. You should create an array of the ratings for each review, preferably using list comprehensions.

In [18]:
ratings = [x['overall'] for x in data]

In [13]:
ratings[:10]

[5.0, 1.0, 4.0, 5.0, 3.0, 5.0, 3.0, 5.0, 4.0, 5.0]

**Note**, the test set used by the grader is in the same format as that of `data`, a list of dictionaries. Your trained model needs to accept data in the same format. Thus, you should use `Pipeline` when constructing your model so that all necessary transformation needed are encapsulated into a single estimator object.

## Question 1: Bag of words model

Construct a machine learning model trained on word counts using the bag of words algorithm. Remember, the bag of words is implemented with `CountVectorizer`. Some things you should consider:

* The reference solution uses a linear model and you should as well; use either `Ridge` or `SGDRegressor`.
* The text review is stored in the key `"reviewText"`. You will need to construct a custom transformer to extract out the value of this key. It will be the first step in your pipeline.
* Consider what hyperparameters you will need to tune for your model.
* Subsampling the training data will boost training times, which will be helpful when determining the best hyperparameters to use. Note, your final model will perform best if it is trained on the full data set.
* Including stop words may help with performance.

In [19]:
from sklearn.base import BaseEstimator, TransformerMixin

In [20]:
# create custom transformer to extract 'reviewText' key's value

class KeySelector(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key
        
    def fit(self, X, y=None):
        return self # don't need to train anything yet
    
    def transform(self, X):
        return [row[self.key] for row in X]

In [21]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from spacy.lang.en import STOP_WORDS
import numpy as np

In [37]:
bag_of_words_model = Pipeline([
    ('get_text', KeySelector('reviewText')),
    ('vectorizer', CountVectorizer(stop_words=STOP_WORDS)),
    ('ridge', GridSearchCV(Ridge(),
                           param_grid={'alpha': np.logspace(2, 3, 4)},
                           cv=5, n_jobs=2, verbose=1))
])

# NLP long training times - huge feature matrices

In [38]:
bag_of_words_model.fit(data, ratings)

# 2.4 mins

Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  20 out of  20 | elapsed:  2.4min finished


Pipeline(memory=None,
         steps=[('get_text', KeySelector(key='reviewText')),
                ('vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words={'a', 'about', 'above', 'across',
                                             'after',...
                 GridSearchCV(cv=5, error_score='raise-deprecating',
                              estimator=Ridge(alpha=1.0, copy_X=True,
                                              fit_intercept=True, max_iter=None,
                                              normalize=False,
                                              random_sta

In [39]:
bag_of_words_model.score(data, ratings)

# low R^2 (calculated on train data) but high grader score - does not overfit!

0.4115630407078624

In [45]:
bag_of_words_model.named_steps['ridge'].best_params_

{'alpha': 215.44346900318845}

In [46]:
np.logspace(2, 3, 4)

# best alpha not at edge of grid! So need not extend range (make more ganular if good hardware)

array([ 100.        ,  215.443469  ,  464.15888336, 1000.        ])

In [41]:
cv = bag_of_words_model.named_steps['vectorizer']
cv.vocabulary_
# keyword: column index

{'bought': 13848,
 'mouse': 45127,
 'use': 71977,
 'laptop': 40081,
 'don': 24103,
 'like': 41013,
 'little': 41272,
 'touchpads': 69241,
 'happier': 33475,
 'usb': 71935,
 'plug': 51879,
 'computer': 18659,
 'expect': 27768,
 'work': 75296,
 'automatically': 11105,
 'optical': 48484,
 'new': 46574,
 'kind': 39432,
 'confused': 18956,
 'old': 48057,
 'sun': 65568,
 'mice': 43803,
 'required': 57296,
 'special': 63371,
 'checkered': 16771,
 'pad': 49568,
 'works': 75361,
 'surfaces': 65866,
 'including': 36405,
 'pant': 49805,
 'legs': 40465,
 'couch': 19963,
 'random': 55292,
 'tables': 66617,
 'light': 40910,
 'durable': 24891,
 'features': 28848,
 'help': 34258,
 'portability': 52362,
 'wheel': 74420,
 'surprisingly': 65929,
 'useful': 72014,
 'addition': 7662,
 'scrolling': 59913,
 'controls': 19519,
 'zoom': 76572,
 'pan': 49724,
 'programs': 53688,
 'autocad': 11058,
 '3d': 3574,
 'studio': 65099,
 'max': 43015,
 'longer': 41550,
 'bear': 12367,
 'complaint': 18482,
 'software': 6

In [40]:
grader.score.nlp__bag_of_words_model(bag_of_words_model.predict)

Your score:  1.0833511525800792


## Question 2: Normalized model

Ridge is a linear model that benefits from normalised inputs.

Using raw counts will not be as effective compared if we had normalized the counts. There are several ways to normalize raw counts; the `HashingVectorizer` class has the keyword `norm` and there is also the `TfidfTransformer` and `TfidfVectorizer` that perform tf-idf weighting on the counts. Apply normalized to your model to improve performance.

In [22]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [13]:
normalized_model = Pipeline([
    ('get_text', KeySelector('reviewText')),
    ('vectorizer', HashingVectorizer(stop_words=STOP_WORDS, norm='l2')), # default is norm=l2
    ('ridge', GridSearchCV(Ridge(),
                           param_grid={'alpha': np.logspace(-2,1,5)},
                           cv=5, n_jobs=-2, verbose=1))
])

In [65]:
normalized_model.fit(data, ratings)

# l2 takes 7.0 mins; l1 takes 6.5 mins but same score

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=-2)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=-2)]: Done  25 out of  25 | elapsed:  6.5min finished


Pipeline(memory=None,
         steps=[('get_text', KeySelector(key='reviewText')),
                ('vectorizer',
                 HashingVectorizer(alternate_sign=True, analyzer='word',
                                   binary=False, decode_error='strict',
                                   dtype=<class 'numpy.float64'>,
                                   encoding='utf-8', input='content',
                                   lowercase=True, n_features=1048576,
                                   ngram_range=(1, 1), norm='l2',
                                   preprocessor=None,
                                   stop_words={'a', 'about', 'above', 'a...
                 GridSearchCV(cv=5, error_score='raise-deprecating',
                              estimator=Ridge(alpha=1.0, copy_X=True,
                                              fit_intercept=True, max_iter=None,
                                              normalize=False,
                                              random_st

In [None]:
normalized_model.score(data, ratings)

In [69]:
normalized_model.named_steps['ridge'].best_params_ # middle

{'alpha': 1.7782794100389228}

In [70]:
grader.score.nlp__normalized_model(normalized_model.predict)

Your score:  0.9996650827055537


## Question 3: Bigrams model

The model performance may increase when including additional features generated by counting bigrams. Include bigrams to your model. When using more features, the risk of overfitting increases. Make sure you try to minimize overfitting as much as possible.

In [11]:
bigrams_model = Pipeline([
    ('get_text', KeySelector('reviewText')),
    ('vectorizer', HashingVectorizer(stop_words=STOP_WORDS, ngram_range=(1,2))), # single AND pairs of words
    ('ridge', Ridge(alpha=0.5, verbose=1))
])

In [12]:
bigrams_model.fit(data, ratings)

Pipeline(memory=None,
         steps=[('get_text', KeySelector(key='reviewText')),
                ('vectorizer',
                 HashingVectorizer(alternate_sign=True, analyzer='word',
                                   binary=False, decode_error='strict',
                                   dtype=<class 'numpy.float64'>,
                                   encoding='utf-8', input='content',
                                   lowercase=True, n_features=1048576,
                                   ngram_range=(1, 2), norm='l2',
                                   preprocessor=None,
                                   stop_words={'a', 'about', 'above', 'a...
                                               'also', 'although', 'always',
                                               'am', 'among', 'amongst',
                                               'amount', 'an', 'and', 'another',
                                               'any', 'anyhow', 'anyone',
                                 

In [13]:
bigrams_model.score(data, ratings)

0.8365572366513826

In [14]:
grader.score.nlp__bigrams_model(bigrams_model.predict)

Your score:  1.0129273558329073


## Question 4: Polarity analysis

Different to Q1-3; not predicting star rating from review content. Now, Naive Bayes (assumes independence) classification of most polar words.

$$P(\text{hypothesis } | \text{ data}) = \ldots \text{via Bayes Theorem.}$$

Let's derive some insight from our analysis. We want to determine the most polarizing words in the corpus of reviews. In other words, we want identify words that strongly signal a review is either positive or negative. For example, we understand a word like "terrible" will mostly appear in negative rather than positive reviews. The naive Bayes model calculates probabilities such as $P(\text{negative } | \text{ 'terrible'})$, the probability the review is negative given the word "terrible" appears in the text. Using these probabilities, we can derive a polarity score for each counted word,

$$
\text{polarity} =  \log\left(\frac{P(\text{word } | \text{ positive})}{P(\text{word } | \text{ negative})}\right).
$$ 

The polarity analysis is an example where a simpler model offers more explicability than a more complicated model. For this question, you are asked to determine the top twenty-five words with the largest positive **and** largest negative polarity, for a total of fifty words. For this analysis, you should:

1. Use the naive Bayes model, `MultinomialNB`.
1. Use tf-idf weighting.
1. Remove stop words.

A trained naive Bayes model stores the log of the probabilities in the attribute `feature_log_prob_`. It is a NumPy array of shape (number of classes, the number of features). You will need the mapping between feature index to word. For this problem, you will use a different data set; it has been processed to only include reviews with one and five stars. You can download it below.

In [23]:
%%bash
wget http://dataincubator-wqu.s3.amazonaws.com/mldata/amazon_one_and_five_star_reviews.json.gz -nc -P ./data

# includes only reviews with 1 or 5 star ratings - only concerned with most polar words!

File ‘./data/amazon_one_and_five_star_reviews.json.gz’ already there; not retrieving.



To avoid memory issue, we can delete the older data.

In [24]:
del data, ratings

In [25]:
import numpy as np
from sklearn.naive_bayes import MultinomialNB

with gzip.open("data/amazon_one_and_five_star_reviews.json.gz", "r") as f:
    data_polarity = [json.loads(line) for line in f] # subset of 'data' with only 1 or 5 stars

ratings = [x['overall'] for x in data_polarity]

In [26]:
data_polarity[:2]
# 'overall' = rating
# 'reviewText' = document

[{'reviewerID': 'A31RSJTGLVV3TR',
  'asin': 'B000093UDP',
  'reviewerName': 'T. Wayne',
  'helpful': [11, 12],
  'reviewText': "This worked perfectly for about 8 rewinds.  But once it eats one of your precious tapes you'll probably be as dissapointed as I was.  Buyer beware... these rewinders are mostly the same look; but marketed by different companies.  They all eat tape.",
  'overall': 1.0,
  'summary': "It's a super tape-eater",
  'unixReviewTime': 1210550400,
  'reviewTime': '05 12, 2008'},
 {'reviewerID': 'A2Y739CRM15WDL',
  'asin': 'B00005MNSR',
  'reviewerName': 'Sires "I like mysteries (particularly British...',
  'helpful': [0, 0],
  'reviewText': "This was a choice in my gold box or I might not have bought it. I would have been missing out on a very good monitor if I hadn't.I just bought a new 2.53 gigahertz multi media computer running XP home edition and wanted a monitor with a small footprint but still able to handle graphics intense functions.  Set up was a breeze.  I ju

In [32]:
pipe = Pipeline([
    ('get_text', KeySelector('reviewText')),
    ('vectorizer', TfidfVectorizer(stop_words=STOP_WORDS)),
    ('estimator', MultinomialNB())
])

In [33]:
pipe.fit(data_polarity, ratings) 

Pipeline(memory=None,
         steps=[('get_text', KeySelector(key='reviewText')),
                ('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words={'a', 'abou...
                                             'also', 'although', 'always', 'am',
                                             'among', 'amongst', 'amount', 'an',
                                             'and', 'another', 'any', 'anyhow',
                                             'anyone', 'anything', 'anyway',
                                 

In [34]:
nb = pipe.named_steps['estimator']

In [39]:
# log of the probabilities
log_probs = nb.feature_log_prob_
log_probs

array([[ -8.64595801,  -9.6369024 , -10.86439331, ..., -10.19829036,
        -10.71077731, -10.62308687],
       [ -9.35207212,  -9.30288304, -10.57273656, ..., -10.81783738,
        -10.81783738, -10.81783738]])

In [38]:
index_to_token = pipe.named_steps['vectorizer'].get_feature_names()
index_to_token[2000:2010]
# maps index to word

['acoustic',
 'acoustical',
 'acoustically',
 'acoustics',
 'acoustimass',
 'acquainted',
 'acquire',
 'acquired',
 'acquiring',
 'acquisition']

In [41]:
log_probs.shape # (no. classes, no. features)
# row 0 is positive
# row 1 is negative

(2, 25422)

In [42]:
# log of ratio is same as subtraction of logs
polarity = log_probs[0,:] - log_probs[1,:]

In [45]:
index = np.argsort(polarity) # order
index
# index 11288 gives word of lowest value; 18497 is index of highest

array([11288,  3610, 17646, ..., 19021, 24531, 18497])

In [67]:
extreme_index = np.hstack((index[:25], index[-25:]))
extreme_index
# lowest and highest 25 indices

array([11288,  3610, 17646, 16559, 14790, 17123,  2511, 21610, 13921,
       11980,  8880,  4036, 16917,   524,  5010, 10947,  3303, 17124,
        7961,  6376, 16718, 22331,  4353,  9267, 18541, 18509, 22636,
        7418, 18051, 21425, 23709, 11459,  3304, 23418, 17082,  3776,
        6878, 23073, 25048, 25054, 23902, 10275, 19023, 22407, 12809,
       25052, 19024, 19021, 24531, 18497])

In [68]:
top_50 = [index_to_token[i] for i in extreme_index]

In [69]:
grader.score.nlp__most_polar(top_50)

Your score:  1.0


## Question 5: Topic modeling [optional]; unsupervised!


### Ungraded:  Here's some text - what are people talking about in this text?

Topic modeling is the analysis of determining the key topics or themes in a corpus. With respect to machine learning, topic modeling is an unsupervised technique. One way to uncover the main topics in a corpus is to use [non-negative matrix factorization](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html). For this question, use non-negative matrix factorization to determine the **top ten words for the first twenty topics.** You should submit your answer as a list of lists. What topics exist in the reviews?

In [70]:
from sklearn.decomposition import NMF
 