# Using Sentiment Analysis to Predict the Number of Stars in Amazon Reviews
If you want to know how your customers feel about your products, you could read thousands of individual reviews or surveys. And how would you summarize what you have learned? Or you could train a sentiment analysis model to score the data and use the output to summarize how your customers feel.

While sentiment analysis typically makes a binary prediction (positive vs. negative) using a classifier such as logistic regression, Amazon reviews provide a finer-grained perspective along a range of scores from 1 star (the most negative) to 5 stars (the most positive). This allows us to use a regression model to predict where along the range of sentiment a particular review falls.

This experiment will also demonstrate how to use `sklearn.model_selection.GridSearchCV` to tune a preprocessing pipeline.

Data set credit: *John Blitzer, Mark Dredze, Fernando Pereira. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. Association of Computational Linguistics (ACL), 2007.*

In [1]:
from collections import defaultdict
from nltk import download, pos_tag
from nltk.tokenize import wordpunct_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords, wordnet
import string
import re
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from bs4 import BeautifulSoup
import random

## Preprocessing Parameterization
Just as GridSearchCV and RandomSearchCV can search for optimal model hyperparameters, they can search for the optimal preprocessing hyperparameters. Any data transformation class included in an optimizing pipeline must implement the following functions:
+ `fit()`
+ `transform()`
+ `fit_transform()`
+ `get_params()`
+ `set_params()`

By defining `sklearn.base.BaseEstimator` and `sklearn.base.TransformerMixin` as the base classes for a transformer, you get default implementations of `fit_transform()`, `get_params()`, and `set_params()` for free. Any transformers you plan to parameterize must override the `get_params()` and `set_params()` function; however, the default `fit_transform()` function should work for all custom transformers. Since most of these transformers only need to implement a `transform()` function, you can implement a TransformerBase class that defines the correct base classes and a default fit method, leaving the definition of a `transform()` method as the only remaining coding task. 

In [2]:
class TransformerBase(BaseEstimator, TransformerMixin):
    '''
    Provides no-op fit() function for Transformers that only need
    a fit method
    '''
    
    def __init__(self):
        pass
    
    def fit(self, X, y=None, **fit_params):
        return self

class LowerCaser(TransformerBase):
    
    def transform(self, X, **fit_params):
        for i in range(len(X)):
            X[i] = X[i].lower()
        return X    

class Tokenizer(TransformerBase):

    def transform(self, X, **fit_params):
        for i in range(len(X)):
            X[i] = wordpunct_tokenize(X[i])
            for tok in X[i]:
                if tok.endswith('.') and len(tok) > 1:
                    X[i].remove(tok)
        return X
    
def remove_listed_tokens(X, removal_list):
    '''
    If you immediately remove a token as you iterate forward through a list,
    you skip over the next token. This function instead builds a list of tokens
    to be removed, then removes them at the end.
    
    Parameters
      X - list of lists
      removal_list - string of tokens (e.g., punctuation), or list of strings (e.g., stopwords)
    '''
    for doc in X:
        removals = []
        for tok in doc:
            if tok in removal_list:
                removals.append(tok)
        for p in removals:
            doc.remove(p)
    return X
        
class StopWordRemover(TransformerBase):
    
    def __init__(self):
        download("stopwords")
        
    def transform(self, X, **fit_params):
        return remove_listed_tokens(X, stopwords.words('english'))
    
class Stringizer(TransformerBase):
    def transform(self, X, **fit_params):
        for i in range(len(X)):
            X[i] = ' '.join(X[i])
        return X


Once you have have tokenized data, you can part-of-speech (POS) tag them and *lemmatize* them. Lemmatizing substitutes a root word for a token; for example, *"running"* becomes *"run"*. It works best when you provide POS tags, but the Penn Treenet-based `nltk.pos_tag` function uses a different set of tags than the WordNetLemmatizer. Thus you must define a map between the two tag sets. Not every Treenet tag can be mapped; when a token's Treenet POS is not in the map keys, the token cannot be usefully lemmatized and is therefore skipped over.

In [3]:
class Lemmatizer(TransformerBase):
    
    def __init__(self):
        self.treenet_map = defaultdict(str)
        self.treenet_map['N'] = wordnet.NOUN
        self.treenet_map['R'] = wordnet.ADV
        self.treenet_map['V'] = wordnet.VERB
        self.treenet_map['J'] = wordnet.ADJ
        
    def transform(self, X, **fit_params):
        lemmatizer = WordNetLemmatizer()
        for i in range(len(X)):
            doc = X[i].copy()
            X[i] = [] # a list of lemmatized tokens
            for tok, pos in pos_tag(doc):
                wordnet_pos = self.treenet_map[pos[0]]
                if not wordnet_pos:
                    X[i].append(tok) # use tok without any lemmatizing if not a recognized POS
                else:
                    X[i].append(lemmatizer.lemmatize(tok, wordnet_pos))
        return X

### A Parameterized Preprocessing Class
In general, you would expect puntuation such as periods and quotation marks not to contribute much to sentiment analysis, so you would remove them from text data. However, it is *possible* that an exclamation mark or a question mark might indicate an emotion. How would you know? You run an experiment! The `exceptions` parameter is a string containing any punctuation marks that should be retained (i.e., excepted from the removal process). You must override the `get_params()` and `set_params()` functions so the appropriate search method (`sklearn.model_selection.GridSearchCV` or `sklearn.model_selection.RandomSearchCV`) can run experiments by setting the `exceptions` parameter.

In [4]:
class PunctuationRemover(TransformerBase):
    
    def __init__(self, exceptions = ''):
        self.exceptions = exceptions
        
    def transform(self, X, **fit_params):
        if not self.exceptions:
            punc = string.punctuation
        else:
            retained_punc = re.compile('['+self.exceptions+']') # don't remove these chars; they may convey emotion
            punc = retained_punc.sub('', string.punctuation)
        return remove_listed_tokens(X, punc)
    
    def get_params(self, deep=True):
        return {'exceptions': self.exceptions}
    
    def set_params(self, **parameters):
        for parm, value in parameters.items():
            setattr(self, parm, value)
        return self
    


## Establishing a Baseline Model
We first establish a baseline predictor, then we can optimize the proprocessing pipeline and, finally, perform model selection.

In [None]:
base_pipeline = Pipeline([('lower', LowerCaser()),
                          ('tokenize', Tokenizer()),
                          ('lemmatize', Lemmatizer()),
                          ('stopwords', StopWordRemover()),
                          ('punc', PunctuationRemover()),
                          ('stringize', Stringizer()),
                          ('vec', CountVectorizer()),
                          ('model', Lasso())])

def get_reviews_and_ratings(path):
    observations = BeautifulSoup(open(path).read())
    reviews = observations.findAll('review_text')
    reviews = [node.text for node in reviews]
    ratings = observations.findAll('rating')
    ratings = [float(node.text) for node in ratings]
    return reviews, ratings



In [None]:
# load train and test data
positive_path = '/data/electronics/positive.review'
positive_reviews, positive_ratings = get_reviews_and_ratings(positive_path)
negative_path = '/data/electronics/negative.review'
negative_reviews, negative_ratings = get_reviews_and_ratings(negative_path)
test_path = '/data/electronics/unlabeled.review'
test_reviews, test_ratings = get_reviews_and_ratings(test_path)

X = positive_reviews + negative_reviews
Y = positive_ratings + negative_ratings
shuffle_index = list(range(len(X)))
random.shuffle(shuffle_index)
X = list(map(X.__getitem__, shuffle_index))
Y = list(map(Y.__getitem__, shuffle_index))

Using our baseline pipeline, let's establish a baseline model. Let's tune a linear regression model with Lasso (*L1*) regularization so we can get an informative model that's easy to compute.

In [None]:
param_grid = {'model__alpha': [0.03, 0.07, 0.1, 0.2, 0.5, 1.0]}
gs = GridSearchCV(base_pipeline, param_grid, cv = 5)
gs.fit(X, Y)

Let's see what the model tells us:

In [20]:
import numpy as np
best_estimator = gs.best_estimator_ # a Pipeline object
model = best_estimator.named_steps['model'] # a Lasso model
vec = best_estimator.named_steps['vec'] # a CountVectorizer

# Which model hyperparameter got the best cross-validation results?
print("Optimized hyperparameters:" + '=' * 20)
best_parameters = best_estimator.get_params()
for param_name in sorted(param_grid.keys()):
    print("%s: %r" % (param_name, best_parameters[param_name]))
print()

# Linear model coefficients
feature_map = {} # index -> word
for k in vec.vocabulary_: 
    feature_map[vec.vocabulary_[k]] = k
# significant words have non-zero coefficients
significant_word_index = np.where(model.coef_ != 0.)[0]
words_and_coefs = []
for i in significant_word_index[:]:
    words_and_coefs.append((feature_map[i], model.coef_[i]))
sorted_words = sorted(words_and_coefs, key = lambda tup: tup[1])
print("Predictive words and coefficients" + '=' * 20)
for tup in sorted_words:
    print('{:<18}'.format(tup[0]), tup[1])
print()
print("Y intercept:", model.intercept_)

model__alpha: 0.03

return             -0.68033278428
bad                -0.403256279195
waste              -0.403132601845
try                -0.228285563944
send               -0.172963391161
work               -0.170962957798
back               -0.162752200524
poor               -0.152085747623
item               -0.13231479252
even               -0.102432704422
customer           -0.0989887349194
support            -0.094425899632
software           -0.0883674461114
get                -0.0852166447041
buy                -0.0694080124812
would              -0.0527671633476
month              -0.0515337508775
warranty           -0.049498778835
tell               -0.0404522328405
money              -0.0315761809083
card               -0.0297028487524
company            -0.0291649560937
unit               -0.0277561885853
product            -0.0232889177456
another            -0.022519067416
ipod               -0.0155416315244
keyboard           -0.00496139226585
purchase           -0.

In [14]:
# Model accuracy on test data
predictions = gs.predict(test_reviews)
predictions[predictions < 1] = 1.0
predictions[predictions > 5.0] = 5.0
print("mean absolute error of model: %0.3f" % metrics.mean_absolute_error(test_ratings, predictions))

mean absolute error: 1.329


Let's compare the baseline model with predicting the mean of the training set.

In [16]:
mean_prediction = np.mean(Y)
mean_predictions = np.array([mean_prediction for _ in range(len(test_ratings))])
print("mean absolute error of predicting the training mean: %0.3f" % metrics.mean_absolute_error(test_ratings, mean_predictions))

mean absolute error of predicting the training mean: 1.679


The baseline model does not seem like a Kaggle competition winner, but it *is* significantly better than a mildly informed baseline prediction. 

### Important Features
According to the model, the most informative **negative** words in Amazon electronics reviews are as follows:

| Lemma | Coefficient |
| --- | --- |
| return | -0.68033278428 |
| bad  | -0.403256279195 |
| waste | -0.403132601845 |
| try | -0.228285563944 |
| send | -0.172963391161 |
| work | -0.170962957798 |
| back | -0.162752200524 |
| poor | -0.152085747623 |

And the most informative **positive** words are:

| Lemma | Coefficient |
| --- | --- |
| highly | 0.362050510071 |
| great | 0.336320124108 |
| excellent | 0.332493163955 |
| price | 0.321441055098 |
| easy |  0.282700532024 |
| perfect | 0.192540350365 |
| well  | 0.149796510378 |
| best |  0.149366061196 |

Most of these are intuitive: adjectives/adverbs like *bad, back,* and *poor* would surely express disappointment, and *highly, great, excellent, easy, perfect, well,* and *best* would express satisfaction. If you are talking about *try*ing your purchase,  it probably didn't work--so that would be negative. Ditto for *return, waste* (who wants to waste money?), and *send* (presumably because you had to return it). *Price* is a curious positive word, as not every price is good. It seems happy reviewers are more likely to talk about *price* and dissatisfied reviewers are more likely to talk about *returns*.

Worth noting if you are a product manager: *ipod* has negative connotations, but *speakers* seem to be popular.

### Optimizing Engineered Features
The baseline pipeline removed all punctuation...but did that remove useful information? Would a question mark, an exclamation mark, or a dollar sign indicate sentiment? Let's see if including some combination of them helps to improve the predictive model.

*Note: if you are re-running this notebook, you must re-load the data. The first data pipeline already munged the data, so you must start over with freshly loaded data.*

In [None]:
# Note that we are fixing the alpha parameter of the Lasso model
pipeline = Pipeline([('lower', LowerCaser()),
                     ('tokenize', Tokenizer()),
                     ('lemmatize', Lemmatizer()),
                     ('stopwords', StopWordRemover()),
                     ('punc', PunctuationRemover()),
                     ('stringize', Stringizer()),
                     ('vec', CountVectorizer()),
                     ('model', Lasso(alpha = 0.03))])
param_grid = {'punc__exceptions': ['', '?', '$', '!', '?$', '?!', '$!', '?$!']}
gs = GridSearchCV(pipeline, param_grid, cv = 5)
gs.fit(X, Y)

In [9]:
# Which model hyperparameter got the best cross-validation results?
best_estimator = gs.best_estimator_ # a Pipeline object
print("Optimized hyperparameters:" + '=' * 20)
best_parameters = best_estimator.get_params()
for param_name in sorted(param_grid.keys()):
    print("%s: %r" % (param_name, best_parameters[param_name]))
print()


punc__exceptions: ''



## The Punctuation Hypothesis Was Wrong!
The data show that adding the most salient punctuation marks to the bag of words does not help to predict Amazon ratings. That conclusion surprises me, but when you work as a data scientist every day harbors a new surprise.

Now that the best configuration of the data pipeline is known, we could go through the process of model selection and tuning in order to reduce the prediction error. However, that is a topic for another Python notebook. I hope you have enjoyed learning along with me!