# Predicting sentiment from product reviews

In [1]:
import graphlab
import math
import string

# Data preparation

We will use a dataset consisting of baby product reviews on Amazon.com.

In [2]:
products = graphlab.SFrame('amazon_baby.gl/')

[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: C:\Users\Darshan\AppData\Local\Temp\graphlab_server_1486251939.log.0


This non-commercial license of GraphLab Create for academic use is assigned to darshanb@umd.edu and will expire on January 08, 2018.


Now, let us see a preview of what the dataset looks like.

In [3]:
products.head(3)

name,review,rating
Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3.0
Planetwise Wipe Pouch,it came early and was not disappointed. i love ...,5.0
Annas Dream Full Quilt with 2 Shams ...,Very soft and comfortable and warmer than it ...,5.0


## Build the word count vector for each review

Let us explore a specific example of a baby product.


In [4]:
# A fucntion to remove punctuations from the reviews
def remove_punctuation(text):
    return text.translate(None, string.punctuation) 

#Applying it to all the reviews in the dataset
review_without_punctuation = products['review'].apply(remove_punctuation)

#Tokenizing the review
products['word_count'] = graphlab.text_analytics.count_words(review_without_punctuation)

Each entry in the **word_count** column is a dictionary where the key is the word and the value is a count of the number of times the word occurs.

## Extract sentiments

We will **ignore** all reviews with *rating = 3*, since they tend to have a neutral sentiment.

In [5]:
products = products[products['rating'] != 3]

Now, we will assign reviews with a rating of 4 or higher to be *positive* reviews, while the ones with rating of 2 or lower are *negative*. For the sentiment column, we use +1 for the positive class label and -1 for the negative class label.

In [6]:
products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)

## Split data into training and test sets

Let's perform a train/test split with 80% of the data in the training set and 20% of the data in the test set.

In [7]:
train_data, test_data = products.random_split(.8, seed=1)

# Train a sentiment classifier with logistic regression

We will now use logistic regression to create a sentiment classifier on the training data. This model will use the column **word_count** as a feature and the column **sentiment** as the target. We will use `validation_set=None` to obtain same results as everyone else. 

In [8]:
sentiment_model = graphlab.logistic_classifier.create(train_data,
                                                      target = 'sentiment',
                                                      features=['word_count'],
                                                      validation_set=None,verbose=False)

Now that we have fitted the model, we can extract the weights (coefficients) as an SFrame as follows:

In [9]:
weights = sentiment_model.coefficients

There are a total of `121713` coefficients in the model. Recall from the lecture that positive weights $w_j$ correspond to weights that cause positive sentiment, while negative weights correspond to negative sentiment. 

Fill in the following block of code to calculate how many *weights* are positive ( >= 0). (**Hint**: The `'value'` column in SFrame *weights* must be positive ( >= 0)).

In [10]:
num_positive_weights = len(weights[weights['value']>=0])
num_negative_weights = len(weights)-num_positive_weights
print "Number of positive weights: %s " % num_positive_weights
print "Number of negative weights: %s " % num_negative_weights

Number of positive weights: 68419 
Number of negative weights: 53294 


## Making predictions with logistic regression

Now that a model is trained, we can make predictions on the **test data**. In this section, we will explore this in the context of 3 examples in the test dataset.  We refer to this set of 3 examples as the **sample_test_data**.

In [11]:
sample_test_data = test_data[10:13]
print sample_test_data['rating']

[5.0, 2.0, 1.0]


Let's dig deeper into the first row of the **sample_test_data**. Here's the full review:

In [12]:
sample_test_data[0]['review']

'Absolutely love it and all of the Scripture in it.  I purchased the Baby Boy version for my grandson when he was born and my daughter-in-law was thrilled to receive the same book again.'

That review seems pretty positive.

Now, let's see what the next row of the **sample_test_data** looks like. As we could guess from the sentiment (-1), the review is quite negative.

In [13]:
sample_test_data[1]['review']

'Would not purchase again or recommend. The decals were thick almost plastic like and were coming off the wall as I was applying them! The would NOT stick! Literally stayed stuck for about 5 minutes then started peeling off.'

We will now make a **class** prediction for the **sample_test_data**. The `sentiment_model` should predict **+1** if the sentiment is positive and **-1** if the sentiment is negative. Recall from the lecture that the **score** (sometimes called **margin**) for the logistic regression model  is defined as:

$$
\mbox{score}_i = \mathbf{w}^T h(\mathbf{x}_i)
$$ 

where $h(\mathbf{x}_i)$ represents the features for example $i$.  We will write some code to obtain the **scores** using GraphLab Create. For each row, the **score** (or margin) is a number in the range **[-inf, inf]**.

In [14]:
scores = sentiment_model.predict(sample_test_data, output_type='margin')
print scores

[6.734619727058961, -5.73413099676015, -14.668460404468345]


### Predicting sentiment

These scores can be used to make class predictions as follows:

$$
\hat{y} = 
\left\{
\begin{array}{ll}
      +1 & \mathbf{w}^T h(\mathbf{x}_i) > 0 \\
      -1 & \mathbf{w}^T h(\mathbf{x}_i) \leq 0 \\
\end{array} 
\right.
$$

Using scores, write code to calculate $\hat{y}$, the class predictions:

In [17]:
print "Class predictions according to GraphLab Create:" 
y=scores.apply(lambda x: +1 if x>0 else -1)
print(y)

Class predictions according to GraphLab Create:
[1L, -1L, -1L]


### Probability predictions

 calculate the probability predictions from the scores using:
$$
P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) = \frac{1}{1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))}.
$$

Using the variable **scores** calculated previously, write code to calculate the probability that a sentiment is positive using the above formula. For each row, the probabilities should be a number in the range **[0, 1]**.

In [19]:
import numpy as np
prob_prediction=1/(1+np.exp(-scores))
print(prob_prediction)

[  9.98812385e-01   3.22326818e-03   4.26155800e-07]


In [20]:
print "Class predictions according to GraphLab Create:" 
print sentiment_model.predict(sample_test_data, output_type='probability')

Class predictions according to GraphLab Create:
[0.9988123848377194, 0.003223268181801026, 4.2615579966561597e-07]


# Find the most positive (and negative) review

We now turn to examining the full test dataset, **test_data**, and use GraphLab Create to form predictions on all of the test data points for faster performance.

Using the `sentiment_model`, find the 20 reviews in the entire **test_data** with the **highest probability** of being classified as a **positive review**. We refer to these as the "most positive reviews."

In [23]:
test_prob=sentiment_model.predict(test_data,output_type='probability')
t=test_prob.topk_index(20,reverse=True)
tf=graphlab.SFrame(t)
tf['index']=np.arange(0,len(tf))

ind=(tf[tf['X1']>0])
ind1=ind['index'][:]

print(ind1[0])
find=ind1.astype(int)
print(test_data['name'][find[0]])

for i in range(len(ind1)):
    print(test_data['name'][find[i]])

print('Negative')
#negative
d=graphlab.SFrame(test_prob)
d['index']=np.arange(0,len(test_prob))
sort_d=d.sort('X1',ascending=True)[0:20]
# print(sort_d)
find_negative=sort_d['index'].astype(int)
for i in range(len(ind1)):
    print(test_data['name'][find_negative[i]])

1469
Fisher-Price Royal Potty
Fisher-Price Royal Potty
Fisher-Price Ocean Wonders Aquarium Bouncer
Evenflo Take Me Too Premiere Tandem Stroller - Castlebay
Chicco Cortina KeyFit 30 Travel System in Adventure
Adiri BPA Free Natural Nurser Ultimate Bottle Stage 1 White, Slow Flow (0-3 months)
Safety 1st High-Def Digital Monitor
The First Years 3 Pack Breastflow Bottle, 9 Ounce
Prince Lionheart Warmies Wipes Warmer
Safety 1st Lift Lock and Swing Gate
Peg-Perego Tatamia High Chair, White Latte
Safety 1st Exchangeable Tip 3 in 1 Thermometer
Cloth Diaper Sprayer--styles may vary
Nuby Natural Touch Silicone Travel Infa Feeder, Colors May Vary, 3 Ounce
Snuza Portable Baby Movement Monitor
The First Years True Choice P400 Premium Digital Monitor, 2 Parent Unit
Valco Baby Tri-mode Twin Stroller EX- Hot Chocolate
Levana Safe N'See Digital Video Baby Monitor with Talk-to-Baby Intercom and Lullaby Control (LV-TW501)
Jolly Jumper Arctic Sneak A Peek Infant Car Seat Cover Black
Munchkin Nursery Proje

In [22]:
print('Positive')
#negative
d3=graphlab.SFrame(test_prob)
d3['index']=np.arange(0,len(test_prob))
sort_p=d3.sort('X1',ascending=False)
sort_p=sort_p[0:20]
# print(sort_d)
find_p=sort_p['index'].astype(int)
for i in range(len(find_p)):
    print(test_data['name'][find_p[i]])

Positive
KidCo Magnet Lock Starter Set
Fisher-Price Rainforest Melodies and Lights Deluxe Gym
Baby Jogger City Mini GT Single Stroller, Shadow/Orange
Eddie Bauer Trailmaker Travel System - Sinclair
Baby Jogger City Mini GT Double Stroller, Shadow/Orange
Fisher-Price Grow with Me High Chair, Bunny
Maxi-Cosi Pria 70 with Tiny Fit Convertible Car Seat
Orbit Baby Stroller Travel System G2, Mocha
Regalo Easy Step Walk Thru Gate, White
Baby BeeHinds Magic Alls Minkee All In One Pocket Diaper Snaps - Marine Green Small
Graco SnugRide Click Connect 35 - Lyric
Ikea 36 Pcs Kalas Kids Plastic BPA Free Flatware, Bowl, Plate, Tumbler Set, Colorful
Barin Toys Teething Pendant, Tiny Bird
Cozy Infant Remote Wireless Video Baby Monitor with 3.5-Inch Color LCD Screen, Infrared Optic Night Vision and Remote Camera Pan, Tilt and Zoom, 3.5 Inch
Safety 1st Tot-Lok Four Lock Assembly
Summer Infant Complete Nursery Care Kit
Gerber First Essential Clear View BPA Free Plastic Nurser With Latex Nipple, Assorted 

## Compute accuracy of the classifier

We will now evaluate the accuracy of the trained classifier.Accuracy is given by


$$
\mbox{accuracy} = \frac{\mbox{# correctly classified examples}}{\mbox{# total examples}}
$$

In [37]:
def get_classification_accuracy(model, data, true_labels):
    # First get the predictions
    pred=model.predict(data)
    # Compute the number of correctly classified examples
    right_count=0
    for i in range(len(pred)):
        if(pred[i]==true_labels[i]):
            right_count+=1
    # Then compute accuracy by dividing num_correct by total number of examples
    accuracy=right_count/float(len(data))
    return accuracy

Now, let's compute the classification accuracy of the **sentiment_model** on the **test_data**.

In [38]:
print "The accuracy of the sentiment model is:",'%.4f' %get_classification_accuracy(sentiment_model, test_data, test_data['sentiment'])

The accuracy of the sentiment model is: 0.9145


## Learn another classifier with fewer words

In [40]:
significant_words = ['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves', 
      'well', 'able', 'car', 'broke', 'less', 'even', 'waste', 'disappointed', 
      'work', 'product', 'money', 'would', 'return']

In [43]:
# Remove count for all the words not in the dictionary for significant words
train_data['word_count_subset'] = train_data['word_count'].dict_trim_by_keys(significant_words, exclude=False)
test_data['word_count_subset'] = test_data['word_count'].dict_trim_by_keys(significant_words, exclude=False)

In [44]:
train_data[0]['review'] # WHat does the data look like now

'it came early and was not disappointed. i love planet wise bags and now my wipe holder. it keps my osocozy wipes moist and does not leak. highly recommend it.'

The **word_count** column had been working with before looks like the following:

In [45]:
print train_data[0]['word_count']

{'and': 3L, 'love': 1L, 'it': 3L, 'highly': 1L, 'osocozy': 1L, 'bags': 1L, 'leak': 1L, 'moist': 1L, 'does': 1L, 'recommend': 1L, 'was': 1L, 'wipes': 1L, 'disappointed': 1L, 'early': 1L, 'not': 2L, 'now': 1L, 'holder': 1L, 'wipe': 1L, 'keps': 1L, 'wise': 1L, 'i': 1L, 'planet': 1L, 'my': 2L, 'came': 1L}


Since we are only working with a subset of these words, the column **word_count_subset** is a subset of the above dictionary. In this example, only 2 `significant words` are present in this review.

In [47]:
print train_data[0]['word_count_subset']

{'love': 1L, 'disappointed': 1L}


## Train a logistic regression model on a subset of data

We will now build a classifier with **word_count_subset** as the feature and **sentiment** as the target. 

In [48]:
simple_model = graphlab.logistic_classifier.create(train_data,
                                                   target = 'sentiment',
                                                   features=['word_count_subset'],
                                                   validation_set=None)
simple_model

Class                          : LogisticClassifier

Schema
------
Number of coefficients         : 21
Number of examples             : 133416
Number of classes              : 2
Number of feature columns      : 1
Number of unpacked features    : 20

Hyperparameters
---------------
L1 penalty                     : 0.0
L2 penalty                     : 0.01

Training Summary
----------------
Solver                         : newton
Solver iterations              : 6
Solver status                  : SUCCESS: Optimal solution found.
Training time (sec)            : 0.8371

Settings
--------
Log-likelihood                 : 44323.7254

Highest Positive Coefficients
-----------------------------
word_count_subset[loves]       : 1.6773
word_count_subset[perfect]     : 1.5145
word_count_subset[love]        : 1.3654
(intercept)                    : 1.2995
word_count_subset[easy]        : 1.1937

Lowest Negative Coefficients
----------------------------
word_count_subset[disappointed] : -2.3551
wo

We can compute the classification accuracy using the `get_classification_accuracy` function you implemented earlier.

In [50]:
print"The accuracy of the selected words model is:",get_classification_accuracy(simple_model, test_data, test_data['sentiment'])

The accuracy of the selected words model is: 0.869300455964


Now, we will inspect the weights (coefficients) of the **simple_model**:

In [51]:
simple_model.coefficients

name,index,class,value,stderr
(intercept),,1,1.2995449552,0.0120888541331
word_count_subset,disappointed,1,-2.35509250061,0.0504149888557
word_count_subset,love,1,1.36543549368,0.0303546295109
word_count_subset,well,1,0.504256746398,0.021381300631
word_count_subset,product,1,-0.320555492996,0.0154311321362
word_count_subset,loves,1,1.67727145556,0.0482328275384
word_count_subset,little,1,0.520628636025,0.0214691475665
word_count_subset,work,1,-0.621700012425,0.0230330597946
word_count_subset,easy,1,1.19366189833,0.029288869202
word_count_subset,great,1,0.94469126948,0.0209509926591


Let's sort the coefficients (in descending order) by the **value** to obtain the coefficients with the most positive effect on the sentiment.

In [52]:
simple_model.coefficients.sort('value', ascending=False).print_rows(num_rows=21)

+-------------------+--------------+-------+-----------------+-----------------+
|        name       |    index     | class |      value      |      stderr     |
+-------------------+--------------+-------+-----------------+-----------------+
| word_count_subset |    loves     |   1   |  1.67727145556  | 0.0482328275384 |
| word_count_subset |   perfect    |   1   |  1.51448626703  |  0.049861952294 |
| word_count_subset |     love     |   1   |  1.36543549368  | 0.0303546295109 |
|    (intercept)    |     None     |   1   |   1.2995449552  | 0.0120888541331 |
| word_count_subset |     easy     |   1   |  1.19366189833  |  0.029288869202 |
| word_count_subset |    great     |   1   |  0.94469126948  | 0.0209509926591 |
| word_count_subset |    little    |   1   |  0.520628636025 | 0.0214691475665 |
| word_count_subset |     well     |   1   |  0.504256746398 |  0.021381300631 |
| word_count_subset |     able     |   1   |  0.191438302295 | 0.0337581955697 |
| word_count_subset |     ol

In [55]:
#How many words have a positive coefficient
d=simple_model.coefficients.sort('value', ascending=False)
coef=d[d['name']!='(intercept)']
d=coef['index'][coef['value']>=0]
len(d)

10

# Comparing models

We will now compare the accuracy of the **sentiment_model** and the **simple_model** using the `get_classification_accuracy` method you implemented above.

First, compute the classification accuracy of the **sentiment_model** on the **train_data**:

In [57]:
print "The accuracy of Sentiment model i.e. all words on training data is:",get_classification_accuracy(sentiment_model,train_data,train_data['sentiment'])
print "The accuracy of Selected words model i.e. all words on training data is:",get_classification_accuracy(simple_model,train_data,train_data['sentiment'])
print "The accuracy of Sentiment model i.e. all words on test data is:" ,get_classification_accuracy(sentiment_model,test_data,test_data['sentiment'])
print "The accuracy of Selected words model i.e. all words on test data is:",get_classification_accuracy(simple_model,test_data,test_data['sentiment'])

The accuracy of Sentiment model i.e. all words on training data is: 0.979440247047
The accuracy of Selected words model i.e. all words on training data is: 0.866815074654
The accuracy of Sentiment model i.e. all words on test data is: 0.914536837053
The accuracy of Selected words model i.e. all words on test data is: 0.869300455964


## Baseline: Majority class prediction

It is quite common to use the **majority class classifier** as the a baseline (or reference) model for comparison with your classifier model. The majority classifier model predicts the majority class for all data points. At the very least, you should healthily beat the majority class classifier, otherwise, the model is (usually) pointless.

What is the majority class in the **train_data**?

In [61]:
num_positive  = (test_data['sentiment'] == +1).sum()
num_negative = (test_data['sentiment'] == -1).sum()
print "The accurac of a majoirty class classifier will be:",num_positive/float(len(test_data))

The accurac of a majoirty class classifier will be: 0.842782577394


## Thank You!