# Week 1: Predicting sentiment from product reviews

The goal of this first notebook is to explore logistic regression and feature engineering with existing GraphLab functions.

In this notebook you will use product review data from Amazon.com to predict whether the sentiments about a product (from its reviews) are positive or negative.

* Use SFrames to do some feature engineering
* Train a logistic regression model to predict the sentiment of product reviews.
* Inspect the weights (coefficients) of a trained logistic regression model.
* Make a prediction (both class and probability) of sentiment for a new product review.
* Given the logistic regression weights, predictors and ground truth labels, write a function to compute the **accuracy** of the model.
* Inspect the coefficients of the logistic regression model and interpret their meanings.
* Compare multiple logistic regression models.

In [1]:
import sframe
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import linear_model

[INFO] SFrame v1.8.3 started. Logging /tmp/sframe_server_1457531319.log


# Load amazon dataset

In [2]:
products = sframe.SFrame('amazon_baby.gl/')

# Perform text cleaning

In [3]:
def remove_punctuation(text):
    import string
    return text.translate(None, string.punctuation)

products['review_clean'] = products['review'].apply(remove_punctuation)

# Extract sentiments

In [4]:
products = products[products['rating'] != 3]
products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)

# Split into training and test sets

In [5]:
train_data, test_data = products.random_split(.8, seed=1)

# Build the word count vector

In [6]:
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')

train_matrix = vectorizer.fit_transform(train_data['review_clean'])
test_matrix = vectorizer.transform(test_data['review_clean'])

# Train classifer with logistic regression

In [7]:
clf = linear_model.LogisticRegression()

In [8]:
clf.fit(train_matrix, train_data['sentiment'])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [9]:
print clf.intercept_

[ 1.3780432]


In [10]:
weights = clf.coef_

i = 0
for weight in weights[0]:
    if weight > 0.:
        i = i+1
print i

85752


In [11]:
print '# of positive weights: ', i + 1
print '# of negative weights: ', len(weights[0]) - i

# of positive weights:  85753
# of negative weights:  35960


# Sentiment predictions

In [12]:
sample_test_data = test_data[10:13]
print sample_test_data['rating']

[5.0, 2.0, 1.0]


In [13]:
sample_test_data[0]['review']

'Absolutely love it and all of the Scripture in it.  I purchased the Baby Boy version for my grandson when he was born and my daughter-in-law was thrilled to receive the same book again.'

In [14]:
sample_test_data[1]['review']

'Would not purchase again or recommend. The decals were thick almost plastic like and were coming off the wall as I was applying them! The would NOT stick! Literally stayed stuck for about 5 minutes then started peeling off.'

In [15]:
sample_test_data[2]['review']

"Was so excited to get this product for my baby girls bedroom!  When I got it the back is NOT STICKY at all!  Every time I walked into the bedroom I was picking up pieces off of the floor!  Very very frustrating!  Ended up having to super glue it to the wall...very disappointing.  I wouldn't waste the time or money on it."

In [16]:
sample_test_matrix = vectorizer.transform(sample_test_data['review_clean'])

In [17]:
scores = clf.decision_function(sample_test_matrix)
print scores

[  5.59095054  -3.12647284 -10.42233483]


In [18]:
clf.predict(sample_test_matrix)

array([ 1, -1, -1])

# Probability predictions

In [19]:
def calculate_probability(score):
    p = 1./(1+np.exp(-score))
    return p

In [20]:
for s in scores:
    p = calculate_probability(s)
    print p

0.99628239291
0.0420283884688
2.97594272228e-05


In [21]:
clf.predict_proba(sample_test_matrix)
# what's the first column?

array([[  3.71760709e-03,   9.96282393e-01],
       [  9.57971612e-01,   4.20283885e-02],
       [  9.99970241e-01,   2.97594272e-05]])

* 13 Find the most positive (and negative) review

In [22]:
p = clf.predict_proba(test_matrix)
p[:, 1]

array([ 0.78237037,  0.99999926,  0.93446093, ...,  0.99999444,
        0.9999974 ,  0.98103463])

In [23]:
test_data['probability'] = p[:, 1]

In [24]:
top20_positive = test_data.topk('probability', k=20, reverse=False)
top20_negative = test_data.topk('probability', k=20, reverse=True)

In [72]:
for i in top20_negative['name']:
    print i

Fisher-Price Ocean Wonders Aquarium Bouncer
Levana Safe N'See Digital Video Baby Monitor with Talk-to-Baby Intercom and Lullaby Control (LV-TW501)
Safety 1st Exchangeable Tip 3 in 1 Thermometer
Adiri BPA Free Natural Nurser Ultimate Bottle Stage 1 White, Slow Flow (0-3 months)
VTech Communications Safe &amp; Sounds Full Color Video and Audio Monitor
The First Years True Choice P400 Premium Digital Monitor, 2 Parent Unit
Safety 1st High-Def Digital Monitor
Cloth Diaper Sprayer--styles may vary
Motorola Digital Video Baby Monitor with Room Temperature Thermometer
Philips AVENT Newborn Starter Set
Cosco Alpha Omega Elite Convertible Car Seat
Ellaroo Mei Tai Baby Carrier - Hershey
Belkin WeMo Wi-Fi Baby Monitor for Apple iPhone, iPad, and iPod Touch (Firmware Update)
Peg-Perego Tatamia High Chair, White Latte
Chicco Cortina KeyFit 30 Travel System in Adventure
NUK Cook-n-Blend Baby Food Maker
VTech Communications Safe &amp; Sound Digital Audio Monitor with two Parent Units
Safety 1st Delux

In [73]:
for i in top20_positive['name']:
    print i

Britax 2012 B-Agile Stroller, Red
P'Kolino Silly Soft Seating in Tias, Green
Evenflo X Sport Plus Convenience Stroller - Christina
Evenflo 6 Pack Classic Glass Bottle, 4-Ounce
Simple Wishes Hands-Free Breastpump Bra, Pink, XS-L
Baby Einstein Around The World Discovery Center
Freemie Hands-Free Concealable Breast Pump Collection System
Infantino Wrap and Tie Baby Carrier, Black Blueberries
Fisher-Price Cradle 'N Swing,  My Little Snugabunny
Roan Rocco Classic Pram Stroller 2-in-1 with Bassinet and Seat Unit - Coffee
Mamas &amp; Papas 2014 Urbo2 Stroller - Black
Graco Pack 'n Play Element Playard - Flint
Diono RadianRXT Convertible Car Seat, Plum
Baby Jogger City Mini GT Single Stroller, Shadow/Orange
Graco FastAction Fold Jogger Click Connect Stroller, Grapeade
Buttons Cloth Diaper Cover - One Size - 8 Color Options
Ikea 36 Pcs Kalas Kids Plastic BPA Free Flatware, Bowl, Plate, Tumbler Set, Colorful
Britax Decathlon Convertible Car Seat, Tiffany
Baby Jogger City Mini GT Double Stroller,

# Compute accuracy

In [27]:
def calculate_accuracy(model, data, true_labels):
    predicted_labels = model.predict(data)
    
    correct = (np.array(true_labels) == predicted_labels).sum()
    
    accuracy = float(correct) / len(true_labels)
    
    return accuracy

In [28]:
calculate_accuracy(clf, test_matrix, test_data['sentiment'])

0.9324454043676506

# Learn another classifier with fewer words

In [29]:
significant_words = ['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves', 
      'well', 'able', 'car', 'broke', 'less', 'even', 'waste', 'disappointed', 
      'work', 'product', 'money', 'would', 'return']

In [30]:
vectorizer_word_subset = CountVectorizer(vocabulary=significant_words) # limit to 20 words
train_matrix_word_subset = vectorizer_word_subset.fit_transform(train_data['review_clean'])
test_matrix_word_subset = vectorizer_word_subset.transform(test_data['review_clean'])

# Train the simple classifier

In [31]:
simple_clf = linear_model.LogisticRegression()
simple_clf.fit(train_matrix_word_subset, train_data['sentiment'])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [36]:
simple_clf_coef_table = sframe.SFrame({'word': significant_words, 'coefficient': simple_clf.coef_.flatten()})

In [45]:
print '# of positive coefficients: ', np.sum(np.array(simple_clf_coef_table['coefficient']) > 0)

# of positive coefficients:  10


In [69]:
for word in significant_words:
    index = vectorizer.vocabulary_[word]
    print word, clf.coef_[0][index]

love 1.57985183716
great 1.23132511927
easy 1.36313632807
old 0.0572106753799
little 0.63985180834
perfect 1.85970752258
loves 1.51735651768
well 0.541186559629
able 0.391000301035
car 0.11939705687
broke -1.39264233854
less -0.276714722711
even -0.465956269691
waste -1.99410264917
disappointed -2.19363827737
work -0.461059082457
product -0.191033905674
money -0.786784668211
would -0.283277287649
return -1.65510283807


In [71]:
simple_clf_coef_table.print_rows(20, 2)

+-----------------+--------------+
|   coefficient   |     word     |
+-----------------+--------------+
|  1.36368975931  |     love     |
|  0.943999590572 |    great     |
|  1.19253827349  |     easy     |
| 0.0855127794632 |     old      |
|  0.520185762718 |    little    |
|  1.50981247669  |   perfect    |
|  1.67307389259  |    loves     |
|  0.503760457768 |     well     |
|  0.190908572065 |     able     |
| 0.0588546711526 |     car      |
|  -1.65157634496 |    broke     |
| -0.209562864534 |     less     |
| -0.511379631799 |     even     |
|  -2.03369861394 |    waste     |
|  -2.3482982195  | disappointed |
| -0.621168773641 |     work     |
| -0.320556236734 |   product    |
| -0.898030737714 |    money     |
| -0.362166742274 |    would     |
|  -2.10933109032 |    return    |
+-----------------+--------------+
[20 rows x 2 columns]



# Comparing models

* on training data

In [46]:
calculate_accuracy(clf, train_matrix, train_data['sentiment'])

0.9677100197877316

In [48]:
calculate_accuracy(simple_clf, train_matrix_word_subset, train_data['sentiment'])

0.8668225700065959

* on testing data

In [50]:
calculate_accuracy(clf, test_matrix, test_data['sentiment'])

0.9324454043676506

# the answer 0.86 or 0.87 is wrong in the quiz!!!

In [75]:
calculate_accuracy(simple_clf, test_matrix_word_subset, test_data['sentiment'])

0.8693604511639069