# Predicting sentiment from product reviews

The goal of this project is to explore logistic regression and feature engineering with existing GraphLab functions.
Using product review data from Amazon.com, the sentiments about a product, whether positive or negative, is predicted.

In [1]:
import sys

In [2]:
sys.path.append("C:\Users\Bishusunita\Anaconda2\lib\site-packages")

In [3]:
from __future__ import division

import graphlab
import math
import string

In [4]:
products = graphlab.SFrame('amazon_baby.gl/')

2016-03-12 12:50:29,155 [INFO] graphlab.cython.cy_server, 176: GraphLab Create v1.8.4 started. Logging: C:\Users\BISHUS~1\AppData\Local\Temp\graphlab_server_1457805028.log.0


This non-commercial license of GraphLab Create is assigned to chowdhury.biswanath@gmail.com and will expire on October 24, 2016. For commercial licensing options, visit https://dato.com/buy/.


In [5]:
products

name,review,rating
Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3.0
Planetwise Wipe Pouch,it came early and was not disappointed. i love ...,5.0
Annas Dream Full Quilt with 2 Shams ...,Very soft and comfortable and warmer than it ...,5.0
Stop Pacifier Sucking without tears with ...,This is a product well worth the purchase. I ...,5.0
Stop Pacifier Sucking without tears with ...,All of my kids have cried non-stop when I tried to ...,5.0
Stop Pacifier Sucking without tears with ...,"When the Binky Fairy came to our house, we didn't ...",5.0
A Tale of Baby's Days with Peter Rabbit ...,"Lovely book, it's bound tightly so you may no ...",4.0
"Baby Tracker&reg; - Daily Childcare Journal, ...",Perfect for new parents. We were able to keep ...,5.0
"Baby Tracker&reg; - Daily Childcare Journal, ...",A friend of mine pinned this product on Pinte ...,5.0
"Baby Tracker&reg; - Daily Childcare Journal, ...",This has been an easy way for my nanny to record ...,4.0


In [6]:
products[269]

{'name': 'The First Years Massaging Action Teether',
 'rating': 5.0,
 'review': 'A favorite in our house!'}

In [7]:
def remove_punctuation(text):
    import string
    return text.translate(None, string.punctuation) 

review_without_puctuation = products['review'].apply(remove_punctuation)
products['word_count'] = graphlab.text_analytics.count_words(review_without_puctuation)

In [9]:
products[269]['word_count']

{'a': 1L, 'favorite': 1L, 'house': 1L, 'in': 1L, 'our': 1L}

In [12]:
products = products[products['rating'] != 3]
len(products)

166752

In [13]:
products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)
products

name,review,rating,word_count,sentiment
Planetwise Wipe Pouch,it came early and was not disappointed. i love ...,5.0,"{'and': 3L, 'love': 1L, 'it': 3L, 'highly': 1L, ...",1
Annas Dream Full Quilt with 2 Shams ...,Very soft and comfortable and warmer than it ...,5.0,"{'and': 2L, 'quilt': 1L, 'it': 1L, 'comfortable': ...",1
Stop Pacifier Sucking without tears with ...,This is a product well worth the purchase. I ...,5.0,"{'and': 3L, 'ingenious': 1L, 'love': 2L, 'is': ...",1
Stop Pacifier Sucking without tears with ...,All of my kids have cried non-stop when I tried to ...,5.0,"{'and': 2L, 'all': 2L, 'help': 1L, 'cried': 1L, ...",1
Stop Pacifier Sucking without tears with ...,"When the Binky Fairy came to our house, we didn't ...",5.0,"{'and': 2L, 'cute': 1L, 'help': 2L, 'habit': 1L, ...",1
A Tale of Baby's Days with Peter Rabbit ...,"Lovely book, it's bound tightly so you may no ...",4.0,"{'shop': 1L, 'be': 1L, 'is': 1L, 'bound': 1L, ...",1
"Baby Tracker&reg; - Daily Childcare Journal, ...",Perfect for new parents. We were able to keep ...,5.0,"{'and': 2L, 'all': 1L, 'right': 1L, 'able': 1L, ...",1
"Baby Tracker&reg; - Daily Childcare Journal, ...",A friend of mine pinned this product on Pinte ...,5.0,"{'and': 1L, 'fantastic': 1L, 'help': 1L, 'give': ...",1
"Baby Tracker&reg; - Daily Childcare Journal, ...",This has been an easy way for my nanny to record ...,4.0,"{'all': 1L, 'standarad': 1L, 'another': 1L, ...",1
"Baby Tracker&reg; - Daily Childcare Journal, ...",I love this journal and our nanny uses it ...,4.0,"{'all': 2L, 'nannys': 1L, 'just': 1L, 'sleep': 2L, ...",1


In [14]:
train_data, test_data = products.random_split(.8, seed=1)
print len(train_data)
print len(test_data)

133416
33336


In [15]:
sentiment_model = graphlab.logistic_classifier.create(train_data,
                                                      target = 'sentiment',
                                                      features=['word_count'],
                                                      validation_set=None)

In [16]:
sentiment_model

Class                         : LogisticClassifier

Schema
------
Number of coefficients        : 121713
Number of examples            : 133416
Number of classes             : 2
Number of feature columns     : 1
Number of unpacked features   : 121712

Hyperparameters
---------------
L1 penalty                    : 0.0
L2 penalty                    : 0.01

Training Summary
----------------
Solver                        : lbfgs
Solver iterations             : 6
Solver status                 : TERMINATED: Terminated due to numerical difficulties.
Training time (sec)           : 9.1811

Settings
--------
Log-likelihood                : inf

Highest Positive Coefficients
-----------------------------
word_count[mobileupdate]      : 41.9847
word_count[placeid]           : 41.7354
word_count[labelbox]          : 41.151
word_count[httpwwwamazoncomreviewrhgg6qp7tdnhbrefcmcrprcmtieutf8asinb00318cla0nodeid]: 40.0454
word_count[knobskeeping]      : 36.2091

Lowest Negative Coefficients
-----------

In [17]:
weights = sentiment_model.coefficients
weights.column_names()

['name', 'index', 'class', 'value', 'stderr']

In [20]:
weights[weights['value'] >= 0]

name,index,class,value,stderr
(intercept),,1,1.30337080544,
word_count,recommend,1,0.303815600015,
word_count,highly,1,1.49183015276,
word_count,love,1,1.43301685439,
word_count,it,1,0.00986646490307,
word_count,and,1,0.048449573172,
word_count,bags,1,0.165541436615,
word_count,early,1,0.488413478808,
word_count,came,1,0.131378480765,
word_count,i,1,0.0182528116279,


In [21]:
num_positive_weights = len(weights[weights['value'] >= 0])
num_negative_weights = len(weights[weights['value'] < 0])

print "Number of positive weights: %s " % num_positive_weights
print "Number of negative weights: %s " % num_negative_weights

Number of positive weights: 68419 
Number of negative weights: 53294 


In [22]:
sample_test_data = test_data[10:13]
print sample_test_data['rating']
sample_test_data

[5.0, 2.0, 1.0]


name,review,rating,word_count,sentiment
Our Baby Girl Memory Book,Absolutely love it and all of the Scripture in ...,5.0,"{'and': 2L, 'all': 1L, 'love': 1L, ...",1
Wall Decor Removable Decal Sticker - Colorful ...,Would not purchase again or recommend. The decals ...,2.0,"{'and': 1L, 'wall': 1L, 'them': 1L, 'decals': ...",-1
New Style Trailing Cherry Blossom Tree Decal ...,Was so excited to get this product for my baby ...,1.0,"{'all': 1L, 'money': 1L, 'into': 1L, 'it': 3L, ...",-1


In [23]:
sample_test_data[0]['review']

'Absolutely love it and all of the Scripture in it.  I purchased the Baby Boy version for my grandson when he was born and my daughter-in-law was thrilled to receive the same book again.'

In [24]:
sample_test_data[1]['review']

'Would not purchase again or recommend. The decals were thick almost plastic like and were coming off the wall as I was applying them! The would NOT stick! Literally stayed stuck for about 5 minutes then started peeling off.'

In [25]:
scores = sentiment_model.predict(sample_test_data, output_type='margin')
print scores

[6.734619727060312, -5.734130996760897, -14.668460404469606]


In [27]:
def getClass(score):
    if score > 0:
        pclass = 1
    else:
        pclass = -1
    return pclass

In [28]:
for score in scores:
    print getClass(score)

1
-1
-1


In [29]:
print "Class predictions according to GraphLab Create:" 
print sentiment_model.predict(sample_test_data)

Class predictions according to GraphLab Create:
[1L, -1L, -1L]


In [32]:
import numpy as np
def getP(score):
    prob = 1./(1. + np.exp(-score))
    return prob

In [33]:
for score in scores:
    print getP(score)

0.998812384838
0.0032232681818
4.26155799665e-07


In [37]:
test_data['prob'] = sentiment_model.predict(test_data, output_type='probability')

In [77]:
print test_data.topk('prob',k=20).print_rows(num_rows=20)

+-------------------------------+-------------------------------+--------+
|              name             |             review            | rating |
+-------------------------------+-------------------------------+--------+
| Britax Decathlon Convertib... | I researched a few differe... |  4.0   |
| bumGenius One-Size Cloth D... | I love, love, love these d... |  5.0   |
| Moby Wrap Original 100% Co... | Let me just say that I DO ... |  4.0   |
| Moby Wrap Original 100% Co... | My Moby is an absolute nec... |  5.0   |
| Ameda Purely Yours Breast ... | As with many new moms, I c... |  4.0   |
| Traveling Toddler Car Seat... | I am sure this product wor... |  2.0   |
| Cloud b Sound Machine Soot... | First off, I love plush sh... |  5.0   |
| JP Lizzy Chocolate Ice Cla... | I got this bag as a presen... |  4.0   |
| Lilly Gold Sit 'n' Stroll ... | I just completed a two-mon... |  5.0   |
|  Fisher-Price Deluxe Jumperoo | I had already decided that... |  5.0   |
|   Munchkin Mozart Magic

In [78]:
test_data.topk('prob',k=20,reverse=True).print_rows(num_rows=20)

+-------------------------------+-------------------------------+--------+
|              name             |             review            | rating |
+-------------------------------+-------------------------------+--------+
| Jolly Jumper Arctic Sneak ... | I am a "research-aholic" i... |  5.0   |
| Levana Safe N'See Digital ... | This is the first review I... |  1.0   |
| Snuza Portable Baby Moveme... | I would have given the pro... |  1.0   |
| Fisher-Price Ocean Wonders... | We have not had ANY luck w... |  2.0   |
| VTech Communications Safe ... | This is my second video mo... |  1.0   |
| Safety 1st High-Def Digita... | We bought this baby monito... |  1.0   |
| Chicco Cortina KeyFit 30 T... | My wife and I have used th... |  1.0   |
| Prince Lionheart Warmies W... | *****IMPORTANT UPDATE*****... |  1.0   |
| Valco Baby Tri-mode Twin S... | I give one star to the dim... |  1.0   |
| Adiri BPA Free Natural Nur... | I will try to write an obj... |  2.0   |
| Munchkin Nursery Projec

In [50]:
def get_classification_accuracy(model, data, true_labels):
    # First get the predictions
    data['margin'] = sentiment_model.predict(data, output_type='margin')
    #products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)
    data['pclass'] = data['margin'].apply(lambda margin : +1 if margin > 0 else -1)
    
    # Compute the number of correctly classified examples
    print data[data['pclass'] == true_labels]
    ncorrect = len(data[data['pclass'] == true_labels])

    # Then compute accuracy by dividing num_correct by total number of examples
    accuracy = np.round(float(ncorrect)/float(len(data)),2)
    
    return accuracy

In [51]:
print get_classification_accuracy(sentiment_model, test_data, test_data['sentiment'])

+-------------------------------+-------------------------------+--------+
|              name             |             review            | rating |
+-------------------------------+-------------------------------+--------+
| Baby Tracker&reg; - Daily ... | This has been an easy way ... |  4.0   |
| Baby Tracker&reg; - Daily ... | I love this journal and ou... |  4.0   |
| Nature's Lullabies Second ... | I had a hard time finding ... |  5.0   |
|  Lamaze Peekaboo, I Love You  | One of baby's first and fa... |  4.0   |
|  Lamaze Peekaboo, I Love You  | My son loved this book as ... |  5.0   |
|  Lamaze Peekaboo, I Love You  | Our baby loves this book &... |  5.0   |
| SoftPlay Peek-A-Boo Where'... | I bought two for recent ba... |  5.0   |
| Baby's First Year Undated ... | I searched high and low fo... |  5.0   |
|   Our Baby Girl Memory Book   | Absolutely love it and all... |  5.0   |
| Wall Decor Removable Decal... | Would not purchase again o... |  2.0   |
+------------------------

In [52]:
significant_words = ['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves', 
      'well', 'able', 'car', 'broke', 'less', 'even', 'waste', 'disappointed', 
      'work', 'product', 'money', 'would', 'return']

In [53]:
len(significant_words)

20

In [54]:
train_data['word_count_subset'] = train_data['word_count'].dict_trim_by_keys(significant_words, exclude=False)
test_data['word_count_subset'] = test_data['word_count'].dict_trim_by_keys(significant_words, exclude=False)

In [55]:
train_data[0]['review']

'it came early and was not disappointed. i love planet wise bags and now my wipe holder. it keps my osocozy wipes moist and does not leak. highly recommend it.'

In [56]:
print train_data[0]['word_count']

{'and': 3L, 'love': 1L, 'it': 3L, 'highly': 1L, 'osocozy': 1L, 'bags': 1L, 'leak': 1L, 'moist': 1L, 'does': 1L, 'recommend': 1L, 'was': 1L, 'wipes': 1L, 'disappointed': 1L, 'early': 1L, 'not': 2L, 'now': 1L, 'holder': 1L, 'wipe': 1L, 'keps': 1L, 'wise': 1L, 'i': 1L, 'planet': 1L, 'my': 2L, 'came': 1L}


In [57]:
print train_data[0]['word_count_subset']

{'love': 1L, 'disappointed': 1L}


In [58]:
simple_model = graphlab.logistic_classifier.create(train_data,
                                                   target = 'sentiment',
                                                   features=['word_count_subset'],
                                                   validation_set=None)
simple_model

Class                         : LogisticClassifier

Schema
------
Number of coefficients        : 21
Number of examples            : 133416
Number of classes             : 2
Number of feature columns     : 1
Number of unpacked features   : 20

Hyperparameters
---------------
L1 penalty                    : 0.0
L2 penalty                    : 0.01

Training Summary
----------------
Solver                        : newton
Solver iterations             : 6
Solver status                 : SUCCESS: Optimal solution found.
Training time (sec)           : 0.7385

Settings
--------
Log-likelihood                : 44323.7254

Highest Positive Coefficients
-----------------------------
word_count_subset[loves]      : 1.6773
word_count_subset[perfect]    : 1.5145
word_count_subset[love]       : 1.3654
(intercept)                   : 1.2995
word_count_subset[easy]       : 1.1937

Lowest Negative Coefficients
----------------------------
word_count_subset[disappointed]: -2.3551
word_count_subset[ret

In [59]:
print get_classification_accuracy(simple_model, test_data, test_data['sentiment'])

+-------------------------------+-------------------------------+--------+
|              name             |             review            | rating |
+-------------------------------+-------------------------------+--------+
| Baby Tracker&reg; - Daily ... | This has been an easy way ... |  4.0   |
| Baby Tracker&reg; - Daily ... | I love this journal and ou... |  4.0   |
| Nature's Lullabies Second ... | I had a hard time finding ... |  5.0   |
|  Lamaze Peekaboo, I Love You  | One of baby's first and fa... |  4.0   |
|  Lamaze Peekaboo, I Love You  | My son loved this book as ... |  5.0   |
|  Lamaze Peekaboo, I Love You  | Our baby loves this book &... |  5.0   |
| SoftPlay Peek-A-Boo Where'... | I bought two for recent ba... |  5.0   |
| Baby's First Year Undated ... | I searched high and low fo... |  5.0   |
|   Our Baby Girl Memory Book   | Absolutely love it and all... |  5.0   |
| Wall Decor Removable Decal... | Would not purchase again o... |  2.0   |
+------------------------

In [60]:
simple_model.coefficients

name,index,class,value,stderr
(intercept),,1,1.2995449552,0.0120888541331
word_count_subset,disappointed,1,-2.35509250061,0.0504149888557
word_count_subset,love,1,1.36543549368,0.0303546295109
word_count_subset,well,1,0.504256746398,0.021381300631
word_count_subset,product,1,-0.320555492996,0.0154311321362
word_count_subset,loves,1,1.67727145556,0.0482328275384
word_count_subset,little,1,0.520628636025,0.0214691475665
word_count_subset,work,1,-0.621700012425,0.0230330597946
word_count_subset,easy,1,1.19366189833,0.029288869202
word_count_subset,great,1,0.94469126948,0.0209509926591


In [86]:
for coeff in simple_model.coefficients:
    if coeff['value'] > 0:
        scoeff = sentiment_model.coefficients[sentiment_model.coefficients['index']==coeff['index']]
        print coeff['index'],coeff['value'],scoeff['value']

None 1.2995449552 [1.3033708054362867, ... ]
love 1.36543549368 [1.4330168543928075, ... ]
well 0.504256746398 [0.6279648775668165, ... ]
loves 1.67727145556 [1.5664851756956497, ... ]
little 0.520628636025 [0.6741624574994501, ... ]
easy 1.19366189833 [1.2134693782160606, ... ]
great 0.94469126948 [1.3145924503860063, ... ]
able 0.191438302295 [0.17433127255187922, ... ]
perfect 1.51448626703 [1.7519011439200898, ... ]
old 0.0853961886678 [0.009122301136675403, ... ]
car 0.058834990068 [0.1952636706177577, ... ]


In [61]:
simple_model.coefficients.sort('value', ascending=False).print_rows(num_rows=21)

+-------------------+--------------+-------+-----------------+-----------------+
|        name       |    index     | class |      value      |      stderr     |
+-------------------+--------------+-------+-----------------+-----------------+
| word_count_subset |    loves     |   1   |  1.67727145556  | 0.0482328275384 |
| word_count_subset |   perfect    |   1   |  1.51448626703  |  0.049861952294 |
| word_count_subset |     love     |   1   |  1.36543549368  | 0.0303546295109 |
|    (intercept)    |     None     |   1   |   1.2995449552  | 0.0120888541331 |
| word_count_subset |     easy     |   1   |  1.19366189833  |  0.029288869202 |
| word_count_subset |    great     |   1   |  0.94469126948  | 0.0209509926591 |
| word_count_subset |    little    |   1   |  0.520628636025 | 0.0214691475665 |
| word_count_subset |     well     |   1   |  0.504256746398 |  0.021381300631 |
| word_count_subset |     able     |   1   |  0.191438302295 | 0.0337581955697 |
| word_count_subset |     ol

In [64]:
coeff = simple_model.coefficients.sort('value', ascending=False)[0:21]

In [67]:
print len(coeff[coeff['value']>0])

11


# Comparing models

In [68]:
print get_classification_accuracy(sentiment_model, train_data, train_data['sentiment'])

+-------------------------------+-------------------------------+--------+
|              name             |             review            | rating |
+-------------------------------+-------------------------------+--------+
|     Planetwise Wipe Pouch     | it came early and was not ... |  5.0   |
| Annas Dream Full Quilt wit... | Very soft and comfortable ... |  5.0   |
| Stop Pacifier Sucking with... | This is a product well wor... |  5.0   |
| Stop Pacifier Sucking with... | All of my kids have cried ... |  5.0   |
| Stop Pacifier Sucking with... | When the Binky Fairy came ... |  5.0   |
| A Tale of Baby's Days with... | Lovely book, it's bound ti... |  4.0   |
| Baby Tracker&reg; - Daily ... | Perfect for new parents. W... |  5.0   |
| Baby Tracker&reg; - Daily ... | A friend of mine pinned th... |  5.0   |
| Baby Tracker&reg; - Daily ... | This book is perfect!  I'm... |  5.0   |
| Baby Tracker&reg; - Daily ... | I originally just gave the... |  4.0   |
+------------------------

In [69]:
print get_classification_accuracy(simple_model, train_data, train_data['sentiment'])

+-------------------------------+-------------------------------+--------+
|              name             |             review            | rating |
+-------------------------------+-------------------------------+--------+
|     Planetwise Wipe Pouch     | it came early and was not ... |  5.0   |
| Annas Dream Full Quilt wit... | Very soft and comfortable ... |  5.0   |
| Stop Pacifier Sucking with... | This is a product well wor... |  5.0   |
| Stop Pacifier Sucking with... | All of my kids have cried ... |  5.0   |
| Stop Pacifier Sucking with... | When the Binky Fairy came ... |  5.0   |
| A Tale of Baby's Days with... | Lovely book, it's bound ti... |  4.0   |
| Baby Tracker&reg; - Daily ... | Perfect for new parents. W... |  5.0   |
| Baby Tracker&reg; - Daily ... | A friend of mine pinned th... |  5.0   |
| Baby Tracker&reg; - Daily ... | This book is perfect!  I'm... |  5.0   |
| Baby Tracker&reg; - Daily ... | I originally just gave the... |  4.0   |
+------------------------

In [70]:
print get_classification_accuracy(sentiment_model, test_data, test_data['sentiment'])

+-------------------------------+-------------------------------+--------+
|              name             |             review            | rating |
+-------------------------------+-------------------------------+--------+
| Baby Tracker&reg; - Daily ... | This has been an easy way ... |  4.0   |
| Baby Tracker&reg; - Daily ... | I love this journal and ou... |  4.0   |
| Nature's Lullabies Second ... | I had a hard time finding ... |  5.0   |
|  Lamaze Peekaboo, I Love You  | One of baby's first and fa... |  4.0   |
|  Lamaze Peekaboo, I Love You  | My son loved this book as ... |  5.0   |
|  Lamaze Peekaboo, I Love You  | Our baby loves this book &... |  5.0   |
| SoftPlay Peek-A-Boo Where'... | I bought two for recent ba... |  5.0   |
| Baby's First Year Undated ... | I searched high and low fo... |  5.0   |
|   Our Baby Girl Memory Book   | Absolutely love it and all... |  5.0   |
| Wall Decor Removable Decal... | Would not purchase again o... |  2.0   |
+------------------------

In [71]:
print get_classification_accuracy(simple_model, test_data, test_data['sentiment'])

+-------------------------------+-------------------------------+--------+
|              name             |             review            | rating |
+-------------------------------+-------------------------------+--------+
| Baby Tracker&reg; - Daily ... | This has been an easy way ... |  4.0   |
| Baby Tracker&reg; - Daily ... | I love this journal and ou... |  4.0   |
| Nature's Lullabies Second ... | I had a hard time finding ... |  5.0   |
|  Lamaze Peekaboo, I Love You  | One of baby's first and fa... |  4.0   |
|  Lamaze Peekaboo, I Love You  | My son loved this book as ... |  5.0   |
|  Lamaze Peekaboo, I Love You  | Our baby loves this book &... |  5.0   |
| SoftPlay Peek-A-Boo Where'... | I bought two for recent ba... |  5.0   |
| Baby's First Year Undated ... | I searched high and low fo... |  5.0   |
|   Our Baby Girl Memory Book   | Absolutely love it and all... |  5.0   |
| Wall Decor Removable Decal... | Would not purchase again o... |  2.0   |
+------------------------

In [72]:
num_positive  = (train_data['sentiment'] == +1).sum()
num_negative = (train_data['sentiment'] == -1).sum()
print num_positive
print num_negative

112164
21252


In [75]:
num_positive  = (test_data['sentiment'] == +1).sum()
num_negative = (test_data['sentiment'] == -1).sum()
print num_positive
print num_negative
print np.round(float(num_positive)/float(len(test_data)),2)

28095
5241
0.84
