# Predicting sentiment from product reviews
In this assignment, we are going to explore this application further, training a sentiment analysis model using a set of key polarizing words, verify the weights learned to each of these words, and compare the results of this simpler classifier with those of the one using all of the words. These techniques will be a core component in your capstone project.

In [93]:
import graphlab

### Load product review data
Loading reviews for a set of baby products. 

In [94]:
products = graphlab.SFrame('Data/amazon_baby.gl/')
products.save('Data/amazon_baby.csv', format='csv')

In [95]:
# only use the defined words for evaluation
selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']

#### Data exploration
Data includes the product name, the review text and the rating word count as well as the count of words occuring in the word count section.

In [96]:
products['word_count'] = graphlab.text_analytics.count_words(products['review'])

In [97]:
products.head(5)

name,review,rating,word_count
Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3.0,"{'and': 5L, 'stink': 1L, 'because': 1L, 'order ..."
Planetwise Wipe Pouch,it came early and was not disappointed. i love ...,5.0,"{'and': 3L, 'love': 1L, 'it': 2L, 'highly': 1L, ..."
Annas Dream Full Quilt with 2 Shams ...,Very soft and comfortable and warmer than it ...,5.0,"{'and': 2L, 'quilt': 1L, 'it': 1L, 'comfortable': ..."
Stop Pacifier Sucking without tears with ...,This is a product well worth the purchase. I ...,5.0,"{'ingenious': 1L, 'and': 3L, 'love': 2L, ..."
Stop Pacifier Sucking without tears with ...,All of my kids have cried non-stop when I tried to ...,5.0,"{'and': 2L, 'parents!!': 1L, 'all': 2L, 'puppe ..."


In [98]:
graphlab.canvas.set_target('ipynb')

In [99]:
products['rating'].show(view='Categorical')

In [100]:
## define what is positive (rating >= 4) and negative (rating <= 2)

# ignore all products whose rating is 3
products = products[products['rating'] != 3]

# create sentiment column
products['sentiment'] = products['rating'] >= 4


In [101]:
products.head(5)

name,review,rating,word_count,sentiment
Planetwise Wipe Pouch,it came early and was not disappointed. i love ...,5.0,"{'and': 3L, 'love': 1L, 'it': 2L, 'highly': 1L, ...",1
Annas Dream Full Quilt with 2 Shams ...,Very soft and comfortable and warmer than it ...,5.0,"{'and': 2L, 'quilt': 1L, 'it': 1L, 'comfortable': ...",1
Stop Pacifier Sucking without tears with ...,This is a product well worth the purchase. I ...,5.0,"{'ingenious': 1L, 'and': 3L, 'love': 2L, ...",1
Stop Pacifier Sucking without tears with ...,All of my kids have cried non-stop when I tried to ...,5.0,"{'and': 2L, 'parents!!': 1L, 'all': 2L, 'puppe ...",1
Stop Pacifier Sucking without tears with ...,"When the Binky Fairy came to our house, we didn't ...",5.0,"{'and': 2L, 'cute': 1L, 'help': 2L, 'doll': 1L, ...",1


### 1. Build new features
Build new features with the counts for each of the selected words

In [102]:
# evaluate occurence of selected words

for word in selected_words:
    products[word] = products['word_count'].apply(lambda word_count_dic: word_count_dic[word] if word in word_count_dic.keys() else 0)
    print 'The total occurance of the word %s - is : %s' % (word, products[word].sum())

The total occurance of the word awesome - is : 2002
The total occurance of the word great - is : 42420
The total occurance of the word fantastic - is : 873
The total occurance of the word amazing - is : 1305
The total occurance of the word love - is : 40277
The total occurance of the word horrible - is : 659
The total occurance of the word bad - is : 3197
The total occurance of the word terrible - is : 673
The total occurance of the word awful - is : 345
The total occurance of the word wow - is : 131
The total occurance of the word hate - is : 1057


In [103]:
# explore the data set
products.head(5)

name,review,rating,word_count,sentiment,awesome
Planetwise Wipe Pouch,it came early and was not disappointed. i love ...,5.0,"{'and': 3L, 'love': 1L, 'it': 2L, 'highly': 1L, ...",1,0
Annas Dream Full Quilt with 2 Shams ...,Very soft and comfortable and warmer than it ...,5.0,"{'and': 2L, 'quilt': 1L, 'it': 1L, 'comfortable': ...",1,0
Stop Pacifier Sucking without tears with ...,This is a product well worth the purchase. I ...,5.0,"{'ingenious': 1L, 'and': 3L, 'love': 2L, ...",1,0
Stop Pacifier Sucking without tears with ...,All of my kids have cried non-stop when I tried to ...,5.0,"{'and': 2L, 'parents!!': 1L, 'all': 2L, 'puppe ...",1,0
Stop Pacifier Sucking without tears with ...,"When the Binky Fairy came to our house, we didn't ...",5.0,"{'and': 2L, 'cute': 1L, 'help': 2L, 'doll': 1L, ...",1,0

great,fantastic,amazing,love,horrible,bad,terrible,awful,wow,hate
0,0,0,1,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0
0,0,0,2,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0


### 2. Create a new sentiment analysis model
Build the word count vector for each review

In [104]:
# Split the dataset into training and test data
train_data,test_data = products.random_split(.8, seed=0)

print('# of rows in test data set %s' % test_data.num_rows())
print('# of rows in train data set %s' % train_data.num_rows())

# of rows in test data set 33304
# of rows in train data set 133448


#### Train a logistic regression classifier using just the selected_words. 

In [105]:
# train the sentiment classifier
selected_words_model = graphlab.logistic_classifier.create(train_data,
                                                     target='sentiment',
                                                     features=selected_words,
                                                     validation_set = test_data)

In [106]:
selected_words_model['coefficients'].sort('value', ascending = False).print_rows(12)

+-------------+-------+-------+------------------+------------------+
|     name    | index | class |      value       |      stderr      |
+-------------+-------+-------+------------------+------------------+
|     love    |  None |   1   |  1.39989834302   | 0.0287147460124  |
| (intercept) |  None |   1   |  1.36728315229   | 0.00861805467824 |
|   awesome   |  None |   1   |  1.05800888878   |  0.110865296265  |
|   amazing   |  None |   1   |  0.892802422508  |  0.127989503231  |
|  fantastic  |  None |   1   |  0.891303090304  |  0.154532343591  |
|    great    |  None |   1   |  0.883937894898  | 0.0217379527921  |
|     wow     |  None |   1   | -0.0541450123333 |  0.275616449416  |
|     bad     |  None |   1   | -0.985827369929  | 0.0433603009142  |
|     hate    |  None |   1   |  -1.40916406276  | 0.0771983993506  |
|    awful    |  None |   1   |  -1.76469955631  |  0.134679803365  |
|   horrible  |  None |   1   |  -1.99651800559  | 0.0973584169028  |
|   terrible  |  Non

### 3. Comparing the accuracy of different sentiment analysis model

In [107]:
selected_words_model.evaluate(test_data, metric='roc_curve')

{'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+-----+-----+-------+------+
 | threshold | fpr | tpr |   p   |  n   |
 +-----------+-----+-----+-------+------+
 |    0.0    | 1.0 | 1.0 | 27976 | 5328 |
 |   1e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   2e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   3e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   4e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   5e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   6e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   7e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   8e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   9e-05   | 1.0 | 1.0 | 27976 | 5328 |
 +-----------+-----+-----+-------+------+
 [100001 rows x 5 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}

In [108]:
selected_words_model.show(view='Evaluation')

In [109]:
# products = selected_words_model.predict(diaperchamp_reviews[0:1], output_type='probability')

### 4. Interpreting the difference in performance between the models
To understand why the model with all word counts performs better than the one with only the selected_words, we will now examine the reviews for a particular product.

#### Baby trend diaper champ - Analysis

In [110]:
diaperchamp_reviews = products[products['name'] == 'Baby Trend Diaper Champ']

In [111]:
len(diaperchamp_reviews)

298

In [112]:
diaperchamp_reviews['rating'].show(view='Categorical')

#### Define what's a positive and a negative sentiment

We will ignore all reviews with rating = 3, since they tend to have a neutral sentiment.  Reviews with a rating of 4 or higher will be considered positive, while the ones with rating of 2 or lower will have a negative sentiment.   

In [113]:
#ignore all 3* reviews
# products = products[products['rating'] != 3]

In [114]:
#positive sentiment = 4* or 5* reviews
# products['sentiment'] = products['rating'] >=4

In [115]:
# products.head(5)

#### Train the sentiment classifier

In [116]:
# train the sentiment classifier
diaperchamp_model = graphlab.logistic_classifier.create(train_data,
                                                     target='sentiment',
                                                     features=['word_count'],
                                                     validation_set = test_data)

#### Evaluate the sentiment model

In [117]:
diaperchamp_model['coefficients'].sort('value', ascending = False)

name,index,class,value,stderr
word_count,pinkjeep,1,13.5701456406,
word_count,(http://www.amazon.com/re view/rhgg6qp7tdnhb/re ...,1,12.3088000996,
word_count,label/box.,1,11.1774213484,
word_count,product.***,1,11.0639586141,
word_count,direct-pumping,1,11.0531278722,
word_count,taped.,1,10.0956313358,
word_count,17lb.,1,9.79947043521,
word_count,flow).,1,9.51388499429,
word_count,win.of,1,9.49609098985,
word_count,leacho!,1,9.49609098985,


In [118]:
diaperchamp_model.evaluate(test_data, metric='roc_curve')

{'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+----------------+----------------+-------+------+
 | threshold |      fpr       |      tpr       |   p   |  n   |
 +-----------+----------------+----------------+-------+------+
 |    0.0    |      1.0       |      1.0       | 27976 | 5328 |
 |   1e-05   | 0.909346846847 | 0.998856162425 | 27976 | 5328 |
 |   2e-05   | 0.896021021021 | 0.998748927652 | 27976 | 5328 |
 |   3e-05   | 0.886448948949 | 0.998462968259 | 27976 | 5328 |
 |   4e-05   | 0.879692192192 | 0.998284243637 | 27976 | 5328 |
 |   5e-05   | 0.875187687688 | 0.998212753789 | 27976 | 5328 |
 |   6e-05   | 0.872184684685 | 0.998177008865 | 27976 | 5328 |
 |   7e-05   | 0.868618618619 | 0.998034029168 | 27976 | 5328 |
 |   8e-05   | 0.864677177177 | 0.997998284244 | 27976 | 5328 |
 |   9e-05   | 0.860735735736 | 0.997962539319 | 27976 | 5328 |
 +-----------+----------------+----------------+-------+------

In [119]:
diaperchamp_model.show(view='Evaluation')

In [120]:
# giraffe_reviews['predicted_sentiment'] = 
selected_words_model.predict(diaperchamp_reviews[0:1], output_type='probability')

dtype: float
Rows: 1
[0.796940851290673]

#### Applying the model to understand sentiment for diaper champ

In [121]:
diaperchamp_reviews['predicted_sentiment_sw'] = selected_words_model.predict(diaperchamp_reviews, output_type='probability')
diaperchamp_reviews = diaperchamp_reviews.sort('rating', ascending=False)

print selected_words_model.predict(diaperchamp_reviews[0:1], output_type='probability')

[0.9408763934283927]


#Sort the reviews based on the predicted sentiment and explore

In [122]:
diaperchamp_reviews['predicted_sentiment_dc'] = diaperchamp_model.predict(diaperchamp_reviews, output_type='probability')
diaperchamp_reviews = diaperchamp_reviews.sort('rating', ascending=False)

print diaperchamp_model.predict(diaperchamp_reviews[0:1], output_type='probability')


[0.998693566706023]


In [123]:
diaperchamp_reviews.sort('predicted_sentiment_sw', ascending = False)

name,review,rating,word_count,sentiment,awesome
Baby Trend Diaper Champ,I LOVE LOVE LOVE this product! It is SO much ...,4.0,"{'rating': 1L, 'contacted': 1L, 'over': ...",1,0
Baby Trend Diaper Champ,I received my Diaper Champ at my baby shower ...,5.0,"{'bags.': 1L, ""don't"": 1L, 'son.': 1L, 'of,': ...",1,0
Baby Trend Diaper Champ,"Love it, love it, love it! This lives up to ...",5.0,"{'all': 1L, 'already': 1L, 'love': 3L, 'have': ...",1,0
Baby Trend Diaper Champ,I love this diaper pale and wouldn't dream of ...,5.0,"{'and': 3L, 'love': 1L, 'use.': 1L, 'is': 2L, ...",1,0
Baby Trend Diaper Champ,Works great - no smells. LOVE that it uses reg ...,5.0,"{'and': 2L, 'love': 1L, 'garbage': 1L, ...",1,0
Baby Trend Diaper Champ,Love the Diaper Champ. I had planned to get the ...,4.0,"{'reviews,': 1L, 'all': 1L, 'bags.': 1L, 'just': ...",1,0
Baby Trend Diaper Champ,I've worked with kids more than half my life. ...,5.0,"{'and': 4L, 'genies': 1L, 'now': 1L, 'because': ...",1,0
Baby Trend Diaper Champ,I love this diaper pail. It keeps the diapers ...,4.0,"{'and': 1L, 'bags.': 1L, 'extra': 1L, 'is': 1L, ...",1,0
Baby Trend Diaper Champ,We had 2 diaper Genie's both given to us as a ...,4.0,"{'hand.': 1L, '(required': 1L, ...",1,0
Baby Trend Diaper Champ,I have a two-year-old son and I love the Diaper ...,5.0,"{'and': 6L, 'two-year- old': 1L, ""toddler's"": ...",1,0

great,fantastic,amazing,love,horrible,bad,terrible,awful,wow,hate,predicted_sentiment_sw
1,0,0,3,0,0,0,0,0,0,0.998423414594
0,0,0,3,0,0,0,0,0,0,0.996192539732
0,0,0,3,0,0,0,0,0,0,0.996192539732
2,0,0,1,0,0,0,0,0,0,0.989387539605
2,0,0,1,0,0,0,0,0,0,0.989387539605
0,0,0,2,0,0,0,0,0,0,0.984739056527
0,0,0,2,0,0,0,0,0,0,0.984739056527
0,0,0,2,0,0,0,0,0,0,0.984739056527
0,0,0,2,0,0,0,0,0,0,0.984739056527
0,0,0,2,0,0,0,0,0,0,0.984739056527

predicted_sentiment_dc
0.999993652036
0.999301330286
0.985732101571
0.983086548255
0.998904798032
0.998471561712
0.999879939529
0.998240768806
0.99267406035
0.833101136873


In [124]:
diaperchamp_reviews.tail(5)

name,review,rating,word_count,sentiment,awesome
Baby Trend Diaper Champ,After 2 and half years I still can't get the s ...,1.0,"{'and': 1L, 'concept': 1L, 'love': 1L, 'do': ...",0,0
Baby Trend Diaper Champ,It stinks - we lysol ours 2-3 times a week. Dia ...,1.0,"{'and': 2L, 'lysol': 1L, 'garbage': 2L, 'is': 1L, ...",0,0
Baby Trend Diaper Champ,This is the worst diaper pail ever! It was great ...,1.0,"{'and': 4L, 'this': 2L, 'old': 1L, 'less': 1L, ...",0,0
Baby Trend Diaper Champ,Bad construction is my main issue. My husband ...,1.0,"{'control': 1L, 'and': 1L, 'everyone': 2L, ...",0,0
Baby Trend Diaper Champ,We loved this pail at first. The mechanism ...,1.0,"{'bags.': 1L, 'retire': 1L, 'isolated': 1L, ...",0,0

great,fantastic,amazing,love,horrible,bad,terrible,awful,wow,hate,predicted_sentiment_sw
0,0,0,1,0,0,0,0,0,0,0.940876393428
0,0,0,0,0,0,0,0,0,0,0.796940851291
1,0,0,0,0,0,0,0,0,0,0.904755808093
0,0,0,0,0,1,0,0,0,0,0.5942241719
0,0,0,0,0,0,0,0,0,0,0.796940851291

predicted_sentiment_dc
0.0272966332051
0.00875021314945
5.58405606877e-07
0.725517068639
0.167362480219


##Most positive reviews for the diaper champ

In [125]:
diaperchamp_reviews[0]['review']

'this works really well.  I found it easier than diaper gienie and you can drop in the diaper and dispose with one hand.  To keep odor down you need to change before it gets too full, couple times a week  or more often if have a couple poopy diapersYou also can use any bags.  I have one on each floor of my home.Also keeps the dogs from getting the diapers.  occasionally a diaper will get stuck but that is usually when it is getting too full.  Only prob I have found is when I push on the lever to open to change the bag it is kind of difficult to unlatch and have broken a nail on it a few times- no biggie, kind of annoying.'

In [126]:
diaperchamp_reviews[1]['review']

"Let me just say, I LOVE THIS PRODUCT!!  I used the diaper genie from the time my daughter was born until the time she was 16 months.  That was all I could take.  Constantly buying expensive refills, emptying it every couple of days, juggling a wiggly baby while trying to open, lift, push, spin, and close the genie was just too much.  Then I was shopping at Babies R Us and in the STORE's changing room is the Diaper Champ.  It was easy, didn't smell, and used regular trash bags.  I was sold.After using the Diaper Champ for 2 months now, I am confident I made the right choice.  Yes, when it gets too full, you have to change the bag or the weight will get stuck (duh!).  Yes, if you don't wrap up the poopy wipies in the dirty diaper, you will have to clean poop from the chute (just wrap it up).  Yes, poop does smell (not like roses), but my daughter's room doesn't smell like poop because the Diaper Champ does a great job of containing odor.  You do need to disenfect it when you change the 

##Show most negative reviews for diaper champ

In [127]:
diaperchamp_reviews[-1]['review']

'We loved this pail at first. The mechanism seemed ingenius, and we appreciated that it took regular bags. But once our daughter started to stand up, that big white handle was irresistable to her, and before we even realized it, she started flipping it. That makes the heavy center portion slide back and forth FAST, and sure enough she caught her finger and started yowling.This was *not* an isolated incident -- if you play with the pail a little bit, you\'ll quickly see that the mechanism is a total finger trap for toddlers. Worse yet, they can slam a finger then hurt it seriously in their attempts to get it free by pushing the handle the wrong way. It\'s really pretty scary.You\'ll notice that the positive reviews here generally say "I\'ve been using this for two months, and it\'s great!" But once you have older babies, I\'d retire it pronto.'

In [128]:
diaperchamp_reviews[-2]['review']

"Bad construction is my main issue. My husband assembled it and when changing the bag, you're supposed to open up the top head part which tilts back.  There's a little plastic tab that's suppused to slide in as your opening the top, but instead it get's cought so it allows the top to open up only partially. It may just be one defective item that somehow passed the quality control inspection, but I've given up on diaper pails in general.  Talking to 1st time parents, I found out that almost everyone gets one, but almost everyone stops using them very quickly."