## Week 3 Quiz: Analyzing Product Sentiment

In [1]:
import graphlab

A newer version of GraphLab Create (v2.1) is available! Your current version is v2.0.1.
You can use pip to upgrade the graphlab-create package. For more information see https://turi.com/products/create/upgrade.


In [2]:
products = graphlab.SFrame('amazon_baby.gl/')
selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 
                  'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']

[INFO] graphlab.cython.cy_server: GraphLab Create v2.0.1 started. Logging: /tmp/graphlab_server_1469971055.log


This non-commercial license of GraphLab Create for academic use is assigned to dominic.debiaso@gmail.com and will expire on June 07, 2017.


### 1. Use .apply() to build a new feature with the counts for each of the selected_words

In [3]:
# create word counts for words in review
products['word_count'] = graphlab.text_analytics.count_words(products['review'])

# if word present in dictionary return its value, else zero
def word_count(dictionary, word):
    if word in dictionary:
        return dictionary[word]
    else:
        return 0

# iterate through all the selected words creating a column for each and apply the function
# apply to each series (ie. row) with the following function
for word in selected_words:
    products[word] = products['word_count'].apply(lambda x: word_count(x, word))

In [4]:
# which word is most/least used?
for word in selected_words:
    print word, products[word].sum()

awesome 2090
great 45206
fantastic 932
amazing 1363
love 42065
horrible 734
bad 3724
terrible 748
awful 383
wow 144
hate 1220


### 2. Create a new sentiment analysis model using only the selected_words as features

In [5]:
products = products[products['rating'] != 3]
products['sentiment'] = products['rating'] >= 4.0
train_data, test_data = products.random_split(0.8, seed=0)

selected_words_model = graphlab.logistic_classifier.create(train_data,
                                                          target='sentiment',
                                                          features=selected_words,
                                                          validation_set=test_data)

In [17]:
selected_words_model['coefficients'].sort(['value']).print_rows(num_rows=12)

+-------------+-------+-------+------------------+------------------+
|     name    | index | class |      value       |      stderr      |
+-------------+-------+-------+------------------+------------------+
|   terrible  |  None |   1   |  -2.09049998487  | 0.0967241912229  |
|   horrible  |  None |   1   |  -1.99651800559  | 0.0973584169028  |
|    awful    |  None |   1   |  -1.76469955631  |  0.134679803365  |
|     hate    |  None |   1   |  -1.40916406276  | 0.0771983993506  |
|     bad     |  None |   1   | -0.985827369929  | 0.0433603009142  |
|     wow     |  None |   1   | -0.0541450123333 |  0.275616449416  |
|    great    |  None |   1   |  0.883937894898  | 0.0217379527921  |
|  fantastic  |  None |   1   |  0.891303090304  |  0.154532343591  |
|   amazing   |  None |   1   |  0.892802422508  |  0.127989503231  |
|   awesome   |  None |   1   |  1.05800888878   |  0.110865296265  |
| (intercept) |  None |   1   |  1.36728315229   | 0.00861805467824 |
|     love    |  Non

### 3. Comparing the accuracy of different sentiment analysis model

In [7]:
selected_words_model.evaluate(test_data, metric='roc_curve')

{'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+-----+-----+-------+------+
 | threshold | fpr | tpr |   p   |  n   |
 +-----------+-----+-----+-------+------+
 |    0.0    | 1.0 | 1.0 | 27976 | 5328 |
 |   1e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   2e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   3e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   4e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   5e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   6e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   7e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   8e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   9e-05   | 1.0 | 1.0 | 27976 | 5328 |
 +-----------+-----+-----+-------+------+
 [100001 rows x 5 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}

In [15]:
selected_words_model.show(view='Evaluation')

Canvas is updated and available in a tab in the default browser.


In [9]:
test_data['sentiment'].sum()

27976

In [10]:
test_data.shape

(33304, 16)

In [11]:
27976.0/33304.0

0.8400192169108815

### 4. Interpreting the difference in performance between the models

In [12]:
diaper_champ_reviews = products[products['name'] == 'Baby Trend Diaper Champ']
sentiment_model = graphlab.logistic_classifier.create(train_data,
                                                     target='sentiment',
                                                     features=['word_count'],
                                                     validation_set=test_data)
sentiment_model.evaluate(test_data, metric='roc_curve')
sentiment_model.show(view='Evaluation')
diaper_champ_reviews['predicted_sentiment'] = sentiment_model.predict(diaper_champ_reviews, output_type='probability')
diaper_champ_reviews.sort('predicted_sentiment', ascending=False)

Canvas is accessible via web browser at the URL: http://localhost:49431/index.html
Opening Canvas in default web browser.


name,review,rating,word_count,awesome,great,fantastic
Baby Trend Diaper Champ,Baby Luke can turn a clean diaper to a dirty ...,5.0,"{'all': 1, 'less': 1, ""friend's"": 1, '(which': ...",0,0,0
Baby Trend Diaper Champ,I LOOOVE this diaper pail! Its the easies ...,5.0,"{'just': 1, 'over': 1, 'rweek': 1, 'sooo': 1, ...",0,0,0
Baby Trend Diaper Champ,We researched all of the different types of di ...,4.0,"{'all': 2, 'just': 4, ""don't"": 2, 'one,': 1, ...",0,0,0
Baby Trend Diaper Champ,My baby is now 8 months and the can has been ...,5.0,"{""don't"": 1, 'able': 2, 'over': 1, 'soon': 1, ...",0,2,0
Baby Trend Diaper Champ,"This is absolutely, by far, the best diaper ...",5.0,"{'just': 3, 'money': 1, 'still': 3, 'fine': 1, ...",0,0,0
Baby Trend Diaper Champ,Diaper Champ or Diaper Genie? That was my ...,5.0,"{'son': 2, 'all': 1, 'bags.': 1, 'son,': 1, ...",0,0,0
Baby Trend Diaper Champ,Wow! This is fabulous. It was a toss-up between ...,5.0,"{'and': 4, 'this': 3, 'stink': 1, 'garbage' ...",0,0,0
Baby Trend Diaper Champ,I originally put this item on my baby registry ...,5.0,"{'lysol': 1, 'all': 2, 'bags.': 1, 'feedback': ...",0,0,0
Baby Trend Diaper Champ,Two girlfriends and two family members put me ...,5.0,"{'just': 1, '-': 3, 'both': 1, 'results': 1, ...",0,0,0
Baby Trend Diaper Champ,I am one of those super- critical shoppers who ...,5.0,"{'all': 1, 'humid': 1, 'just': 1, 'less': 1, ...",0,0,0

amazing,love,horrible,bad,terrible,awful,wow,hate,sentiment,predicted_sentiment
0,0,0,0,0,0,0,0,1,0.999999937267
0,1,0,0,0,0,0,0,1,0.999999917406
0,0,0,1,0,0,0,0,1,0.999999899509
0,0,0,1,0,0,0,0,1,0.999999836182
0,2,0,0,0,0,0,0,1,0.999999824745
0,0,0,0,0,0,0,0,1,0.999999759315
0,0,0,0,0,0,0,0,1,0.999999692111
0,0,0,0,0,0,0,0,1,0.999999642488
0,0,1,0,0,0,0,0,1,0.999999604504
0,1,0,0,0,0,0,0,1,0.999999486804


In [13]:
diaper_champ_reviews['predicted_sentiment'] = selected_words_model.predict(diaper_champ_reviews, output_type='probability')
diaper_champ_reviews.sort('predicted_sentiment', ascending=False)

name,review,rating,word_count,awesome,great,fantastic
Baby Trend Diaper Champ,I LOVE LOVE LOVE this product! It is SO much ...,4.0,"{'rating': 1, 'contacted': 1, 'over': ...",0,1,0
Baby Trend Diaper Champ,I received my Diaper Champ at my baby shower ...,5.0,"{'bags.': 1, ""don't"": 1, 'son.': 1, 'of,': 1, ...",0,0,0
Baby Trend Diaper Champ,"Love it, love it, love it! This lives up to ...",5.0,"{'instead': 1, 'all': 1, 'already': 1, 'love': 3, ...",0,0,0
Baby Trend Diaper Champ,Works great - no smells. LOVE that it uses reg ...,5.0,"{'and': 2, 'bags.': 1, 'garbage': 1, 'wastef ...",0,2,0
Baby Trend Diaper Champ,I love this diaper pale and wouldn't dream of ...,5.0,"{'and': 3, 'love': 1, 'use.': 1, 'is': 2, ' ...",0,2,0
Baby Trend Diaper Champ,I've worked with kids more than half my life. ...,5.0,"{'and': 4, 'genies': 1, 'all': 1, 'because': 1, ...",0,0,0
Baby Trend Diaper Champ,I love this diaper pail. It keeps the diapers ...,4.0,"{'and': 1, 'old': 1, 'extra': 1, 'is': 1, ...",0,0,0
Baby Trend Diaper Champ,"This is absolutely, by far, the best diaper ...",5.0,"{'just': 3, 'money': 1, 'still': 3, 'fine': 1, ...",0,0,0
Baby Trend Diaper Champ,Love the Diaper Champ. I had planned to get the ...,4.0,"{'reviews,': 1, 'infant': 1, 'bags.': 1, 'just' ...",0,0,0
Baby Trend Diaper Champ,We had 2 diaper Genie's both given to us as a ...,4.0,"{'hand.': 1, 'both': 1, '(required': 1, 'befo ...",0,0,0

amazing,love,horrible,bad,terrible,awful,wow,hate,sentiment,predicted_sentiment
0,3,0,0,0,0,0,0,1,0.998423414594
0,3,0,0,0,0,0,0,1,0.996192539732
0,3,0,0,0,0,0,0,1,0.996192539732
0,1,0,0,0,0,0,0,1,0.989387539605
0,1,0,0,0,0,0,0,1,0.989387539605
0,2,0,0,0,0,0,0,1,0.984739056527
0,2,0,0,0,0,0,0,1,0.984739056527
0,2,0,0,0,0,0,0,1,0.984739056527
0,2,0,0,0,0,0,0,1,0.984739056527
0,2,0,0,0,0,0,0,1,0.984739056527


In [14]:
selected_words_model.predict(diaper_champ_reviews[0:1], output_type='probability')

dtype: float
Rows: 1
[0.796940851290673]