# Week 3: Analyzing product sentiment

In this module, we focused on classifiers, applying them to analyzing product sentiment, and understanding the types of errors a classifier makes. We also built an exciting IPython notebook for analyzing the sentiment of real product reviews.

In this assignment, we are going to explore this application further, training a sentiment analysis model using a set of key polarizing words, verify the weights learned to each of these words, and compare the results of this simpler classifier with those of the one using all of the words. These techniques will be a core component in your capstone project.

Learning outcomes

- Execute sentiment analysis code with the IPython notebook
- Load and transform real, text data
- Using the .apply() function to create new columns (features) for our model
- Compare results of two models, one using all words and the other using a subset of the words
- Compare learned models with majority class prediction
- Examine the predictions of a sentiment model
- Build a sentiment analysis model using a classifier

In [46]:
import graphlab

## Load and explore the data

In [105]:
products = graphlab.SFrame("amazon_baby.gl/")

In [106]:
products['word_count'] = graphlab.text_analytics.count_words(products['review'])

In [107]:
graphlab.canvas.set_target("ipynb")

In [108]:
selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 
                  'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']

for i in selected_words:
    def feature_count(dic_word_count):
        if i in dic_word_count:
            counts = dic_word_count[i]
            return counts
        else:
            counts = 0
            return counts
    products[i] = products['word_count'].apply(feature_count)

In [109]:
for i in selected_words:
    print i, products[i].sum()
#products['awesome']

awesome 2090
great 45206
fantastic 932
amazing 1363
love 42065
horrible 734
bad 3724
terrible 748
awful 383
wow 144
hate 1220


In [114]:
products = products[products['rating'] != 3]
products['sentiment'] = products['rating'] >= 4

In [144]:
print len(products)
print products['sentiment'].sum()

166752


140259

## Create the logistic model

In [115]:
train_data, test_data = products.random_split(.8, seed = 0)

In [116]:
selected_words_model = graphlab.logistic_classifier.create(train_data, 
                                              target='sentiment', 
                                              features=selected_words, 
                                              validation_set=test_data)

PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 133448
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 11
PROGRESS: Number of unpacked features : 11
PROGRESS: Number of coefficients    : 12
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 2        | 0.180066     | 0.844299          | 0.842842            |
PROGRESS: | 2         | 3        | 0.293074     | 0.844186          | 0.842842            |
PROGRESS: | 3         | 4        | 0.405487     | 0.844276          | 0.843142            |
PROGRESS: | 4         | 5        |

In [121]:
coefficients = selected_words_model['coefficients'].sort('value', ascending=False)

In [140]:
len(coefficients)
coefficients.print_rows(num_rows=12)

+-------------+-------+-------+------------------+------------------+
|     name    | index | class |      value       |      stderr      |
+-------------+-------+-------+------------------+------------------+
|     love    |  None |   1   |  1.39989834302   | 0.0287147460124  |
| (intercept) |  None |   1   |  1.36728315229   | 0.00861805467824 |
|   awesome   |  None |   1   |  1.05800888878   |  0.110865296265  |
|   amazing   |  None |   1   |  0.892802422508  |  0.127989503231  |
|  fantastic  |  None |   1   |  0.891303090304  |  0.154532343591  |
|    great    |  None |   1   |  0.883937894898  | 0.0217379527921  |
|     wow     |  None |   1   | -0.0541450123333 |  0.275616449416  |
|     bad     |  None |   1   | -0.985827369929  | 0.0433603009142  |
|     hate    |  None |   1   |  -1.40916406276  | 0.0771983993506  |
|    awful    |  None |   1   |  -1.76469955631  |  0.134679803365  |
|   horrible  |  None |   1   |  -1.99651800559  | 0.0973584169028  |
|   terrible  |  Non

## Evaluate the model

In [123]:
selected_words_model.evaluate(test_data, metric='roc_curve')

{'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+-----+-----+-------+------+
 | threshold | fpr | tpr |   p   |  n   |
 +-----------+-----+-----+-------+------+
 |    0.0    | 1.0 | 1.0 | 27976 | 5328 |
 |   1e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   2e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   3e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   4e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   5e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   6e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   7e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   8e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   9e-05   | 1.0 | 1.0 | 27976 | 5328 |
 +-----------+-----+-----+-------+------+
 [100001 rows x 5 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}

In [124]:
selected_words_model.show(view='Evaluation')

## Create the 'diaper_chem_reviews' model

In [125]:
diaper_champ_reviews = products[products['name'] == 'Baby Trend Diaper Champ']

In [126]:
diaper_champ_reviews['rating'].show(view='Categorical')

In [132]:
# sentiment_model
sentiment_model = graphlab.logistic_classifier.create(train_data, 
                                              target='sentiment', 
                                              features=['word_count'], 
                                              validation_set=test_data)
sentiment_model.evaluate(test_data, metric='roc_curve')
sentiment_model.show(view='Evaluation')

PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 133448
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 1
PROGRESS: Number of unpacked features : 219217
PROGRESS: Number of coefficients    : 219218
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 5        | 0.000002  | 1.448898     | 0.841481          | 0.839989            |
PROGRESS: | 2         | 9        | 3.000000  | 2.866022     | 0.947425          | 0.894877            |
PROGRESS: | 3         | 10       | 3.000000  | 3.427268     | 0.92

In [134]:
# sentiment_model continue...
diaper_champ_reviews['old_predicted_sentiment'] = sentiment_model.predict(diaper_champ_reviews, output_type='probability')
diaper_champ_reviews = diaper_champ_reviews.sort('old_predicted_sentiment', ascending=False)
diaper_champ_reviews.head()

name,review,rating,word_count,awesome,great,fantastic
Baby Trend Diaper Champ,Baby Luke can turn a clean diaper to a dirty ...,5.0,"{'all': 1, 'less': 1, ""friend's"": 1, '(which': ...",0,0,0
Baby Trend Diaper Champ,I LOOOVE this diaper pail! Its the easies ...,5.0,"{'just': 1, 'over': 1, 'rweek': 1, 'sooo': 1, ...",0,0,0
Baby Trend Diaper Champ,We researched all of the different types of di ...,4.0,"{'all': 2, 'just': 4, ""don't"": 2, 'one,': 1, ...",0,0,0
Baby Trend Diaper Champ,My baby is now 8 months and the can has been ...,5.0,"{""don't"": 1, 'able': 2, 'over': 1, 'soon': 1, ...",0,2,0
Baby Trend Diaper Champ,"This is absolutely, by far, the best diaper ...",5.0,"{'just': 3, 'money': 1, 'still': 3, 'fine': 1, ...",0,0,0
Baby Trend Diaper Champ,Diaper Champ or Diaper Genie? That was my ...,5.0,"{'son': 2, 'all': 1, 'bags.': 1, 'son,': 1, ...",0,0,0
Baby Trend Diaper Champ,Wow! This is fabulous. It was a toss-up between ...,5.0,"{'and': 4, 'this': 3, 'stink': 1, 'garbage' ...",0,0,0
Baby Trend Diaper Champ,I originally put this item on my baby registry ...,5.0,"{'lysol': 1, 'all': 2, 'bags.': 1, 'feedback': ...",0,0,0
Baby Trend Diaper Champ,Two girlfriends and two family members put me ...,5.0,"{'just': 1, '-': 3, 'both': 1, 'results': 1, ...",0,0,0
Baby Trend Diaper Champ,I am one of those super- critical shoppers who ...,5.0,"{'all': 1, 'humid': 1, 'just': 1, 'less': 1, ...",0,0,0

amazing,love,horrible,bad,terrible,awful,wow,hate,sentiment,predicted_sentiment,old_predicted_sentiment
0,0,0,0,0,0,0,0,1,0.796940851291,0.999999937267
0,1,0,0,0,0,0,0,1,0.940876393428,0.999999917406
0,0,0,1,0,0,0,0,1,0.5942241719,0.999999899509
0,0,0,1,0,0,0,0,1,0.895606298305,0.999999836182
0,2,0,0,0,0,0,0,1,0.984739056527,0.999999824745
0,0,0,0,0,0,0,0,1,0.796940851291,0.999999759315
0,0,0,0,0,0,0,0,1,0.796940851291,0.999999692111
0,0,0,0,0,0,0,0,1,0.796940851291,0.999999642488
0,0,1,0,0,0,0,0,1,0.347684052736,0.999999604504
0,1,0,0,0,0,0,0,1,0.940876393428,0.999999486804


In [135]:
selected_words_model.predict(diaper_champ_reviews[0:1], output_type='probability')

dtype: float
Rows: 1
[0.7969408512906712]

THE END!

In [137]:
diaper_champ_reviews[0]['review']
diaper_champ_reviews[0]['word_count']

{'"what': 1,
 '(which': 1,
 '3': 1,
 'a': 6,
 'absolutly': 2,
 'added': 1,
 'all': 1,
 'and': 6,
 'any': 1,
 'are': 1,
 'around': 1,
 'at': 1,
 'baby': 3,
 'bag': 1,
 'bag,': 1,
 'bags': 1,
 'bassinet': 1,
 'because': 1,
 'best': 2,
 'bjorn,': 1,
 'bulk': 1,
 'can': 1,
 'champ': 1,
 'champ,': 2,
 'champ.': 1,
 'changing': 1,
 'chanp': 1,
 'clean': 1,
 'comparison,': 1,
 'deffinite': 1,
 'diaper': 7,
 'difficult': 1,
 'dirty': 1,
 'easy': 2,
 'economical,': 1,
 'edge': 1,
 'effective,': 1,
 'eminating': 1,
 'fabulous.updatei': 1,
 'flat.': 1,
 'fluerville': 1,
 'for': 2,
 'found': 1,
 'free,': 1,
 "friend's": 1,
 'from': 1,
 'garbage': 1,
 'genie': 2,
 'genieplus': 1,
 'graco': 1,
 'handed': 1,
 'have': 1,
 'hesitated': 1,
 'house': 1,
 'i': 3,
 'if': 1,
 'in': 2,
 'integrated': 1,
 'into': 2,
 'is': 4,
 "isn't": 1,
 'knew': 1,
 'less': 1,
 'little': 1,
 'loved': 1,
 'luke': 1,
 'made.': 1,
 'needed': 1,
 'no': 1,
 'nursery.': 1,
 'odor': 1,
 'of': 2,
 'on': 1,
 'one': 3,
 'pack': 1,
 '

## Create the 'giraffe_reviews' model

In [21]:
giraffe_reviews = products[products['name'] == "Vulli Sophie the Giraffe Teether"]

In [22]:
len(giraffe_reviews)

723

In [23]:
giraffe_reviews["rating"].show(view = "Categorical")

In [24]:
products["rating"].show(view = 'Categorical')

In [25]:
products = products[products['rating'] != 3]

In [30]:
products = products[products['rating'] != 3]
products['sentiment'] = products['rating'] >= 4

In [36]:
#products.head()

In [37]:
train_data, test_data = products.random_split(.8, seed = 0)

In [39]:
sentiment_model = graphlab.logistic_classifier.create(train_data, 
                                              target='sentiment', 
                                              features=['word count'], 
                                              validation_set=test_data)

PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 133448
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 1
PROGRESS: Number of unpacked features : 219217
PROGRESS: Number of coefficients    : 219218
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 5        | 0.000002  | 1.460195     | 0.841481          | 0.839989            |
PROGRESS: | 2         | 9        | 3.000000  | 2.869801     | 0.947425          | 0.894877            |
PROGRESS: | 3         | 10       | 3.000000  | 3.434164     | 0.92

In [40]:
# sentiment_model
sentiment_model = graphlab.logistic_classifier.create(train_data, 
                                              target='sentiment', 
                                              features=['word count'], 
                                              validation_set=test_data)
sentiment_model.evaluate(test_data, metric='roc_curve')
sentiment_model.show(view='Evaluation')


diaper_champ_reviews['predicted_sentiment'] = sentiment_model.predict(diaper_champ_reviews, output_type='probability')

diaper_champ_reviews = diaper_champ_reviews.sort('predicted_sentiment', ascending=False)

{'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+----------------+----------------+-------+------+
 | threshold |      fpr       |      tpr       |   p   |  n   |
 +-----------+----------------+----------------+-------+------+
 |    0.0    |      1.0       |      1.0       | 27976 | 5328 |
 |   1e-05   | 0.909346846847 | 0.998856162425 | 27976 | 5328 |
 |   2e-05   | 0.896021021021 | 0.998748927652 | 27976 | 5328 |
 |   3e-05   | 0.886448948949 | 0.998462968259 | 27976 | 5328 |
 |   4e-05   | 0.879692192192 | 0.998284243637 | 27976 | 5328 |
 |   5e-05   | 0.875187687688 | 0.998212753789 | 27976 | 5328 |
 |   6e-05   | 0.872184684685 | 0.998177008865 | 27976 | 5328 |
 |   7e-05   | 0.868618618619 | 0.998034029168 | 27976 | 5328 |
 |   8e-05   | 0.864677177177 | 0.997998284244 | 27976 | 5328 |
 |   9e-05   | 0.860735735736 | 0.997962539319 | 27976 | 5328 |
 +-----------+----------------+----------------+-------+------

In [41]:
sentiment_model.show(view='Evaluation')

In [42]:
giraffe_reviews['predicted_sentiment'] = sentiment_model.predict(giraffe_reviews, output_type='probability')

In [43]:
giraffe_reviews.head()

name,review,rating,word count,predicted_sentiment
Vulli Sophie the Giraffe Teether ...,He likes chewing on all the parts especially the ...,5.0,"{'and': 1, 'all': 1, 'because': 1, 'it': 1, ...",0.999513023521
Vulli Sophie the Giraffe Teether ...,My son loves this toy and fits great in the diaper ...,5.0,"{'and': 1, 'right': 1, 'help': 1, 'just': 1, ...",0.999320678306
Vulli Sophie the Giraffe Teether ...,There really should be a large warning on the ...,1.0,"{'and': 2, 'all': 1, 'would': 1, 'latex.': 1, ...",0.013558811687
Vulli Sophie the Giraffe Teether ...,All the moms in my moms' group got Sophie for ...,5.0,"{'and': 2, 'one!': 1, 'all': 1, 'love': 1, ...",0.995769474148
Vulli Sophie the Giraffe Teether ...,I was a little skeptical on whether Sophie was ...,5.0,"{'and': 3, 'all': 1, 'months': 1, 'old': 1, ...",0.662374415673
Vulli Sophie the Giraffe Teether ...,I have been reading about Sophie and was going ...,5.0,"{'and': 6, 'seven': 1, 'already': 1, 'love': 1, ...",0.999997148186
Vulli Sophie the Giraffe Teether ...,My neice loves her sophie and has spent hours ...,5.0,"{'and': 4, 'drooling,': 1, 'love': 1, ...",0.989190989536
Vulli Sophie the Giraffe Teether ...,What a friendly face! And those mesmerizing ...,5.0,"{'and': 3, 'chew': 1, 'be': 1, 'is': 1, ...",0.999563518413
Vulli Sophie the Giraffe Teether ...,We got this just for my son to chew on instea ...,5.0,"{'chew': 2, 'seemed': 1, 'because': 1, 'about.': ...",0.970160542725
Vulli Sophie the Giraffe Teether ...,This product is without a doubt the best on the ...,5.0,"{'and': 4, ':)': 1, 'just': 2, 'give': 1, ...",0.999999795012


In [44]:
giraffe_reviews = giraffe_reviews.sort('predicted_sentiment', ascending=False)

In [45]:
giraffe_reviews.head()

name,review,rating,word count,predicted_sentiment
Vulli Sophie the Giraffe Teether ...,"Sophie, oh Sophie, your time has come. My ...",5.0,"{'giggles': 1, 'all': 1, ""violet's"": 2, 'bring': ...",1.0
Vulli Sophie the Giraffe Teether ...,I'm not sure why Sophie is such a hit with the ...,4.0,"{'adoring': 1, 'find': 1, 'month': 1, 'bright': 1, ...",0.999999999703
Vulli Sophie the Giraffe Teether ...,I'll be honest...I bought this toy because all the ...,4.0,"{'all': 2, 'discovered': 1, 'existence.': 1, ...",0.999999999392
Vulli Sophie the Giraffe Teether ...,We got this little giraffe as a gift from a ...,5.0,"{'all': 2, ""don't"": 1, '(literally).so': 1, ...",0.99999999919
Vulli Sophie the Giraffe Teether ...,As a mother of 16month old twins; I bought ...,5.0,"{'cute': 1, 'all': 1, 'reviews.': 2, 'just' ...",0.999999998657
Vulli Sophie the Giraffe Teether ...,Sophie the Giraffe is the perfect teething toy. ...,5.0,"{'just': 2, 'both': 1, 'month': 1, 'ears,': 1, ...",0.999999997108
Vulli Sophie the Giraffe Teether ...,Sophie la giraffe is absolutely the best toy ...,5.0,"{'and': 5, 'the': 1, 'all': 1, 'that': 2, ...",0.999999995589
Vulli Sophie the Giraffe Teether ...,My 5-mos old son took to this immediately. The ...,5.0,"{'just': 1, 'shape': 2, 'mutt': 1, '""dog': 1, ...",0.999999995573
Vulli Sophie the Giraffe Teether ...,My nephews and my four kids all had Sophie in ...,5.0,"{'and': 4, 'chew': 1, 'all': 1, 'perfect;': 1, ...",0.999999989527
Vulli Sophie the Giraffe Teether ...,Never thought I'd see my son French kissing a ...,5.0,"{'giggles': 1, 'all': 1, 'out,': 1, 'over': 1, ...",0.999999985069
