#Predicting sentiment from product reviews

#Fire up GraphLab Create

In [1]:
import graphlab

#Read some product review data

Loading reviews for a set of baby products. 

In [2]:
products = graphlab.SFrame('amazon_baby.gl/')

This non-commercial license of GraphLab Create for academic use is assigned to antoine.coppin@dauphine.eu and will expire on August 30, 2020.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: C:\Users\antoi\AppData\Local\Temp\graphlab_server_1567689468.log.0


IOError: C:\Users\antoi\Documents\Programming\Teaching\Bruno\WashingtonU\amazon_baby.gl not found.

#Let's explore this data together

Data includes the product name, the review text and the rating of the review. 

In [None]:
products.head()

#Build the word count vector for each review

In [None]:
products['word_count'] = graphlab.text_analytics.count_words(products['review'])

In [None]:
products.head()

In [None]:
graphlab.canvas.set_target('ipynb')

In [None]:
products['name'].show()

#Examining the reviews for most-sold product:  'Vulli Sophie the Giraffe Teether'

In [None]:
giraffe_reviews = products[products['name'] == 'Vulli Sophie the Giraffe Teether']

In [None]:
len(giraffe_reviews)

In [None]:
giraffe_reviews['rating'].show(view='Categorical')

#Build a sentiment classifier

In [None]:
products['rating'].show(view='Categorical')

##Define what's a positive and a negative sentiment

We will ignore all reviews with rating = 3, since they tend to have a neutral sentiment.  Reviews with a rating of 4 or higher will be considered positive, while the ones with rating of 2 or lower will have a negative sentiment.   

In [None]:
#ignore all 3* reviews
products = products[products['rating'] != 3]

In [None]:
#positive sentiment = 4* or 5* reviews
products['sentiment'] = products['rating'] >=4

In [None]:
products.head()

##Let's train the sentiment classifier

In [None]:
train_data,test_data = products.random_split(.8, seed=0)

In [None]:
sentiment_model = graphlab.logistic_classifier.create(train_data,
                                                     target='sentiment',
                                                     features=['word_count'],
                                                     validation_set=test_data)

#Evaluate the sentiment model

In [None]:
sentiment_model.evaluate(test_data, metric='roc_curve')

In [None]:
sentiment_model.show(view='Evaluation')

#Applying the learned model to understand sentiment for Giraffe

In [None]:
giraffe_reviews['predicted_sentiment'] = sentiment_model.predict(giraffe_reviews, output_type='probability')

In [None]:
giraffe_reviews.head()

##Sort the reviews based on the predicted sentiment and explore

In [None]:
giraffe_reviews = giraffe_reviews.sort('predicted_sentiment', ascending=False)

In [None]:
giraffe_reviews.head()

##Most positive reviews for the giraffe

In [None]:
giraffe_reviews[0]['review']

In [None]:
giraffe_reviews[1]['review']

##Show most negative reviews for giraffe

In [None]:
giraffe_reviews[-1]['review']

In [None]:
giraffe_reviews[-2]['review']

# Exercise

In [None]:
selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']

**Use .apply() to build a new feature with the counts for each of the selected_words:** In the notebook above, we created a column ‘word_count’ with the word counts for each review. Our first task is to create a new column in the products SFrame with the counts for each selected_word above, and, in the process, we will see how the method .apply() can be used to create new columns in our data (our features) and how to use a Python function, which is an extremely useful concept to grasp!


Our first goal is to create a column products[‘awesome’] where each row contains the number of times the word ‘awesome’ showed up in the review for the corresponding product, and 0 if the review didn’t show up. One way to do this is to look at the each row ‘word_count’ column and follow this logic:

 - If ‘awesome’ shows up in the word counts for a particular product (row of the products SFrame), then we know how often ‘awesome’ appeared in the review,
 - if ‘awesome’ doesn’t appear in the word counts, then it didn’t appear in the review, and we should set the count for ‘awesome’ to 0 in this review.

In [None]:
def awesome_count(sa):
    if 'awesome' in sa['word_count']:
        sa['word_count']['awesome']
    else:
        return 0L
    
# an inline function would have been lambda x['awesome'] = x if 'awseome' in x else: 0L
# products['awesome'] = products['word_count'].apply(lambda x = x['awesome'] if 'awseome' in x else: 0L)
# or 
# word = 'awesome'
# products[word] = products['word_count'].apply(lambda x = x[word] if word in x else: 0L)

In [None]:
products['awesome'] = products.apply(awesome_count)

In [None]:
products.head()

In [None]:
for word in selected_words:
    products[word] = products['word_count'].apply(lambda x: x[word] if word in x else 0L)

Using the .sum() method on each of the new columns you created, answer the following questions: Out of the selected_words, which one is most used in the dataset? Which one is least used? Save these results to answer the quiz at the end.


In [None]:
print 'Word count value:'

for word in selected_words:
    print '{0}: {1}'.format(word, products[word].sum())

2. **Create a new sentiment analysis model using only the selected_words as features**: In the IPython Notebook above, we used word counts for all words as features for our sentiment classifier. Now, you are just going to use the selected_words:

Use the same train/test split as in the IPython Notebook from lecture:

In [None]:
train_data,test_data = products.random_split(.8, seed=0)

Train a logistic regression classifier (use graphlab.logistic_classifier.create) using just the selected_words. Hint: you can use this parameter in the .create() call to specify the features used to be exactly the new columns you just created:


In [None]:
selected_words_model = graphlab.logistic_classifier.create(train_data,
                                                     target='sentiment',
                                                     features=selected_words,
                                                     validation_set=test_data)

In [None]:
selected_words_model['coefficients']

In [None]:
selected_words_model['coefficients'].sort(['value'], ascending = False)

3. **Comparing the accuracy of different sentiment analysis model**: Using the method



 - What is the accuracy of the selected_words_model on the test_data? 

In [None]:
selected_words_model.evaluate(test_data)

 - What was the accuracy of the sentiment_model that we learned using all the word counts in the IPython Notebook above from the lectures? 

In [None]:
sentiment_model.evaluate(test_data)

 - What is the accuracy majority class classifier on this task? 

 - How do you compare the different learned models with the baseline approach where we are just predicting the majority class? Save these results to answer the quiz at the end.

In [None]:
selected_words_model.show(view='Evaluation')

Hint: we discussed the majority class classifier in lecture, which simply predicts that every data point is from the most common class. This is baseline is something we definitely want to beat with models we learn from data.

4. **Interpreting the difference in performance between the models:** To understand why the model with all word counts performs better than the one with only the selected_words, we will now examine the reviews for a particular product.

We will investigate a product named ‘Baby Trend Diaper Champ’. (This is a trash can for soiled baby diapers, which keeps the smell contained.)

Just like we did for the reviews for the giraffe toy in the IPython Notebook in the lecture video, before we start our analysis you should select all reviews where the product name is ‘Baby Trend Diaper Champ’. Let’s call this table diaper_champ_reviews.


In [77]:
diaper_champ_reviews = products[products['name'] == 'Baby Trend Diaper Champ']

Again, just as in the video, use the sentiment_model to predict the sentiment of each review in diaper_champ_reviews and sort the results according to their ‘predicted_sentiment’.

In [78]:
diaper_champ_reviews.head()

name,review,rating,word_count,sentiment,awesome
Baby Trend Diaper Champ,Ok - newsflash. Diapers are just smelly. We've ...,4.0,"{'just': 2L, 'less': 1L, '-': 3L, 'smell- ...",1,0
Baby Trend Diaper Champ,"My husband and I selected the Diaper ""Champ"" ma ...",1.0,"{'just': 1L, 'less': 1L, 'when': 3L, 'over': 1L, ...",0,0
Baby Trend Diaper Champ,Excellent diaper disposal unit. I used it in ...,5.0,"{'control': 1L, 'am': 1L, 'it': 1L, 'used': 1L, ...",1,0
Baby Trend Diaper Champ,We love our diaper champ. It is very easy to use ...,5.0,"{'and': 3L, 'over.': 1L, 'all': 1L, 'love': 1L, ...",1,0
Baby Trend Diaper Champ,Two girlfriends and two family members put me ...,5.0,"{'just': 1L, 'when': 1L, 'both': 1L, 'results': ...",1,0
Baby Trend Diaper Champ,I waited to review this until I saw how it ...,4.0,"{'lysol': 1L, 'all': 1L, 'mom.': 1L, 'busy': 1L, ...",1,0
Baby Trend Diaper Champ,I have had a diaper genie for almost 4 years since ...,1.0,"{'all': 1L, 'bags.': 1L, 'just': 1L, ""don't"": 2L, ...",0,0
Baby Trend Diaper Champ,I originally put this item on my baby registry ...,5.0,"{'lysol': 1L, 'all': 2L, 'bags.': 1L, 'feedback': ...",1,0
Baby Trend Diaper Champ,I am so glad I got the Diaper Champ instead of ...,5.0,"{'and': 2L, 'all': 1L, 'just': 1L, 'is': 2L, ...",1,0
Baby Trend Diaper Champ,We had 2 diaper Genie's both given to us as a ...,4.0,"{'hand.': 1L, '(required': 1L, ...",1,0

great,fantastic,amazing,love,horrible,bad,terrible,awful,wow,hate
0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0
0,0,0,1,0,0,0,0,0,0
0,0,0,0,1,0,0,0,0,0
0,0,0,0,0,1,0,0,0,0
0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0
0,0,0,2,0,0,0,0,0,0


In [80]:
diaper_champ_reviews['predicted_sentiment'] = sentiment_model.predict(diaper_champ_reviews, output_type='probability')

In [81]:
diaper_champ_reviews.head()

name,review,rating,word_count,sentiment,awesome
Baby Trend Diaper Champ,Ok - newsflash. Diapers are just smelly. We've ...,4.0,"{'just': 2L, 'less': 1L, '-': 3L, 'smell- ...",1,0
Baby Trend Diaper Champ,"My husband and I selected the Diaper ""Champ"" ma ...",1.0,"{'just': 1L, 'less': 1L, 'when': 3L, 'over': 1L, ...",0,0
Baby Trend Diaper Champ,Excellent diaper disposal unit. I used it in ...,5.0,"{'control': 1L, 'am': 1L, 'it': 1L, 'used': 1L, ...",1,0
Baby Trend Diaper Champ,We love our diaper champ. It is very easy to use ...,5.0,"{'and': 3L, 'over.': 1L, 'all': 1L, 'love': 1L, ...",1,0
Baby Trend Diaper Champ,Two girlfriends and two family members put me ...,5.0,"{'just': 1L, 'when': 1L, 'both': 1L, 'results': ...",1,0
Baby Trend Diaper Champ,I waited to review this until I saw how it ...,4.0,"{'lysol': 1L, 'all': 1L, 'mom.': 1L, 'busy': 1L, ...",1,0
Baby Trend Diaper Champ,I have had a diaper genie for almost 4 years since ...,1.0,"{'all': 1L, 'bags.': 1L, 'just': 1L, ""don't"": 2L, ...",0,0
Baby Trend Diaper Champ,I originally put this item on my baby registry ...,5.0,"{'lysol': 1L, 'all': 2L, 'bags.': 1L, 'feedback': ...",1,0
Baby Trend Diaper Champ,I am so glad I got the Diaper Champ instead of ...,5.0,"{'and': 2L, 'all': 1L, 'just': 1L, 'is': 2L, ...",1,0
Baby Trend Diaper Champ,We had 2 diaper Genie's both given to us as a ...,4.0,"{'hand.': 1L, '(required': 1L, ...",1,0

great,fantastic,amazing,love,horrible,bad,terrible,awful,wow,hate,predicted_sentiment
0,0,0,0,0,0,0,0,0,0,0.958443580893
0,0,0,0,0,0,0,0,0,0,2.47155884995e-12
0,0,0,0,0,0,0,0,0,0,0.999994864775
0,0,0,1,0,0,0,0,0,0,0.998779072633
0,0,0,0,1,0,0,0,0,0,0.999999604504
0,0,0,0,0,1,0,0,0,0,0.999952233179
0,0,0,0,0,0,0,0,0,0,0.972560724165
0,0,0,0,0,0,0,0,0,0,0.999999642488
0,0,0,0,0,0,0,0,0,0,0.97415225478
0,0,0,2,0,0,0,0,0,0,0.99267406035


In [83]:
diaper_champ_reviews.sort('predicted_sentiment', ascending=False)

name,review,rating,word_count,sentiment,awesome
Baby Trend Diaper Champ,Baby Luke can turn a clean diaper to a dirty ...,5.0,"{'all': 1L, 'less': 1L, ""friend's"": 1L, '(whi ...",1,0
Baby Trend Diaper Champ,I LOOOVE this diaper pail! Its the easies ...,5.0,"{'just': 1L, 'over': 1L, 'rweek': 1L, 'sooo': 1L, ...",1,0
Baby Trend Diaper Champ,We researched all of the different types of di ...,4.0,"{'all': 2L, 'just': 4L, ""don't"": 2L, 'one,': 1L, ...",1,0
Baby Trend Diaper Champ,My baby is now 8 months and the can has been ...,5.0,"{""don't"": 1L, 'when': 1L, 'over': 1L, 'soon': 1L, ...",1,0
Baby Trend Diaper Champ,"This is absolutely, by far, the best diaper ...",5.0,"{'just': 3L, 'money': 1L, 'not': 2L, 'mechanism': ...",1,0
Baby Trend Diaper Champ,Diaper Champ or Diaper Genie? That was my ...,5.0,"{'all': 1L, 'bags.': 1L, 'son,': 1L, '(i': 1L, ...",1,0
Baby Trend Diaper Champ,Wow! This is fabulous. It was a toss-up between ...,5.0,"{'and': 4L, '""genie"".': 1L, 'since': 1L, ...",1,0
Baby Trend Diaper Champ,I originally put this item on my baby registry ...,5.0,"{'lysol': 1L, 'all': 2L, 'bags.': 1L, 'feedback': ...",1,0
Baby Trend Diaper Champ,Two girlfriends and two family members put me ...,5.0,"{'just': 1L, 'when': 1L, 'both': 1L, 'results': ...",1,0
Baby Trend Diaper Champ,I am one of those super- critical shoppers who ...,5.0,"{'taller': 1L, 'bags.': 1L, 'just': 1L, ""don't"": ...",1,0

great,fantastic,amazing,love,horrible,bad,terrible,awful,wow,hate,predicted_sentiment
0,0,0,0,0,0,0,0,0,0,0.999999937267
0,0,0,1,0,0,0,0,0,0,0.999999917406
0,0,0,0,0,1,0,0,0,0,0.999999899509
2,0,0,0,0,1,0,0,0,0,0.999999836182
0,0,0,2,0,0,0,0,0,0,0.999999824745
0,0,0,0,0,0,0,0,0,0,0.999999759315
0,0,0,0,0,0,0,0,0,0,0.999999692111
0,0,0,0,0,0,0,0,0,0,0.999999642488
0,0,0,0,1,0,0,0,0,0,0.999999604504
0,0,0,1,0,0,0,0,0,0,0.999999486804


Now use the selected_words_model you learned using just the selected_words to predict the sentiment most positive review you found above. Hint: if you sorted the diaper_champ_reviews in descending order (from most positive to most negative), this command will be helpful to make the prediction you need:


In [None]:
selected_words_model.predict(diaper_champ_reviews[0:1], output_type='probability')