# Predicting sentiment from product reviews


The goal of this first notebook is to explore logistic regression and feature engineering

Let's get started!
    


In [1]:
from __future__ import division
import turicreate
import math
import string

# Data preparation

We will use a dataset consisting of baby product reviews on Amazon.com.

In [2]:
products = turicreate.SFrame('m_bfaa91c17752f745.frame_idx')

Now, let us see a preview of what the dataset looks like.

In [3]:
products

name,review,rating
Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3.0
Planetwise Wipe Pouch,it came early and was not disappointed. i love ...,5.0
Annas Dream Full Quilt with 2 Shams ...,Very soft and comfortable and warmer than it ...,5.0
Stop Pacifier Sucking without tears with ...,This is a product well worth the purchase. I ...,5.0
Stop Pacifier Sucking without tears with ...,All of my kids have cried non-stop when I tried to ...,5.0
Stop Pacifier Sucking without tears with ...,"When the Binky Fairy came to our house, we didn't ...",5.0
A Tale of Baby's Days with Peter Rabbit ...,"Lovely book, it's bound tightly so you may no ...",4.0
"Baby Tracker&reg; - Daily Childcare Journal, ...",Perfect for new parents. We were able to keep ...,5.0
"Baby Tracker&reg; - Daily Childcare Journal, ...",A friend of mine pinned this product on Pinte ...,5.0
"Baby Tracker&reg; - Daily Childcare Journal, ...",This has been an easy way for my nanny to record ...,4.0


## Build the word count vector for each review

Let us explore a specific example of a baby product.


In [5]:
products[269]

{'name': 'The First Years Massaging Action Teether',
 'review': 'A favorite in our house!',
 'rating': 5.0}

In [6]:
import string 
def remove_punctuation(text):
    try: # python 2.x
        text = text.translate(None, string.punctuation) 
    except: # python 3.x
        translator = text.maketrans('', '', string.punctuation)
        text = text.translate(translator)
    return text

review_without_punctuation = products['review'].apply(remove_punctuation)
products['word_count'] = turicreate.text_analytics.count_words(review_without_punctuation)

In [7]:
products[269]['word_count']

{'our': 1.0, 'in': 1.0, 'favorite': 1.0, 'house': 1.0, 'a': 1.0}

## Extract sentiments

We will **ignore** all reviews with *rating = 3*, since they tend to have a neutral sentiment.

In [8]:
products = products[products['rating'] != 3]
len(products)

166752

Now, we will assign reviews with a rating of 4 or higher to be *positive* reviews, while the ones with rating of 2 or lower are *negative*. For the sentiment column, we use +1 for the positive class label and -1 for the negative class label.

In [9]:
products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)
products

name,review,rating,word_count,sentiment
Planetwise Wipe Pouch,it came early and was not disappointed. i love ...,5.0,"{'recommend': 1.0, 'highly': 1.0, ...",1
Annas Dream Full Quilt with 2 Shams ...,Very soft and comfortable and warmer than it ...,5.0,"{'quilt': 1.0, 'this': 1.0, 'for': 1.0, ...",1
Stop Pacifier Sucking without tears with ...,This is a product well worth the purchase. I ...,5.0,"{'tool': 1.0, 'clever': 1.0, 'approach': 2.0, ...",1
Stop Pacifier Sucking without tears with ...,All of my kids have cried non-stop when I tried to ...,5.0,"{'rock': 1.0, 'headachesthanks': 1.0, ...",1
Stop Pacifier Sucking without tears with ...,"When the Binky Fairy came to our house, we didn't ...",5.0,"{'thumb': 1.0, 'or': 1.0, 'break': 1.0, 'trying': ...",1
A Tale of Baby's Days with Peter Rabbit ...,"Lovely book, it's bound tightly so you may no ...",4.0,"{'2995': 1.0, 'for': 1.0, 'barnes': 1.0, 'at': ...",1
"Baby Tracker&reg; - Daily Childcare Journal, ...",Perfect for new parents. We were able to keep ...,5.0,"{'right': 1.0, 'because': 1.0, 'questions': 1.0, ...",1
"Baby Tracker&reg; - Daily Childcare Journal, ...",A friend of mine pinned this product on Pinte ...,5.0,"{'like': 1.0, 'and': 1.0, 'changes': 1.0, 'the': ...",1
"Baby Tracker&reg; - Daily Childcare Journal, ...",This has been an easy way for my nanny to record ...,4.0,"{'in': 1.0, 'pages': 1.0, 'out': 1.0, 'run': 1.0, ...",1
"Baby Tracker&reg; - Daily Childcare Journal, ...",I love this journal and our nanny uses it ...,4.0,"{'tracker': 1.0, 'now': 1.0, 'postits': 1.0, ...",1


Now, we can see that the dataset contains an extra column called **sentiment** which is either positive (+1) or negative (-1).

## Split data into training and test sets

 We use `seed=1` so that everyone gets the same result.

In [10]:
train_data, test_data = products.random_split(.8, seed=1)
print(len(train_data))
print(len(test_data))

133416
33336


# Train a sentiment classifier with logistic regression



In [22]:
sentiment_model = turicreate.logistic_classifier.create(train_data,
                                                        target = 'sentiment',
                                                        features=['word_count'],
                                                        validation_set=None)

In [23]:
sentiment_model

Class                          : LogisticClassifier

Schema
------
Number of coefficients         : 121713
Number of examples             : 133416
Number of classes              : 2
Number of feature columns      : 1
Number of unpacked features    : 121712

Hyperparameters
---------------
L1 penalty                     : 0.0
L2 penalty                     : 0.01

Training Summary
----------------
Solver                         : lbfgs
Solver iterations              : 5
Solver status                  : TERMINATED: Terminated due to numerical difficulties.
Training time (sec)            : 1.4165

Settings
--------
Log-likelihood                 : inf

Highest Positive Coefficients
-----------------------------
word_count[offsi]              : 21.7657
word_count[feedthrough]        : 21.6818
word_count[conclusions]        : 21.6818
word_count[easycheap]          : 21.6818
word_count[torsional]          : 21.6818

Lowest Negative Coefficients
----------------------------
word_count[wahwah]

Now that we have fitted the model, we can extract the weights (coefficients) as an SFrame as follows:

In [13]:
weights = sentiment_model.coefficients
weights.column_names()

['name', 'index', 'class', 'value', 'stderr']

In [14]:
weights

name,index,class,value,stderr
(intercept),,1,0.779736281462833,
word_count,recommend,1,0.3052457737582838,
word_count,highly,1,0.9060636221519458,
word_count,disappointed,1,-2.42679249447826,
word_count,love,1,0.8405057320615064,
word_count,it,1,0.0104794460442007,
word_count,planet,1,-0.384672991123842,
word_count,and,1,0.0363143727865874,
word_count,bags,1,0.1305759809887296,
word_count,wipes,1,-0.0020751803190877,




#### Fill in the following block of code to calculate how many *weights* are positive ( >= 0). (**Hint**: The `'value'` column in SFrame *weights* must be positive ( >= 0)).

In [21]:
num_positive_weights = (weights['value'] >= 0).sum()
num_negative_weights = (weights['value'] < 0).sum()

print("Number of positive weights: %s " % num_positive_weights)
print("Number of negative weights: %s " % num_negative_weights)

Number of positive weights: 91073 
Number of negative weights: 30640 


**Quiz Question:** How many weights are >= 0?

## Making predictions with logistic regression


In [24]:
sample_test_data = test_data[10:13]
print(sample_test_data['rating'])
sample_test_data

[5.0, 2.0, 1.0]


name,review,rating,word_count,sentiment
Our Baby Girl Memory Book,Absolutely love it and all of the Scripture in ...,5.0,"{'again': 1.0, 'book': 1.0, 'same': 1.0, ...",1
Wall Decor Removable Decal Sticker - Colorful ...,Would not purchase again or recommend. The decals ...,2.0,"{'peeling': 1.0, '5': 1.0, 'about': 1.0, 'f ...",-1
New Style Trailing Cherry Blossom Tree Decal ...,Was so excited to get this product for my baby ...,1.0,"{'on': 1.0, 'waste': 1.0, 'wouldnt': 1.0, ...",-1


Let's dig deeper into the first row of the **sample_test_data**. Here's the full review:

In [25]:
sample_test_data[0]['review']

'Absolutely love it and all of the Scripture in it.  I purchased the Baby Boy version for my grandson when he was born and my daughter-in-law was thrilled to receive the same book again.'

In [26]:
sample_test_data[1]['review']

'Would not purchase again or recommend. The decals were thick almost plastic like and were coming off the wall as I was applying them! The would NOT stick! Literally stayed stuck for about 5 minutes then started peeling off.'



$$
\mbox{score}_i = \mathbf{w}^T h(\mathbf{x}_i)
$$ 



In [27]:
scores = sentiment_model.predict(sample_test_data, output_type='margin')
print(scores)

[4.788907309214016, -3.0007822224624583, -8.188501360762793]


### Predicting sentiment

These scores can be used to make class predictions as follows:

$$
\hat{y} = 
\left\{
\begin{array}{ll}
      +1 & \mathbf{w}^T h(\mathbf{x}_i) > 0 \\
      -1 & \mathbf{w}^T h(\mathbf{x}_i) \leq 0 \\
\end{array} 
\right.
$$

Using scores, write code to calculate $\hat{y}$, the class predictions:

In [35]:
def class_predictions(scores):
    """ make class predictions
    """
    preds = []
    for score in scores:
        if score > 0:
            pred = 1
        else:
            pred = -1
        preds.append(pred)
    return preds
class_predictions(scores)

[1, -1, -1]

In [30]:
print("Class predictions according to Turi Create:")
print(sentiment_model.predict(sample_test_data))

Class predictions according to Turi Create:
[1, -1, -1]




### Probability predictions

Recall from the lectures that we can also calculate the probability predictions from the scores using:
$$
P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) = \frac{1}{1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))}.
$$


In [32]:
def calculate_proba(scores):
    proba_predc = []
    for i in scores:
        proba_pred =  1 / (1 + math.exp(-i))
        proba_predc.append(proba_pred)
    return proba_predc

calculate_proba(scores)

[0.9917471313286885, 0.0473905474871242, 0.00027775277121725437]

**Checkpoint**: Make sure your probability predictions match the ones obtained from Turi Create.

In [33]:
print("Class predictions according to Turi Create:")
print(sentiment_model.predict(sample_test_data, output_type='probability'))

Class predictions according to Turi Create:
[0.9917471313286885, 0.047390547487124186, 0.0002777527712172544]


# Find the most positive (and negative) review

We now turn to examining the full test dataset, **test_data**, and use Turi Create to form predictions on all of the test data points for faster performance.

Using the `sentiment_model`, find the 20 reviews in the entire **test_data** with the **highest probability** of being classified as a **positive review**. We refer to these as the "most positive reviews."


In [36]:
test_data['proba_pred'] = sentiment_model.predict(test_data, output_type='probability')
test_data

name,review,rating,word_count,sentiment
"Baby Tracker&reg; - Daily Childcare Journal, ...",This has been an easy way for my nanny to record ...,4.0,"{'in': 1.0, 'pages': 1.0, 'out': 1.0, 'run': 1.0, ...",1
"Baby Tracker&reg; - Daily Childcare Journal, ...",I love this journal and our nanny uses it ...,4.0,"{'tracker': 1.0, 'now': 1.0, 'postits': 1.0, ...",1
Nature's Lullabies First Year Sticker Calendar ...,"I love this little calender, you can keep ...",5.0,"{'too': 1.0, 'stickers': 1.0, 'illustrations': ...",1
Nature's Lullabies Second Year Sticker Calendar ...,"I had a hard time finding a second year calendar, ...",5.0,"{'reference': 1.0, 'have': 1.0, 'out': 1.0, ...",1
"Lamaze Peekaboo, I Love You ...","One of baby's first and favorite books, and i ...",4.0,"{'typical': 1.0, 'your': 1.0, 'the': 1.0, ...",1
"Lamaze Peekaboo, I Love You ...",My son loved this book as an infant. It was ...,5.0,"{'farm': 1.0, 'out': 1.0, 'say': 1.0, 'again': ...",1
"Lamaze Peekaboo, I Love You ...",Our baby loves this book & has loved it for a ...,5.0,"{'own': 1.0, 'his': 1.0, 'on': 1.0, 'a': 1.0, ...",1
"SoftPlay Giggle Jiggle Funbook, Happy Bear ...",This bear is absolutely adorable and I would ...,2.0,"{'kenzie': 1.0, 'my': 1.0, 'down': 1.0, 'gi ...",-1
SoftPlay Peek-A-Boo Where's Elmo A Childr ...,I bought two for recent baby showers! The book ...,5.0,"{'book': 1.0, 'elmo': 1.0, 'love': 1.0, 'for': ...",1
Baby's First Year Undated Wall Calendar with ...,I searched high and low for a first year cale ...,5.0,"{'pictures': 1.0, 'personalization': 1.0, ...",1

proba_pred
0.8837450374340837
0.999999979596926
0.7214046198524092
0.9999702405005484
0.9698462672312208
0.9999953253994284
0.9970314926060642
0.9721442926353544
0.9969355527911852
0.9973467019063472


**Quiz Question**: Which of the following products are represented in the 20 most positive reviews? [multiple choice]




In [37]:
test_data['name','proba_pred'].topk('proba_pred', k=20).print_rows(20)

+-------------------------------+------------+
|              name             | proba_pred |
+-------------------------------+------------+
| Fisher-Price Cradle 'N Swi... |    1.0     |
| The Original CJ's BuTTer (... |    1.0     |
| Baby Jogger City Mini GT D... |    1.0     |
| Diono RadianRXT Convertibl... |    1.0     |
| Diono RadianRXT Convertibl... |    1.0     |
| Graco Pack 'n Play Element... |    1.0     |
| Maxi-Cosi Pria 70 with Tin... |    1.0     |
| Britax 2012 B-Agile Stroll... |    1.0     |
| Quinny 2012 Buzz Stroller,... |    1.0     |
| Roan Rocco Classic Pram St... |    1.0     |
| Britax Decathlon Convertib... |    1.0     |
| bumGenius One-Size Snap Cl... |    1.0     |
| Infantino Wrap and Tie Bab... |    1.0     |
| Baby Einstein Around The W... |    1.0     |
| Britax Frontier Booster Ca... |    1.0     |
| Evenflo X Sport Plus Conve... |    1.0     |
| P'Kolino Silly Soft Seatin... |    1.0     |
| Peg Perego Aria Light Weig... |    1.0     |
| Fisher-Pric

**Quiz Question**: Which of the following products are represented in the 20 most negative reviews?  [multiple choice]

In [38]:
test_data['name','proba_pred'].topk('proba_pred', k=20, reverse=True).print_rows(20)

+-------------------------------+------------------------+
|              name             |       proba_pred       |
+-------------------------------+------------------------+
| Luna Lullaby Bosom Baby Nu... | 3.229790842407196e-63  |
| The First Years True Choic... | 1.6322823186481659e-24 |
| Jolly Jumper Arctic Sneak ... | 8.110311382105151e-20  |
| Motorola MBP36 Remote Wire... | 7.797281605696141e-16  |
| VTech Communications Safe ... | 1.841161479862229e-14  |
| Fisher-Price Ocean Wonders... |  6.34509494188667e-14  |
| Levana Safe N'See Digital ... | 6.578528513080714e-14  |
| Safety 1st High-Def Digita... | 1.5408410011895798e-13 |
| Snuza Portable Baby Moveme... | 6.301835289177339e-13  |
| Adiri BPA Free Natural Nur... | 9.314560905796485e-13  |
| Samsung SEW-3037W Wireless... | 5.9206029701455675e-12 |
| Motorola Digital Video Bab... | 5.986200850005558e-12  |
| Cloth Diaper Sprayer--styl... | 1.0091402072241272e-11 |
| Munchkin Nursery Projector... | 3.455183274564688e-11 

## Compute accuracy of the classifier

We will now evaluate the accuracy of the trained classifier. Recall that the accuracy is given by


$$
\mbox{accuracy} = \frac{\mbox{# correctly classified examples}}{\mbox{# total examples}}
$$



In [40]:
def get_classification_accuracy(model, data, true_labels):
    predictions = model.predict(data)
    num_correct = sum(predictions == true_labels)
    accuracy = num_correct/len(data)
    return accuracy

Now, let's compute the classification accuracy of the **sentiment_model** on the **test_data**.

In [41]:
get_classification_accuracy(sentiment_model, test_data, test_data['sentiment'])

0.9221862251019919

## Learn another classifier with fewer words

There were a lot of words in the model we trained above. We will now train a simpler logistic regression model using only a subset of words that occur in the reviews. For this assignment, we selected a 20 words to work with. These are:

In [43]:
significant_words = ['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves', 
      'well', 'able', 'car', 'broke', 'less', 'even', 'waste', 'disappointed', 
      'work', 'product', 'money', 'would', 'return']

In [44]:
len(significant_words)

20

In [45]:
train_data['word_count_subset'] = train_data['word_count'].dict_trim_by_keys(significant_words, exclude=False)
test_data['word_count_subset'] = test_data['word_count'].dict_trim_by_keys(significant_words, exclude=False)

Let's see what the first example of the dataset looks like:

In [46]:
train_data[0]['review']

'it came early and was not disappointed. i love planet wise bags and now my wipe holder. it keps my osocozy wipes moist and does not leak. highly recommend it.'

The **word_count** column had been working with before looks like the following:

In [47]:
print(train_data[0]['word_count'])

{'recommend': 1.0, 'highly': 1.0, 'disappointed': 1.0, 'love': 1.0, 'it': 3.0, 'planet': 1.0, 'and': 3.0, 'bags': 1.0, 'wipes': 1.0, 'not': 2.0, 'early': 1.0, 'came': 1.0, 'i': 1.0, 'does': 1.0, 'wise': 1.0, 'my': 2.0, 'was': 1.0, 'now': 1.0, 'wipe': 1.0, 'holder': 1.0, 'leak': 1.0, 'keps': 1.0, 'osocozy': 1.0, 'moist': 1.0}


Since we are only working with a subset of these words, the column **word_count_subset** is a subset of the above dictionary. In this example, only 2 `significant words` are present in this review.

In [48]:
print(train_data[0]['word_count_subset'])

{'disappointed': 1.0, 'love': 1.0}


## Train a logistic regression model on a subset of data

In [49]:
simple_model = turicreate.logistic_classifier.create(train_data,
                                                     target = 'sentiment',
                                                     features=['word_count_subset'],
                                                     validation_set=None)
simple_model

Class                          : LogisticClassifier

Schema
------
Number of coefficients         : 21
Number of examples             : 133416
Number of classes              : 2
Number of feature columns      : 1
Number of unpacked features    : 20

Hyperparameters
---------------
L1 penalty                     : 0.0
L2 penalty                     : 0.01

Training Summary
----------------
Solver                         : newton
Solver iterations              : 6
Solver status                  : SUCCESS: Optimal solution found.
Training time (sec)            : 0.3676

Settings
--------
Log-likelihood                 : 44323.7254

Highest Positive Coefficients
-----------------------------
word_count_subset[loves]       : 1.6773
word_count_subset[perfect]     : 1.5145
word_count_subset[love]        : 1.3654
(intercept)                    : 1.2995
word_count_subset[easy]        : 1.1937

Lowest Negative Coefficients
----------------------------
word_count_subset[disappointed] : -2.3551
wo

We can compute the classification accuracy using the `get_classification_accuracy` function you implemented earlier.

In [50]:
get_classification_accuracy(simple_model, test_data, test_data['sentiment'])

0.8693004559635229

Now, we will inspect the weights (coefficients) of the **simple_model**:

In [51]:
simple_model.coefficients

name,index,class,value,stderr
(intercept),,1,1.2995449552027043,0.0120888541330532
word_count_subset,disappointed,1,-2.3550925006107253,0.0504149888556979
word_count_subset,love,1,1.3654354936790372,0.0303546295109051
word_count_subset,well,1,0.5042567463979284,0.02138130063099
word_count_subset,product,1,-0.320555492995575,0.0154311321362016
word_count_subset,loves,1,1.6772714555592918,0.0482328275383501
word_count_subset,little,1,0.5206286360250184,0.0214691475664903
word_count_subset,work,1,-0.6217000124253143,0.0230330597945848
word_count_subset,easy,1,1.1936618983284648,0.0292888692020295
word_count_subset,great,1,0.9446912694798444,0.02095099265905


Let's sort the coefficients (in descending order) by the **value** to obtain the coefficients with the most positive effect on the sentiment.

In [52]:
simple_model.coefficients.sort('value', ascending=False).print_rows(num_rows=21)

+-------------------+--------------+-------+----------------------+
|        name       |    index     | class |        value         |
+-------------------+--------------+-------+----------------------+
| word_count_subset |    loves     |   1   |  1.6772714555592918  |
| word_count_subset |   perfect    |   1   |  1.5144862670271348  |
| word_count_subset |     love     |   1   |  1.3654354936790372  |
|    (intercept)    |     None     |   1   |  1.2995449552027043  |
| word_count_subset |     easy     |   1   |  1.1936618983284648  |
| word_count_subset |    great     |   1   |  0.9446912694798443  |
| word_count_subset |    little    |   1   |  0.5206286360250184  |
| word_count_subset |     well     |   1   |  0.5042567463979284  |
| word_count_subset |     able     |   1   |  0.1914383022947509  |
| word_count_subset |     old      |   1   |  0.0853961886678159  |
| word_count_subset |     car      |   1   | 0.05883499006802042  |
| word_count_subset |     less     |   1   | -0.

In [53]:
simple_weights = simple_model.coefficients
positive_significant_words = simple_weights[(simple_weights['value'] > 0) & (simple_weights['name'] == "word_count_subset")]['index']
print(len(positive_significant_words))
print(positive_significant_words)

10
['love', 'well', 'loves', 'little', 'easy', 'great', 'able', 'perfect', 'old', 'car']


In [54]:
weights.filter_by(positive_significant_words, 'index')

name,index,class,value,stderr
word_count,love,1,0.8405057320615064,
word_count,well,1,0.4010755749233184,
word_count,loves,1,0.974982312514265,
word_count,little,1,0.4099316272571706,
word_count,easy,1,0.7349826255674924,
word_count,great,1,0.7789532883805084,
word_count,able,1,0.1075280219142423,
word_count,perfect,1,1.0447994204048685,
word_count,old,1,0.0796749090098758,
word_count,car,1,0.11965787650766,


# Comparing models

In [55]:
get_classification_accuracy(sentiment_model, train_data, train_data['sentiment'])

0.976494573364514

In [56]:
get_classification_accuracy(simple_model, train_data, train_data['sentiment'])

0.8668150746537147

In [57]:
get_classification_accuracy(sentiment_model, test_data, test_data['sentiment'])

0.9221862251019919

In [58]:
get_classification_accuracy(simple_model, test_data, test_data['sentiment'])

0.8693004559635229

## Baseline: Majority class prediction



What is the majority class in the **train_data**?

In [59]:
num_positive  = (train_data['sentiment'] == +1).sum()
num_negative = (train_data['sentiment'] == -1).sum()
print(num_positive)
print(num_negative)

112164
21252


In [63]:
print((test_data['sentiment'] == +1).sum())
print((test_data['sentiment'] == -1).sum())
print((test_data['sentiment'] == +1).sum()/len(test_data['sentiment']))

28095
5241
0.8427825773938085
