In [11]:
import pandas as pd
import numpy as np


Load Amazon dataset

In [12]:
products = pd.read_csv('/content/amazon_baby.csv')

Perform text cleaning

In [15]:
#We start by removing punctuation, so that words "cake." and "cake!" are counted as the same word.
products = products.fillna({'review':''})  # fill in N/A's in the review column

In [21]:
def remove_punctuation(text):
    import string
    return text.translate(string.punctuation) 
products['review_clean'] = products['review'].apply(remove_punctuation)

In [22]:
products.head(5)

Unnamed: 0,name,review,rating,review_clean
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,it came early and was not disappointed. i love...
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,Very soft and comfortable and warmer than it l...
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5,This is a product well worth the purchase. I ...
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5,All of my kids have cried non-stop when I trie...
5,Stop Pacifier Sucking without tears with Thumb...,"When the Binky Fairy came to our house, we did...",5,"When the Binky Fairy came to our house, we did..."


Extract Sentiments

In [20]:
#We will ignore all reviews with rating = 3, since they tend to have a neutral sentiment.
products = products[products['rating'] != 3]

Now, we will assign reviews with a rating of 4 or higher to be positive reviews, while the ones with rating of 2 or lower are negative. For the sentiment column, we use +1 for the positive class label and -1 for the negative class label. A good way is to create an anonymous function that converts a rating into a class label and then apply that function to every element in the rating column. In SFrame, you would use apply():

In [23]:
products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)

In [24]:
products.head(3)

Unnamed: 0,name,review,rating,review_clean,sentiment
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,it came early and was not disappointed. i love...,1
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,Very soft and comfortable and warmer than it l...,1
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5,This is a product well worth the purchase. I ...,1


Split into training and test sets


In [27]:
#Let's perform a train/test split with 80% of the data in the training set and 20% of the data in the test set.
import json
with open('/content/module-2-assignment-test-idx.json') as test_data_file:    
    test_data_idx = json.load(test_data_file)
with open('/content/module-2-assignment-train-idx.json') as train_data_file:    
    train_data_idx = json.load(train_data_file)

print (train_data_idx[:3])
print (test_data_idx[:3])

[0, 1, 2]
[8, 9, 14]


In [28]:
train_data = products.iloc[train_data_idx]
train_data.head(2)

Unnamed: 0,name,review,rating,review_clean,sentiment
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,it came early and was not disappointed. i love...,1
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,Very soft and comfortable and warmer than it l...,1


In [29]:
test_data = products.iloc[test_data_idx]
test_data.head(2)

Unnamed: 0,name,review,rating,review_clean,sentiment
9,"Baby Tracker&reg; - Daily Childcare Journal, S...",This has been an easy way for my nanny to reco...,4,This has been an easy way for my nanny to reco...,1
10,"Baby Tracker&reg; - Daily Childcare Journal, S...",I love this journal and our nanny uses it ever...,4,I love this journal and our nanny uses it ever...,1


Build the word count vector for each review

We will now compute the word count for each word that appears in the reviews. A vector consisting of word counts is often referred to as bag-of-word features. Since most words occur in only a few reviews, word count vectors are sparse. For this reason, scikit-learn and many other tools use sparse matrices to store a collection of word count vectors. Refer to appropriate manuals to produce sparse word count vectors. General steps for extracting word count vectors are as follows:

Learn a vocabulary (set of all words) from the training data. Only the words that show up in the training data will be considered for feature extraction.

Compute the occurrences of the words in each review and collect them into a row vector.

Build a sparse matrix where each row is the word count vector for the corresponding review. Call this matrix train_matrix.

Using the same mapping between words and columns, convert the test data into a sparse matrix test_matrix.

In [30]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
     # Use this token pattern to keep single-letter words
# First, learn vocabulary from the training data and assign columns to words
# Then convert the training data into a sparse matrix
train_matrix = vectorizer.fit_transform(train_data['review_clean'])
# Second, convert the test data into a sparse matrix, using the same word-column mapping
test_matrix = vectorizer.transform(test_data['review_clean'])
#print vectorizer.vocabulary_

Train a sentiment classifier with logistic regression

Learn a logistic regression classifier using the training data. If you are using scikit-learn, you should create an instance of the LogisticRegression class and then call the method fit() to train the classifier. This model should use the sparse word count matrix (train_matrix) as features and the column sentiment of train_data as the target. Use the default values for other parameters. Call this model sentiment_model.

In [31]:
from sklearn.linear_model import LogisticRegression
sentiment_model = LogisticRegression()
sentiment_model.fit(train_matrix, train_data['sentiment'])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [32]:
np.sum(sentiment_model.coef_ >= 0)

40075

Making predictions with logistic regression

In [34]:
sample_test_data = test_data.iloc[10:13]
print (sample_test_data)

                                                 name  ... sentiment
59                          Our Baby Girl Memory Book  ...         1
71  Wall Decor Removable Decal Sticker - Colorful ...  ...        -1
91  New Style Trailing Cherry Blossom Tree Decal R...  ...        -1

[3 rows x 5 columns]


In [35]:
sample_test_data.iloc[0]['review']

'Absolutely love it and all of the Scripture in it.  I purchased the Baby Boy version for my grandson when he was born and my daughter-in-law was thrilled to receive the same book again.'

In [36]:
sample_test_data.iloc[1]['review']

'Would not purchase again or recommend. The decals were thick almost plastic like and were coming off the wall as I was applying them! The would NOT stick! Literally stayed stuck for about 5 minutes then started peeling off.'

In [38]:
sample_test_matrix = vectorizer.transform(sample_test_data['review_clean'])
scores = sentiment_model.decision_function(sample_test_matrix)
print (scores)

print (sentiment_model.predict(sample_test_matrix))

[  5.23042889  -3.0298181  -11.01522803]
[ 1 -1 -1]


Probability Predictions

In [40]:
print ([1./(1+np.exp(-x)) for x in scores])

[0.9946772535022802, 0.04609682504841702, 1.6449022940204537e-05]


In [41]:
print (sentiment_model.classes_)
print (sentiment_model.predict_proba(sample_test_matrix))

[-1  1]
[[5.32274650e-03 9.94677254e-01]
 [9.53903175e-01 4.60968250e-02]
 [9.99983551e-01 1.64490229e-05]]


In [43]:
#20 most positive reviews
test_scores = sentiment_model.decision_function(test_matrix)
positive_idx = np.argsort(-test_scores)[:20]
print (positive_idx)
print (test_scores[positive_idx[0]])
test_data.iloc[positive_idx]

[30634 18112 15732 25554  9555 21531 17558 11923 24899 24286 33060 30297
 21203 20743  2570 26830 16502 30076  9125 27048]
47.01454533421485


Unnamed: 0,name,review,rating,review_clean,sentiment
168697,Graco FastAction Fold Jogger Click Connect Str...,Graco's FastAction Jogging Stroller definitely...,5,Graco's FastAction Jogging Stroller definitely...,1
100166,"Infantino Wrap and Tie Baby Carrier, Black Blu...",I bought this carrier when my daughter was abo...,5,I bought this carrier when my daughter was abo...,1
87017,Baby Einstein Around The World Discovery Center,I am so HAPPY I brought this item for my 7 mon...,5,I am so HAPPY I brought this item for my 7 mon...,1
140816,"Diono RadianRXT Convertible Car Seat, Plum",I bought this seat for my tall (38in) and thin...,5,I bought this seat for my tall (38in) and thin...,1
52631,Evenflo X Sport Plus Convenience Stroller - Ch...,After seeing this in Parent's Magazine and rea...,5,After seeing this in Parent's Magazine and rea...,1
119182,Roan Rocco Classic Pram Stroller 2-in-1 with B...,Great Pram Rocco!!!!!!I bought this pram from ...,5,Great Pram Rocco!!!!!!I bought this pram from ...,1
97325,Freemie Hands-Free Concealable Breast Pump Col...,I absolutely love this product. I work as a C...,5,I absolutely love this product. I work as a C...,1
66059,"Evenflo 6 Pack Classic Glass Bottle, 4-Ounce",It's always fun to write a review on those pro...,5,It's always fun to write a review on those pro...,1
137034,Graco Pack 'n Play Element Playard - Flint,My husband and I assembled this Pack n' Play l...,4,My husband and I assembled this Pack n' Play l...,1
133651,"Britax 2012 B-Agile Stroller, Red",[I got this stroller for my daughter prior to ...,4,[I got this stroller for my daughter prior to ...,1


In [44]:
#20 most negative reviews
negative_idx = np.argsort(test_scores)[:20]
print (negative_idx)
print (test_scores[negative_idx[0]])
test_data.iloc[negative_idx]

[17069 28184  2931 21700 17122 30373  8818 17222 28120  1810 31928   205
 13680 10814  9655 13751 14711 13939 17428 17034]
-37.779838766263815


Unnamed: 0,name,review,rating,review_clean,sentiment
94560,The First Years True Choice P400 Premium Digit...,Note: we never installed batteries in these un...,1,Note: we never installed batteries in these un...,-1
155287,VTech Communications Safe &amp; Sounds Full Co...,"This is my second video monitoring system, the...",1,"This is my second video monitoring system, the...",-1
16042,Fisher-Price Ocean Wonders Aquarium Bouncer,We have not had ANY luck with Fisher-Price pro...,2,We have not had ANY luck with Fisher-Price pro...,-1
120209,Levana Safe N'See Digital Video Baby Monitor w...,This is the first review I have ever written o...,1,This is the first review I have ever written o...,-1
94815,Stork Craft Heather Dressing Table with Drawer...,"The construction is simply terrible, and not f...",2,"The construction is simply terrible, and not f...",-1
167249,Samsung SEW-3037W Wireless Pan Tilt Video Baby...,Reviewers. You failed me!This thing worked for...,1,Reviewers. You failed me!This thing worked for...,-1
48694,Adiri BPA Free Natural Nurser Ultimate Bottle ...,I will try to write an objective review of the...,2,I will try to write an objective review of the...,-1
95420,One Step Ahead Hide-Away Extra Long Bed Rail,"I bought a brand new 56"" hide-away bed safety ...",1,"I bought a brand new 56"" hide-away bed safety ...",-1
154878,VTech Communications Safe &amp; Sound Digital ...,"First, the distance on these are no more than ...",1,"First, the distance on these are no more than ...",-1
9915,Cosco Alpha Omega Elite Convertible Car Seat,I bought this car seat after both seeing the ...,1,I bought this car seat after both seeing the ...,-1


Accuracy of the classifier

In [46]:
predicted_y = sentiment_model.predict(test_matrix)
correct_num = np.sum(predicted_y == test_data['sentiment'])
total_num = len(test_data['sentiment'])
print ("correct_num: {}, total_num: {}".format(correct_num, total_num))
accuracy = correct_num * 1./ total_num
print (accuracy)

correct_num: 31082, total_num: 33336
0.9323854091672666


Learn another classifier with fewer words

In [47]:
significant_words = ['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves', 
      'well', 'able', 'car', 'broke', 'less', 'even', 'waste', 'disappointed', 
      'work', 'product', 'money', 'would', 'return']

In [48]:
vectorizer_word_subset = CountVectorizer(vocabulary=significant_words) # limit to 20 words
train_matrix_word_subset = vectorizer_word_subset.fit_transform(train_data['review_clean'])
test_matrix_word_subset = vectorizer_word_subset.transform(test_data['review_clean'])

Train a logistic regression model on a subset of data

In [49]:
simple_model = LogisticRegression()
simple_model.fit(train_matrix_word_subset, train_data['sentiment'])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [50]:
simple_model_coef_table = pd.DataFrame({'word':significant_words,
                                         'coefficient':simple_model.coef_.flatten()})
#simple_model_coef_table
simple_model_coef_table.sort_values(['coefficient'], ascending=False)

Unnamed: 0,word,coefficient
6,loves,1.677511
5,perfect,1.506779
0,love,1.356381
2,easy,1.183212
1,great,0.943456
7,well,0.530037
4,little,0.514496
8,able,0.193097
3,old,0.082757
9,car,0.056485


In [51]:
len(simple_model_coef_table[simple_model_coef_table['coefficient']>0])

10

In [52]:
model_coef_table = pd.DataFrame({'word':significant_words,
                                         'coefficient':simple_model.coef_.flatten()})
#simple_model_coef_table
simple_model_coef_table.sort_values(['coefficient'], ascending=False)

Unnamed: 0,word,coefficient
6,loves,1.677511
5,perfect,1.506779
0,love,1.356381
2,easy,1.183212
1,great,0.943456
7,well,0.530037
4,little,0.514496
8,able,0.193097
3,old,0.082757
9,car,0.056485


In [53]:
vectorizer_word_subset.get_feature_names()

['love',
 'great',
 'easy',
 'old',
 'little',
 'perfect',
 'loves',
 'well',
 'able',
 'car',
 'broke',
 'less',
 'even',
 'waste',
 'disappointed',
 'work',
 'product',
 'money',
 'would',
 'return']

Comparing models

In [54]:
train_predicted_y = sentiment_model.predict(train_matrix)
correct_num = np.sum(train_predicted_y == train_data['sentiment'])
total_num = len(train_data['sentiment'])
print ("correct_num: {}, total_num: {}".format(correct_num, total_num))
train_accuracy = correct_num * 1./ total_num
print ("sentiment_model training accuracy: {}".format(train_accuracy))

train_predicted_y = simple_model.predict(train_matrix_word_subset)
correct_num = np.sum(train_predicted_y == train_data['sentiment'])
total_num = len(train_data['sentiment'])
print ("correct_num: {}, total_num: {}".format(correct_num, total_num))
train_accuracy = correct_num * 1./ total_num
print ("simple_model training accuracy: {}".format(train_accuracy))

correct_num: 126305, total_num: 133416
sentiment_model training accuracy: 0.9467005456616897
correct_num: 115702, total_num: 133416
simple_model training accuracy: 0.8672273190621814


In [55]:
test_predicted_y = sentiment_model.predict(test_matrix)
correct_num = np.sum(test_predicted_y == test_data['sentiment'])
total_num = len(test_data['sentiment'])
print ("correct_num: {}, total_num: {}".format(correct_num, total_num))
test_accuracy = correct_num * 1./ total_num
print ("sentiment_model test accuracy: {}".format(test_accuracy))

test_predicted_y = simple_model.predict(test_matrix_word_subset)
correct_num = np.sum(test_predicted_y == test_data['sentiment'])
total_num = len(test_data['sentiment'])
print ("correct_num: {}, total_num: {}".format(correct_num, total_num))
test_accuracy = correct_num * 1./ total_num
print ("simple_model test accuracy: {}".format(test_accuracy))

correct_num: 31082, total_num: 33336
sentiment_model test accuracy: 0.9323854091672666
correct_num: 29009, total_num: 33336
simple_model test accuracy: 0.8702003839692825


In [57]:
positive_label = len(test_data[test_data['sentiment']>0])
negative_label = len(test_data[test_data['sentiment']<0])
print ("positive_label is {}, negative_label is {}".format(positive_label, negative_label))

positive_label is 28095, negative_label is 5241


In [59]:
baseline_accuracy = positive_label*1./(positive_label+negative_label)
print ("baseline_accuracy is {}".format(baseline_accuracy))

baseline_accuracy is 0.8427825773938085
