# Predicting sentiment from product reviews

In [8]:
import pandas as pd
import numpy as np
import string

# Read the data

In [2]:
products = pd.read_csv('amazon_baby.csv')

In [3]:
products

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5
5,Stop Pacifier Sucking without tears with Thumb...,"When the Binky Fairy came to our house, we did...",5
6,A Tale of Baby's Days with Peter Rabbit,"Lovely book, it's bound tightly so you may not...",4
7,"Baby Tracker&reg; - Daily Childcare Journal, S...",Perfect for new parents. We were able to keep ...,5
8,"Baby Tracker&reg; - Daily Childcare Journal, S...",A friend of mine pinned this product on Pinter...,5
9,"Baby Tracker&reg; - Daily Childcare Journal, S...",This has been an easy way for my nanny to reco...,4


# Perform text cleaning

In [4]:
#go around pandas intelligent(?) type conversion
products['review'] = products['review'].astype(str)

In [9]:
def remove_punctuation(text):
    return text.translate(None, string.punctuation) 

products['review_clean'] = products['review'].apply(remove_punctuation)

In [10]:
products = products.fillna({'review':''})  # fill in N/A's in the review column

# Extract Sentiments

3. We will ignore all reviews with rating = 3, since they tend to have a neutral sentiment. 

In [11]:
products = products[products['rating'] != 3]

Now, we will assign reviews with a rating of 4 or higher to be positive reviews, while the ones with rating of 2 or lower are negative. For the sentiment column, we use +1 for the positive class label and -1 for the negative class label. A good way is to create an anonymous function that converts a rating into a class label and then apply that function to every element in the rating column.

In [12]:
products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)

In [13]:
products.iloc[:3]

Unnamed: 0,name,review,rating,review_clean,sentiment
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,it came early and was not disappointed i love ...,1
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,Very soft and comfortable and warmer than it l...,1
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5,This is a product well worth the purchase I h...,1


# Split into training and test sets

In [14]:
import json

In [15]:
with open('module-2-assignment-train-idx.json') as json_data:
    train_data_idx = json.load(json_data)

In [16]:
with open('module-2-assignment-test-idx.json') as json_data:
    test_data_idx = json.load(json_data)

In [17]:
train_data = products.iloc[train_data_idx,:]

In [18]:
train_data.iloc[3,:]

name            Stop Pacifier Sucking without tears with Thumb...
review          All of my kids have cried non-stop when I trie...
rating                                                          5
review_clean    All of my kids have cried nonstop when I tried...
sentiment                                                       1
Name: 4, dtype: object

In [19]:
test_data = products.iloc[test_data_idx,:]

In [20]:
print len(train_data_idx), len(train_data)

133416 133416


# Build the word count vector for each review

We will now compute the word count for each word that appears in the reviews. 
A vector consisting of word counts is often referred to as bag-of-word features. 
Since most words occur in only a few reviews, word count vectors are sparse. 
For this reason, scikit-learn and many other tools use sparse matrices to store a collection of word count vectors. 
Refer to appropriate manuals to produce sparse word count vectors. 
General steps for extracting word count vectors are as follows:

* Learn a vocabulary (set of all words) from the training data. 
  Only the words that show up in the training data will be considered for feature extraction.
* Compute the occurrences of the words in each review and collect them into a row vector.
* Build a sparse matrix where each row is the word count vector for the corresponding review. Call this matrix train_matrix.
Using the same mapping between words and columns, convert the test data into a sparse matrix test_matrix.

In [21]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
     # Use this token pattern to keep single-letter words
# First, learn vocabulary from the training data and assign columns to words
# Then convert the training data into a sparse matrix
train_matrix = vectorizer.fit_transform(train_data['review_clean'])
# Second, convert the test data into a sparse matrix, using the same word-column mapping
test_matrix = vectorizer.transform(test_data['review_clean'])

# Train a sentiment classifier with logistic regression
This model uses the sparse word count matrix (**train_matrix**) as features and the column **sentiment** of train_data as the target. Use the default values for other parameters.

In [22]:
from sklearn import linear_model
sentiment_model = linear_model.LogisticRegression()
sentiment_model.fit(train_matrix,train_data['sentiment'])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [23]:
num_positive_weights = len(sentiment_model.coef_[sentiment_model.coef_>=0])
num_negative_weights = len(sentiment_model.coef_[sentiment_model.coef_<0])

print "Number of positive weights: %s " % num_positive_weights
print "Number of negative weights: %s " % num_negative_weights
print "Total weights %s" % (num_positive_weights+num_negative_weights )

Number of positive weights: 86772 
Number of negative weights: 34941 
Total weights 121713


## Making predictions with logistic regression

Now that a model is trained, we can make predictions on the **test data**. In this section, we will explore this in the context of 3 examples in the test dataset.  We refer to this set of 3 examples as the **sample_test_data**.

In [24]:
sample_test_data = test_data[10:13]
print sample_test_data

                                                 name  \
59                          Our Baby Girl Memory Book   
71  Wall Decor Removable Decal Sticker - Colorful ...   
91  New Style Trailing Cherry Blossom Tree Decal R...   

                                               review  rating  \
59  Absolutely love it and all of the Scripture in...       5   
71  Would not purchase again or recommend. The dec...       2   
91  Was so excited to get this product for my baby...       1   

                                         review_clean  sentiment  
59  Absolutely love it and all of the Scripture in...          1  
71  Would not purchase again or recommend The deca...         -1  
91  Was so excited to get this product for my baby...         -1  


We will now make a **class** prediction for the **sample_test_data**. The `sentiment_model` should predict **+1** if the sentiment is positive and **-1** if the sentiment is negative. Recall from the lecture that the **score** (sometimes called **margin**) for the logistic regression model  is defined as:

$$
\mbox{score}_i = \mathbf{w}^T h(\mathbf{x}_i)
$$ 

where $h(\mathbf{x}_i)$ represents the features for example $i$.  For each row, the **score** (or margin) is a number in the range **[-inf, inf]**.

In [25]:
sample_test_matrix = vectorizer.transform(sample_test_data['review_clean'])
scores = sentiment_model.decision_function(sample_test_matrix)
print scores

[  5.58798096  -3.18610889 -10.42939887]


### Predicting sentiment

These scores can be used to make class predictions as follows:

$$
\hat{y} = 
\left\{
\begin{array}{ll}
      +1 & \mathbf{w}^T h(\mathbf{x}_i) > 0 \\
      -1 & \mathbf{w}^T h(\mathbf{x}_i) \leq 0 \\
\end{array} 
\right.
$$

Using scores, write code to calculate $\hat{y}$, the class predictions:

In [26]:
scores_predictor = lambda scores:+1 if scores >0 else -1
np_predict = np.vectorize(scores_predictor)
predictions=np_predict(scores)
predictions

array([ 1, -1, -1])

In [27]:
print "Class predictions according to Scikit:" 
print sentiment_model.predict(sample_test_matrix)

Class predictions according to Scikit:
[ 1 -1 -1]


### Probability predictions

We can also calculate the probability predictions from the scores using:
$$
P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) = \frac{1}{1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))}.
$$

These probabilities can be used to make class predictions as follows:

$$
\hat{y} = 
\left\{
\begin{array}{ll}
      +1 & P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) > 0.5 \\
      -1 & P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) \leq 0.5 \\
\end{array} 
\right.
$$

Using the variable **scores** calculated previously, write code to calculate the probability that a sentiment is positive using the above formula. For each row, the probabilities should be a number in the range **[0, 1]**.

In [31]:
import math
prob_calc = lambda x: 1 / (1 + math.exp(-x))
np_prob = np.vectorize(prob_calc)
prob_calc = np_prob(scores)
np_prob_pred = lambda prob:+1 if prob >0.5 else -1
np_prob_predictor = np.vectorize(np_prob_pred)
np_prob_predictions = np_prob_predictor(prob_calc) 
print np_prob_predictions

[ 1 -1 -1]


## Compute accuracy of the classifier

We will now evaluate the accuracy of the trained classifier. Recall that the accuracy is given by


$$
\mbox{accuracy} = \frac{\mbox{# correctly classified examples}}{\mbox{# total examples}}
$$

This can be computed as follows:

* **Step 1:** Use the trained model to compute class predictions (**Hint:** Use the `predict` method)
* **Step 2:** Count the number of data points when the predicted class labels match the ground truth labels 
* **Step 3:** Divide the total number of correct predictions by the total number of data points in the dataset.

Complete the function below to compute the classification accuracy:

In [36]:
from sklearn.metrics import accuracy_score
predictions = sentiment_model.predict(test_matrix)
accuracy_score(test_data['sentiment'], predictions)

0.93244540436765055

In [41]:
def calc_accuracy(model,data,target):
    predictions = model.predict(data)
    correct_predictions = target[predictions==target] 
    accuracy = len(correct_predictions)/float(len(target))
    return accuracy

In [42]:
calc_accuracy(sentiment_model,test_matrix,test_data['sentiment'])

0.9324454043676506