# Predicting sentiment from product reviews


The goal of this notebook is to explore logistic regression and feature engineering.

In this notebook we will use product review data from Amazon.com to predict whether the sentiments about a product (from its reviews) are positive or negative.

* Use Dataframes to do some feature engineering
* Train a logistic regression model to predict the sentiment of product reviews.
* Inspect the weights (coefficients) of a trained logistic regression model.
* Make a prediction (both class and probability) of sentiment for a new product review.
* Given the logistic regression weights, predictors and ground truth labels, write a function to compute the **accuracy** of the model.
* Inspect the coefficients of the logistic regression model and interpret their meanings.
* Compare multiple logistic regression models.

Let's get started!

In [1]:
import pandas as pd

# Data preperation

We will use a dataset consisting of baby product reviews on Amazon.com.

In [2]:
products = pd.read_csv("amazon_baby.csv")

In [3]:
products = products[0:10000]

In [4]:
products

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5
5,Stop Pacifier Sucking without tears with Thumb...,"When the Binky Fairy came to our house, we did...",5
6,A Tale of Baby's Days with Peter Rabbit,"Lovely book, it's bound tightly so you may not...",4
7,"Baby Tracker&reg; - Daily Childcare Journal, S...",Perfect for new parents. We were able to keep ...,5
8,"Baby Tracker&reg; - Daily Childcare Journal, S...",A friend of mine pinned this product on Pinter...,5
9,"Baby Tracker&reg; - Daily Childcare Journal, S...",This has been an easy way for my nanny to reco...,4


## Build the word count vector for each review

Let us explore a specific example of a baby product.

In [5]:
products.iloc[9]

name      Baby Tracker&reg; - Daily Childcare Journal, S...
review    This has been an easy way for my nanny to reco...
rating                                                    4
Name: 9, dtype: object

Now, we will perform 2 simple data transformations:

1. Remove punctuation using [Python's built-in](https://docs.python.org/2/library/string.html) string functionality.
2. Transform the reviews into word-counts.

**Aside**. In this notebook, we remove all punctuations for the sake of simplicity. A smarter approach to punctuations would preserve phrases such as "I'd", "would've", "hadn't" and so forth. See [this page](https://www.cis.upenn.edu/~treebank/tokenization.html) for an example of smart handling of punctuations.

In [6]:
def remove_punctuation(text):
    import string
    return text.translate(None, string.punctuation) 

review_without_puctuation = products['review'].apply(str).apply(remove_punctuation)

In [7]:
def my_split(text):
    global my_words
    mw = nltk.word_tokenize(text)
    for w in mw:
        my_words.add(w)
        
my_words = set()

#import nltk
# sentence = "I am a big boy. I'd love to eat ice-cream right now."
# nltk.word_tokenize
# tokens = nltk.word_tokenize(sentence)
# print tokens
# print type(tokens)

In [8]:
significant_words = ['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves', 
      'well', 'able', 'car', 'broke', 'less', 'even', 'waste', 'disappointed', 
      'work', 'product', 'money', 'would', 'return']
        
def count_number_of_significant_words(text):
    words = text['review'].split()
    word_dict = {}
    for word in significant_words:
        word_dict[word] = 0
    for word in words:
        if word in significant_words:
            if word not in word_dict:
                word_dict[word] = 1
            else:
                word_dict[word] = word_dict[word] + 1
    significant_words_counts = []
    for word in significant_words:
        significant_words_counts.append(word_dict[word]) 
    return pd.Series(significant_words_counts, index=significant_words)

lambdafunc = lambda x: pd.Series(significant_words)

newcols = pd.DataFrame(review_without_puctuation).apply(count_number_of_significant_words, axis=1)
newcols.columns = significant_words

products_with_words = products.join(newcols)

Now, let us explore what the sample example above looks like after these 2 transformations. Here, each entry in the **word_count** column is a dictionary where the key is the word and the value is a count of the number of times the word occurs.

In [9]:
products_with_words.iloc[9]['love']

0

## Extract sentiments

We will **ignore** all reviews with *rating = 3*, since they tend to have a neutral sentiment.

In [10]:
products_with_words_ignore_3 = products_with_words[products_with_words['rating'] != 3]
len(products_with_words_ignore_3)

9146

Now, we will assign reviews with a rating of 4 or higher to be positive reviews, while the ones with rating of 2 or lower are negative. For the sentiment column, we use +1 for the positive class label and -1 for the negative class label.

In [11]:
sentiments = products_with_words_ignore_3['rating'].apply(lambda rating : +1 if rating > 3 else -1)

sentimentsdf = pd.DataFrame(sentiments)
sentimentsdf.columns = ['sentiment']

products_prepared = products_with_words_ignore_3.join(sentimentsdf)

Now, we can see that the dataset contains an extra column called sentiment which is either positive (+1) or negative (-1).

## Split data into training and test sets

Let's perform a train/test split with 80% of the data in the training set and 20% of the data in the test set. We use `seed=1` so that everyone gets the same result.

In [12]:
from sklearn.cross_validation import train_test_split

# X = products_prepared[['rating']]
X = products_prepared[significant_words]
y = products_prepared['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X,
	     y,
	     test_size=0.2,
	     random_state=0)

# Train a sentiment classifier with logistic regression

We will now use logistic regression to create a sentiment classifier on the training data. This model will use the column **word_count** as a feature and the column **sentiment** as the target. We will use `validation_set=None` to obtain same results as everyone else.

**Note:** This line may take 1-2 minutes.

In [13]:
from sklearn import linear_model, datasets
logreg = linear_model.LogisticRegression(C=1e5)
model = logreg.fit(X_train, y_train)

# Evaluate the trained model

We will now use the cross-validation set to evaluate our model.

In [14]:
print model
print model.get_params()

error_sum = 0
test_size = len(X_test)

print 'Prediction sample', model.predict(X_test[0])[0]

for i in range(0, test_size):
    error_sum += abs(model.predict(X_test[i])[0] - y_test[i]) / 2

print 'error sum is:', error_sum
average_error = float(error_sum) / test_size

print 'average_error is:', average_error

LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, penalty='l2',
          random_state=None, tol=0.0001)
{'C': 100000.0, 'intercept_scaling': 1, 'fit_intercept': True, 'penalty': 'l2', 'random_state': None, 'dual': False, 'tol': 0.0001, 'class_weight': None}
Prediction sample 1
error sum is: 328
average_error is: 0.179234972678


### Predicting sentiment

These scores can be used to make class predictions as follows:

$$
\hat{y} = 
\left\{
\begin{array}{ll}
      +1 & \mathbf{w}^T h(\mathbf{x}_i) > 0 \\
      -1 & \mathbf{w}^T h(\mathbf{x}_i) \leq 0 \\
\end{array} 
\right.
$$

Using scores, write code to calculate $\hat{y}$, the class predictions:

# Find the most positive (and negative) review

We now turn to examining the full test dataset, **test_data**, and use GraphLab Create to form predictions on all of the test data points for faster performance.

Using the `sentiment_model`, find the 20 reviews in the entire **test_data** with the **highest probability** of being classified as a **positive review**. We refer to these as the "most positive reviews."

To calculate these top-20 reviews, use the following steps:
1.  Make probability predictions on **test_data** using the `sentiment_model`. (**Hint:** When you call `.predict` to make predictions on the test data, use option `output_type='probability'` to output the probability rather than just the most likely class.)
2.  Sort the data according to those predictions and pick the top 20. (**Hint:** You can use the `.topk` method on an SFrame to find the top k rows sorted according to the value of a specified column.)

## Compute accuracy of the classifier

We will now evaluate the accuracy of the trained classifer. Recall that the accuracy is given by


$$
\mbox{accuracy} = \frac{\mbox{# correctly classified examples}}{\mbox{# total examples}}
$$

This can be computed as follows:

* **Step 1:** Use the trained model to compute class predictions (**Hint:** Use the `predict` method)
* **Step 2:** Count the number of data points when the predicted class labels match the ground truth labels (called `true_labels` below).
* **Step 3:** Divide the total number of correct predictions by the total number of data points in the dataset.

Complete the function below to compute the classification accuracy:

In [None]:
def get_classification_accuracy(model, data, true_labels):
    # First get the predictions
    ## YOUR CODE HERE
    ...
    
    # Compute the number of correctly classified examples
    ## YOUR CODE HERE
    ...

    # Then compute accuracy by dividing num_correct by total number of examples
    ## YOUR CODE HERE
    ...
    
    return accuracy

Now, let's compute the classification accuracy of the sentiment_model on the test_data.

In [None]:
get_classification_accuracy(sentiment_model, test_data, test_data['sentiment'])

Quiz Question: What is the accuracy of the sentiment_model on the test_data? Round your answer to 2 decimal places (e.g. 0.76).
Quiz Question: Does a higher accuracy value on the training_data always imply that the classifier is better?