# Logistic Regression Model: Predicting sentiment from product reviews


The goal of this first notebook is to explore logistic regression and feature engineering with a model created from scratch.

In this notebook we will use product review data from Amazon.com to predict whether the sentiments about a product (from its reviews) are positive or negative.
## Fire up [Sframe](https://github.com/dato-code/SFrame)

In [1]:
import sframe

## Loading data

In [4]:
products = sframe.SFrame('Amazon_Instant_Video_5.csv')

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,list,float,str,str,str,str,str,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


# Exploring Data 1/4
Let us quickly explore more of this dataset.
1. We count the number of positive and negative reviews 
2. list the first 10 products in the dataset.

In [5]:
len(products)

37121

In [6]:
products.print_rows(num_rows=2, num_columns=10)

+------------+-------------------------------+---------+
|    asin    |            helpful            | overall |
+------------+-------------------------------+---------+
| B000H00VBQ | [None, 0, None, None, None... |   2.0   |
| B000H00VBQ | [None, 0, None, None, None... |   5.0   |
+------------+-------------------------------+---------+
+-------------------------------+------------+----------------+--------------+
|           reviewText          | reviewTime |   reviewerID   | reviewerName |
+-------------------------------+------------+----------------+--------------+
| I had big expectations bec... | 05 3, 2014 | A11N155CW1UV02 |   AdrianaM   |
| I highly recommend this se... | 09 3, 2012 | A3BC8O2KCL29V2 |   Carol T    |
+-------------------------------+------------+----------------+--------------+
+----------------------------+----------------+
|          summary           | unixReviewTime |
+----------------------------+----------------+
| A little bit boring for me |   1399075

# Data Engineering: defining which reviews have positive or negative sentiment


We'll call data engineering, just defining what is a positive and negative sentiment. So let's do that right now. So in the subsection we're gonna define what's a positive and a negative sentiment.
And so I'm gonna make an arbitrary choice here:
1. Let's say that things that 4, 5 stars are things that people liked. So those are positives. 
2. Things that 1 and 2 stars are negative. 
3. ignore all 3 star reviews.
So I'm gonna say a positive sentiment equals 4 star or 5 star reviews. So let's go ahead and add a new column to our table that defines the actual sentiment. So products new column called sentiment.

We will **ignore** all reviews with *rating = 3*, since they tend to have a neutral sentiment.

In [7]:
products = products[products['overall'] != 3]
len(products)

32934

Now, we will assign reviews with a rating of 4 or higher to be *positive* reviews, while the ones with rating of 2 or lower are *negative*. For the sentiment column, we use +1 for the positive class label and -1 for the negative class label.

In [8]:
products['sentiment'] = products['overall'].apply(lambda rating : +1 if rating > 3 else -1)
products.print_rows(num_rows=2, num_columns=10)

+------------+-------------------------------+---------+
|    asin    |            helpful            | overall |
+------------+-------------------------------+---------+
| B000H00VBQ | [None, 0, None, None, None... |   2.0   |
| B000H00VBQ | [None, 0, None, None, None... |   5.0   |
+------------+-------------------------------+---------+
+-------------------------------+------------+----------------+--------------+
|           reviewText          | reviewTime |   reviewerID   | reviewerName |
+-------------------------------+------------+----------------+--------------+
| I had big expectations bec... | 05 3, 2014 | A11N155CW1UV02 |   AdrianaM   |
| I highly recommend this se... | 09 3, 2012 | A3BC8O2KCL29V2 |   Carol T    |
+-------------------------------+------------+----------------+--------------+
+----------------------------+----------------+-----------+
|          summary           | unixReviewTime | sentiment |
+----------------------------+----------------+-----------+
| A 

# Exploring Data 2/4
Let us quickly explore more of this dataset.
3. We count the number of positive and negative reviews.

**TODO**: Modify the subset to contain similar numbers of positive and negative reviews, as the original dataset consisted primarily of positive reviews.

In [9]:
print '# of positive reviews =', len(products[products['sentiment']==1])
print '# of negative reviews =', len(products[products['sentiment']==-1])

# of positive reviews = 29331
# of negative reviews = 3603


# Extraction Phase: data preparation

# TODO: obtain balanced data !

In [10]:
#TODO

## TODO: apply text cleaning
Now, we will perform one simple data transformation:
1. Remove punctuation using [Python's built-in](https://docs.python.org/2/library/string.html) string functionality.

**Aside**. In this notebook, we remove all punctuations for the sake of simplicity. A smarter approach to punctuations would preserve phrases such as "I'd", "would've", "hadn't" and so forth. See [this page](https://www.cis.upenn.edu/~treebank/tokenization.html) for an example of smart handling of punctuations.

In [11]:
def remove_punctuation(text):
    import string
    return text.translate(None, string.punctuation)

In [69]:
products

asin,helpful,overall,reviewText,reviewTime,reviewerID
B000H00VBQ,"[None, 0, None, None, None, 0, None] ...",2.0,I had big expectations because I love English ...,"05 3, 2014",A11N155CW1UV02
B000H00VBQ,"[None, 0, None, None, None, 0, None] ...",5.0,I highly recommend this series. It is a must for ...,"09 3, 2012",A3BC8O2KCL29V2
B000H00VBQ,"[None, 0, None, None, None, 1, None] ...",1.0,This one is a real snoozer. Don't believe ...,"10 16, 2013",A60D5HQFOTSOM
B000H00VBQ,"[None, 0, None, None, None, 0, None] ...",4.0,Mysteries are interesting. The ten ...,"10 30, 2013",A1RJPIGRSNX4PW
B000H00VBQ,"[None, 1, None, None, None, 1, None] ...",5.0,"This show always is excellent, as far as ...","02 11, 2009",A16XRPF40679KG
B000H00VBQ,"[None, 1, 2, None, None, None, 1, 2, None] ...",5.0,I discovered this series quite by accident. Ha ...,"10 11, 2011",A1POFVVXUZR3IQ
B000H0X79O,"[None, 0, None, None, None, 0, None] ...",5.0,This is the best of the best comedy Stand-up. ...,"02 26, 2014",A3RXD7Z44T9DHW
B000H0X79O,"[None, 0, None, None, None, 0, None] ...",4.0,"Funny, interesting, a great way to pass tim ...","02 7, 2014",AXM3GQLD0CHIL
B000H29TXU,"[None, 0, None, None, None, 0, None] ...",4.0,I love the variety of comics. Great for di ...,"02 6, 2014",A398QSASJOIKA6
B000H29TXU,"[None, 0, None, None, None, 0, None] ...",5.0,Watched it for Kevin Hart and only Kevin Hart! He ...,"04 29, 2014",A39F2EW27YYUDM

reviewerName,summary,unixReviewTime,sentiment,review_clean,baby
AdrianaM,A little bit boring for me ...,1399075200,-1,I had big expectations because I love Englis ...,0
Carol T,Excellent Grown Up TV,1346630400,1,I highly recommend this series It is a must for ...,0
"Daniel Cooper ""dancoopermedia"" ...",Way too boring for me,1381881600,-1,This one is a real snoozer Dont believe ...,0
"J. Kaplan ""JJ""",Robson Green is mesmerizing ...,1383091200,1,Mysteries are interesting The tension between ...,0
Michael Dobey,Robson green and great writing ...,1234310400,1,This show always is excellent as far as ...,0
Z Hayes,I purchased the series via streaming and loved ...,1318291200,1,I discovered this series quite by accident Having ...,0
Kansas,kansas001,1393372800,1,This is the best of the best comedy Standup The ...,0
Ray Shiva,Worth watching!,1391731200,1,Funny interesting a great way to pass time I ...,0
Amazon Customer,comedy club quality without leaving your ...,1391644800,1,I love the variety of comics Great for dinner ...,0
Emily Booth,Loved it!,1398729600,1,Watched it for Kevin Hart and only Kevin Hart He ...,0

one,great,love,use,would,like,easy,little,seat,old,well,get,also,really,son,time,bought
0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,2,0,0,0,0,3,2,0,0,0,0,0
2,0,0,0,0,0,0,1,0,2,2,1,1,1,0,1,0
1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

product,good,daughter,much,loves,stroller,put,months,car,still,back,...
0,0,0,0,0,0,0,0,0,0,0,...
0,0,0,0,0,0,0,0,0,0,0,...
0,0,0,0,0,0,0,0,0,0,0,...
0,1,0,0,0,0,0,0,0,0,0,...
0,0,0,1,0,0,0,0,0,1,0,...
0,1,0,0,0,0,0,0,0,0,1,...
0,0,0,0,0,0,0,0,0,0,0,...
0,1,0,0,0,0,0,0,0,0,0,...
0,0,0,0,0,0,0,0,0,0,0,...
0,0,0,0,0,0,0,0,0,0,0,...


In [12]:
products['review_clean'] = products['reviewText'].apply(remove_punctuation)

## TODO: building features
**Note:** There are several ways of doing this. We use the built-in *count* function for Python lists. Each **review without punctuation, stopwords, etc** string is first split into individual words and the number of occurances of a given word is counted.
1. Transform the reviews into word-counts (only for **important_words**, without punctuation, stopwords, etc)
2. For each word in **important_words**, we compute a count for the number of times the word occurs in the review. We will store this count in a separate column (one for each word). The result of this feature processing is a single column for each word in **important_words** which keeps a count of the number of times the respective word occurs in the review text.

Now, we will load these words from this JSON file:

In [13]:
import json
with open('important_words.json', 'r') as f: # Reads the list of words
    important_words = json.load(f)
important_words = [str(s) for s in important_words]

In [14]:
print important_words

['baby', 'one', 'great', 'love', 'use', 'would', 'like', 'easy', 'little', 'seat', 'old', 'well', 'get', 'also', 'really', 'son', 'time', 'bought', 'product', 'good', 'daughter', 'much', 'loves', 'stroller', 'put', 'months', 'car', 'still', 'back', 'used', 'recommend', 'first', 'even', 'perfect', 'nice', 'bag', 'two', 'using', 'got', 'fit', 'around', 'diaper', 'enough', 'month', 'price', 'go', 'could', 'soft', 'since', 'buy', 'room', 'works', 'made', 'child', 'keep', 'size', 'small', 'need', 'year', 'big', 'make', 'take', 'easily', 'think', 'crib', 'clean', 'way', 'quality', 'thing', 'better', 'without', 'set', 'new', 'every', 'cute', 'best', 'bottles', 'work', 'purchased', 'right', 'lot', 'side', 'happy', 'comfortable', 'toy', 'able', 'kids', 'bit', 'night', 'long', 'fits', 'see', 'us', 'another', 'play', 'day', 'money', 'monitor', 'tried', 'thought', 'never', 'item', 'hard', 'plastic', 'however', 'disappointed', 'reviews', 'something', 'going', 'pump', 'bottle', 'cup', 'waste', 'retu

In [15]:
for word in important_words:
    products[word] = products['review_clean'].apply(lambda s : s.split().count(word))

# Exploring Data 3/4

The SFrame **products** now contains one column for each of the **important_words**. As an example, the column **perfect** contains a count of the number of times the word **perfect** occurs in each of the reviews.

In [16]:
products['perfect']

dtype: int
Rows: 32934
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... ]

Now, write some code to compute the number of product reviews that contain the word **perfect**.
* First create a column called `contains_perfect` which is set to 1 if the count of the word **perfect** (stored in column **perfect**) is >= 1.
* Sum the number of 1s in the column `contains_perfect`.

In [17]:
products['contains_perfect'] = products['perfect'].apply(lambda s : +1 if s >= 1 else 0)

In [18]:
products['contains_perfect'].sum()

753

# Implementing logistic regression from scratch

## link function (estimating conditional probability)

Recall from lecture that the link function is given by:
$$
P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) = \frac{1}{1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))},
$$

where the feature vector $h(\mathbf{x}_i)$ represents the word counts of **important_words** in the review  $\mathbf{x}_i$.

In [19]:
'''
produces probablistic estimate for P(y_i = +1 | x_i, w).
estimate ranges between 0 and 1.
'''

def predict_probability(feature_matrix, coefficients):
    # Take dot product of feature_matrix and coefficients  
    scores = np.dot(feature_matrix, coefficients)
    
    # Compute P(y_i = +1 | x_i, w) using the link function
    predictions = 1.0 / (1.0 + np.exp(-scores))

    return predictions

How the link function works with matrix algebra?

Since the word counts are stored as columns in **feature_matrix**, each $i$-th row of the matrix corresponds to the feature vector $h(\mathbf{x}_i)$:
$$
[\text{feature_matrix}] =
\left[
\begin{array}{c}
h(\mathbf{x}_1)^T \\
h(\mathbf{x}_2)^T \\
\vdots \\
h(\mathbf{x}_N)^T
\end{array}
\right] =
\left[
\begin{array}{cccc}
h_0(\mathbf{x}_1) & h_1(\mathbf{x}_1) & \cdots & h_D(\mathbf{x}_1) \\
h_0(\mathbf{x}_2) & h_1(\mathbf{x}_2) & \cdots & h_D(\mathbf{x}_2) \\
\vdots & \vdots & \ddots & \vdots \\
h_0(\mathbf{x}_N) & h_1(\mathbf{x}_N) & \cdots & h_D(\mathbf{x}_N)
\end{array}
\right]
$$

By the rules of matrix multiplication, the score vector containing elements $\mathbf{w}^T h(\mathbf{x}_i)$ is obtained by multiplying **feature_matrix** and the coefficient vector $\mathbf{w}$.
$$
[\text{score}] =
[\text{feature_matrix}]\mathbf{w} =
\left[
\begin{array}{c}
h(\mathbf{x}_1)^T \\
h(\mathbf{x}_2)^T \\
\vdots \\
h(\mathbf{x}_N)^T
\end{array}
\right]
\mathbf{w}
= \left[
\begin{array}{c}
h(\mathbf{x}_1)^T\mathbf{w} \\
h(\mathbf{x}_2)^T\mathbf{w} \\
\vdots \\
h(\mathbf{x}_N)^T\mathbf{w}
\end{array}
\right]
= \left[
\begin{array}{c}
\mathbf{w}^T h(\mathbf{x}_1) \\
\mathbf{w}^T h(\mathbf{x}_2) \\
\vdots \\
\mathbf{w}^T h(\mathbf{x}_N)
\end{array}
\right]
$$

## Compute derivative of log likelihood with respect to a single coefficient

Recall:
$$
\frac{\partial\ell}{\partial w_j} = \sum_{i=1}^N h_j(\mathbf{x}_i)\left(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})\right)
$$

Function that computes the derivative of log likelihood with respect to a single coefficient $w_j$. The function accepts two arguments:
* `errors` vector containing $\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})$ for all $i$.
* `feature` vector containing $h_j(\mathbf{x}_i)$  for all $i$. 

In [20]:
def feature_derivative(errors, feature):     
    # Compute the dot product of errors and feature
    derivative = np.dot(errors, feature)
    
    # Return the derivative
    return derivative

We introduced a transformation of this likelihood---called the log likelihood---that simplifies the derivation of the gradient and is more numerically stable.  Due to its numerical stability, we will use the log likelihood instead of the likelihood to assess the algorithm.

The log likelihood is computed using the following formula (see the advanced optional video if you are curious about the derivation of this equation):

$$\ell\ell(\mathbf{w}) = \sum_{i=1}^N \Big( (\mathbf{1}[y_i = +1] - 1)\mathbf{w}^T h(\mathbf{x}_i) - \ln\left(1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))\right) \Big) $$

Function to compute the log likelihood for the entire dataset.

In [21]:
def compute_log_likelihood(feature_matrix, sentiment, coefficients):
    indicator = (sentiment==+1)
    scores = np.dot(feature_matrix, coefficients)
    logexp = np.log(1. + np.exp(-scores))
    
    # Simple check to prevent overflow
    mask = np.isinf(logexp)
    logexp[mask] = -scores[mask]
    
    lp = np.sum((indicator-1)*scores - logexp)
    return lp

## Taking gradient steps
Now we are ready to implement our own logistic regression. 

Function to solve the logistic regression model using gradient ascent:

In [22]:
from math import sqrt

def logistic_regression(feature_matrix, sentiment, initial_coefficients, step_size, max_iter):
    coefficients = np.array(initial_coefficients) # make sure it's a numpy array
    for itr in xrange(max_iter):

        # Predict P(y_i = +1|x_i,w) using your predict_probability() function
        predictions = predict_probability(feature_matrix, coefficients)
        
        # Compute indicator value for (y_i = +1)
        indicator = (sentiment==+1)
        
        # Compute the errors as indicator - predictions
        errors = indicator - predictions
        for j in xrange(len(coefficients)): # loop over each coefficient
            
            # Recall that feature_matrix[:,j] is the feature column associated with coefficients[j].
            # Compute the derivative for coefficients[j]. Save it in a variable called derivative
            derivative = np.dot(errors,feature_matrix[:,j])
            
            # add the step size times the derivative to the current coefficient
            coefficients[j] = coefficients[j] + derivative*step_size
        
        # Checking whether log likelihood is increasing
        if itr <= 15 or (itr <= 100 and itr % 10 == 0) or (itr <= 1000 and itr % 100 == 0) \
        or (itr <= 10000 and itr % 1000 == 0) or itr % 10000 == 0:
            lp = compute_log_likelihood(feature_matrix, sentiment, coefficients)
            print 'iteration %*d: log likelihood of observed labels = %.8f' % \
                (int(np.ceil(np.log10(max_iter))), itr, lp)
    return coefficients

# Resolving a sentiment classifier with logistic regression

## Split data into training and test sets
Let's perform a train/test split with 80% of the data in the training set and 20% of the data in the test set. We use `seed=1` so that everyone gets the same result.

In [23]:
print '# of total reviews =', len(products)
print '# of positive reviews on all data =', len(products[products['sentiment']==1])
print '# of negative reviews on all data =', len(products[products['sentiment']==-1])

# of total reviews = 32934
# of positive reviews on all data = 29331
# of negative reviews on all data = 3603


In [24]:
train_data, test_data = products.random_split(.8, seed=1)

In [25]:
print '# of train_data reviews =', len(train_data)
print '# of positive reviews on train data =', len(train_data[train_data['sentiment']==1])
print '# of negative reviews on train data =', len(train_data[train_data['sentiment']==-1])

# of train_data reviews = 26313
# of positive reviews on train data = 23415
# of negative reviews on train data = 2898


In [26]:
print '# of test_data reviews =', len(test_data)
print '# of positive reviews on test data =', len(test_data[test_data['sentiment']==1])
print '# of negative reviews on test data =', len(test_data[test_data['sentiment']==-1])

# of test_data reviews = 6621
# of positive reviews on test data = 5916
# of negative reviews on test data = 705


## SFrame to NumPy array
NumPy is a powerful library for doing matrix manipulation. Let us convert our data to matrices and then implement our algorithms with matrices.

Function that extracts columns from an SFrame and converts them into a NumPy array. Two arrays are returned: one representing features and another representing class labels. The feature matrix includes an additional column 'intercept' to take account of the intercept term.

In [55]:
import numpy as np
def get_numpy_data(data_sframe, features, label):
    data_sframe['intercept'] = 1
    features = ['intercept'] + features
    features_sframe = data_sframe[features]
    feature_matrix = features_sframe.to_numpy()
    label_sarray = data_sframe[label]
    label_array = label_sarray.to_numpy()
    return(feature_matrix, label_array)

def get_numpy_feature_matrix(data_sframe, features):
    data_sframe['intercept'] = 1
    features = ['intercept'] + features
    features_sframe = data_sframe[features]
    feature_matrix = features_sframe.to_numpy()
    return(feature_matrix)

# Training Set

Let us convert the train_data into NumPy arrays.

In [58]:
# Warning: This may take a few minutes...
train_feature_matrix, train_sentiment = get_numpy_data(train_data, important_words, 'sentiment') 

In [59]:
train_feature_matrix.shape

(26313, 194)

## Creating the sentiment classifier on the training data

In [60]:
sentiment_model_coefficients = logistic_regression(train_feature_matrix, train_sentiment, initial_coefficients=np.zeros(194),
                                   step_size=1e-7, max_iter=301)

iteration   0: log likelihood of observed labels = -18218.94676804
iteration   1: log likelihood of observed labels = -18199.18500621
iteration   2: log likelihood of observed labels = -18179.49616583
iteration   3: log likelihood of observed labels = -18159.87993599
iteration   4: log likelihood of observed labels = -18140.33600580
iteration   5: log likelihood of observed labels = -18120.86406429
iteration   6: log likelihood of observed labels = -18101.46380056
iteration   7: log likelihood of observed labels = -18082.13490372
iteration   8: log likelihood of observed labels = -18062.87706300
iteration   9: log likelihood of observed labels = -18043.68996772
iteration  10: log likelihood of observed labels = -18024.57330739
iteration  11: log likelihood of observed labels = -18005.52677168
iteration  12: log likelihood of observed labels = -17986.55005049
iteration  13: log likelihood of observed labels = -17967.64283398
iteration  14: log likelihood of observed labels = -17948.8048

## Class predictions from scores

Class predictions for a data point $\mathbf{x}$ can be computed from the coefficients $\mathbf{w}$ using the following formula:
$$
\hat{y}_i = 
\left\{
\begin{array}{ll}
      +1 & \mathbf{x}_i^T\mathbf{w} > 0 \\
      -1 & \mathbf{x}_i^T\mathbf{w} \leq 0 \\
\end{array} 
\right.
$$

Now, we will write some code to compute class predictions. We will do this in two steps:
* **Step 1**: First compute the **scores** using **feature_matrix** and **coefficients** using a dot product.
* **Step 2**: Using the formula above, compute the class predictions from the scores.

Step 1 can be implemented as follows:

In [31]:
# Step 1: Compute the scores as a dot product between feature_matrix and coefficients.
scores = np.dot(train_feature_matrix, sentiment_model_coefficients)

In [32]:
# Step 2: compute the class predictions using the **scores** obtained above:
train_sentiment_predictions = map((lambda score: +1 if score > 0 else -1), scores)

In [33]:
positive_train_sentiment_predictions = sum(map((lambda score: +1 if score > 0 else 0), scores))
positive_train_sentiment_predictions

26313

In [47]:
print '# of true positive reviews =', len(train_data[train_data['sentiment']==1])

# of true positive reviews = 23415


# TODO: que ocurre????? unbalanced data!!!!

## Measuring accuracy of the model

We will now measure the classification accuracy of the model. Recall from the lecture that the classification accuracy can be computed as follows:

$$
\mbox{accuracy} = \frac{\mbox{# correctly classified data points}}{\mbox{# total data points}}
$$

Complete the following code block to compute the accuracy of the model.

In [35]:
num_mistakes = (train_sentiment != train_sentiment_predictions).sum()
accuracy = 1.0 * (len(train_data) - num_mistakes) / len(products)
print "-----------------------------------------------------"
print '# Reviews   correctly classified =', len(train_data) - num_mistakes
print '# Reviews incorrectly classified =', num_mistakes
print '# Reviews total                  =', len(train_data)
print "-----------------------------------------------------"
print 'Accuracy = %.2f' % accuracy

-----------------------------------------------------
# Reviews   correctly classified = 23415
# Reviews incorrectly classified = 2898
# Reviews total                  = 26313
-----------------------------------------------------
Accuracy = 0.71


# Exploring Data 4/4
## Which words contribute most to positive & negative sentiments?

We were able to compute the "**most positive words**". These are words that correspond most strongly with positive reviews. In order to do this, we will first do the following:
* Treat each coefficient as a tuple, i.e. (**word**, **coefficient_value**).
* Sort all the (**word**, **coefficient_value**) tuples by **coefficient_value** in descending order.

In [36]:
sentiment_model_coefficients_without_intercept = list(sentiment_model_coefficients[1:]) # exclude intercept
word_coefficient_tuples = [(word, coefficient) for word, coefficient in zip(important_words, sentiment_model_coefficients_without_intercept)]
word_coefficient_tuples = sorted(word_coefficient_tuples, key=lambda x:x[1], reverse=True)

Now, **word_coefficient_tuples** contains a sorted list of (**word**, **coefficient_value**) tuples. The first 10 elements in this list correspond to the words that are most positive.

### Ten "most positive" words

Now, we compute the 10 words that have the most positive coefficient values. These words are associated with positive sentiment.

In [37]:
word_coefficient_tuples[0:10]

[('good', 0.067031116541016539),
 ('great', 0.066049394812606743),
 ('one', 0.061888478016965398),
 ('like', 0.061385675336285792),
 ('love', 0.058368926174629),
 ('well', 0.045921224641511972),
 ('see', 0.044901436075544768),
 ('really', 0.042376632493612537),
 ('would', 0.031851895935752425),
 ('get', 0.030400837371417237)]

### Ten "most negative" words

Next, we repeat this exercise on the 10 most negative words.  That is, we compute the 10 words that have the most negative coefficient values. These words are associated with negative sentiment.

In [38]:
word_coefficient_tuples[len(word_coefficient_tuples)-10:len(word_coefficient_tuples)]

[('toy', 6.3700634942759005e-05),
 ('gate', 4.3598905923211703e-05),
 ('tub', 1.5476477044616865e-05),
 ('stroller', 0.0),
 ('crib', 0.0),
 ('bag', -4.5246508359879137e-06),
 ('diaper', -4.5983221948640832e-06),
 ('pump', -4.959959433548705e-06),
 ('cheap', -0.00039779689700774392),
 ('waste', -0.0030231005817950365)]

# Test Set. Making predictions with logistic regression
Now that a model is trained, we can make predictions on the **test data**.

In [39]:
#We need to convert test_data into the sparse matrix format first.

In [61]:
test_feature_matrix  = get_numpy_feature_matrix(test_data, important_words) 

In [41]:
test_feature_matrix.shape

(6621, 194)

In [62]:
# Step 1: Compute the scores as a dot product between feature_matrix and coefficients.
scores = np.dot(test_feature_matrix, sentiment_model_coefficients)

In [64]:
# Step 2: compute the class predictions using the **scores** obtained above:
test_predictions = map((lambda score: +1 if score > 0 else -1), scores)

In [65]:
positive_test_predictions = sum(map((lambda score: +1 if score > 0 else 0), scores))
positive_test_predictions

6621

In [66]:
print '# of true positive reviews =', len(test_data[test_data['sentiment']==1])

# of true positive reviews = 5916


# Export to csv

In [68]:
test_data['score'] = scores
test_data['pred_sentiment'] = test_predictions
test_data.export_csv('Amazon_Instant_Video_5_preds.csv')