In [1]:
import pandas as pd
import math
import numpy as np
import matplotlib.pyplot as plt
import json
%matplotlib inline

products = pd.read_csv('amazon_baby_subset.csv',dtype={'name': str, 'review': str, 'rating': float})
products.tail(10)
products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53072 entries, 0 to 53071
Data columns (total 4 columns):
name         52982 non-null object
review       52831 non-null object
rating       53072 non-null float64
sentiment    53072 non-null int64
dtypes: float64(1), int64(1), object(2)
memory usage: 1.6+ MB


In [2]:
products['sentiment'].value_counts()

 1    26579
-1    26493
Name: sentiment, dtype: int64

In [3]:
import json
with open('important_words.json', 'r') as f: # Reads the list of most frequent words
    important_words = json.load(f)
important_words = [str(s) for s in important_words]

## 4. Let us perform 2 simple data transformations:

- Remove punctuation
- Compute word counts (only for important_words)


In [4]:
products = products.fillna({'review': ""}) # fill in N/A's in the review column
products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53072 entries, 0 to 53071
Data columns (total 4 columns):
name         52982 non-null object
review       53072 non-null object
rating       53072 non-null float64
sentiment    53072 non-null int64
dtypes: float64(1), int64(1), object(2)
memory usage: 1.6+ MB


In [5]:
def remove_punctuation(text):
    import string
    translator = str.maketrans(' ',' ', string.punctuation)
    return text.translate(translator) 

products['review_clean'] = products['review'].apply(remove_punctuation)
# products['review_clean'][1]

## 5. Now we proceed with the second item. For each word in important_words, we compute a count for the number of times the word occurs in the review. We will store this count in a separate column (one for each word). The result of this feature processing is a single column for each word in important_words which keeps a count of the number of times the respective word occurs in the review text.

Note: There are several ways of doing this. One way is to create an anonymous function that counts the occurrence of a particular word and apply it to every element in the review_clean column. Repeat this step for every word in important_words. Your code should be analogous to the following:

In [6]:
for word in important_words:
    products[word] = products['review_clean'].apply(lambda s: s.split().count(word))
    
products.head()   

Unnamed: 0,name,review,rating,sentiment,review_clean,baby,one,great,love,use,...,seems,picture,completely,wish,buying,babies,won,tub,almost,either
0,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5.0,1,All of my kids have cried nonstop when I tried...,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Nature's Lullabies Second Year Sticker Calendar,We wanted to get something to keep track of ou...,5.0,1,We wanted to get something to keep track of ou...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Nature's Lullabies Second Year Sticker Calendar,My daughter had her 1st baby over a year ago. ...,5.0,1,My daughter had her 1st baby over a year ago S...,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Lamaze Peekaboo, I Love You","One of baby's first and favorite books, and it...",4.0,1,One of babys first and favorite books and it i...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,SoftPlay Peek-A-Boo Where's Elmo A Children's ...,Very cute interactive book! My son loves this ...,5.0,1,Very cute interactive book My son loves this b...,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0


## 6. After #4 and #5, the data frame products should contain one column for each of the 193 important_words. As an example, the column perfect contains a count of the number of times the word perfect occurs in each of the reviews.

In [8]:
products['perfect'].head()
    

0    0
1    0
2    0
3    1
4    0
Name: perfect, dtype: int64

## 7. Now, write some code to compute the number of product reviews that contain the word perfect.

Hint:

- First create a column called contains_perfect which is set to 1 if the count of the word perfect (stored in column perfect is >= 1.
- Sum the number of 1s in the column contains_perfect.

<font color= red><b>Quiz Question.</b> How many reviews contain the word perfect?</font>


In [16]:
df = products.copy()
products['contain_perfect'] = products['perfect'].apply(lambda x: 1 if x >= 1 else 0)
products['contain_perfect'].sum()

2955

# Convert data frame to multi-dimensional array
## 8. It is now time to convert our data frame to a multi-dimensional array. Look for a package that provides a highly optimized matrix operations. In the case of Python, NumPy is a good choice.

Write a function that extracts columns from a data frame and converts them into a multi-dimensional array. We plan to use them throughout the course, so make sure to get this function right.

#### The function should accept three parameters:
- dataframe: a data frame to be converted
- features: a list of string, containing the names of the columns that are used as features.
- label: a string, containing the name of the single column that is used as class labels.

#### The function should return two values:
- one 2D array for features
- one 1D array for class labels

#### The function should do the following:
- Prepend a new column constant to dataframe and fill it with 1's. This column takes account of the intercept term. - - Make sure that the constant column appears first in the data frame.
- Prepend a string 'constant' to the list features. Make sure the string 'constant' appears first in the list.
- Extract columns in dataframe whose names appear in the list features.
- Convert the extracted columns into a 2D array using a function in the data frame library. If you are using Pandas, you would use as_matrix() function.
- Extract the single column in dataframe whose name corresponds to the string label.
- Convert the column into a 1D array.
- Return the 2D array and the 1D array.

In [17]:
def get_numpy_data(dataframe, features, label):
    dataframe['constant'] = 1
    features = ['constant'] + features
    features_frame = dataframe[features]
    feature_matrix = features_frame.as_matrix()
    label_sarray = dataframe[label]
    label_array = label_sarray.as_matrix()
    return(feature_matrix, label_array)

## 9. Using the function written in #8, extract two arrays feature_matrix and sentiment. The 2D array feature_matrix would contain the content of the columns given by the list important_words. The 1D array sentiment would contain the content of the column sentiment.

<font color = red>Quiz Question: How many features are there in the feature_matrix?
<br>
Quiz Question: Assuming that the intercept is present, how does the number of features in feature_matrix relate to the number of features in the logistic regression model?</font>

In [18]:
feature_matrix, sentiment = get_numpy_data(products, important_words, 'sentiment')

In [59]:
feature_matrix.shape

(53072, 194)

# Estimating conditional probability with link function
## 10. Recall from lecture that the link function is given by

$P(y_i = +1 | \mathbf{x}_i, \mathbf{w}) = \dfrac{1}{1 + \exp{(-\mathbf{w}^\intercal h(\mathbf{x}_i))}}$

where the feature vector $h(\mathbf{x}_i)$ represents the word counts of important_words in the review $\mathbf{x}_i$

Write a function named predict_probability that implements the link function.

Take two parameters: feature_matrix and coefficients.
First compute the dot product of feature_matrix and coefficients.
Then compute the link function $P(y = +1 | \mathbf{x}, \mathbf{w})$.
Return the predictions given by the link function.
Your code should be analogous to the following Python function:

In [32]:
### '''
### produces probablistic estimate for P(y_i = +1 | x_i, w).
###estimate ranges between 0 and 1.
###'''
def predict_probability(feature_matrix, coefficients):
    # Take dot product of feature_matrix and coefficients  
    score = np.dot(feature_matrix, coefficients)
    
    # Compute P(y_i = +1 | x_i, w) using the link function
    predictions = 1 / (1 + np.exp(-score))
    
    # return predictions
    return predictions

In [33]:
dummy_feature_matrix = np.array([[1.,2.,3.], [1.,-1.,-1]])
dummy_coefficients = np.array([1., 3., -1.])

correct_scores      = np.array( [ 1.*1. + 2.*3. + 3.*(-1.),          1.*1. + (-1.)*3. + (-1.)*(-1.) ] )
correct_predictions = np.array( [ 1./(1+np.exp(-correct_scores[0])), 1./(1+np.exp(-correct_scores[1])) ] )

print ('The following outputs must match ')
print ('------------------------------------------------')
print ('correct_predictions           =', correct_predictions)
print ('output of predict_probability =', predict_probability(dummy_feature_matrix, dummy_coefficients))

The following outputs must match 
------------------------------------------------
correct_predictions           = [ 0.98201379  0.26894142]
output of predict_probability = [ 0.98201379  0.26894142]


## 11. Recall from lecture:

$\displaystyle \frac{\partial \ell}{\partial w_j} = \sum_{i=1}^N h_j(\mathbf{x}_i) (\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})) $

We will now write a function feature_derivative that computes the derivative of log likelihood with respect to a single coefficient $w_j$. The function accepts two arguments:

- errors: vector whose i-th value contains
$\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})$

- feature: vector whose i-th value contains
$h_j(\mathbf{x}_i)$
This corresponds to the j-th column of feature_matrix.

The function should do the following:

Take two parameters errors and feature.
Compute the dot product of errors and feature.
Return the dot product. This is the derivative with respect to a single coefficient $w_j$.


In [34]:
def feature_derivative(errors, feature):     
    # Compute the dot product of errors and feature
    derivative = np.dot(errors,feature)
        # Return the derivative
    return derivative

## 12. In the main lecture, our focus was on the likelihood. In the advanced optional video, however, we introduced a transformation of this likelihood---called the log-likelihood---that simplifies the derivation of the gradient and is more numerically stable. Due to its numerical stability, we will use the log-likelihood instead of the likelihood to assess the algorithm.

The log-likelihood is computed using the following formula (see the advanced optional video if you are curious about the derivation of this equation):

$\displaystyle \ell \ell (\mathbf{w}) = \sum_{i=1}^N \Big( (\mathbf{1}[y_i = +1] - 1) \mathbf{w}^\intercal h(\mathbf{w}_i) - \ln{\big(1 + \exp{(-\mathbf{w}^\intercal h(\mathbf{x}_i) )} \big)} \Big)$

Write a function compute_log_likelihood that implements the equation. The function would be analogous to the following Python function:

In [23]:
def compute_log_likelihood(feature_matrix, sentiment, coefficients):
    indicator = (sentiment==+1)
    scores = np.dot(feature_matrix, coefficients)
    lp = np.sum((indicator-1)*scores - np.log(1. + np.exp(-scores)))
    return lp

# Taking gradient steps
## 13. Now we are ready to implement our own logistic regression. All we have to do is to write a gradient ascent function that takes gradient steps towards the optimum.

Write a function logistic_regression to fit a logistic regression model using gradient ascent.

The function accepts the following parameters:

- feature_matrix: 2D array of features
- sentiment: 1D array of class labels
- initial_coefficients: 1D array containing initial values of coefficients
- step_size: a parameter controlling the size of the gradient steps
- max_iter: number of iterations to run gradient ascent
- The function returns the last set of coefficients after performing gradient ascent.

The function carries out the following steps:

1. Initialize vector coefficients to initial_coefficients.
2. Predict the class probability $P(y_i = +1 | \mathbf{x}_i,\mathbf{w})$ using your predict_probability function and save it to variable predictions.
3. Compute indicator value for $(y_i = +1)$ by comparing sentiment against +1. Save it to variable indicator.
4. Compute the errors as difference between indicator and predictions. Save the errors to variable errors.
5. For each j-th coefficient, compute the per-coefficient derivative by calling feature_derivative with the j-th column of feature_matrix. Then increment the j-th coefficient by (step_size*derivative).
6. Once in a while, insert code to print out the log likelihood.
7. Repeat steps 2-6 for max_iter times.

At the end of day, your code should be analogous to the following Python function (with blanks filled in):

In [35]:
from math import sqrt
def logistic_regression(feature_matrix, sentiment, initial_coefficients, step_size, max_iter):
    coefficients = np.array(initial_coefficients) # make sure it's a numpy array
    for itr in range(max_iter):
        # Predict P(y_i = +1|x_1,w) using your predict_probability() function
        
        predictions = predict_probability(feature_matrix, coefficients)

        # Compute indicator value for (y_i = +1)
        indicator = (sentiment==+1)

        # Compute the errors as indicator - predictions
        errors = indicator - predictions

        for j in range(len(coefficients)): # loop over each coefficient
            # Recall that feature_matrix[:,j] is the feature column associated with coefficients[j]
            # compute the derivative for coefficients[j]. Save it in a variable called derivative
            derivative = feature_derivative(errors, feature_matrix[:,j])

            # add the step size times the derivative to the current coefficient
            coefficients[j] += step_size * derivative

        # Checking whether log likelihood is increasing
        if itr <= 15 or (itr <= 100 and itr % 10 == 0) or (itr <= 1000 and itr % 100 == 0) \
        or (itr <= 10000 and itr % 1000 == 0) or itr % 10000 == 0:
            lp = compute_log_likelihood(feature_matrix, sentiment, coefficients)
            print('iteration %*d: log likelihood of observed labels = %.8f' % \
                (int(np.ceil(np.log10(max_iter))), itr, lp))
    return coefficients

## 14. Now, let us run the logistic regression solver with the parameters below:

- feature_matrix = feature_matrix extracted in #9
- sentiment = sentiment extracted in #9
- initial_coefficients = a 194-dimensional vector filled with zeros
- step_size = 1e-7
- max_iter = 301

Save the returned coefficients to variable coefficients.

### Quiz question: As each iteration of gradient ascent passes, does the log likelihood increase or decrease?



In [36]:
initial_coefficients = np.zeros(194)
step_size = 1e-7
max_iter = 301
coefficients = logistic_regression(feature_matrix, sentiment, initial_coefficients, step_size, max_iter)

iteration   0: log likelihood of observed labels = -36780.91768478
iteration   1: log likelihood of observed labels = -36775.13434712
iteration   2: log likelihood of observed labels = -36769.35713564
iteration   3: log likelihood of observed labels = -36763.58603240
iteration   4: log likelihood of observed labels = -36757.82101962
iteration   5: log likelihood of observed labels = -36752.06207964
iteration   6: log likelihood of observed labels = -36746.30919497
iteration   7: log likelihood of observed labels = -36740.56234821
iteration   8: log likelihood of observed labels = -36734.82152213
iteration   9: log likelihood of observed labels = -36729.08669961
iteration  10: log likelihood of observed labels = -36723.35786366
iteration  11: log likelihood of observed labels = -36717.63499744
iteration  12: log likelihood of observed labels = -36711.91808422
iteration  13: log likelihood of observed labels = -36706.20710739
iteration  14: log likelihood of observed labels = -36700.5020

# Predicting sentiments
## 15. Recall from lecture that class predictions for a data point x can be computed from the coefficients w using the following formula:

$\hat{y}_i = {+1−1if x⊺iw>0if x⊺iw≤0$
Now, we write some code to compute class predictions. We do this in two steps:

- First compute the scores using feature_matrix and coefficients using a dot product.
- Then apply threshold 0 on the scores to compute the class predictions. Refer to the formula above.

### Quiz question: How many reviews were predicted to have positive sentiment?



In [45]:
scores = np.dot(feature_matrix,coefficients)

# scores
pred_func = np.vectorize(lambda x: 1 if x > 0 else -1)
class_prediction = pred_func(scores)
unique, count  = np.unique(class_prediction, return_counts = True)
print(unique)
print(count)

[-1  1]
[27946 25126]


# Measuring accuracy
## 16. We will now measure the classification accuracy of the model. Recall from the lecture that the classification accuracy can be computed as follows:

accuracy=# correctly classified data points# total data points

### Quiz question: What is the accuracy of the model on predictions made above? (round to 2 digits of accuracy)


In [48]:
num_correct = (class_prediction == sentiment).sum()
accuracy = num_correct / len(class_prediction)
print(accuracy)

0.751865390413



# Which words contribute most to positive & negative sentiments
## 17. Recall that in the earlier assignment, we were able to compute the "most positive words". These are words that correspond most strongly with positive reviews. In order to do this, we will first do the following:

- Treat each coefficient as a tuple, i.e. (word, coefficient_value). The intercept has no corresponding word, so throw it out.
- Sort all the (word, coefficient_value) tuples by coefficient_value in descending order. Save the sorted list of tuples to word_coefficient_tuples.
Your code should be analogous to the following:

In [49]:
coefficients = list(coefficients[1:]) # exclude intercept
word_coefficient_tuples = [(word, coefficient) for word, coefficient in zip(important_words, coefficients)]
word_coefficient_tuples = sorted(word_coefficient_tuples, key=lambda x:x[1], reverse=True)

In [57]:
word_coefficient_tuples[0:10]

[('great', 0.066546084170457695),
 ('love', 0.065890762922123244),
 ('easy', 0.064794586802578394),
 ('little', 0.045435626308421372),
 ('loves', 0.044976401394906038),
 ('well', 0.03013500109210707),
 ('perfect', 0.029739937104968459),
 ('old', 0.020077541034775381),
 ('nice', 0.018408707995268992),
 ('daughter', 0.01770319990570169)]

## 18. Compute the 10 words that have the most positive coefficient values. These words are associated with positive sentiment.

### Quiz question: Which word is not present in the top 10 "most positive" words?

## 19. Next, we repeat this exerciese on the 10 most negative words. That is, we compute the 10 words that have the most negative coefficient values. These words are associated with negative sentiment.

### Quiz question: Which word is not present in the top 10 "most negative" words?

In [56]:
import operator
sorted(word_coefficient_tuples, key =operator.itemgetter(1), reverse=False )[0:10]

[('would', -0.053860148445203128),
 ('product', -0.041511033392108897),
 ('money', -0.038982037286487116),
 ('work', -0.03306951529475273),
 ('even', -0.030051249236035804),
 ('disappointed', -0.028978976142317068),
 ('get', -0.028711552980192581),
 ('back', -0.027742697230661327),
 ('return', -0.026592778462247283),
 ('monitor', -0.02448210054589172)]