# Implementing logistic regression from scratch

The goal of this notebook is to implement your own logistic regression classifier. You will:

 * Extract features from Amazon product reviews.
 * Implement the link function for logistic regression.
 * Write a function to compute the derivative of the log likelihood function with respect to a single coefficient.
 * Implement gradient descent/ascent.
 * Given a set of coefficients, predict whether it has high rating.
 * Compute classification accuracy for the logistic regression model.
 
Let's get started! 

*This file is adapted from course material by Carlos Guestrin and Emily Fox. The data is subsetted from the Amazon review data available at http://jmcauley.ucsd.edu/data/amazon/. *
    
## Import packages needed

In [1]:
## please make sure that the packages are updated to the newest version. 

import pandas as pd
import numpy as np

## Load the dataset

For this assignment, we will use a subset of the UCI *Weight Lifting Exercises monitored with Inertial Measurement Units Data Set* contributed by *Wallace Ugulino* and *Eduardo Velloso*. 

In [254]:
products = pd.read_csv('amazon_baby_small.csv')

One column of this dataset is 'rating', ranging from 1 to 5.

Let us quickly explore more of this dataset.  The 'name' column indicates the name of the product.  Here we list the first 10 products in the dataset.  We then count the number of positive and negative reviews.

In [255]:
products.head(10)['name']

0        Britax Marathon Convertible Car Seat, Granite
1                          PRIMO EuroBath, Pearl White
2              Jeep Shopping Cart and High Chair Cover
3                  Pearhead Wood Bank, Memorybox White
4             The Juppy Baby Walker (Pink-Full Lining)
5                    The First Years Car Rear Sunshade
6    Vulli Products - Sophie The Giraffe Teething R...
7                Cuisinart CS-6 Baby Bottle Sterilizer
8          Graco Sarah Classic Convertible Crib, White
9    Cosco - Scenera Convertible Car Seat, Realtree...
Name: name, dtype: object

In [256]:
print '# of high_rating (5-star) reviews =', len(products[products['rating']>=5])
print '# of low_rating (not 5-star) reviews =', len(products[products['rating']<5])

# of high_rating (5-star) reviews = 25693
# of low_rating (not 5-star) reviews = 24473


We re-label the reviews by whether it is a 5-star review.

In [257]:
products['high_rating'] = (products['rating'] > 4)

## Apply text cleaning on the review data

In this section, we will perform some simple feature cleaning using Pandas and Numpy. Here we compiled a list of 193 most frequent words (important words) into a JSON file. 

Now, we will load these words from this JSON file:

In [258]:
import json
with open('important_words.json', 'r') as f: # Reads the list of most frequent words
    important_words = json.load(f)
important_words = [str(s) for s in important_words]

Now, we will perform 2 simple data transformations:

1. Remove punctuation using [Python's built-in](https://docs.python.org/2/library/string.html) string functionality.
2. Compute word counts (only for **important_words**)

We start with *Step 1* which can be done as follows:

In [259]:
products = products.fillna({'review':''})  # fill in N/A's in the review column

def remove_punctuation(text):
    import string
    return text.replace(string.punctuation, ' ') 

products['review_clean'] = products['review'].apply(remove_punctuation)

Now we proceed with *Step 2*. For each word in **important_words**, we compute a count for the number of times the word occurs in the review. We will store this count in a separate column (one for each word). The result of this feature processing is a single column for each word in **important_words** which keeps a count of the number of times the respective word occurs in the review text.


**Note:** There are several ways of doing this. In this assignment, we use the built-in *count* function for Python lists. Each review string is first split into individual words and the number of occurances of a given word is counted.

In [260]:
for word in important_words:
    products[word] = products['review_clean'].apply(lambda s : s.split().count(word))

The DataFrame **products** now contains one column for each of the 193 **important_words**. As an example, the column **perfect** contains a count of the number of times the word **perfect** occurs in each of the reviews.

Now, write some code to compute the number of product reviews that contain the word **perfect**.

**Hint**: 
* First create a column called `contains_perfect` which is set to 1 if the count of the word **perfect** (stored in column **perfect**) is >= 1.
* Sum the number of 1s in the column `contains_perfect`.

In [261]:
import numpy as np
products['contains_perfect'] = products['perfect'].apply(lambda x: 1 if x>=1 else 0)
count = np.sum(list(products['contains_perfect']))
print count

2808


Finally, we drop some raw columns to save memory. 

In [262]:
## please update pandas to the newest version in order to execute the following line

products = products.drop(columns = ['name', 'review', 'review_clean', 'rating']) 

## Convert DataFrame to NumPy array

As you have seen, NumPy is a powerful library for doing matrix manipulation. Let us convert our data to matrices and then implement our algorithms with matrices.

We now provide you with a function that extracts columns from an DataFrame and converts them into a NumPy array. Two arrays are returned: one representing features and another representing class labels. Note that the feature matrix includes an additional column 'intercept' to take account of the intercept term.

In [263]:
def get_numpy_data(dataframe, features, label):
    dataframe['intercept'] = 1
    features = ['intercept'] + features
    feature_matrix = np.array(dataframe[features])
    label_array = np.array(dataframe[label])
    return(feature_matrix, label_array)

Let us convert the data into NumPy arrays.

In [264]:
# Warning: This may take a few minutes...
feature_matrix, high_rating = get_numpy_data(products, important_words, 'high_rating') 

Now, let us see what the **high_rating** column looks like:

## Estimating conditional probability with logistic sigmoid function

Recall from lecture that the predicted conditional probablity (of a positive label) is given by:
$$
P(y^{(i)} = 1 | \mathbf{x}^{(i)},\mathbf{w}) = \sigma( \mathbf{w}^T \mathbf{x}^{(i)} ) = \frac{1}{1 + \exp(-\mathbf{w}^T \mathbf{x}^{(i)})},
$$

where the feature vector $\mathbf{x}^{(i)}$.

In [265]:
'''
produces probablistic estimate for P(y_i = +1 | x^(i), w).
estimate ranges between 0 and 1.
'''
def predict_probability(feature_matrix, coefficients):
    # Take dot product of feature_matrix and coefficients  
    # YOUR CODE HERE
    ...
    
    # Compute the conditional probability
    # YOUR CODE HERE
    ...

    # return predictions
    return predictions

**Aside**. How the sigmoid function works with matrix algebra

Since the word counts are stored as columns in **feature_matrix**, each $i$-th row of the matrix corresponds to the feature vector $\mathbf{x}^{(i)}$:
$$
[\text{feature_matrix}] =
\left[
\begin{array}{c}
(\mathbf{x}^{(1)})^T \\
(\mathbf{x}^{(2)})^T \\
\vdots \\
(\mathbf{x}^{(N)})^T
\end{array}
\right] =
\left[
\begin{array}{cccc}
\mathbf{x}^{(1)}_1 & \mathbf{x}^{(1)}_2 & \cdots & \mathbf{x}^{(1)}_D \\
\mathbf{x}^{(2)}_1 & \mathbf{x}^{(2)}_2 & \cdots & \mathbf{x}^{(2)}_D \\
\vdots & \vdots & \ddots & \vdots \\
\mathbf{x}^{(N)}_1 & \mathbf{x}^{(N)}_2 & \cdots & \mathbf{x}^{(N)}_D
\end{array}
\right]
$$

By the rules of matrix multiplication, the score vector containing elements $\mathbf{w}^T h(\mathbf{x}_i)$ is obtained by multiplying **feature_matrix** and the coefficient vector $\mathbf{w}$.
$$
[\text{score}] =
[\text{feature_matrix}]\mathbf{w} =
\left[
\begin{array}{c}
(\mathbf{x}^{(1)})^T \\
(\mathbf{x}^{(2)})^T \\
\vdots \\
(\mathbf{x}^{(N)})^T
\end{array}
\right]
\mathbf{w}
= \left[
\begin{array}{c}
(\mathbf{x}^{(1)})^T \mathbf{w} \\
(\mathbf{x}^{(2)})^T \mathbf{w} \\
\vdots \\
(\mathbf{x}^{(N)})^T \mathbf{w}
\end{array}
\right]
= \left[
\begin{array}{c}
\mathbf{w}^T \mathbf{x}^{(1)} \\
\mathbf{w}^T \mathbf{x}^{(1)} \\
\vdots \\
\mathbf{w}^T \mathbf{x}^{(1)}
\end{array}
\right]
$$

**Checkpoint**

Just to make sure you are on the right track, we have provided a few examples. If your `predict_probability` function is implemented correctly, then the outputs will match:

In [266]:
correct_scores      = np.array( [ 1.*1. + 2.*3. + 3.*(-1.),          1.*1. + (-1.)*3. + (-1.)*(-1.) ] )
correct_predictions = np.array( [ 1./(1+np.exp(-correct_scores[0])), 1./(1+np.exp(-correct_scores[1])) ] )
print correct_scores[0]

4.0


In [267]:
dummy_feature_matrix = np.array([[1.,2.,3.], [1.,-1.,-1]])
dummy_coefficients = np.array([1., 3., -1.])

correct_scores      = np.array( [ 1.*1. + 2.*3. + 3.*(-1.),          1.*1. + (-1.)*3. + (-1.)*(-1.) ] )
correct_predictions = np.array( [ 1./(1+np.exp(-correct_scores[0])), 1./(1+np.exp(-correct_scores[1])) ] )

print 'The following outputs must match '
print '------------------------------------------------'
print 'correct_predictions           =', correct_predictions
print 'output of predict_probability =', predict_probability(dummy_feature_matrix, dummy_coefficients)

The following outputs must match 
------------------------------------------------
correct_predictions           = [0.98201379 0.26894142]
output of predict_probability = [0.9820137900379085, 0.2689414213699951]


## Compute derivative of log likelihood (negation of the cross entropy loss) with respect to a single coefficient

In part (b), we have shown:
$$\frac{\partial \ell\ell(\mathbf{w})}{\partial \mathbf{w}_j} = \sum_{i=1}^N \mathbf{x}^{(i)}_j \left( y^{(i)} - \sigma( \mathbf{w}^\top \mathbf{x}^{(i)} ) \right)
$$
where $\ell\ell$ stands for log-likelihood.
We will now write a function that computes the derivative of log likelihood with respect to a single coefficient $\mathbf{w}_j$. The function accepts two arguments:
* `errors` vector containing $\left( y^{(i)} - \sigma( \mathbf{w}^\top \mathbf{x}^{(i)} ) \right)$ for all $i$.
* `feature` vector containing $\mathbf{x}^{(i)}$  for all $i$. 

Complete the following code block:

In [268]:
def feature_derivative(errors, feature):     
    # Compute the dot product of errors and feature
    derivative = np.dot(errors, feature)
    
    # Return the derivative
    return derivative

Due to its numerical stability, we will use the log likelihood instead of the likelihood to assess the algorithm.

Recall: the log likelihood is computed using the following formula:

$$\ell\ell(\mathbf{w}) = \sum_{i=1}^N \Big( y^{(i)} \ln\left( \sigma(\mathbf{w}^T \mathbf{x}^{(i)} )\right) + \left(1 -y^{(i)} \right) \ln\left( 1 - \sigma(\mathbf{w}^T \mathbf{x}^{(i)} )\right) \Big) $$

We provide a function to compute the log likelihood for the entire dataset. 

In [269]:
def compute_log_likelihood(feature_matrix, high_rating, coefficients):
        
    scores = np.dot(feature_matrix, coefficients)
    sig = 1./(1. + np.exp(-scores) )
    
    ## YOUR CODE HERE, do check overflow/underflow problem.
    ...
    
    return lp

**Checkpoint**

Just to make sure we are on the same page, run the following code block and check that the outputs match.

In [270]:
# here uses another way to compute log-likelihood. Don't worry about it too much.

dummy_feature_matrix = np.array([[1.,2.,3.], [1.,0.,0]])
dummy_coefficients = np.array([1., 3., -1])
dummy_high_rating = np.array([0, 1])

correct_indicators  = np.array( [ 0==+1,                                       1==+1 ] )
correct_scores      = np.array( [ 1.*1. + 2.*3. + 3.*(-1.),                     1.*1. + (0.)*3. + (0.)*(-1.) ] )
correct_first_term  = np.array( [ (correct_indicators[0]-1)*correct_scores[0],  (correct_indicators[1]-1)*correct_scores[1] ] )
correct_second_term = np.array( [ np.log(1. + np.exp(-correct_scores[0])),      np.log(1. + np.exp(-correct_scores[1])) ] )

correct_ll          =      sum( [ correct_first_term[0]-correct_second_term[0], correct_first_term[1]-correct_second_term[1] ] ) 

print 'The following outputs must match '
print '------------------------------------------------'
print 'correct_log_likelihood           =', correct_ll
print 'output of compute_log_likelihood =', compute_log_likelihood(dummy_feature_matrix, dummy_high_rating, dummy_coefficients)

The following outputs must match 
------------------------------------------------
correct_log_likelihood           = -4.331411615436032
output of compute_log_likelihood = -4.331411615436033


## Taking gradient steps

Now we are ready to implement our own logistic regression. All we have to do is to write a gradient ascent function that takes gradient steps towards the optimum. 

Complete the following function to solve the logistic regression model using gradient ascent:

In [271]:
from math import sqrt

def logistic_regression(feature_matrix, high_rating, initial_coefficients, step_size, max_iter):
    coefficients = np.array(initial_coefficients) # make sure it's a numpy array
    for itr in xrange(max_iter):

        # Predict P(y^(i) = +1|x^(i),w) using your predict_probability() function
        # YOUR CODE HERE
        ...
                
        # Compute the errors as y - predictions
        errors = high_rating - predictions
        for j in xrange(len(coefficients)): # loop over each coefficient
            
            # Recall that feature_matrix[:,j] is the feature column associated with coefficients[j].
            # Compute the derivative for coefficients[j]. Save it in a variable called derivative
            # YOUR CODE HERE
            ...
            
            # add the step size times the derivative to the current coefficient
            ## YOUR CODE HERE
            ...
        
        # Checking whether log likelihood is increasing
        if itr <= 15 or (itr <= 100 and itr % 10 == 0) or (itr <= 1000 and itr % 100 == 0) \
        or (itr <= 10000 and itr % 1000 == 0) or itr % 10000 == 0:
            lp = compute_log_likelihood(feature_matrix, high_rating, coefficients)
            print 'iteration %*d: log likelihood of observed labels = %.8f' % \
                (int(np.ceil(np.log10(max_iter))), itr, lp)
    return coefficients

Now, let us run the logistic regression solver.

In [272]:
coefficients = logistic_regression(feature_matrix, high_rating, initial_coefficients=np.zeros(194),
                                   step_size=1e-7, max_iter=301)

iteration   0: log likelihood of observed labels = -34770.14654431
iteration   1: log likelihood of observed labels = -34767.87459771
iteration   2: log likelihood of observed labels = -34765.60560727
iteration   3: log likelihood of observed labels = -34763.33956017
iteration   4: log likelihood of observed labels = -34761.07644369
iteration   5: log likelihood of observed labels = -34758.81624520
iteration   6: log likelihood of observed labels = -34756.55895218
iteration   7: log likelihood of observed labels = -34754.30455219
iteration   8: log likelihood of observed labels = -34752.05303288
iteration   9: log likelihood of observed labels = -34749.80438201
iteration  10: log likelihood of observed labels = -34747.55858740
iteration  11: log likelihood of observed labels = -34745.31563700
iteration  12: log likelihood of observed labels = -34743.07551881
iteration  13: log likelihood of observed labels = -34740.83822096
iteration  14: log likelihood of observed labels = -34738.6037

## Predicting high_rating

Recall from lecture that class predictions for a data point $\mathbf{x}^{(i)}$ can be computed from the coefficients $\mathbf{w}$ using the following formula:
$$
\hat{y}^{(i)} = 
\left\{
\begin{array}{ll}
      1 & \mathbf{w}^T \mathbf{x}^{(i)} > 0 \\
      0 & \mathbf{w}^T \mathbf{x}^{(i)} \leq 0 \\
\end{array} 
\right.
$$

Now, we will write some code to compute class predictions. We will do this in two steps:
* **Step 1**: First compute the **scores** using **feature_matrix** and **coefficients** using a dot product.
* **Step 2**: Using the formula above, compute the class predictions from the scores.

Step 1 can be implemented as follows:

In [273]:
# Compute the scores as a dot product between feature_matrix and coefficients.
scores = np.dot(feature_matrix, coefficients)

Now, complete the following code block for **Step 2** to compute the class predictions using the **scores** obtained above:

In [274]:
class_pred = scores>0
print list(class_pred).count(True)

26998


** Question** i: How many reviews were predicted to have high rating?

## Measuring accuracy

We will now measure the classification accuracy of the model. Recall from the lecture that the classification accuracy can be computed as follows:

$$
\mbox{accuracy} = \frac{\mbox{# correctly classified data points}}{\mbox{# total data points}}
$$

Complete the following code block to compute the accuracy of the model.

In [276]:
... # YOUR CODE HERE
... # YOUR CODE HERE
print "-----------------------------------------------------"
print '# Reviews   correctly classified =', len(products) - num_mistakes
print '# Reviews incorrectly classified =', num_mistakes
print '# Reviews total                  =', len(products)
print "-----------------------------------------------------"
print 'Accuracy = %.2f' % accuracy

-----------------------------------------------------
# Reviews   correctly classified = 33089
# Reviews incorrectly classified = 17077
# Reviews total                  = 50166
-----------------------------------------------------
Accuracy = 0.66


**Question** ii: What is the accuracy of the model on predictions made above? (round to 2 digits of accuracy)

## Which words contribute most to high ratings?

In order to do inspect the importance of the words, we will first do the following:
* Treat each coefficient as a tuple, i.e. (**word**, **coefficient_value**).
* Sort all the (**word**, **coefficient_value**) tuples by **coefficient_value** in descending order.

In [277]:
coefficients = list(coefficients[1:]) # exclude intercept
word_coefficient_tuples = [(word, coefficient) for word, coefficient in zip(important_words, coefficients)]
word_coefficient_tuples = sorted(word_coefficient_tuples, key=lambda x:x[1], reverse=True)

Now, **word_coefficient_tuples** contains a sorted list of (**word**, **coefficient_value**) tuples. The first 10 elements in this list correspond to the words that are most positive.

### Ten "most positive" words

Now, we compute the 10 words that have the most positive coefficient values. These words are associated with high ratings.

In [278]:
word_coefficient_tuples

[('love', 0.05417175110031363),
 ('loves', 0.03341634129158078),
 ('easy', 0.033389137480016776),
 ('great', 0.025580515307137473),
 ('perfect', 0.018573255618985464),
 ('recommend', 0.018198089912211276),
 ('baby', 0.015214316424304965),
 ('best', 0.012004878798543237),
 ('daughter', 0.009724230930764757),
 ('old', 0.009607388050573889),
 ('fits', 0.00917736061196277),
 ('also', 0.009096143871771316),
 ('soft', 0.008943944922918834),
 ('happy', 0.00719076203015758),
 ('every', 0.006683151200866261),
 ('well', 0.00624658635328378),
 ('without', 0.0059343223690781904),
 ('comfortable', 0.005554796613533446),
 ('son', 0.00487337540548808),
 ('play', 0.004711611981100413),
 ('room', 0.004500995877516566),
 ('diaper', 0.004107433778957112),
 ('worth', 0.003944765282505371),
 ('many', 0.003736816435735063),
 ('months', 0.0037237525514986027),
 ('car', 0.003720454971433458),
 ('kids', 0.003277811584301806),
 ('clean', 0.003201195805194925),
 ('little', 0.0031799699580543875),
 ('night', 0.00

** Question** iii: What are the top 3 most positively weighted words (according to our model)?