# Implementing logistic regression from scratch

This notebook implements a logistic regression classifier using a gradient ascent algorithm
    
## Imports

In [2]:
import graphlab
import string # used in remove_punctuation()
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt

2016-05-30 17:45:39,754 [INFO] graphlab.cython.cy_server, 176: GraphLab Create v1.9 started. Logging: /tmp/graphlab_server_1464644738.log


This non-commercial license of GraphLab Create is assigned to damiansp@gmail.com and will expire on March 07, 2017. For commercial licensing options, visit https://dato.com/buy/.


## Load review data set

Use a subset of the Amazon product review data set, chosen to contain similar numbers of positive and negative reviews.

In [3]:
products = graphlab.SFrame('amazon_baby_subset.gl/')

One column of this dataset is 'sentiment', corresponding to the class label with 1 indicating a review with positive sentiment and -1 indicating one with negative sentiment.

In [4]:
products['sentiment']

dtype: int
Rows: 53072
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... ]

Explore more of this data set.  The 'name' column is the name of the product.  List the first 10.

In [5]:
products.head(10)['name']

dtype: str
Rows: 10
["Stop Pacifier Sucking without tears with Thumbuddy To Love's Binky Fairy Puppet and Adorable Book", "Nature's Lullabies Second Year Sticker Calendar", "Nature's Lullabies Second Year Sticker Calendar", 'Lamaze Peekaboo, I Love You', "SoftPlay Peek-A-Boo Where's Elmo A Children's Book", 'Our Baby Girl Memory Book', 'Hunnt&reg; Falling Flowers and Birds Kids Nursery Home Decor Vinyl Mural Art Wall Paper Stickers', 'Blessed By Pope Benedict XVI Divine Mercy Full Color Medal', 'Cloth Diaper Pins Stainless Steel Traditional Safety Pin (Black)', 'Cloth Diaper Pins Stainless Steel Traditional Safety Pin (Black)']

In [6]:
print '# of positive reviews =', len(products[products['sentiment'] == 1])
print '# of negative reviews =', len(products[products['sentiment'] == -1])

# of positive reviews = 26579
# of negative reviews = 26493


## Apply text cleaning on the review data

Perform feature cleaning using **SFrames**. The previous notebook used bag-of-words features, but here feature words are limited to 193 words for simplicity. These words are compiled in important_words.json. 

In [7]:
import json
with open('important_words.json', 'r') as f: 
    important_words = json.load(f)
important_words = [str(s) for s in important_words]

In [8]:
print important_words

['baby', 'one', 'great', 'love', 'use', 'would', 'like', 'easy', 'little', 'seat', 'old', 'well', 'get', 'also', 'really', 'son', 'time', 'bought', 'product', 'good', 'daughter', 'much', 'loves', 'stroller', 'put', 'months', 'car', 'still', 'back', 'used', 'recommend', 'first', 'even', 'perfect', 'nice', 'bag', 'two', 'using', 'got', 'fit', 'around', 'diaper', 'enough', 'month', 'price', 'go', 'could', 'soft', 'since', 'buy', 'room', 'works', 'made', 'child', 'keep', 'size', 'small', 'need', 'year', 'big', 'make', 'take', 'easily', 'think', 'crib', 'clean', 'way', 'quality', 'thing', 'better', 'without', 'set', 'new', 'every', 'cute', 'best', 'bottles', 'work', 'purchased', 'right', 'lot', 'side', 'happy', 'comfortable', 'toy', 'able', 'kids', 'bit', 'night', 'long', 'fits', 'see', 'us', 'another', 'play', 'day', 'money', 'monitor', 'tried', 'thought', 'never', 'item', 'hard', 'plastic', 'however', 'disappointed', 'reviews', 'something', 'going', 'pump', 'bottle', 'cup', 'waste', 'retu

Perform 2 simple data transformations:

1. Remove punctuation 
2. Compute word counts (only for **important_words**)

as in the previous notebook.

In [9]:
def remove_punctuation(text):
    return text.translate(None, string.punctuation) 

products['review_clean'] = products['review'].apply(remove_punctuation)

In [10]:
# Compute word counts
for word in important_words:
    products[word] = products['review_clean'].apply(lambda s : s.split().count(word))

The SFrame **products** now contains one column for each of the **important_words**. As an example, the column **perfect** contains a count of the number of times the word **perfect** occurs in each of the reviews.

In [11]:
products['perfect']
print sum(products['perfect'])

3207


Compute the number of product reviews that contain the word **perfect**.

In [12]:
products['contains_perfect'] = products['perfect'].apply(lambda n: 1 if n >= 1 else 0)
print sum(products['contains_perfect'])

2955


In [13]:
print "No. with 'perfect':", sum(products['contains_perfect'])
print "N:", len(products['perfect'])

No. with 'perfect': 2955
N: 53072


## Convert SFrame to NumPy array

In [14]:
def get_numpy_data(data_sframe, features, label):
    data_sframe['intercept'] = 1
    features = ['intercept'] + features
    features_sframe = data_sframe[features]
    feature_matrix = features_sframe.to_numpy()
    label_sarray = data_sframe[label]
    label_array = label_sarray.to_numpy()
    return(feature_matrix, label_array)

In [15]:
feature_matrix, sentiment = get_numpy_data(products, important_words, 'sentiment') 

In [16]:
feature_matrix.shape # (records, features)

(53072, 194)

Look at **sentiment** column looks like:

In [17]:
sentiment

array([ 1,  1,  1, ..., -1, -1, -1])

## Estimating conditional probability with logistic link function

Recall from lecture that the link function is given by:
$$
P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) = \frac{1}{1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))},
$$

where the feature vector $h(\mathbf{x}_i)$ represents the word counts of **important_words** in the review  $\mathbf{x}_i$. Complete the following function that implements the link function:

In [18]:
def predict_probability(feature_matrix, coefficients):
    '''
    produces probablistic estimate for P(y_i = +1 | x_i, w).
    estimate ranges between 0 and 1.
    '''
    dot_prod = np.dot(feature_matrix, coefficients)
    
    # Compute P(y_i = +1 | x_i, w) using the link function
    predictions = 1 / (1 + np.exp(-dot_prod))
    
    # return predictions
    return predictions

## Compute derivative of log likelihood with respect to a single coefficient

$$
\frac{\partial\ell}{\partial w_j} = \sum_{i=1}^N h_j(\mathbf{x}_i)\left(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})\right)
$$

Function that computes the derivative of log likelihood with respect to a single coefficient $w_j$. The function accepts two arguments:
* `errors` vector containing $\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})$ for all $i$.
* `feature` vector containing $h_j(\mathbf{x}_i)$  for all $i$. 

In [27]:
def feature_derivative(errors, feature):     
    # Compute the dot product of errors and feature
    derivative = np.dot(errors, feature)
    
    # Return the derivative
    return derivative

Instead of simple likelihood, use the log likelihood--simplifies the derivation of the gradient and is more numerically stable.

$$\ell\ell(\mathbf{w}) = \sum_{i=1}^N \Big( (\mathbf{1}[y_i = +1] - 1)\mathbf{w}^T h(\mathbf{x}_i) - \ln\left(1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))\right) \Big) $$

In [20]:
def compute_log_likelihood(feature_matrix, sentiment, coefficients):
    indicator = (sentiment == 1)
    scores = np.dot(feature_matrix, coefficients)
    logexp = np.log(1. + np.exp(-scores))
    
    # Simple check to prevent overflow
    mask = np.isinf(logexp)
    logexp[mask] = -scores[mask]
    
    L = np.sum((indicator - 1) * scores - logexp)
    return L

## Taking gradient steps

Implement logistic regression. Gradient ascent function that takes gradient steps towards the optimum. 

In [28]:
def logistic_regression(feature_matrix, sentiment, initial_coefficients, step_size, max_iter, verbose = False):
    coefficients = np.array(initial_coefficients) # make sure it's a numpy array
    for itr in xrange(max_iter):
        # Predict P(y_i = +1|x_i,w) using your predict_probability() function
        predictions = predict_probability(feature_matrix, coefficients)
        
        # Compute indicator value for (y_i = +1)
        indicator = (sentiment == 1)
        
        # Compute the errors as indicator - predictions
        errors = indicator - predictions
        for j in xrange(len(coefficients)): # loop over each coefficient
            # Compute the derivative for coefficients[j]
            derivative = feature_derivative(feature_matrix[:, j], errors)
            
            # add the step size times the derivative to the current coefficient
            coefficients[j] += step_size * derivative
        
        # Output log likelihood if verbose
        if (verbose):
            if (itr <= 15 or 
                (itr <= 100 and itr % 10 == 0) or 
                (itr <= 1000 and itr % 100 == 0) or 
                (itr <= 10000 and itr % 1000 == 0) or 
                itr % 10000 == 0):
                L = compute_log_likelihood(feature_matrix, sentiment, coefficients)
                print ('iteration %*d: log likelihood of observed labels = %.8f' 
                       %(int(np.ceil(np.log10(max_iter))), itr, L))
    return coefficients

Run the logistic regression solver.

In [29]:
coefficients = logistic_regression(
    feature_matrix, sentiment, initial_coefficients = np.zeros(194), step_size = 1e-7, max_iter = 301, verbose = True)

iteration   0: log likelihood of observed labels = -36780.91768478
iteration   1: log likelihood of observed labels = -36775.13434712
iteration   2: log likelihood of observed labels = -36769.35713564
iteration   3: log likelihood of observed labels = -36763.58603240
iteration   4: log likelihood of observed labels = -36757.82101962
iteration   5: log likelihood of observed labels = -36752.06207964
iteration   6: log likelihood of observed labels = -36746.30919497
iteration   7: log likelihood of observed labels = -36740.56234821
iteration   8: log likelihood of observed labels = -36734.82152213
iteration   9: log likelihood of observed labels = -36729.08669961
iteration  10: log likelihood of observed labels = -36723.35786366
iteration  11: log likelihood of observed labels = -36717.63499744
iteration  12: log likelihood of observed labels = -36711.91808422
iteration  13: log likelihood of observed labels = -36706.20710739
iteration  14: log likelihood of observed labels = -36700.5020

## Predicting sentiments

Class predictions for a data point $\mathbf{x}$ can be computed from the coefficients $\mathbf{w}$ using the following formula:
$$
\hat{y}_i = 
\left\{
\begin{array}{ll}
      +1 & \mathbf{x}_i^T\mathbf{w} > 0 \\
      -1 & \mathbf{x}_i^T\mathbf{w} \leq 0 \\
\end{array} 
\right.
$$

Compute class predictions:
* **Step 1**: First compute the **scores** using **feature_matrix** and **coefficients** using a dot product.
* **Step 2**: Using the formula above, compute the class predictions from the scores.

Step 1:

In [30]:
# Compute the scores as a dot product between feature_matrix and coefficients.
scores = np.dot(feature_matrix, coefficients)
scores

array([ 0.05104571, -0.02936473,  0.02411584, ..., -0.40986295,
        0.01411436, -0.06755923])

Step 2:

In [31]:
class_predictions = 1 * (scores > 0)
class_predictions[class_predictions == 0] = -1
class_predictions

array([ 1, -1,  1, ..., -1,  1, -1])

Number predicted to have positive sentiment:

In [32]:
sum(class_predictions == 1)

25126

## Measuring accuracy

$$
\mbox{accuracy} = \frac{\mbox{# correctly classified data points}}{\mbox{# total data points}}
$$

In [33]:
print products['sentiment'][-10:]
print class_predictions[-10:]
correct_preds = [1 * (p == a) for (p, a) in zip(class_predictions, products['sentiment'])]
num_correct = sum(correct_preds)

num_mistakes = len(correct_preds) - num_correct

accuracy = num_correct / len(correct_preds)
print "-----------------------------------------------------"
print '# Reviews   correctly classified =', len(products) - num_mistakes
print '# Reviews incorrectly classified =', num_mistakes
print '# Reviews total                  =', len(products)
print "-----------------------------------------------------"
print 'Accuracy = %.2f' % accuracy

[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
[-1  1 -1 -1 -1 -1 -1 -1  1 -1]
-----------------------------------------------------
# Reviews   correctly classified = 39903
# Reviews incorrectly classified = 13169
# Reviews total                  = 53072
-----------------------------------------------------
Accuracy = 0.75


## Which words contribute most to positive & negative sentiments?

In [35]:
coefficients = list(coefficients[1:]) # exclude intercept
word_coefficient_tuples = [(word, coefficient) for word, coefficient in zip(important_words, coefficients)]
word_coefficient_tuples = sorted(word_coefficient_tuples, key=lambda x:x[1], reverse = True)

Now, **word_coefficient_tuples** contains a sorted list of (**word**, **coefficient_value**) tuples. The first 10 elements in this list correspond to the words that are most positive.

### Ten "most positive" words

In [36]:
word_coefficient_tuples[:10]

[('one', 0.066546084170457695),
 ('great', 0.065890762922123244),
 ('like', 0.064794586802578394),
 ('easy', 0.045435626308421372),
 ('much', 0.044976401394906038),
 ('old', 0.03013500109210707),
 ('even', 0.029739937104968459),
 ('seat', 0.020077541034775381),
 ('perfect', 0.018408707995268992),
 ('good', 0.01770319990570169)]

### Ten "most negative" words

In [37]:
word_coefficient_tuples[-10:]

[('money', -0.02448210054589172),
 ('waste', -0.026592778462247283),
 ('still', -0.027742697230661327),
 ('well', -0.028711552980192581),
 ('however', -0.028978976142317068),
 ('first', -0.030051249236035804),
 ('bottles', -0.03306951529475273),
 ('day', -0.038982037286487116),
 ('bought', -0.041511033392108897),
 ('use', -0.053860148445203128)]