# Implementing logistic regression from scratch
##### Eric Andrés Jardón Chao

**Goal**: to implement our own Logistic Regression Classifier.

For this, we:
 * Extract features from Amazon product reviews data into a Numpy matrix.
 * Implement the **link function** for logistic regression.
 * Write a function to compute the derivative of the **log likelihood** function with respect to a single coefficient.
 * Implement gradient ascent.
 * Predict sentiments, given a set of coefficients.
 * Compute classification accuracy for the logistic regression model.

## Load review dataset

In [1]:
import turicreate

For this assignment, we will use a subset of the Amazon product review dataset. This subset contains similar amounts of positive and negative reviews (the original dataset consisted primarily of positive ones). This prevents the dataset from having _class imbalance_.

In [2]:
products = turicreate.SFrame('../data/amazon_baby_subset.sframe/')

In [3]:
products

name,review,rating,sentiment
Stop Pacifier Sucking without tears with ...,All of my kids have cried non-stop when I tried to ...,5.0,1
Nature's Lullabies Second Year Sticker Calendar ...,We wanted to get something to keep track ...,5.0,1
Nature's Lullabies Second Year Sticker Calendar ...,My daughter had her 1st baby over a year ago. ...,5.0,1
"Lamaze Peekaboo, I Love You ...","One of baby's first and favorite books, and i ...",4.0,1
SoftPlay Peek-A-Boo Where's Elmo A Childr ...,Very cute interactive book! My son loves this ...,5.0,1
Our Baby Girl Memory Book,"Beautiful book, I love it to record cherished t ...",5.0,1
Hunnt&reg; Falling Flowers and Birds Kids ...,"Try this out for a spring project !Easy ,fun and ...",5.0,1
Blessed By Pope Benedict XVI Divine Mercy Full ...,very nice Divine Mercy Pendant of Jesus now on ...,5.0,1
Cloth Diaper Pins Stainless Steel ...,We bought the pins as my 6 year old Autistic son ...,4.0,1
Cloth Diaper Pins Stainless Steel ...,It has been many years since we needed diaper ...,5.0,1


In [4]:
# Count negative and positive reviews in the dataset

print('# of positive reviews =', len(products[products['sentiment']==1]))
print('# of negative reviews =', len(products[products['sentiment']==-1]))

# of positive reviews = 26579
# of negative reviews = 26493


### Extract word counts

For this notebook we will limit ourselves to 193 words (for simplicity). The list of 193 most frequent words is stored into a JSON file.

In [5]:
# Load the JSON list of words

import json 

with open('./important_words.json', 'r') as f: # Reads the list of most frequent words
    important_words = json.load(f)
    
important_words = [str(s) for s in important_words]

Next we do the following transformations on the review data:

1. Remove punctuation using [Python's built-in](https://docs.python.org/2/library/string.html) string functionality.
2. Compute word counts (only for **important_words**)


In [7]:
# Remove punctuation from the 'review' text column
import string 
def remove_punctuation(text):
    
    translator = text.maketrans('', '', string.punctuation)
    text = text.translate(translator)
    
    return text

# Create a new column 'review_clean' containing the transformed text
products['review_clean'] = products['review'].apply(remove_punctuation)

For each word in **important_words**, we add a column containing the count of appearances of that word in the review text. 

We get as a result 193 new columns, one for each of the `important_words`.

In [8]:
# Create word count columns for the 193 words

for word in important_words:
    products[word] = products['review_clean'].apply(lambda s : s.split().count(word))

Now, write some code to compute the number of product reviews that contain the word **perfect**.

**Hint**: 
* First create a column called `contains_perfect` which is set to 1 if the count of the word **perfect** (stored in column **perfect**) is >= 1.
* Sum the number of 1s in the column `contains_perfect`.

In [16]:
(products['perfect'].apply(lambda x : x>=1)).sum()

2955

**Quiz Question**. How many reviews contain the word **perfect**?
`A = 2955`

### Convert SFrame to NumPy array: `get_numpy_data`

With the NumPy library we are able to perform matrix manipulation. To implement our algorithms we need to convert our data into numpy matrices.

In [9]:
import numpy as np

In [10]:
"""Receives a complete dataframe, a list of feature names, and the target label name.
    Returns a 2-dimensional Numpy Array and a numpy array of the target values."""
def get_numpy_data(data_sframe, features, label):
    data_sframe['intercept'] = 1
    features = ['intercept'] + features
    
    features_sframe = data_sframe[features]
    feature_matrix = features_sframe.to_numpy()
    
    label_sarray = data_sframe[label]
    label_array = label_sarray.to_numpy()
    
    # Return a numpy matrix and an array with class labels
    return(feature_matrix, label_array)

In [11]:
# Convert our dataset into numpy arrays

feature_matrix, sentiment = get_numpy_data(products, important_words, 'sentiment')

In [12]:
feature_matrix.shape

(53072, 194)

**Quiz Question:** How many features are there in the **feature_matrix**? `A=194`

**Quiz Question:** Assuming that the intercept is present, how does the number of features in **feature_matrix** relate to the number of features in the logistic regression model? `What?`

### Estimating Conditional Probability  with `predict_probability`

In our Logistic Regression Model we must compute the conditional probability for every row, given by the logistic function:
$$
P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) = \frac{1}{1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))},
$$

where the feature vector $h(\mathbf{x}_i)$ represents the word counts of **important_words** in the review  $\mathbf{x}_i$. Complete the following function that implements the link function:

In [25]:
'''
Produces probablistic estimate for P(y_i = +1 | x_i, w).
estimate ranges between 0 and 1.
Returns an array of the N predicted probabilities.
'''
from math import e
def predict_probability(feature_matrix, coefficients):
    # Compute dot products of feature_matrix and coefficients  (vectorized)
    score = np.dot(feature_matrix, coefficients)
    
    # Compute P(y_i = +1 | x_i, w) via the link function  (vectorized)
    predictions = 1 / (1+e**(-score))
    
    # return Predicted Probabilities array
    return predictions

**Aside**. How the link function works with matrix algebra

Since the word counts are stored as columns in **feature_matrix**, each $i$-th row of the matrix corresponds to the feature vector $h(\mathbf{x}_i)$:
$$
[\text{feature_matrix}] =
\left[
\begin{array}{c}
h(\mathbf{x}_1)^T \\
h(\mathbf{x}_2)^T \\
\vdots \\
h(\mathbf{x}_N)^T
\end{array}
\right] =
\left[
\begin{array}{cccc}
h_0(\mathbf{x}_1) & h_1(\mathbf{x}_1) & \cdots & h_D(\mathbf{x}_1) \\
h_0(\mathbf{x}_2) & h_1(\mathbf{x}_2) & \cdots & h_D(\mathbf{x}_2) \\
\vdots & \vdots & \ddots & \vdots \\
h_0(\mathbf{x}_N) & h_1(\mathbf{x}_N) & \cdots & h_D(\mathbf{x}_N)
\end{array}
\right]
$$

By the rules of matrix multiplication, the score vector containing elements $\mathbf{w}^T h(\mathbf{x}_i)$ is obtained by multiplying **feature_matrix** and the coefficient vector $\mathbf{w}$.
$$
[\text{score}] =
[\text{feature_matrix}]\mathbf{w} =
\left[
\begin{array}{c}
h(\mathbf{x}_1)^T \\
h(\mathbf{x}_2)^T \\
\vdots \\
h(\mathbf{x}_N)^T
\end{array}
\right]
\mathbf{w}
= \left[
\begin{array}{c}
h(\mathbf{x}_1)^T\mathbf{w} \\
h(\mathbf{x}_2)^T\mathbf{w} \\
\vdots \\
h(\mathbf{x}_N)^T\mathbf{w}
\end{array}
\right]
= \left[
\begin{array}{c}
\mathbf{w}^T h(\mathbf{x}_1) \\
\mathbf{w}^T h(\mathbf{x}_2) \\
\vdots \\
\mathbf{w}^T h(\mathbf{x}_N)
\end{array}
\right]
$$

## Compute partial derivative of log likelihood with `feature_derivative`

From lecture, the derivative of **log likelihood** with respect to a single coefficient $w_j$ is given:
$$
\frac{\partial\ell}{\partial w_j} = \sum_{i=1}^N h_j(\mathbf{x}_i)\left(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})\right)
$$

Which can be rewritten as the dot product between a vector of the values of $h_j(\mathbf{x}_i)$ (of every _i_) and the vector of _errors_: the difference between indicator function and predicted probability (of every _i_). 

In [29]:
"""Computes the partial derivative with respect to the j-th coefficient
    Receives:
    *  an errors vector (difference between indicator and predicted probability)
    *  a features vector which contains the j-th feature's values over all obervations
    """
def feature_derivative(errors, feature):     
    # Compute the dot product of errors and feature vectors
    derivative = np.dot(errors, feature)
    
    # Return the derivative
    return derivative

The **log likelihood** simplifies the derivation of the gradient and is more numerically stable.  Due to its numerical stability, we will use the log likelihood instead of the likelihood to assess the algorithm.

The log likelihood is computed using the following formula:

$$\ell\ell(\mathbf{w}) = \sum_{i=1}^N \Big( (\mathbf{1}[y_i = +1] - 1)\mathbf{w}^T h(\mathbf{x}_i) - \ln\left(1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))\right) \Big) $$

We will write a function to compute the log likelihood for the entire dataset. 

In [30]:
"""Computer the log likelihood for the given dataset.
    Receives
    * feature matrix (a 2d numpy array of feature values)
    * sentiment array (an array of class labels)
    * coefficients (a numpy array of D feature weights)
"""
def log_likelihood(feature_matrix, sentiment, coefficients):
    indicator = (sentiment==+1)  # indicator function returns 1 if true, 0 otherwise
    
    scores = np.dot(feature_matrix, coefficients)  # dot product of features and weights
    
    logexp = np.log(1. + np.exp(-scores))  # take the natural log of 1 + e**(-scores)
    
    # Simple check to prevent overflow
    mask = np.isinf(logexp)    # returns an array of booleans indicating if the corresponding value is +/- infinity
    logexp[mask] = -scores[mask]   # replace infinite values with negative score
    
    
    lp = np.sum((indicator-1)*scores - logexp)
    
    return lp

## Gradient Ascent: the `logistic_regression` function

In [58]:
from math import sqrt

def logistic_regression(feature_matrix, sentiment, initial_coefficients, step_size, max_iter):
    # Ensure coefficients is a numpy array
    coefficients = np.array(initial_coefficients) 
    
    # We update coefficients vector max_iter times
    for itr in range(max_iter):

        # Compute array of P(y_i = +1|x_i,w) 
        predictions = predict_probability(feature_matrix, coefficients)
        
        # Compute indicator value I(y_i = +1)
        indicator = (sentiment==+1)
        
        # Compute the errors as indicator - predictions
        errors = indicator - predictions
        
        for j in range(len(coefficients)): # loop over each feature weight (coefficients)
            
            # Compute partial derivative with respect to the j-th weight
            feature_col = feature_matrix[:,j]
            
            derivative = feature_derivative(errors, feature_col)
            
            # add the step size times the derivative to the current coefficient
            coefficients[j] += step_size * derivative
        
        # After a certain number of iterationscheck if log likelihood is improving
        if itr <= 15 or (itr <= 100 and itr % 10 == 0) or (itr <= 1000 and itr % 100 == 0) \
        or (itr <= 10000 and itr % 1000 == 0) or itr % 10000 == 0:
            
            lp = compute_log_likelihood(feature_matrix, sentiment, coefficients)
            
            print('iteration %*d: log likelihood of observed labels = %.8f' % \
                (int(np.ceil(np.log10(max_iter))), itr, lp))
            
    return coefficients

Now, let us run the logistic regression solver.

In [59]:
coefficients = logistic_regression(feature_matrix, sentiment, initial_coefficients=np.zeros(194),
                                   step_size=1e-7, max_iter=301)

iteration   0: log likelihood of observed labels = -36780.91768478
iteration   1: log likelihood of observed labels = -36775.13434712
iteration   2: log likelihood of observed labels = -36769.35713564
iteration   3: log likelihood of observed labels = -36763.58603240
iteration   4: log likelihood of observed labels = -36757.82101962
iteration   5: log likelihood of observed labels = -36752.06207964
iteration   6: log likelihood of observed labels = -36746.30919497
iteration   7: log likelihood of observed labels = -36740.56234821
iteration   8: log likelihood of observed labels = -36734.82152213
iteration   9: log likelihood of observed labels = -36729.08669961
iteration  10: log likelihood of observed labels = -36723.35786366
iteration  11: log likelihood of observed labels = -36717.63499744
iteration  12: log likelihood of observed labels = -36711.91808422
iteration  13: log likelihood of observed labels = -36706.20710739
iteration  14: log likelihood of observed labels = -36700.5020

**Quiz Question:** As each iteration of gradient ascent passes, does the log likelihood increase or decrease? `A = increases`

## Predicting sentiments

Now that we've optimized a decent set of feature weights or coefficients for our model, we can compute class predictions for a data point $\mathbf{x}$ from the coefficients $\mathbf{w}$ using the following formula:
$$
\hat{y}_i = 
\left\{
\begin{array}{ll}
      +1 & \mathbf{x}_i^T\mathbf{w} > 0 \\
      -1 & \mathbf{x}_i^T\mathbf{w} \leq 0 \\
\end{array} 
\right.
$$

We will write some code to compute class predictions, in two steps:
* **Step 1**: First compute the **scores** using **feature_matrix** and **coefficients** using a dot product.
* **Step 2**: Using the formula above, compute the class predictions from the scores.

In [34]:
# STEP 1: Compute the scores as a dot product between feature_matrix and our estimated coefficients.
scores = np.dot(feature_matrix, coefficients)

In [36]:
# STEP 2: Compute class predictions based on scores
class_predictions = []

for s in scores:
    c = 1 if s>0 else -1
    
    class_predictions.append(c)

class_predictions = np.array(class_predictions)

**Quiz Question:** How many reviews were predicted to have positive sentiment?
`A = 25126`

In [38]:
num_pred_positive = (class_predictions == 1).sum()
print(f"Predicted positive reviews: {num_pred_positive}") # real value is 26579

Predicted positive reviews: 25126


## Measuring accuracy

We will now measure the classification accuracy of the model. 
The classification accuracy can be computed as follows:

$$
\mbox{accuracy} = \frac{\mbox{# correctly classified data points}}{\mbox{# total data points}}
$$


In [39]:
num_mistakes = (class_predictions != products['sentiment']).sum()
accuracy = (len(products) - num_mistakes) / len(products)
print("-----------------------------------------------------")
print('# Reviews   correctly classified =', len(products) - num_mistakes)
print('# Reviews incorrectly classified =', num_mistakes)
print('# Reviews total                  =', len(products))
print("-----------------------------------------------------")
print('Accuracy = %.2f' % accuracy)

-----------------------------------------------------
# Reviews   correctly classified = 39903
# Reviews incorrectly classified = 13169
# Reviews total                  = 53072
-----------------------------------------------------
Accuracy = 0.75


**Quiz Question**: What is the accuracy of the model on predictions made above? (round to 2 digits of accuracy)
`A = 0.75`

## Which words contribute most to positive & negative sentiments?

We want to know which words correspond most strongly with positive reviews. 

For this task we follow the steps:
1. Pair every coefficient with its word, i.e. (**word**, **coefficient_value**).
2. Sort all the (**word**, **coefficient_value**) tuples by **coefficient_value** in descending order.

In [60]:
coefficients = list(coefficients[1:]) # exclude intercept (w_0)

# With K iterators, we can build tuples of k elements with the zip() function
# WE have two arrays so we can create element-wise pairs
word_coefficient_pairs = [(word, coefficient) for word, coefficient in zip(important_words, coefficients)]

# Sort by coefficient value from largest to smallest
word_coefficient_pairs = sorted(word_coefficient_pairs, key=lambda x: x[1], reverse=True)

**`word_coefficient_tuples`** contains a sorted list of (**word**, **coefficient_value**) tuples. The first 10 elements in this list correspond to the words that are most positive.

### Ten "most positive" words

Now, we compute the 10 words that have the most positive coefficient values. These words are associated with positive sentiment.

In [61]:
word_coefficient_pairs[:10]

[('great', 0.0665460841704577),
 ('love', 0.06589076292212324),
 ('easy', 0.0647945868025784),
 ('little', 0.04543562630842137),
 ('loves', 0.04497640139490604),
 ('well', 0.030135001092107084),
 ('perfect', 0.029739937104968462),
 ('old', 0.02007754103477538),
 ('nice', 0.018408707995268996),
 ('daughter', 0.017703199905701694)]

**Quiz Question:** Which word is **not** present in the top 10 "most positive" words?

- love 
- easy 
- great 
- perfect
- cheap  `not present`

### Ten "most negative" words

Next, we repeat this exercise on the 10 most negative words.  That is, we compute the 10 words that have the most negative coefficient values. These words are associated with negative sentiment.

In [62]:
# most negative words in increasing negativeness order
last = len(word_coefficient_pairs)
word_coefficient_pairs[last-10: last]

[('monitor', -0.024482100545891717),
 ('return', -0.02659277846224728),
 ('back', -0.02774269723066133),
 ('get', -0.02871155298019258),
 ('disappointed', -0.028978976142317068),
 ('even', -0.03005124923603581),
 ('work', -0.03306951529475272),
 ('money', -0.038982037286487116),
 ('product', -0.0415110333921089),
 ('would', -0.053860148445203135)]

**Quiz Question:** Which word is **not** present in the top 10 "most negative" words?

- need `not present`
- work 
- disappointed 
- even 
- return 