# Multivariate Logistic Regression

Now that we've finished our logistic regression example, you'll realize that the limitations of it is that it's binary classification

In this example, we'll expand our solution to handle multi-class classification. This will set up the first set of the exercise and prepare us for the next topic: neural networks.

This task is using logistic regression to recognize written numbers (0-9). We'll start by loading the dataset. The Coursera course uses a MATLAB file *ex3data1.mat*, which pandas can't handle. We'll use a SciPy to load it

In [19]:
import os
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
from scipy.io import loadmat
%matplotlib inline

In [20]:
pwd = os.getcwd()
data = loadmat(pwd + '/asn3/data/ex3data1.mat')
data

{'X': array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ...,  0.,  0.,  0.],
        ..., 
        [ 0.,  0.,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ...,  0.,  0.,  0.]]),
 '__globals__': [],
 '__header__': 'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sun Oct 16 13:09:09 2011',
 '__version__': '1.0',
 'y': array([[10],
        [10],
        [10],
        ..., 
        [ 9],
        [ 9],
        [ 9]], dtype=uint8)}

In [21]:
# review the shapes of the arrays we just loaded into memory
data['X'].shape, data['y'].shape

((5000, 400), (5000, 1))

Remember what we have here: for our training data X, we have 5000 examples of 20x20 pixel images of handwritten numbers. We unroll our 20x20 pixel images to be one vector of 400 pixels. The measurement here is pixel-intensity in greyscale (so, it's not RGB). That's how we come up with a (5000, 400) dimension set for X

The class labels in y represent what digit the handwritten number in X is.

# Cost Function

We can use the same cost function as before because we vectorized it, meaning it's not hardcoded to a certain amount of features. We'll use the exact same one from the previous example. Note that we're jumping straight to the regularized approach here too.

In [22]:
def sigmoid(z):
    return 1.0 / (1 + np.exp(-z))  

def cost(theta, X, y, reg_lambda):
    theta = np.matrix(theta)
    X = np.matrix(X)
    y = np.matrix(y)
    h = sigmoid(X * theta.T)
    m = len(X)
    
    # don't regularize the first theta term
    reg = (reg_lambda / (2 * m)) * np.sum(np.power(theta[:, 1:], 2))
    
    return ((1.0 / m) * np.sum(np.multiply(-y, np.log(h)) - np.multiply((1 - y), np.log(1 - h)))) \
        + reg

# Gradient Descent

We'll use gradient descent here as well. This was defined in the previous example with a for loop. We'll implement one without a for loop

In [23]:
def gradient(theta, X, y, reg_lambda):
    # import pdb; pdb.set_trace()
    theta = np.matrix(theta)
    X = np.matrix(X)
    y = np.matrix(y)
    m = len(X)

    h = compute_hypothesis(X * theta.T)
    
    reg = ((reg_lambda / m) * theta)

    # regularize the whole set
    # WARNING 1: in numpy, np.multiply is thought of as matrix multiplication, 
    # NOT as dot product.

    # WARNING 2: in pandas, adding two matrices with different dimensions
    # undergoes matrix broadcasting, which will reshape the resulting matrix
    # to prevent involuntarily reshaping when adding, always ensure both your
    # matrices are the same dimensiosn when adding
    grad = ((1.0 / m) * (X.T * (h - y))).T + reg

    # update the first term to not be regularized
    grad[0, 0] = (1.0 / m) * np.sum(np.multiply(h - y, X[:, 0]))

    return grad

def gradient_original(theta, X, y, learningRate):  
    theta = np.matrix(theta)
    X = np.matrix(X)
    y = np.matrix(y)

    parameters = int(theta.ravel().shape[1])
    error = sigmoid(X * theta.T) - y

    grad = ((X.T * error) / len(X)).T + ((learningRate / len(X)) * theta)

    # intercept gradient is not regularized
    grad[0, 0] = np.sum(np.multiply(error, X[:,0])) / len(X)

    return np.array(grad).ravel()

Now that we've defined our cost and gradient functions, we'll build our classifer.

For this task, we have 10 possible classes. Since logistic regression's only able to distringuish between 2 classes at a time, we'll use a *one-vs-all classification* approach, where each class we run through does an analysis of *class k* and *not class k*.

We'll wrap the classifier training in one function that computes the final weights for each of the 10 classifiers and returns the weights as a *k x (n + 1)* array, where n is the number of parameters

In [24]:
from scipy.optimize import minimize

# think of k (n + 1) as 'for each class, this I am either class k or not
# class k. We'l have a 10x10 matrix in this case, where each row represents
# the class, and each col will have 9 zeros and one 1 (showing which class
# that value of k belongs to)
def one_vs_all(X, y, num_labels, reg_lambda):
    num_examples = X.shape[0] # 5000
    num_features = X.shape[1] # 400
    
    # k x (n + 1) array for the parameters of each of the k classifiers
    # (10, 401) <- 11 due to 'ones' column we'll be adding next
    all_theta = np.zeros((num_labels, num_features + 1))
    
    # insert a column of ones at the beginning for the intercept term
    # (5000, 401)
    X = np.insert(X, 0, values=np.ones(num_examples), axis=1)
    
    # labels are 1-indexed instead of 0-indexed
    for i in range(1, num_labels + 1):
        theta = np.zeros(num_features + 1) # (1, 401)
        y_i = np.array([1 if label == i else 0 for label in y]) # (5000, 1)
        y_i = np.reshape(y_i, (num_examples, 1))
        
        # minimize the cost function for each classifier
        fmin = minimize(fun=cost, x0=theta, args=(X, y_i, reg_lambda), method='TNC', jac=gradient_original)
        all_theta[i-1, :] = fmin.x
        
    print X.shape, y_i.shape, theta.shape, all_theta.shape
        
    return all_theta

In [28]:
print "Training..."
all_theta = one_vs_all(data['X'], data['y'], 10, 1)
print(all_theta)

Training...
(5000, 401) (5000, 1) (401,) (10, 401)
[[ -4.82623104e+00   0.00000000e+00   0.00000000e+00 ...,   9.14496656e-03
    2.88549512e-07   0.00000000e+00]
 [ -5.81888929e+00   0.00000000e+00   0.00000000e+00 ...,   5.54209611e-02
   -6.07966728e-03   0.00000000e+00]
 [ -8.81931096e+00   0.00000000e+00   0.00000000e+00 ...,  -2.34701067e-04
   -1.08823612e-06   0.00000000e+00]
 ..., 
 [ -1.31276815e+01   0.00000000e+00   0.00000000e+00 ...,  -5.62757888e+00
    6.49936790e-01   0.00000000e+00]
 [ -8.73271923e+00   0.00000000e+00   0.00000000e+00 ...,  -2.88830420e-01
    1.99467688e-02   0.00000000e+00]
 [ -1.31953052e+01   0.00000000e+00   0.00000000e+00 ...,   2.68975933e-04
    4.22526177e-05   0.00000000e+00]]


Now we can predict our classes. We're going to compute the class probability for each class, for each training example (using vectorization) and assign the output class label as the class with the highest probability

In [26]:
def predict_all(X, all_theta):
    num_rows = X.shape[0]
    num_features = X.shape[1]
    num_labels = all_theta.shape[0]
    
    # same as before, insert ones
    X = np.insert(X, 0, values=np.ones(num_rows), axis=1)
    
    # convert to matrices
    X = np.matrix(X)
    all_theta = np.matrix(all_theta)
    
    # compute the class probability for each class on each training example
    h = sigmoid(X * all_theta.T)
    
    # create array of the index with maximum probability
    h_argmax = np.argmax(h, axis=1)
    
    # because our array was zero-indexed, we need to add one for the true
    # label prediction
    h_argmax = h_argmax + 1
    
    return h_argmax

Now we can use this function to generate class predictions for each example to see how well our classifier works

In [27]:
y_pred = predict_all(data['X'], all_theta)
correct = [1 if a == b else 0 for (a, b) in zip(y_pred, data['y'])]
accuracy = (sum(map(int, correct))/ float(len(correct)))

print 'Accuracy = {0}%'.format(accuracy * 100)

Accuracy = 97.48%
