## HW4 Part A: Logistic Regression for Digits Data Set

Blanca Miller
<br>
CS 791
<br>
03/02/2018

__Objective:__ Generate a logistic regression model to model the probabilities for K classes. We estimate the probability of a  dependent/response variable using p(x, beta) = exp(beta^T * x)/(1 + exp(beta^T * x)), where x represents our data and beta represents the model parameters. We use the maximum likelihood method to pick parameters, starting with a random set of parameters, and iterate to maximize the likelihood.  

The probability is bounded from [0, 1]. The outputted probability p(x, beta) represents the likelihood that data value, x, belongs to a particular class (or positive class for the binary case).  

__Digits Data Set:__ https://web.stanford.edu/~hastie/ElemStatLearn/

## STEPS
1. Import libraries
2. Import data sets: train & test sets
3. Convert data frame into numpy array
4. Parse the data into two matrices:
    - X: design matrix
    - y: response/prediction vector
5. Standardize to make the magnitude of your inputs roughly equal to the magnitude of your weights
6. Estimate the weights
7. Calculate gradient
8. Graph gradient

## FUNCTIONS
- Count the number of instances of two chosen classes using the y(response/prediction) vector
- Train the logistic model based on two chosen classes 
- Test the model based on the the two chosen classes
- beta(weights), X(feature/predictor) 

## DATA SET
- 7291 training observations
- 2007 testing observations
- 16 x 16 grayscale images of digits
- Each row consists of the digit id (0-9) followed by the 256 grayscale values

## Import Libraries

In [1]:
import sklearn
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt

from sklearn import metrics
from sklearn import preprocessing
from sklearn.preprocessing import scale
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

from pandas import Series, DataFrame
from scipy.stats  import spearmanr
from pylab import rcParams
#import seaborn as sb

## Import Data Sets

In [2]:
train = pd.read_csv('digits_data.train', delimiter=' ', header=None)
test = pd.read_csv('digits_data.test', delimiter=' ', header=None)

In [3]:
# print size of data sets
print("Training Set: {}".format(train.shape))
print("Testing Set: {}".format(test.shape))

Training Set: (7291, 258)
Testing Set: (2007, 257)


In [4]:
# print observations for training set
train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,248,249,250,251,252,253,254,255,256,257
0,6.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-0.631,0.862,...,0.823,1.0,0.482,-0.474,-0.991,-1.0,-1.0,-1.0,-1.0,
1,5.0,-1.0,-1.0,-1.0,-0.813,-0.671,-0.809,-0.887,-0.671,-0.853,...,-0.671,-0.033,0.761,0.762,0.126,-0.095,-0.671,-0.828,-1.0,
2,4.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-0.109,1.0,-0.179,-1.0,-1.0,-1.0,-1.0,
3,7.0,-1.0,-1.0,-1.0,-1.0,-1.0,-0.273,0.684,0.96,0.45,...,1.0,0.536,-0.987,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,
4,3.0,-1.0,-1.0,-1.0,-1.0,-1.0,-0.928,-0.204,0.751,0.466,...,0.639,1.0,1.0,0.791,0.439,-0.199,-0.883,-1.0,-1.0,


In [5]:
# print observations for testing data set
test.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,247,248,249,250,251,252,253,254,255,256
0,9,-1.0,-1.0,-1.0,-1.0,-1.0,-0.948,-0.561,0.148,0.384,...,-1.0,-0.908,0.43,0.622,-0.973,-1.0,-1.0,-1.0,-1.0,-1.0
1,6,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
2,3,-1.0,-1.0,-1.0,-0.593,0.7,1.0,1.0,1.0,1.0,...,1.0,0.717,0.333,0.162,-0.393,-1.0,-1.0,-1.0,-1.0,-1.0
3,6,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
4,6,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-0.858,-0.106,...,0.901,0.901,0.901,0.29,-0.369,-0.867,-1.0,-1.0,-1.0,-1.0


## Convert Data Frame into Numpy Array

In [6]:
train_set = train.as_matrix()
test_set = test.as_matrix()

In [7]:
# all rows, start at the 1st column until the end of the matrix
train_set[:,1:]

array([[-1.   , -1.   , -1.   , ..., -1.   , -1.   ,    nan],
       [-1.   , -1.   , -1.   , ..., -0.828, -1.   ,    nan],
       [-1.   , -1.   , -1.   , ..., -1.   , -1.   ,    nan],
       ..., 
       [-1.   , -1.   , -1.   , ..., -1.   , -1.   ,    nan],
       [-1.   , -1.   , -1.   , ..., -1.   , -1.   ,    nan],
       [-1.   , -1.   , -1.   , ..., -1.   , -1.   ,    nan]])

In [8]:
train_set[:,0]

array([ 6.,  5.,  4., ...,  3.,  0.,  1.])

## Parse the Data from the Targets/Labels 

In [9]:
# for all rows, start at the 1st column and go until the end of the column
X_train = train_set[:,1:257] # got to 256 to remove NaN column that numpy inserted

# for alls rows, get only the 0th element
y_train = train_set[:,0]

# for all rows, start at the 1st column and go until the end of the column
X_test = test_set[:,1:]

# foall rows, get only the 0th element
y_test = test_set[:,0]

print("Training Data: {}".format(X_train.shape))
print("Training Labels: {}".format(y_train.shape))
print("Testing Data: {}".format(X_test.shape))
print("Testing Labels: {}".format(y_test.shape))

Training Data: (7291, 256)
Training Labels: (7291,)
Testing Data: (2007, 256)
Testing Labels: (2007,)


## Standardize for Similar Input & Weight Magnitude

In [10]:
# Set axis to 1 to standardize per sample/vector, rather than standardize each feature
X_train = preprocessing.scale(X_train, axis=1)
X_test = preprocessing.scale(X_test, axis=1)
print(X_train)

[[-0.80693359 -0.80693359 -0.80693359 ..., -0.80693359 -0.80693359
  -0.80693359]
 [-1.0229121  -1.0229121  -1.0229121  ..., -0.64403944 -0.82483886
  -1.0229121 ]
 [-0.62434198 -0.62434198 -0.62434198 ..., -0.62434198 -0.62434198
  -0.62434198]
 ..., 
 [-0.88870472 -0.88870472 -0.88870472 ..., -0.88870472 -0.88870472
  -0.88870472]
 [-1.34819665 -1.34819665 -1.34819665 ..., -1.34819665 -1.34819665
  -1.34819665]
 [-0.66492726 -0.66492726 -0.66492726 ..., -0.66492726 -0.66492726
  -0.66492726]]


## Designate Two Classes

In [70]:
a = 0.0
b = 1.0

## Count the Number of Instances of the Two Chosen Classes

In [71]:
# count how many observations account for class a & b
def ab_count(X, y, a, b):
    
    # initialize count 
    n = 0
    
    for i in range(X.shape[0]):
        
        # identify class a or b in label vector y
        if (y[i] == a or y[i] == b):
        
            # increment the count
            n += 1
        
    # return the count
    return n

In [73]:
ab_samples = ab_count(X_train, y_train, a, b)
ab_samples

2199

## Fill a Matrix with the Number Count of the Two Chosen Classes

In [None]:
# count the number of rows for classes a & b 
training_data = np.zeros((X.shape[0], X.shape[1]))

## Train the Logistic Model with the Training Data Set

In [19]:
logistic = LogisticRegression()
logistic.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [21]:
print("LogisticRegression score: {}".format(logistic.score(X_train,y_train)))

LogisticRegression score: 0.982169798382


__Interpretation:__ A score of 0.98 means that our model can predict the correct label for a given digit 98% of the time!! This value is just shy of ideal, 1.0, which would mean our model was correctly predicted the label 100% of the time.  

## Evaluate Model's Precision 

In [23]:
y_pred = logistic.predict(X_train)
from sklearn.metrics import classification_report

print(classification_report(y_train, y_pred))

             precision    recall  f1-score   support

        0.0       0.99      1.00      0.99      1194
        1.0       1.00      1.00      1.00      1005
        2.0       0.99      0.97      0.98       731
        3.0       0.99      0.98      0.98       658
        4.0       0.96      0.97      0.97       652
        5.0       0.97      0.97      0.97       556
        6.0       0.99      0.99      0.99       664
        7.0       0.98      0.99      0.99       645
        8.0       0.96      0.95      0.95       542
        9.0       0.98      0.98      0.98       644

avg / total       0.98      0.98      0.98      7291



__Interpretation:__ The model's overall precision for all 10 classes is 0.98.   

## Iterate through 

In [None]:
# Number of training samples (rows)
n_trains = X_train.shape[0]

# Number of features (columns)
n_features = X_train.shape[1]

# Number of classes
n_classes = 10
classes = np.asarray([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0])

# Chosen Classes
a = 0.0
b = 1.0

In [62]:
# Train the logistic regression model according to the two chosen classes
def train_model(X, y, a, b, ab_size):
    
    # count the number of rows for classes a & b 
    training_data = np.zeros((X.shape[0], X.shape[1]))
    
    # loop through the training rows/observations
    for i in range(X.shape[0]):
        
        # identify label a or label b in y vector
        if y[i] == a or y[i] == b:
            
            # Add the identified data to the training set
            training_data[ab_size] += X[i] 
        
            # run log reg for class a & b
            logistic = LogisticRegression()
            logistic.fit(training_data, y)
        
        
        
            #score = logistic.fit(X_train, y_train).score(X_test, y_test)
        
    #return parameter_vector

In [57]:
logistic.predict?

## Test the Model 

In [None]:
# this does what the test model function does 
logistic.predict(X, y)

# the last step 
# count how many of the y_test match y_train


def test_model(X, y, model):
    
    # beta * x: this product computes the entire dot product for each row (no loop!)
    betax = np.dot(X, model)
        
    # compute y_hat using the logistic function, the sigmoid function 
    y_hat = 1.0 / (1.0 + np.exp(-betax))
    
    #y_hat needs to be rounded
    np.round(y_hat)
    
    


In [36]:
x = np.asarray([-1, 2, 3, 4])

In [37]:
# component 
y = 1.0/(1.0+np.exp(-x))
y

array([ 0.26894142,  0.88079708,  0.95257413,  0.98201379])

In [38]:
np.round(y)

array([ 0.,  1.,  1.,  1.])

In [43]:
A = np.matrix([[-1, 2, 3, 4],[5, 6, 7, 8]])
A

matrix([[-1,  2,  3,  4],
        [ 5,  6,  7,  8]])

In [49]:
y_hat_A = 1.0/(1.0+np.exp(-A))
y_hat_A

matrix([[ 0.26894142,  0.88079708,  0.95257413,  0.98201379],
        [ 0.99330715,  0.99752738,  0.99908895,  0.99966465]])

In [50]:
np.round(y_hat_A)

matrix([[ 0.,  1.,  1.,  1.],
        [ 1.,  1.,  1.,  1.]])

In [None]:
score = logistic.fit(X_train, y_train).score(X_test, y_test)