# HW4 Part A: Logistic Regression for Digits Data Set

Blanca Miller
<br>
CS 791
<br>
03/01/2018

__Objective:__ Generate a logistic regression model to model the probabilities for K classes. We estimate the probability of a  dependent/response variable using p(x, beta) = exp(beta^T * x)/(1 + exp(beta^T * x)), where x represents our data and beta represents the model parameters. We use the maximum likelihood method to pick parameters, starting with a random set of parameters, and iterate to maximize the likelihood.  

The probability is bounded from [0, 1]. The outputted probability p(x, beta) represents the likelihood that data value, x, belongs to a particular class (or positive class for the binary case).  

__Digits Data Set:__ https://web.stanford.edu/~hastie/ElemStatLearn/

## STEPS

Setup:
1. Import libraries
2. Import data sets: train & test sets
3. Convert data frame into numpy array
4. Parse the data into two matrices:
    - X: design matrix
    - y: targets/labels/response vector
5. Standardize to make the magnitude of your inputs roughly equal to the magnitude of your weights

Training:
1. Set the pairs of classes to compare
2. Count the number of training samples for the pair of classes 
3. Initilaize training matrices with corresponding count for each pair of chosen classes
4. Fill training matrices with corresponding data for each pair of chosen classes

Testing:
1. Count the number of testing samples for the pair of classes 
2. Initilaize testing matrices with corresponding count for each pair of chosen classes
3. Fill testing matrices with corresponding data for each pair of chosen classes
4. Make predictions about new data according to the chosen pair of classes
5. Compare the Test & Train Sets Labels for Accuracy
6. Evaluate Model's Precision for Each Pair of Classes

## FUNCTIONS
- Count the number of instances of two chosen classes using the y(response/prediction) vector
- Train the logistic model based on two chosen classes 
- Test the model based on the the two chosen classes
- beta(weights), X(feature/predictor) 

## DATA SET
- 7291 training observations
- 2007 testing observations
- 16 x 16 grayscale images of digits
- Each row consists of the digit id (0-9) followed by the 256 grayscale values

## Import Libraries

In [1]:
import sklearn
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt

from sklearn import metrics
from sklearn import preprocessing
from sklearn.preprocessing import scale
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

from pandas import Series, DataFrame
from scipy.stats  import spearmanr
from pylab import rcParams

## Import Data Sets

In [2]:
train = pd.read_csv('digits_data.train', delimiter=' ', header=None)
test = pd.read_csv('digits_data.test', delimiter=' ', header=None)

In [3]:
# print size of data sets
print("Training Set: {}".format(train.shape))
print("Testing Set: {}".format(test.shape))

Training Set: (7291, 258)
Testing Set: (2007, 257)


In [4]:
# print observations for training set
train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,248,249,250,251,252,253,254,255,256,257
0,6.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-0.631,0.862,...,0.823,1.0,0.482,-0.474,-0.991,-1.0,-1.0,-1.0,-1.0,
1,5.0,-1.0,-1.0,-1.0,-0.813,-0.671,-0.809,-0.887,-0.671,-0.853,...,-0.671,-0.033,0.761,0.762,0.126,-0.095,-0.671,-0.828,-1.0,
2,4.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-0.109,1.0,-0.179,-1.0,-1.0,-1.0,-1.0,
3,7.0,-1.0,-1.0,-1.0,-1.0,-1.0,-0.273,0.684,0.96,0.45,...,1.0,0.536,-0.987,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,
4,3.0,-1.0,-1.0,-1.0,-1.0,-1.0,-0.928,-0.204,0.751,0.466,...,0.639,1.0,1.0,0.791,0.439,-0.199,-0.883,-1.0,-1.0,


In [5]:
# print observations for testing data set
test.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,247,248,249,250,251,252,253,254,255,256
0,9,-1.0,-1.0,-1.0,-1.0,-1.0,-0.948,-0.561,0.148,0.384,...,-1.0,-0.908,0.43,0.622,-0.973,-1.0,-1.0,-1.0,-1.0,-1.0
1,6,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
2,3,-1.0,-1.0,-1.0,-0.593,0.7,1.0,1.0,1.0,1.0,...,1.0,0.717,0.333,0.162,-0.393,-1.0,-1.0,-1.0,-1.0,-1.0
3,6,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
4,6,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-0.858,-0.106,...,0.901,0.901,0.901,0.29,-0.369,-0.867,-1.0,-1.0,-1.0,-1.0


## Convert Data Frame into Numpy Array

In [6]:
train_set = train.as_matrix()
test_set = test.as_matrix()

In [7]:
# print all rows, start at the 1st column until the end of the matrix
train_set[:,1:]

array([[-1.   , -1.   , -1.   , ..., -1.   , -1.   ,    nan],
       [-1.   , -1.   , -1.   , ..., -0.828, -1.   ,    nan],
       [-1.   , -1.   , -1.   , ..., -1.   , -1.   ,    nan],
       ..., 
       [-1.   , -1.   , -1.   , ..., -1.   , -1.   ,    nan],
       [-1.   , -1.   , -1.   , ..., -1.   , -1.   ,    nan],
       [-1.   , -1.   , -1.   , ..., -1.   , -1.   ,    nan]])

In [8]:
# print all rows in 0th column
train_set[:,0]

array([ 6.,  5.,  4., ...,  3.,  0.,  1.])

## Parse the Data from the Targets/Labels 

In [9]:
# for all rows, start at the 1st column and go until the end of the column
X_train = train_set[:,1:257] # got to 256 to remove NaN column that numpy inserted

# for alls rows, get only the 0th element
y_train = train_set[:,0]

# for all rows, start at the 1st column and go until the end of the column
X_test = test_set[:,1:]

# foall rows, get only the 0th element
y_test = test_set[:,0]

# Number of training samples (rows)
n_trains = X_train.shape[0]

# Number of features (columns)
n_features = X_train.shape[1]

print("Training Data: {}".format(X_train.shape))
print("Training Labels: {}".format(y_train.shape))
print("Testing Data: {}".format(X_test.shape))
print("Testing Labels: {}".format(y_test.shape))
print("Number of Training Samples: {}".format(n_trains))
print("Number of Data Features: {}".format(n_features))

Training Data: (7291, 256)
Training Labels: (7291,)
Testing Data: (2007, 256)
Testing Labels: (2007,)
Number of Training Samples: 7291
Number of Data Features: 256


## Standardize for Similar Input & Weight Magnitude

In [10]:
# Set axis to 1 to standardize per sample/vector, rather than standardize each feature
X_train = preprocessing.scale(X_train, axis=1)
X_test = preprocessing.scale(X_test, axis=1)
print(X_train)

[[-0.80693359 -0.80693359 -0.80693359 ..., -0.80693359 -0.80693359
  -0.80693359]
 [-1.0229121  -1.0229121  -1.0229121  ..., -0.64403944 -0.82483886
  -1.0229121 ]
 [-0.62434198 -0.62434198 -0.62434198 ..., -0.62434198 -0.62434198
  -0.62434198]
 ..., 
 [-0.88870472 -0.88870472 -0.88870472 ..., -0.88870472 -0.88870472
  -0.88870472]
 [-1.34819665 -1.34819665 -1.34819665 ..., -1.34819665 -1.34819665
  -1.34819665]
 [-0.66492726 -0.66492726 -0.66492726 ..., -0.66492726 -0.66492726
  -0.66492726]]


## Function: Count the Number of Samples for a Pair of Classes

In [11]:
# count how many observations account for class a & b
def class_count(X, y, a, b):
    
    # initialize count 
    n = 0
    for i in range(X.shape[0]):
        
        # identify class a or b in label vector y
        if (y[i] == a or y[i] == b):
        
            # increment the count
            n += 1
        
    # return the count
    return n

## Function: Fill a Matrix with the Samples for a Pair of Classes

In [12]:
def class_fill(X, y, a, b, size, abX_train, aby_train):
    # initialize count
    n = 0
    for s in range(X.shape[0]):
        
        # identify class a or b in label vector
        if (y[s] == a or y[s] == b):
            
            # fill matrix with data
            abX_train[n] = X[s]
            aby_train[n] = y[s]
            
            # increment count
            n += 1

# TRAINING

## Designate Class Variables

In [13]:
a = 0.0
b = 1.0
c = 2.0
d = 3.0
e = 4.0 
f = 5.0

## Count the Number of Training Instances for Each Pair of Chosen Classes

In [14]:
ab_samples = class_count(X_train, y_train, a, b)
cd_samples = class_count(X_train, y_train, c, d)
ef_samples = class_count(X_train, y_train, e, f)

print("Number of Training Samples for Classes A & B: {}".format(ab_samples))
print("Number of Training Samples for Classes C & D: {}".format(cd_samples))
print("Number of Training Samples for Classes E & F: {}".format(ef_samples))

Number of Training Samples for Classes A & B: 2199
Number of Training Samples for Classes C & D: 1389
Number of Training Samples for Classes E & F: 1208


## Initilaize Training Matrices with Corresponding Count for Each Pair of Chosen Classes

In [15]:
# Initialize the training data set matrix 
abX_train = np.zeros((ab_samples, X_train.shape[1]))
cdX_train = np.zeros((cd_samples, X_train.shape[1]))
efX_train = np.zeros((ef_samples, X_train.shape[1]))

# Initialize the targets/labels matrix
aby_train = np.zeros((ab_samples, 1))
cdy_train = np.zeros((cd_samples, 1))
efy_train = np.zeros((ef_samples, 1))

## Fill Training Matrices with Corresponding Data for each Pair of Chosen Classes

In [16]:
class_fill(X_train, y_train, a, b, ab_samples, abX_train, aby_train)
class_fill(X_train, y_train, c, d, cd_samples, cdX_train, cdy_train)
class_fill(X_train, y_train, e, f, ef_samples, efX_train, efy_train)

print("Size of Training Matrix for Classes A & B: {}".format(abX_train.shape))
print("Size of Training Matrix for Classes C & D: {}".format(cdX_train.shape))
print("Size of Training Matrix for Classes E & F: {}".format(efX_train.shape))

Size of Training Matrix for Classes A & B: (2199, 256)
Size of Training Matrix for Classes C & D: (1389, 256)
Size of Training Matrix for Classes E & F: (1208, 256)


## Train the Logistic Model with the Training Data Set

In [17]:
ab_logistic = LogisticRegression()
ab_logistic.fit(abX_train, aby_train)

cd_logistic = LogisticRegression() 
cd_logistic.fit(cdX_train, cdy_train)

ef_logistic = LogisticRegression()
ef_logistic.fit(efX_train, efy_train)

print("LogisticRegression Score: {}".format(ab_logistic.score(abX_train, aby_train)))
print("LogisticRegression Score: {}".format(cd_logistic.score(cdX_train, cdy_train)))
print("LogisticRegression Score: {}".format(ef_logistic.score(efX_train, efy_train)))

  y = column_or_1d(y, warn=True)


LogisticRegression Score: 1.0
LogisticRegression Score: 1.0
LogisticRegression Score: 1.0


__Interpretation:__ A score of 1.0 means that our model can perfectly predict the correct label for a given digit 100% of the time. This value is ideal as we chose samples that exactly matched to our two classes. Normally, we would see some amount of noise in the data. 

# TESTING 

## Count the Number of Testing Instances for Each Pair of Chosen Classes

In [18]:
ab_test_samples = class_count(X_test, y_test, a, b)
cd_test_samples = class_count(X_test, y_test, c, d)
ef_test_samples = class_count(X_test, y_test, e, f)

print("Number of Testing Samples for Classes A & B: {}".format(ab_test_samples))
print("Number of Testing Samples for Classes C & D: {}".format(cd_test_samples))
print("Number of Testing Samples for Classes E & F: {}".format(ef_test_samples))

Number of Testing Samples for Classes A & B: 623
Number of Testing Samples for Classes C & D: 364
Number of Testing Samples for Classes E & F: 360


## Initialize Testing Matrices with Corresponding Count for Each Pair of Chosen Classes

In [19]:
# Initialize the training data set matrix 
abX_test = np.zeros((ab_test_samples, X_test.shape[1]))
cdX_test = np.zeros((cd_test_samples, X_test.shape[1]))
efX_test = np.zeros((ef_test_samples, X_test.shape[1]))

# Initialize the targets/labels matrix
aby_test = np.zeros((ab_test_samples, 1))
cdy_test = np.zeros((cd_test_samples, 1))
efy_test = np.zeros((ef_test_samples, 1))

## Fill Testing Matrices with Corresponding Data for each Pair of Chosen Classes

In [20]:
class_fill(X_test, y_test, a, b, ab_test_samples, abX_test, aby_test)
class_fill(X_test, y_test, c, d, cd_test_samples, cdX_test, cdy_test)
class_fill(X_test, y_test, e, f, ef_test_samples, efX_test, efy_test)

print("Size of Testing Matrix for Classes A & B: {}".format(abX_test.shape))
print("Size of Testing Matrix for Classes C & D: {}".format(cdX_test.shape))
print("Size of Testing Matrix for Classes E & F: {}".format(efX_test.shape))

Size of Testing Matrix for Classes A & B: (623, 256)
Size of Testing Matrix for Classes C & D: (364, 256)
Size of Testing Matrix for Classes E & F: (360, 256)


## Make Predictions about New Data According to the Chosen Pair of Classes 

In [21]:
ab_y_pred = ab_logistic.predict(abX_test)
cd_y_pred = cd_logistic.predict(cdX_test)
ef_y_pred = ef_logistic.predict(efX_test)

ab_pred_size = ab_y_pred.shape[0]
cd_pred_size = cd_y_pred.shape[0]
ef_pred_size = ef_y_pred.shape[0]

print("Predictions for Classes A & B: \n{}".format(ab_y_pred))
print("Predictions for Classes C & D: \n{}".format(cd_y_pred))
print("Predictions for Classes E & F: \n{}".format(ef_y_pred))

Predictions for Classes A & B: 
[ 0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  1.  0.  1.  0.  0.  1.  0.
  0.  0.  0.  1.  1.  0.  0.  1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  1.  0.  1.  0.  0.  0.  1.  0.  1.  0.  0.  0.  0.  0.  0.  0.
  0.  1.  0.  0.  0.  1.  1.  0.  1.  0.  1.  0.  0.  1.  0.  0.  1.  0.
  0.  0.  0.  0.  1.  0.  1.  1.  1.  1.  0.  1.  0.  0.  0.  0.  1.  0.
  0.  1.  1.  0.  1.  1.  1.  0.  1.  0.  0.  0.  1.  1.  1.  1.  0.  1.
  0.  0.  0.  1.  1.  0.  0.  1.  1.  0.  0.  0.  1.  0.  0.  1.  1.  0.
  0.  0.  1.  1.  0.  1.  0.  1.  0.  1.  0.  1.  1.  0.  0.  0.  1.  1.
  0.  1.  1.  0.  0.  1.  0.  0.  1.  1.  0.  0.  1.  0.  0.  1.  0.  1.
  0.  0.  0.  1.  0.  0.  1.  0.  1.  0.  0.  0.  1.  1.  0.  0.  0.  0.
  0.  0.  0.  1.  0.  0.  0.  0.  1.  0.  0.  1.  1.  1.  0.  0.  1.  0.
  0.  1.  1.  0.  1.  0.  0.  0.  0.  0.  1.  0.  0.  0.  1.  0.  1.  1.
  1.  1.  0.  0.  0.  1.  1.  0.  0.  1.  1.  1.  0.  0.  1.  0.  0.  0.
  1.  0.  0.  0.  0

## Compare the Test & Train Sets Labels for Accuracy

In [22]:
a_count = 0
for a in range(ab_pred_size):
    if ab_y_pred[a] == aby_test[a][0]:
        a_count += 1
        
print("Correctly Classified Labels: {}, {}".format(a_count, ab_test_samples))

Correctly Classified Labels: 618, 623


In [23]:
c_count = 0
for c in range(cd_pred_size):
    if cd_y_pred[c] == cdy_test[c]:
        c_count += 1
        
print("Correctly Classified Labels: {}, {}".format(c_count, cd_test_samples))

Correctly Classified Labels: 346, 364


In [24]:
e_count = 0
for e in range(ef_pred_size):
    if ef_y_pred[e] == efy_test[e]:
        e_count += 1    
        
print("Correctly Classified Labels: {}, {}".format(e_count, ef_test_samples))

Correctly Classified Labels: 355, 360


## Evaluate Model's Precision for Each Pair of Classes

In [25]:
print(classification_report(aby_test, ab_y_pred))
print(classification_report(cdy_test, cd_y_pred))
print(classification_report(efy_test, ef_y_pred))

             precision    recall  f1-score   support

        0.0       0.99      1.00      0.99       359
        1.0       1.00      0.98      0.99       264

avg / total       0.99      0.99      0.99       623

             precision    recall  f1-score   support

        2.0       0.96      0.95      0.95       198
        3.0       0.94      0.95      0.95       166

avg / total       0.95      0.95      0.95       364

             precision    recall  f1-score   support

        4.0       0.99      0.98      0.99       200
        5.0       0.98      0.99      0.98       160

avg / total       0.99      0.99      0.99       360

