<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Lab-6" data-toc-modified-id="Lab-6-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Lab 6</a></span><ul class="toc-item"><li><span><a href="#The-Questions" data-toc-modified-id="The-Questions-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>The Questions</a></span></li><li><span><a href="#The-Code" data-toc-modified-id="The-Code-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>The Code</a></span><ul class="toc-item"><li><span><a href="#MNIST" data-toc-modified-id="MNIST-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>MNIST</a></span><ul class="toc-item"><li><span><a href="#kNN" data-toc-modified-id="kNN-1.2.1.1"><span class="toc-item-num">1.2.1.1&nbsp;&nbsp;</span>kNN</a></span></li><li><span><a href="#Logistic-Regression" data-toc-modified-id="Logistic-Regression-1.2.1.2"><span class="toc-item-num">1.2.1.2&nbsp;&nbsp;</span>Logistic Regression</a></span></li></ul></li></ul></li></ul></li></ul></div>

# Lab 6
This lab should be done with MNIST.

## The Questions

__1. Write a paragraph on how k nearest neighbors could be applied to MNIST.__

Approaching the problem of categorizing the MNIST handwriting set using the kNN algorithm presents several interesting challenges and can seem overwhelming until they are broken apart into simple actionabe steps that form a data pipeline for transforming the data.


When we look at the data we find that each instance is a 28x28 image. Flattening an image into a 1-dimensional array yields a 1x784 array, which means 784 features per instance. These features are represented by a matrix $X$. The class/value of each image (the digits 0-9) comprise a vector $Y$.


In order to train the model we need to select optimal hyperparameters. In particular we need to select the optimal value k, which is the distance around each instance to search for nearest neighbors OR in other terms, the number of neighbors to consult. In addition to the number of neighbors to consult, with sci-kit learn we have the option to specify how we weight our parameters: in a nutshell we can choose to consider all weights equally or to base our weights on the distance between a oint and its neighbors. The best way to select these hyperparameters is to try a variety of hyperparameter values using cross-validation, keeping in mind that more neightbors (high k-value) means greater chances of over-fitting. In order to help prevent this we can use cross-validation in doing our hyperparameter search.


All that is left to do once we've found optimal hyperparameters is to finish training and testing the model.

__2. Write a paragraph on how logistic regression could be applied to MNIST (you can use one vs all)__

In order to use Logistic Regression to classify the MNIST handwritten digit dataset we have to use a multi-class (as opposed to binary) classification model. In a nutshell, multi-class logistic regression comes down to linear algebra. Each class is represented by a hyer-plane (which is a higher-simensional analog to the line that we optimize in basic linear regression) onto which datapoints are projected. The hyperplane parameters are tuned in order to minimize the error between each hyperplane and training instances in their respective classes. Optimizing these parameters requires gradient descent (be it stochasitc, mini-batch, or batch) and the selection of an error/loss function that can be written in convex form. Our hyperparameters will be associated with this gradient descent, in particular the learning rate.

__3. Explain how you would find the best k for KNN. You can use Weka (instance-based classifiers) or KNeighborsClassifier() in sklearn (P.100). In particular, explain nested cross-validation, so that you are not testing the accuracy on the data you used to choose k. Perform the hyper parameter tuning.__

Nested cross-validation splits the data into N-folds, then iterates in a round-robin fashion leaving one fold out at each step to be used for testing.

I chose to perform several trials hyperparameter tuning on small samples of the dataset (5 trials on 5% of the data each) and found that each trial almost unanymously returned the best `k` as being equal to `4`, and they our beights should be distance-based. 

__4. You should use regularization with logistic regression, so that means you will choose the regularization parameters based on nested cross-validation. Perform the hyper parameter tuning.__

__5. Repeat steps 3 and 4 for multiple datasets.__

## The Code

### MNIST

#### kNN

One of the very real problems that this dataset presents is that the dataset is very large and complex, relative to say your average spreadsheet. With 42000 entries in the dataset the complexity and computation time of the problem can blow up very quickly for our purposes. In order to reduce compute time for cross-validation we can run several trials on samples of the train set, then measure the central tendency of those sampled trials and choose our hyperparameters that way.

In [5]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder as ohe
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold

In [8]:
mnist = pd.read_csv("data/mnist/train.csv").values # Load the data

In [9]:
mnist_x, mnist_y = mnist[:, 1:], mnist[:,0] # separate target from pixels
print(mnist_x.shape, mnist_y.shape)
# Outputs: (42000, 784) (42000,)

(42000, 784) (42000,)


In [11]:
# define an encoder to encode each class of digit as a numeric integer value
ohencoder = ohe(sparse=False, categories='auto')

# reshape training values
mnist_y = ohencoder.fit_transform(mnist_y.reshape(-1,1))

42000


array([[0., 1., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

In [31]:
# perform nested cross validation on multiple subsamples of the data
# in order to reduce compute time and optimize hyperparams

N_TRIALS = 5 # number of random trials
n_samples = int(len(mnist_x)*0.05) # specify percent of dataset to sample each trial
sample_indices = np.arange(n_samples)
scores = [] # variable to contain scored from each trial

for i in range(N_TRIALS):
    # select a random sample from the data
    rng = np.random.RandomState(42+i)  # reproducible results with a fixed seed
    rng.shuffle(sample_indices)
    x_shuff = mnist_x[sample_indices]
    y_shuff = mnist_y[sample_indices]
    
    # Define the range of hyperparameters to test/optimize
    grid_params = {"n_neighbors": range(3, 9),
                   "weights": ["uniform", "distance"]}

    # Use the kNN classifier
    knn = KNeighborsClassifier()

    # Provides train/test indices to split data in train/test sets. 
    # Split dataset into k consecutive folds
    # Each fold is then used once as a validation.
    # The k - 1 remaining folds form the training set.
    cv = KFold(n_splits=5, shuffle=True)

    # Nested cross-validation with parameter optimization and max core utilization
    # note that `n_jobs=-1` employs all available cores
    grid_search = GridSearchCV(knn, param_grid=grid_params, cv=cv, n_jobs=-1)
    grid_search.fit(X=x_shuff, y=y_shuff)
    nested_score = cross_val_score(grid_search, X=x_shuff, y=y_shuff, cv=cv)
    
    # store best params and their respective scores
    nested_scores.append([ grid_search.best_params_ , nested_score ])

In [32]:
# test scores almost unanymously agree that we want n_neighbors=4 and weights='distance'
nested_scores

[[array([0.85119048, 0.79761905, 0.81547619, 0.85714286, 0.85119048])],
 [{'n_neighbors': 4, 'weights': 'distance'},
  array([0.83333333, 0.81547619, 0.8452381 , 0.82738095, 0.83333333])],
 [{'n_neighbors': 3, 'weights': 'uniform'},
  array([0.83333333, 0.82142857, 0.80952381, 0.82142857, 0.78571429])],
 [{'n_neighbors': 4, 'weights': 'distance'},
  array([0.86309524, 0.7797619 , 0.80357143, 0.85714286, 0.81547619])],
 [{'n_neighbors': 4, 'weights': 'distance'},
  array([0.90238095, 0.91904762, 0.88571429, 0.88333333, 0.86904762])],
 [{'n_neighbors': 4, 'weights': 'distance'},
  array([0.87857143, 0.88333333, 0.88333333, 0.9       , 0.88571429])],
 [{'n_neighbors': 4, 'weights': 'distance'},
  array([0.91666667, 0.86428571, 0.89285714, 0.88095238, 0.89285714])],
 [{'n_neighbors': 4, 'weights': 'distance'},
  array([0.86190476, 0.90714286, 0.8952381 , 0.86190476, 0.88809524])],
 [{'n_neighbors': 4, 'weights': 'distance'},
  array([0.88333333, 0.87619048, 0.92142857, 0.86904762, 0.871428

In [None]:
# Load the test-set
test = pd.read_csv("data/mnist/train.csv").values
test_x, test_y = test[:, 1:], test[:,0]

# predict the values
predictions = grid_search.predict(test_x)

# encode test_y so we can compare predictions to test labels
test_y_encoded = ohencoder.fit_transform(test_y.reshape(-1,1))
test_y_encoded

In [75]:
correct = 0
incorrect = 0

for i, val in enumerate(predictions):
    if False in (val == test_y_encoded[i]): incorrect+=1
    else: correct+=1

In [78]:
# sanity check, make sure we have 42000 results
correct+incorrect

42000

In [79]:
# Nearly 91% accuracy. Pretty good.
correct/(42000)

0.9079285714285714

#### Logistic Regression

In [23]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
import pandas as pd
import numpy as np

In [24]:
mnist = pd.read_csv("data/mnist/train.csv").values # Load the data

In [25]:
mnist_x, mnist_y = mnist[:, 1:], mnist[:,0] # separate target from pixels
print(mnist_x.shape, mnist_y.shape)

(42000, 784) (42000,)


In [26]:
logReg = LogisticRegression(solver='lbfgs', multi_class='multinomial', n_jobs=-1)

In [None]:
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }
clf = GridSearchCV(logReg, param_grid, n_jobs=-1)

clf = clf.fit(mnist_x, mnist_y)

In [9]:
logReg.fit(mnist_x, mnist_y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=-1, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In [10]:
predictions = logReg.predict(mnist_x)

In [11]:
# Load the test-set
test = pd.read_csv("data/mnist/train.csv").values
test_x, test_y = test[:, 1:], test[:,0]

In [12]:
# Nearly 94% accuracy and wayyyyy lower computation time!
score = logReg.score(test_x, test_y)
print(score)

0.9380714285714286
