# MNIST Classification
In this exercise we will explore the performance of several classification techniques on classifying handwritten digits. The canonical dataset used is the MNIST dataset (https://en.wikipedia.org/wiki/MNIST_database). We will evaluate classifier performance through train/validation/test sets provided to you. 

This exercise will use TensorFlow (https://en.wikipedia.org/wiki/TensorFlow). <b>Ignore all warnings produced by the tensorflow library (or others that you are required to use in this project, e.g. scikit-learn).</b>

In [None]:
# imports
from utils import get_data_extract
from matplotlib import pyplot as plt
import numpy as np
import tensorflow as tf
import time
from sklearn.metrics import accuracy_score

# For reproducibility 
tf.random.set_seed(0)
np.random.seed(0)

Run the following line to retrieve the data and generate the training, validation, and test sets using the function <b>get_data_extract()</b> provided in <b>utils.py</b>. <br>
<b>We want you to use the training, validation and test sets provided by this function, exclusively.</b>

In [None]:
X_train, Y_train, X_val, Y_val, X_test, Y_test = get_data_extract()

# 0. Understanding the data

The resolution of the images are all $28\times 28$. The corresponding feature vector of an image is a $28^2=784$ length 1D array representing the row-major flattened version of the image (i.e. the rows are concatenated top-down). All values in the array are in the range $[0,1]$ representing the grayscale value at that point. Note to get the pixel value, you would have to multiply these values by $255$.  

a) What are the dimensions of $X_{train}, X_{val}, X_{test}$?

In [None]:
# (1 pt) Calculate dimensions here

b) Display the first two images of $X_{train}$. (You may find the <b>np.reshape()</b> and <b>plt.imshow()</b> methods useful.)

In [None]:
# (1 pt) Display first two training images here

c) Print out the first two labels in $Y_{train}$. 

In [None]:
# (1 pt) Print first two training labels here

We will now use the MNIST data extracts above to compare various classification algorithms. <b>ONLY use the data extracts provided above. </b> We are primarily interested in two metrics: 
1. the test set accuracy (0.0 being all incorrect and 1.0 being all correct)
2. time it takes to train the model and produce classifications for the <b>test set</b> <br>

$X_{train}, Y_{train}$ will be used for training while $X_{test}, Y_{test}$ will be used for evaluating performance. Some models will use $X_{val}, Y_{val}$ for parameter tuning. 

# 1. Logistic Regression

Repeat the exercise above for sklearn's <b>LogisticsRegression()</b> algorithm. You might find the following arguments helpful when initializing the model: <b>penalty='l1', C=1.0, tol=0.01, solver='liblinear'</b>. These parameters will help speed up the algorithm significantly. Produce the training time, test set prediction accuracy score and time taken to produce predictions. Format your answers.

In [None]:
# (5 pts) Train Logistic Regression and report time
# (5 pts) Test Logistic Regression and report accuracy, time

<b>****** The folowing models take a long time to run so we suggest reading all parts of the exercise first before implementing code. *****</b>

# 2. Logistic Regression with Polynomial Features

We now add polynomial features to our logistic regression. For that purpose, extend $X_{train}$ with second moment information. Specifically, add the squared value of each pixel, and the empirical covariance matrix, where each column in each 28x28 image is treated as different measurements of the corresponding row (hint: you can look on each row of X_train as a 28x28 image and then use np.cov). Now, re-run the logistic regression procedure from the previous excercise. You might find the following arguments helpful when initializing the model: <b>penalty='l1', C=8.0, tol=0.01, solver='liblinear'</b>. These parameters will help speed up the algorithm significantly, and yield good performance. Produce the training time, test set prediction accuracy score and time taken to produce predictions. Format your answers.

In [None]:
# (5 pts) Train Logistic Regression and report time
# (5 pts) Test Logistic Regression and report accuracy, time

Q (4 pts): Does polynomial features improve the performance of the logistic regression model? Why does it necessary to use larger regularization coefficient in this case?  

# 3. k-Nearest Neighbors

a) Repeat the classification exercises above for sklearn's kNN algorithm <b>KNeighborsClassifier()</b> using <b>k=1</b>. You might find the following arguments helpful when initializing the model: <b>algorithm='kd_tree', metric='minkowski', p=2</b>. These parameters will help speed up the algorithm significantly. However, the predictions will still be very slow. Produce the training time, test set prediction accuracy score and time taken to produce predictions. Format your answers.

In [None]:
# (4 pts) Train kNN for k=1 and report time
# (4 pts) Test kNN and report accuracy, time

b) For the kNN model trained above, what is the prediction accuracy on the <b>training set</b> $X_{train}, Y_{train}$ (this will take a while to compute)? Compare this to the the prediction accuracy on the training set for the <b>LogisticRegression model</b> (without polynomial features)? 

In [None]:
# (2 pts) Report kNN training accuracy
# (2 pts) Report Logistic Regression training accuracy 

c) (3 pts) Does anything surprise you about the Training set accuracies above? Why or why not?

d) (3 pts) Can you think why there is such a large difference between the kNN algorithms's training and prediction times?

e) For now, we have only tried <b>k=1</b> (choose the closet neighbor) according to the Euclidean distance metric <b>p=2</b>. Repeat the exercise above for the following combinations of parameters: <b> k $\in$ {1, 3}, p $\in$ {2, 3} </b>. Note that p=2 is the 2-norm (Euclidean distance) and p=3 is the 3-norm. Train each of the models on the training set and evaluate the accuracy on the <b>validation set</b> $X_{val}, Y_{val}$. <b>NOTE: this might take a while!</b> Report the validation set accuracy for each model. Format your answers.

<b>Note:</b> We recommend following the code skeleton below.

In [None]:
# Train kNN for all combinations of parameters above and report validation accuracy for each
def runKNN(X_train,Y_train, X_val, Y_val, k, p):
    # (6pts) your code below
    
    knn, val_score, train_time = None, None, None # TODO: compute
    return (knn, val_score, train_time)

best_score = 0.0
best_k, best_p, best_knn = None, None, None
best_train_time = np.inf

for k in [1, 3]:
    for p in [2, 3]:
        (knn_model, val_score, train_time) = runKNN(X_train,Y_train, X_val, Y_val, k, p)
        
        # (each (k,p) combination 1 pt) your code here

f) (1 pt) Based on the scores on the validation set, which parameters give the best model? Report the time taken to train the best model.

g) Using the best model, evaluate performance on the <b>test set</b>. Produce the prediction accuracy score and time taken to produce predictions. Please format your answers.

In [None]:
# (4 pts) Test best kNN and report accuracy, time

# 4. Simple Neural Network

We will now complete the tasks above using a simple Neural Network model.
This portal comes with a tensorflow installation (see the imports at the very top of this file).
Your task is the following:
1. Adapt the tensorflow tutorial here (https://www.tensorflow.org/tutorials/quickstart/beginner) to train a simple Neural Network model
2. You are required to use <b>ONLY</b> the $X_{train}, Y_{train}, X_{val}, Y_{val}, X_{test}, Y_{test}$ subsets provided earlier in this assignment. <b>DO NOT use any other subsets of the mnist dataset: you will have to adapt the tutorial to use the data provided in this assignment.</b>
3. You will need to write a custom one-hot encoder for the labels
   Hint: you might find the following libraries useful:
		from keras.utils import to_categorical
4. Note that you will not need to use any validation in this part. Train using $X_{train}, Y_{train}$.
5. Finally, produce the following: Time taken to train the model; Time taken to produce Test set predictions; Test set accuracy

<b>Note:</b> We recommend following the code skeleton below.

In [None]:
# Implement simpleNN
# Train simpleNN and report time
# Test simpleNN and report accuracy, time

def simple_nn(X, Y, X_val, Y_val, X_test, Y_test):
    
    model = tf.keras.models.Sequential([
     # (6 pts) your model here
    ])
    
    # (1 pt) one-hot encode the labels
    
    # (2 pts) compile model
    
    process = model.fit(
        # (4 pts) train model for 20 epochs and time it
        # make sure to achieve 99% training accuracy
    )
    
    # (4 pts) evaluate model and time it
    
    return process
    
simple_process = simple_nn(X_train, Y_train, X_val, Y_val, X_test, Y_test)

(2 pts) Plot the training and validation accuracy curves in the same plot.

In [None]:
# hint: use simple_process

# 5. Convolutional Neural Network

We will now complete the tasks above using a more complicated model: Convolutional Neural Network.

Your task is the following:
1. Adapt the CNN tensorflow tutorial here (https://machinelearningmastery.com/how-to-develop-a-convolutional-neural-network-from-scratch-for-mnist-handwritten-digit-classification/) to train a Convolutional Neural Network model
2. You are required to use <b>ONLY</b> the $X_{train}, Y_{train}, X_{val}, Y_{val}, X_{test}, Y_{test}$ subsets provided earlier in this assignment. <b>DO NOT use any other subsets of the mnist dataset: you will have to adapt the tutorial to use the data provided in this assignment.</b>
3. Use the same one-hot encoder as the previous part
4. Note that you will not need to use any validation in this part. Train using $X_{train}, Y_{train}$.
5. Finally, produce the following:
	Time taken to train the model; Time taken to produce Test set predictions; Test set accuracy

<b>Note:</b> We recommend following the code skeleton below.

In [None]:
# Implement CNN
# Train CNN and report time
# Test CNN and report accuracy, time

from keras.utils import to_categorical
from keras import layers
from keras.optimizers import SGD

def conv_nn(X, Y, X_test, Y_test):
    # your code below
    
    # (1 pt) reshape dataset to have a single channel (see tutorial)

    # (1 pt) one-hot encode the labels
    
    model = tf.keras.models.Sequential([
     # (8 pts) your model here
    ])
    
    # (2 pts) compile model
    
    process = model.fit(
        # (4 pts) train model for 20 epochs and time it
        # make sure to achieve 100% training accuracy
    )
    
    # (4 pts) evaluate model and time it
    
    return process
    
conv_process = conv_nn(X_train, Y_train, X_val, Y_val, X_test, Y_test)

(2 pts) Plot the training and validation accuracy curves in the same plot.

In [None]:
# hint: use conv_process

(3 pts) What do you observe in the plots showing training v.s. validation accuracies? Why?

(6 pts) Comment on the features of the various algorithms you have used in this assignment and the tradeoffs between computational efficiency and accuracy.