# Graded Lab Assignment 2: Evaluate classifiers (10 points)
 
In this assignment you will optimize and compare the perfomance of a parametric (logistic regression) and non-parametric (k-nearest neighbours) classifier on the MNIST dataset.

Publish your notebook (ipynb file) to your Machine Learning repository on Github ON TIME. We will check the last commit on the day of the deadline.  

### Deadline Friday, November 17, 23:59.

This notebook consists of three parts: design, implementation, results & analysis. 
We provide you with the design of the experiment and you have to implement it and analyse the results.

### Criteria used for grading
* Explain and analyse all results.
* Make your notebook easy to read. When you are finished take your time to review it!
* You do not want to repeat the same chunks of code multiply times. If your need to do so, write a function. 
* The implementation part of this assignment needs careful design before you start coding. You could start by writing pseudocode.
* In this exercise the insights are important. Do not hide them somewhere in the comments in the implementation, but put them in the Analysis part
* Take care that all the figures and tables are well labeled and numbered so that you can easily refer to them.
* A plot should have a title and axes labels.
* You may find that not everything is 100% specified in this assignment. That is correct! Like in real life you probably have to make some choices. Motivate your choices.


### Grading points distribution

* Implementation 5 points
* Results and analysis 5 points

## Design of the experiment

You do not have to keep the order of this design and are allowed to alter it if you are confident.
* Import all necessary modules. Try to use as much of the available functions as possible. 
* Use the provided train and test set of MNIST dataset.
* Pre-process data eg. normalize/standardize, reformat, etc.           
  Do whatever you think is necessary and motivate your choices.
* (1) Train logistic regression and k-nn using default settings.
* Use 10-fold cross validation for each classifier to optimize the performance for one parameter: 
    * consult the documentation on how cross validation works in sklearn (important functions:             cross_val_score(), GridSearchCV()).
    * Optimize k for k-nn,
    * for logistic regression focus on the regularization parameter,
* (2) Train logistic regression and k-nn using optimized parameters.
* Show performance on the cross-validation set for (1) and (2) for both classifiers: 
    * report the average cross validation error rates (alternatively, the average accuracies - it's up to you) and standard deviation,
    * plot the average cross valildation errors (or accuracies) for different values of the parameter that you tuned. 
* Compare performance on the test set for two classifiers:
    * produce the classification report for both classifiers, consisting of precision, recall, f1-score. Explain and analyse the results.
    * print confusion matrix for both classifiers and compare whether they missclassify the same  classes. Explain and analyse the results.
* Discuss your results.
* BONUS: only continue with this part if you are confident that your implemention is complete 
    * tune more parameters of logistic regression
    * add additional classifiers (NN, Naive Bayes, decision tree), 
    * analyse additional dataset (ex. Iris dataset)

## Implementation of the experiment

In [2]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [21]:
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
import scipy.stats as stats

# load mnist dataset and split in train and test set.
digits = load_digits()

X_train_mnist = reshape(digits.images[:1500],(1500,64))
X_test_mnist = reshape(digits.images[1500:],(297,64))
y_train_mnist = digits.target[:1500]
y_test_mnist = digits.target[1500:]


#pre-processing data, it is necessary to normalize data to prevent bad behaviour of estimators
#due to individual features not adhering to a Gaussian distribution

stats.normaltest(X_train_mnist) #Use the normaltest() function to check if the data has a Normal distribution
                                #It turns out the data doesn't have a Normal dist. so we normalize the data
scaler = StandardScaler() 
scaler.fit(X_train_mnist)


#Loading in Logistic Regression and K-NN
logreg = LogisticRegression()
knnreg = KNeighborsClassifier()

#Training Logistic Regression and K-NN
logreg.fit(X_train_mnist, y_train_mnist)
knnreg.fit(X_train_mnist, y_train_mnist)

# your implementation here

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

## Results and analysis of the experiment

In [14]:
# discuss the results

##Testing

NormaltestResult(statistic=array([  1.00390819e+00,   1.23876220e+03,   2.12277796e+02,
         2.58245810e+02,   2.39676670e+02,   1.27746847e+03,
         8.41743480e+02,   2.24208619e+03,   3.09145770e+03,
         4.50409917e+02,   2.74446212e+02,   1.39346795e+02,
         2.03444861e+02,   6.09626778e+00,   6.64528583e+02,
         2.25671962e+03,   3.55598342e+03,   3.07284555e+02,
         5.60494644e+02,   8.87065342e+04,   1.43314789e+01,
         1.10651821e+00,   6.64921589e+02,   2.41804166e+03,
         3.82169286e+03,   2.21653648e+02,   2.53951140e+01,
         7.07593108e+00,   1.70282614e+03,   2.37567038e+00,
         3.82753159e+02,   3.28498534e+03,   1.00390819e+00,
         3.27710120e+02,   1.18070634e+00,   1.58130564e+01,
         5.69079779e+02,   2.19301384e+01,   1.86311935e+02,
         1.00390819e+00,   3.37096319e+03,   7.29741469e+02,
         1.31283623e+01,   9.76358899e+00,   2.84808996e+00,
         9.21782537e+00,   1.75280536e+02,   2.80206374e+0