# Graded Lab Assignment 2: Evaluate classifiers (10 points)
 
In this assignment you will optimize and compare the perfomance of a parametric (logistic regression) and non-parametric (k-nearest neighbours) classifier on the MNIST dataset.

Publish your notebook (ipynb file) to your Machine Learning repository on Github ON TIME. We will check the last commit on the day of the deadline.  

### Deadline Friday, November 17, 23:59.

This notebook consists of three parts: design, implementation, results & analysis. 
We provide you with the design of the experiment and you have to implement it and analyse the results.

### Criteria used for grading
* Explain and analyse all results.
* Make your notebook easy to read. When you are finished take your time to review it!
* You do not want to repeat the same chunks of code multiply times. If your need to do so, write a function. 
* The implementation part of this assignment needs careful design before you start coding. You could start by writing pseudocode.
* In this exercise the insights are important. Do not hide them somewhere in the comments in the implementation, but put them in the Analysis part
* Take care that all the figures and tables are well labeled and numbered so that you can easily refer to them.
* A plot should have a title and axes labels.
* You may find that not everything is 100% specified in this assignment. That is correct! Like in real life you probably have to make some choices. Motivate your choices.


### Grading points distribution

* Implementation 5 points
* Results and analysis 5 points

## Design of the experiment

You do not have to keep the order of this design and are allowed to alter it if you are confident.
* Import all necessary modules. Try to use as much of the available functions as possible. 
* Use the provided train and test set of MNIST dataset.
* Pre-process data eg. normalize/standardize, reformat, etc.           
  Do whatever you think is necessary and motivate your choices.
* (1) Train logistic regression and k-nn using default settings.
* Use 10-fold cross validation for each classifier to optimize the performance for one parameter: 
    * consult the documentation on how cross validation works in sklearn (important functions:             cross_val_score(), GridSearchCV()).
    * Optimize k for k-nn,
    * for logistic regression focus on the regularization parameter,
* (2) Train logistic regression and k-nn using optimized parameters.
* Show performance on the cross-validation set for (1) and (2) for both classifiers: 
    * report the average cross validation error rates (alternatively, the average accuracies - it's up to you) and standard deviation,
    * plot the average cross valildation errors (or accuracies) for different values of the parameter that you tuned. 
* Compare performance on the test set for two classifiers:
    * produce the classification report for both classifiers, consisting of precision, recall, f1-score. Explain and analyse the results.
    * print confusion matrix for both classifiers and compare whether they missclassify the same  classes. Explain and analyse the results.
* Discuss your results.
* BONUS: only continue with this part if you are confident that your implemention is complete 
    * tune more parameters of logistic regression
    * add additional classifiers (NN, Naive Bayes, decision tree), 
    * analyse additional dataset (ex. Iris dataset)

## Implementation of the experiment

In [3]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [39]:
from sklearn.datasets import load_digits
import numpy as np
from sklearn import datasets # to load the dataset
from sklearn.model_selection import train_test_split #to split in train and test set
from sklearn.model_selection import cross_val_score, GridSearchCV #BONUS
from sklearn.metrics import classification_report, accuracy_score # for reporting
from scipy.spatial import distance #to calculate the Euclidean distance
from collections import Counter #to count unique occurances of items in array, for majority voting
from sklearn.preprocessing import StandardScaler # to normalize data (NN is very sensitive to this!)
from sklearn.neighbors import KNeighborsClassifier # k nearest neighbour classifier
from sklearn.neural_network import MLPClassifier # neural network classifier
from sklearn.linear_model import LogisticRegression #logistic regression classifier

# load mnist dataset and split in train and test set.
digits = datasets.load_digits()
X_train_mnist = np.reshape(digits.images[:1500],(1500,64))
X_test_mnist = np.reshape(digits.images[1500:],(297,64))
y_train_mnist = digits.target[:1500]
y_test_mnist = digits.target[1500:]

# your implementation here
# DRAFT

#preprocess
scaler = StandardScaler()
scaler.fit(X_train_mnist)
scaler.fit(X_test_mnist)

scaler.transform(X_train_mnist)
scaler.transform(X_test_mnist)


#train using default params
log_regr = LogisticRegression()
k_nn_classifier = KNeighborsClassifier()

log_regr.fit(X_train_mnist, y_train_mnist)
k_nn_classifier.fit(X_train_mnist, y_train_mnist)

#10-fold cross-validation
log_val_error = cross_val_score(log_regr, X_train_mnist, y_train_mnist, scoring="accuracy", cv=10)
k_nn_val_error = cross_val_score(k_nn_classifier, X_train_mnist, y_train_mnist, scoring="accuracy", cv=10)

#tune k
lst_of_k = np.arange(1, 50)
score_list_k = []
for k in lst_of_k:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train_mnist, y_train_mnist, scoring="accuracy", cv=10)
    score_list_k.append((scores.mean(), k))
sorted_list_k = sorted(score_list_k)
best_k = sorted_list_k[-1][1]

#tune lambda with GridSearchCV
lambda_param = {'C': [1,10]}
lst_of_c = np.arange(1, 10)
score_list_c = []
for c in lst_of_c:
    log = LogisticRegression(C=c)
    grid_search = GridSearchCV(log, lambda_param, cv=10)
    grid_search.fit(X_train_mnist, y_train_mnist)
    score = grid_search.score(X_train_mnist, y_train_mnist)
    score_list_c.append((score, c))
sorted_list_c = sorted(score_list_c)
best_c = sorted_list_c[-1][1]

#train using customised params
log_regr_customised = LogisticRegression(C=best_c)
knn_customised = KNeighborsClassifier(n_neighbors=best_k)

#cross val average accuracies and sd with different parameters

#plot use plt.plot

#show classification report


In [37]:
lst_of_k = np.arange(1, 50)
score_list_k = []
for k in lst_of_k:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train_mnist, y_train_mnist, scoring="accuracy", cv=10)
    score_list_k.append((scores.mean(), k))
sorted_list_k = sorted(score_list_k)
sorted_list_k

[(0.93350257174067397, 49),
 (0.93420023669404295, 41),
 (0.93420446414480873, 42),
 (0.93483614215319177, 44),
 (0.93484491408301618, 48),
 (0.9348539230920252, 46),
 (0.93486212180246642, 43),
 (0.93487990274129995, 45),
 (0.93551605502261359, 47),
 (0.93754729038850004, 39),
 (0.93821395705516686, 40),
 (0.93822296606417588, 38),
 (0.9382317379940005, 37),
 (0.93824074700300941, 35),
 (0.93890287893359781, 36),
 (0.94021470689058706, 34),
 (0.9421889038573491, 33),
 (0.94360147033842345, 32),
 (0.94422389251563266, 31),
 (0.94489503345523007, 29),
 (0.94555740246500286, 30),
 (0.95025128251771329, 28),
 (0.95160311777041395, 27),
 (0.95229098252811217, 25),
 (0.95230023835928601, 26),
 (0.95293614381843494, 23),
 (0.95296212346770959, 22),
 (0.95297591403496162, 24),
 (0.95364705497455904, 21),
 (0.95501691798825772, 19),
 (0.95566979408767239, 20),
 (0.95568805892785491, 18),
 (0.95629792427458149, 15),
 (0.95630618344817042, 13),
 (0.95635472559452173, 14),
 (0.95635472559452173, 

## Results and analysis of the experiment

In [None]:
# discuss the results