# Graded Lab Assignment 2: Evaluate classifiers (10 points)
 
In this assignment you will optimize and compare the perfomance of a parametric (logistic regression) and non-parametric (k-nearest neighbours) classifier on the MNIST dataset.

Publish your notebook (ipynb file) to your Machine Learning repository on Github ON TIME. We will check the last commit on the day of the deadline.  

### Deadline Friday, November 17, 23:59.

This notebook consists of three parts: design, implementation, results & analysis. 
We provide you with the design of the experiment and you have to implement it and analyse the results.

### Criteria used for grading
* Explain and analyse all results.
* Make your notebook easy to read. When you are finished take your time to review it!
* You do not want to repeat the same chunks of code multiply times. If your need to do so, write a function. 
* The implementation part of this assignment needs careful design before you start coding. You could start by writing pseudocode.
* In this exercise the insights are important. Do not hide them somewhere in the comments in the implementation, but put them in the Analysis part
* Take care that all the figures and tables are well labeled and numbered so that you can easily refer to them.
* A plot should have a title and axes labels.
* You may find that not everything is 100% specified in this assignment. That is correct! Like in real life you probably have to make some choices. Motivate your choices.


### Grading points distribution

* Implementation 5 points
* Results and analysis 5 points

## Design of the experiment

You do not have to keep the order of this design and are allowed to alter it if you are confident.
* Import all necessary modules. Try to use as much of the available functions as possible. 
* Use the provided train and test set of MNIST dataset.
* Pre-process data eg. normalize/standardize, reformat, etc.           
  Do whatever you think is necessary and motivate your choices.
* (1) Train logistic regression and k-nn using default settings.
* Use 10-fold cross validation for each classifier to optimize the performance for one parameter: 
    * consult the documentation on how cross validation works in sklearn (important functions:             cross_val_score(), GridSearchCV()).
    * Optimize k for k-nn,
    * for logistic regression focus on the regularization parameter,
* (2) Train logistic regression and k-nn using optimized parameters.
* Show performance on the cross-validation set for (1) and (2) for both classifiers: 
    * report the average cross validation error rates (alternatively, the average accuracies - it's up to you) and standard deviation,
    * plot the average cross valildation errors (or accuracies) for different values of the parameter that you tuned. 
* Compare performance on the test set for two classifiers:
    * produce the classification report for both classifiers, consisting of precision, recall, f1-score. Explain and analyse the results.
    * print confusion matrix for both classifiers and compare whether they missclassify the same  classes. Explain and analyse the results.
* Discuss your results.
* BONUS: only continue with this part if you are confident that your implemention is complete 
    * tune more parameters of logistic regression
    * add additional classifiers (NN, Naive Bayes, decision tree), 
    * analyse additional dataset (ex. Iris dataset)

## Implementation of the experiment

In [2]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [48]:
from sklearn.datasets import load_digits
# load mnist dataset and split in train and test set.
digits = load_digits()
X_train_mnist = reshape(digits.images[:1500],(1500,64))
X_test_mnist = reshape(digits.images[1500:],(297,64))
y_train_mnist = digits.target[:1500]
y_test_mnist = digits.target[1500:]

# your implementation here
from sklearn.model_selection import train_test_split #to split in train and cv
from sklearn.model_selection import cross_val_score #for cross validation
from sklearn.model_selection import GridSearchCV #for optimization using cv
from sklearn.preprocessing import StandardScaler #for scaling
from sklearn.linear_model import LogisticRegression #logistic regression classifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score # for reporting
from scipy.spatial import distance #to calculate the Euclidean distance
from collections import Counter #to do majority voting

scaler = StandardScaler()
scaler.fit(X_train_mnist)
X_train_mnist = scaler.transform(X_train_mnist)


In [46]:
# Using the default parameters
crossval_accuracy_lr = cross_val_score(LogisticRegression(), X_train_mnist, y_train_mnist, cv=10)
print(crossval_accuracy_lr.mean())

crossval_accuracy_nn = cross_val_score(KNeighborsClassifier(), X_train_mnist, y_train_mnist, cv=10)
print(crossval_accuracy_nn.mean())

0.943464257824
0.961582938171


In [49]:
# Optimize the parameters

param_grid_regularization = [{'C': [0.1, 0.5, 1, 5, 10, 20, 50, 100]}]

GS_LR = GridSearchCV(LogisticRegression(), param_grid_regularization, cv=10)
OptimizedRegularization = GS_LR.best_params_
print(GS_LR.cv_results_)


param_grid_k = [{'n_neighbors': [1, 2, 4, 8, 16, 30, 50, 100]}]

GS_NN = GridSearchCV(KNeighborsClassifier(), param_grid_regularization, cv=10)
OptimizedK = GS_NN.best_params_
print(GS_NN.cv_results_)

#scores_logreg = cross_val_score(LogisticRegression(C=50.0), X_train, y_train, scoring='accuracy', cv=10)
#print(scores_logreg.mean())

#scores_kneigh = cross_val_score(KNeighborsClassifier(n_neighbors=1000), X_train, y_train, scoring='accuracy', cv=10)
#print(scores_kneigh.mean())


NotFittedError: This GridSearchCV instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.

In [None]:
# Train using the optimized parameters

## Results and analysis of the experiment

In [11]:
# discuss the results