# Graded Lab Assignment 2: Evaluate classifiers (10 points)
 
In this assignment you will optimize and compare the perfomance of a parametric (logistic regression) and non-parametric (k-nearest neighbours) classifier on the MNIST dataset.

Publish your notebook (ipynb file) to your Machine Learning repository on Github ON TIME. We will check the last commit on the day of the deadline.  

### Deadline Friday, November 17, 23:59.

This notebook consists of three parts: design, implementation, results & analysis. 
We provide you with the design of the experiment and you have to implement it and analyse the results.

### Criteria used for grading
* Explain and analyse all results.
* Make your notebook easy to read. When you are finished take your time to review it!
* You do not want to repeat the same chunks of code multiply times. If your need to do so, write a function. 
* The implementation part of this assignment needs careful design before you start coding. You could start by writing pseudocode.
* In this exercise the insights are important. Do not hide them somewhere in the comments in the implementation, but put them in the Analysis part
* Take care that all the figures and tables are well labeled and numbered so that you can easily refer to them.
* A plot should have a title and axes labels.
* You may find that not everything is 100% specified in this assignment. That is correct! Like in real life you probably have to make some choices. Motivate your choices.


### Grading points distribution

* Implementation 5 points
* Results and analysis 5 points

## Design of the experiment

You do not have to keep the order of this design and are allowed to alter it if you are confident.
* Import all necessary modules. Try to use as much of the available functions as possible. 
* Use the provided train and test set of MNIST dataset.
* Pre-process data eg. normalize/standardize, reformat, etc.           
  Do whatever you think is necessary and motivate your choices.
* (1) Train logistic regression and k-nn using default settings.
* Use 10-fold cross validation for each classifier to optimize the performance for one parameter: 
    * consult the documentation on how cross validation works in sklearn (important functions:             cross_val_score(), GridSearchCV()).
    * Optimize k for k-nn,
    * for logistic regression focus on the regularization parameter,
* (2) Train logistic regression and k-nn using optimized parameters.
* Show performance on the cross-validation set for (1) and (2) for both classifiers: 
    * report the average cross validation error rates (alternatively, the average accuracies - it's up to you) and standard deviation,
    * plot the average cross valildation errors (or accuracies) for different values of the parameter that you tuned. 
* Compare performance on the test set for two classifiers:
    * produce the classification report for both classifiers, consisting of precision, recall, f1-score. Explain and analyse the results.
    * print confusion matrix for both classifiers and compare whether they missclassify the same  classes. Explain and analyse the results.
* Discuss your results.
* BONUS: only continue with this part if you are confident that your implemention is complete 
    * tune more parameters of logistic regression
    * add additional classifiers (NN, Naive Bayes, decision tree), 
    * analyse additional dataset (ex. Iris dataset)

## Implementation of the experiment

In [335]:
%pylab
from sklearn import datasets
from sklearn.model_selection import train_test_split 
from sklearn.model_selection import cross_val_score 
from sklearn.metrics import classification_report, accuracy_score 
from sklearn.neighbors import KNeighborsClassifier
from sklearn import linear_model, datasets
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_fscore_support, confusion_matrix

Using matplotlib backend: Qt5Agg
Populating the interactive namespace from numpy and matplotlib


In [337]:
# import some data to play with
digits = datasets.load_digits()
X_train_mnist = reshape(digits.images[:1500],(1500,64))
X_test_mnist = reshape(digits.images[1500:],(297,64))
y_train_mnist = digits.target[:1500]
y_test_mnist = digits.target[1500:]

#scaling the train and test arrays
scaler = StandardScaler()
scaler.fit(X_train_mnist)
scaler.fit(X_test_mnist)
X_train_transform = scaler.transform(X_train_mnist)
X_test_transform = scaler.transform(X_test_mnist)


## logistic regression

##### Default model

In [338]:
model_lr = LogisticRegression() #creating a default model where C = 5
model_lr.fit(X_train_transform, y_train_mnist)
#Use 10-fold crossvalidation to calculate the average accuracy
cv_score_lr = cross_val_score(model_lr, X_train_transform, y_train_mnist, cv=10, scoring='accuracy')
avgscore_lr = sum(cv_score_lr)/10
print('average accuracy default model: ', avgscore_lr)

average accuracy defaualt model:  0.941446300269


#### Optimizing C

##### Average cross validation errors plot LR

In [344]:
Cplt = [c/8 for c in range(1,21)] #from 0.125 to 2.5 in steps of 0.125 

scoreplt = []
for c in Cplt:
    lr = LogisticRegression(C = c)
    scores_lr = cross_val_score(lr, X_train_transform, y_train_mnist, cv=10, scoring='accuracy')
    scoreplt.append(sum(scores_lr)/10)
    
plt.plot(Cplt, scoreplt)
plt.show()

Our best C seems to lie around 0.5. In order to see find the best C we iterate from around 0.375 to 0.625 in smaller chunks in order to find the value where cv_score is highest

In [349]:
C = [i/50 for i in list(range(18, 33))] # steps of 0.02 from 0.36 to 0.64

cv_scores_C = dict()

for c in C: 
    # For each value in C create a LR model with C = c and caalculate its accuracy through a 10-fold cross validation
    lr = LogisticRegression(C = c)
    scores_lr = cross_val_score(lr, X_train_transform, y_train_mnist, cv=10, scoring='accuracy')
    cv_scores_C[c] = sum(scores_lr)/10

In [350]:
# 0.36 to 1.0
lists = sorted(cv_scores_C.items()) # sorted by key, return a list of tuples
x, y = zip(*lists) # unpack a list of pairs into two tuples

plt.plot(x,y)
plt.show()

Best score when $0.48 < C < 0.54$ with an average accuracy of 0.94540
We take C = $0.5$ for simplicities sake

## K-Nearest Neighbours

##### Default model

In [154]:
model_knn = KNeighborsClassifier()
model_knn.fit(X_train_transform, y_train_mnist)
cv_score_knn = cross_val_score(model_knn, X_train_transform, y_train_mnist, cv=10, scoring='accuracy')
avgscore_knn = sum(cv_score_knn)/10
print('prediction probability: ', avgscore_knn)

prediction probability:  0.948266646885


#### Optimizing K

In [342]:
# creating odd list of K for KNN
myList = list(range(1,15))

# subsetting just the odd ones to not deal with ties
neighbors = list(filter(lambda x: x % 2 != 0, myList))

# empty dict that will hold {k: cv score}
cv_scores = dict()

for k in neighbors:
    # For each value in neigbours create a nn model with n_neighbors = k and score through a 10-fold cross validation
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train_transform, y_train_mnist, cv=10, scoring='accuracy')
    cv_scores[k] = sum(scores)/10
    
best_k_cv = max(cv_scores.values()) # find best score
for key, value in cv_scores.items():
    # find k of best score
    if value == best_k_cv:
        print('With cross validation, the best k = ', key, "with accuracy: ", value)
        break


With cross validation, the best k =  1 with accuracy:  0.959534823035


##### Average cross validation errors plot knn

In [352]:
# creating odd list of K for KNN
myList = list(range(1,11))

# subsetting just the odd ones
neighbors = list(filter(lambda x: x % 2 != 0, myList))

# empty list that will hold cv scores
cv_scores = []

for k in range(1,11):
    # For each value in neigbours create a nn model with n_neighbors = k and score through a 10-fold cross validation
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train_transform, y_train_mnist, cv=10, scoring='accuracy')
    cv_scores.append(sum(scores)/10)
plt.plot(myList, cv_scores)
plt.show()

Eventhough k = 3 is close and would carry more information, since k = 1 is the best fit we will continue with that

### Training Logistic regression with optimized C (C = 0.5)

In [353]:
model_lr_opt = LogisticRegression(C = 0.5)
model_lr_opt.fit(X_train_transform, y_train_mnist)
cv_score_lr_opt = cross_val_score(model_lr_opt, X_train_transform, y_train_mnist, cv=10, scoring='accuracy')
avgscore_lr_opt = sum(cv_score_lr_opt)/10 # Average accuracy
standarddev_lr_opt = std(cv_score_lr_opt) #standard deviation
print('average cv accuracy: ', avgscore_lr_opt)
print('standard deviation: ', standarddev_lr_opt)

average cv accuracy:  0.945400039237
standard deviation:  0.0214456571311


In [354]:
predtrain = model_lr_opt.predict(X_train_transform)
print("accuracy on trainset: ", accuracy_score(predtrain, y_train_mnist))
predtest = model_lr_opt.predict(X_test_transform)
print("accuracy on testset: ", accuracy_score(predtest, y_test_mnist))
print('---')
print('precision, recall, fscore and support of trainset with C = 0.5: ')
print(precision_recall_fscore_support(y_train_mnist, predtrain, average='weighted'))
print('precision, recall, fscore and support of testset with C = 0.5: ')
print(precision_recall_fscore_support(y_test_mnist, predtest, average='weighted'))

accuracy on trainset:  0.984666666667
accuracy on testset:  0.888888888889
---
precision, recall, fscore and support of trainset with C = 0.5: 
(0.98467686336527849, 0.98466666666666669, 0.98465223025653814, None)
precision, recall, fscore and support of testset with C = 0.5: 
(0.89591581095807415, 0.88888888888888884, 0.88674841370887081, None)


### Training 1-Nearest Neighbour

In [355]:
model_1nn = KNeighborsClassifier(n_neighbors=1)
model_1nn.fit(X_train_transform, y_train_mnist)
cv_score_1nn = cross_val_score(model_1nn, X_train_transform, y_train_mnist, cv=10, scoring='accuracy')
avgscore_1nn = sum(cv_score_1nn)/10
standarddev_1nn = std(cv_score_1nn)
print('average cv accuracy: ', avgscore_1nn)
print('standard deviation: ', standarddev_1nn)

average cv accuracy:  0.959534823035
standard deviation:  0.0271404395113


In [321]:
predtrain_1nn = model_1nn.predict(X_train_transform)
print("accuracy on trainset: ", accuracy_score(predtrain_1nn, y_train_mnist))
predtest_1nn = model_1nn.predict(X_test_transform)
print("accuracy on testset: ", accuracy_score(predtest_1nn, y_test_mnist))
print('---')
print('precision, recall, fscore and support of trainset with k-nn k=1: ')
print(precision_recall_fscore_support(y_train_mnist, predtrain_1nn, average='weighted'))
print('precision, recall, fscore and support of testset with k-nn where k=1: ')
print(precision_recall_fscore_support(y_test_mnist, predtest_1nn, average='weighted'))


accuracy on trainset:  1.0
accuracy on testset:  0.929292929293
---
precision, recall, fscore and support of trainset with k-nn k=1: 
(1.0, 1.0, 1.0, None)
precision, recall, fscore and support of testset with k-nn where k=1: 
(0.93309180822044191, 0.92929292929292928, 0.92883458069178859, None)


### Confusion Matrices

In [359]:
confmat_lr_train= confusion_matrix(y_train_mnist, predtrain)
print('Logistic Regression Train Set:')
print(confmat_lr_train)
print('---')
print('1-nn Train Set')
confmat_1nn_train= confusion_matrix(y_train_mnist, predtrain_1nn)
print(confmat_1nn_train)

Logistic Regression Train Set:
[[151   0   0   0   0   0   0   0   0   0]
 [  0 147   0   0   0   0   1   0   2   1]
 [  0   0 150   0   0   0   0   0   0   0]
 [  0   0   0 151   0   1   0   0   1   0]
 [  0   0   0   0 146   0   0   1   0   1]
 [  0   0   0   0   0 149   1   0   0   2]
 [  0   1   0   0   1   0 149   0   0   0]
 [  0   0   0   0   0   0   0 148   0   1]
 [  0   2   1   3   0   1   0   0 139   0]
 [  0   0   0   0   0   0   0   0   2 147]]
---
1-nn Train Set
[[151   0   0   0   0   0   0   0   0   0]
 [  0 151   0   0   0   0   0   0   0   0]
 [  0   0 150   0   0   0   0   0   0   0]
 [  0   0   0 153   0   0   0   0   0   0]
 [  0   0   0   0 148   0   0   0   0   0]
 [  0   0   0   0   0 152   0   0   0   0]
 [  0   0   0   0   0   0 151   0   0   0]
 [  0   0   0   0   0   0   0 149   0   0]
 [  0   0   0   0   0   0   0   0 146   0]
 [  0   0   0   0   0   0   0   0   0 149]]


In [358]:
confmat_lr_test= confusion_matrix(y_test_mnist, predtest)
print('Logistic Regression Test Set:')
print(confmat_lr_test)
print('---')
confmat_1nn_test = confusion_matrix(y_test_mnist, predtest_1nn)
print('1-nn Test Set:')
print(confmat_1nn_test)

Logistic Regression Test Set:
[[25  0  0  0  1  0  1  0  0  0]
 [ 0 28  0  1  0  0  0  0  2  0]
 [ 0  0 27  0  0  0  0  0  0  0]
 [ 0  2  0 18  0  3  0  3  4  0]
 [ 0  0  0  0 30  0  0  0  1  2]
 [ 0  0  0  0  0 30  0  0  0  0]
 [ 0  0  0  0  0  0 30  0  0  0]
 [ 0  0  0  0  2  0  0 26  2  0]
 [ 0  2  0  0  0  0  0  0 26  0]
 [ 1  2  0  1  0  0  0  1  2 24]]
---
1-nn Test Set:
[[27  0  0  0  0  0  0  0  0  0]
 [ 0 31  0  0  0  0  0  0  0  0]
 [ 0  0 24  0  0  0  3  0  0  0]
 [ 0  0  0 27  0  1  1  1  0  0]
 [ 0  0  0  0 29  0  0  1  0  3]
 [ 0  1  0  0  1 28  0  0  0  0]
 [ 0  0  0  0  0  0 30  0  0  0]
 [ 0  0  0  0  0  0  0 30  0  0]
 [ 0  3  0  1  0  0  0  0 23  1]
 [ 0  0  0  1  0  2  0  1  0 27]]


## Results and analysis of the experiment

##### Logistic Regression
To create a Logistic regression model that fits our data, we use Scikit's build in *LinearRegression()* function.
We fit this standard model on our data, and calculate the accuracy.
Where the *average accuracy default model* = **0.94145**

In order to improve this we C by plotting the (cross validated) accuracy of a multiple of different Cs and finding the maximum.
In our data set it appeared to be when $0.48 < C < 0.54$ where the accuracy of our model was **0.94540**. Eventhough it is a small improvement, it is the best we can do.

##### K-Nearest Neighbours
To create a K-nearest neighbour model that fits our data, we use Scikit's build in *KNeighborsClassifier()* function.
We fit this standard model (k = 5) on our data, and calculate the accuracy.
Where the *average accuracy default model* = **0.94827**

While already better than the Logistic regression, we have to look whether $k = 5$ is the best model we can create.
To figure this out we plot for different Ks and find that $k = 1$ gives us an acuracy of **0.95953**

##### Training the models using our optimized parameters
After training our two models we can compare their performance  using scikit's: *precision_recall_fscore_support(yfound, ytrue, average = weighted)*. This formula returns the following values:

| set | precision | recall | f1 score |
|-----|-----------|--------|--------|
|LR Train| 0.984677 | 0.984667 | 0.98465 |
|1-NN Train | 1.0 | 1.0 | 1.0|
|---|---|---|---|
|LR Test | 0.895916 | 0.888889 | 0.88675 |
|1-NN Test | 0.933092 | 0.929293 | 0.928835|

Note: Since 1-NN even takes the outliers into account during training 1-NN's training f1 score is perfect.

As seen during the training 1-Nearest Neighbour is more acurate than logistic regression when C = 0.5

When looking at the Confusion matrices, those als seem to differ drastically. This is most likely because LR is most likely misclassifies by underfitting. While 1-NN most likely misclassifies due to an outlier in the training set influencing the classification of the test set (as may be the case where 2 gets misclassified *three* times as a 6, same for 8 and 1).
