## Exercise 1

Try to achieve 97% accuracy on the MNIST test set. *Hint: Experiment with KNeighborsClassifier*

In [1]:
import numpy as np

# Load and pre-process the data
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
X, y = mnist['data'], mnist['target']

X_train, y_train = X[:60000], y[:60000]
X_test, y_test = X[60000:], y[60000:]
shuffle_idx = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_idx], y_train[shuffle_idx]

# With data in hand, time to structure the classifier and perform some grid searching.

### KNeighbors Classifier

Documentation for the [here](http://scikit-learn.org/stable/modules/neighbors.html#classification) and [here](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html).

It is an *instance based* classifier that does not build an internal model. Instead it stores instances of the training data. Implements a k-nearest neighbors vote process: classification is computed from the majority vote of the nearest neighbors.

Hyperparameters
1. n_neighbors
2. weights - can be uniform or based on a distance metric. You can also send in a custom function
3. algorithm - used to compute the nearest neighbors: ball_tree, kd_tree, brute, or auto.
4. leaf_size - passed into ball_tree/kd_tree. Default=30. Affects the speed of the query and memory required
5. p - power parameter for the Minkowski metric. 1 = manhattan, 2 = euclidean, etc
6. n_jobs - parallelization param.

In [2]:
# To start, let's build and train a sinlge classifier to see how it works with hyperparameter defaults.
from sklearn.neighbors import KNeighborsClassifier
from random import randint

knn_default_clf = KNeighborsClassifier()
knn_default_clf.fit(X_train, y_train)

rand_idx = randint(0, len(X_train))
print("Predicted: " + str(knn_default_clf.predict([X_train[rand_idx]])))
print("Label: " + str(y_train[rand_idx]))

Predicted: [8.]
Label: 8.0


In [3]:
# Ok, not bad. Perform a grid search, but limit the size of training data as it takes too long
# on my machine with all 60000 data points
from sklearn.model_selection import GridSearchCV

X_train_small = X_train[:20000]
y_train_small = y_train[:20000]

knn_clf = KNeighborsClassifier()
param_grid = [
    # From early testing, I used {'n_neighbors': [3,4,5,6,7], 'weights': ['uniform', 'distance']} but
    # for posterity, I limited it to the following:
    {'n_neighbors': [3,4,5], 'weights': ['distance']}
]

# Make it verbose to get periodic updates from the grid search process and parallelize into 3 jobs.
grid_search = GridSearchCV(knn_clf, param_grid, cv=3, scoring='accuracy', verbose=10, n_jobs=3)
grid_search.fit(X_train_small, y_train_small)

knn_best_clf = grid_search.best_estimator_
print("Grid search complete. Best estimator:\n " + str(knn_best_clf))

Fitting 3 folds for each of 3 candidates, totalling 9 fits
[CV] n_neighbors=3, weights=distance .................................
[CV] n_neighbors=3, weights=distance .................................
[CV] n_neighbors=3, weights=distance .................................
[CV]  n_neighbors=3, weights=distance, score=0.9575648523016944, total= 1.8min
[CV] n_neighbors=4, weights=distance .................................
[CV]  n_neighbors=3, weights=distance, score=0.9558757316524088, total= 1.7min
[CV] n_neighbors=4, weights=distance .................................


[Parallel(n_jobs=3)]: Done   2 tasks      | elapsed:  5.2min


[CV]  n_neighbors=3, weights=distance, score=0.9532093581283744, total= 1.8min
[CV] n_neighbors=4, weights=distance .................................
[CV]  n_neighbors=4, weights=distance, score=0.9581646423751687, total= 1.7min
[CV] n_neighbors=5, weights=distance .................................
[CV]  n_neighbors=4, weights=distance, score=0.9563587282543491, total= 1.7min
[CV] n_neighbors=5, weights=distance .................................
[CV]  n_neighbors=4, weights=distance, score=0.9575266396518085, total= 1.7min


[Parallel(n_jobs=3)]: Done   5 out of   9 | elapsed: 10.4min remaining:  8.3min
[Parallel(n_jobs=3)]: Done   6 out of   9 | elapsed: 10.4min remaining:  5.2min


[CV] n_neighbors=5, weights=distance .................................
[CV]  n_neighbors=5, weights=distance, score=0.9544159544159544, total= 1.8min


[Parallel(n_jobs=3)]: Done   7 out of   9 | elapsed: 15.7min remaining:  4.5min


[CV]  n_neighbors=5, weights=distance, score=0.9539592081583683, total= 1.8min
[CV]  n_neighbors=5, weights=distance, score=0.9552754014708089, total= 1.8min


[Parallel(n_jobs=3)]: Done   9 out of   9 | elapsed: 15.7min remaining:    0.0s
[Parallel(n_jobs=3)]: Done   9 out of   9 | elapsed: 15.7min finished


Grid search complete. Best estimator:
 KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=4, p=2,
           weights='distance')


In [4]:
# From GridSearch, the best estimator uses 4 neighbors and the euclidean distance metric (L2)
# Check against the test data.
y_pred = knn_best_clf.predict(X_test)

In [8]:
from sklearn.metrics import accuracy_score

knn_best_acc = accuracy_score(y_test, y_pred)
print(knn_best_acc)

0.9621


In [6]:
# Check accuracy of the default:
y_pred_default = knn_default_clf.predict(X_test)

In [9]:
knn_default_acc = accuracy_score(y_test, y_pred_default)
print(knn_default_acc)

0.9688


From this experiment, I achieved a 96% accuracy - just shy of the exercise goal. More than likely this is due to training with only 1/3 of the data. This hunch is confirmed by the performance of the default KNN classifier (.9688 vs .9621.