CAP6618 Spring 2018
Assignment 3
Christian Garbin

This notebook uses scikit-learn's ```KNeighborsClassifier``` to classify digits from the MNIST dataset. ```KNeighborsClassifier``` was chosen based on the experiments detailed in the accompanying report.

**This notebook assumes Python 3**.

# Setup

Libraries and Jupyter global configuration options.

In [2]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline

To make the notebook reproducible.

In [3]:
np.random.seed(42)

# MNIST dataset

These functions load and split the MNIST dataset (lightly adapted from A. Géron's "Hands-On Machine Learning with Scikit-Learn and TensorFlow").

In [4]:
def sort_by_target(mnist):
    reorder_train = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[:60000])]))[:, 1]
    reorder_test = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[60000:])]))[:, 1]
    mnist.data[:60000] = mnist.data[reorder_train]
    mnist.target[:60000] = mnist.target[reorder_train]
    mnist.data[60000:] = mnist.data[reorder_test + 60000]
    mnist.target[60000:] = mnist.target[reorder_test + 60000]

In [5]:
try:
    from sklearn.datasets import fetch_openml
    mnist = fetch_openml('mnist_784', version=1, cache=True)
    mnist.target = mnist.target.astype(np.int8) # fetch_openml() returns targets as strings
    sort_by_target(mnist) # fetch_openml() returns an unsorted dataset
except ImportError:
    from sklearn.datasets import fetch_mldata
    mnist = fetch_mldata('MNIST original')

In [6]:
X, y = mnist["data"], mnist["target"]

The MNIST dataset is already split into train and test set. The first 60,000 samples are the train set, the last 10,000 are the test set. It has to be split this way because of how the dataset was created. Do **not** shuffle it first. Refer to [LeCun's page](http://yann.lecun.com/exdb/mnist/) for details.

In [7]:
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

Shuffle the train set because some classifiers work better with shuffled data. It's mostly needed for classifiers that can get stuck in a local minimum ([reference](https://stats.stackexchange.com/a/311318)), which is not the case with the current classifier. Added here in case this notebook is updated to use another classifier.

In [8]:
shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

## Visualizing the dataset

A visualization of the MNIST dataset,

TBD........

1. Are all classes equally represented?

# Training and evaluating

In this section the classifier is trained and evluated. The following sections tune the classifier.

## Training

Using `KneighborsClassifier` based on its combination of test time, training time and accuracy.

Note that to achieve a manageable test time we are using `n_jobs=-1`, to parallelize the work as much as possible.

Also note that the time spent in `.fit(...)` for this classifier is deceiving. This classifier does not have a proper training phase. All the work is done at test time (e.g. in `cross_val_score`).

In [10]:
%%time

from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=3, n_jobs=-1)

fit_results = clf.fit(X_train, y_train)
print(fit_results) # need to explicitly print results because of %%time

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=-1, n_neighbors=3, p=2,
           weights='uniform')
CPU times: user 18.7 s, sys: 80.2 ms, total: 18.8 s
Wall time: 18.8 s


## Evaluating

First perform a (relatively) quick evaluation with class prediction, to have a general idea of performance.

In [None]:
%%time
from sklearn.metrics import accuracy_score

y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)

Now perform a thorough evaluation with cross-validation. This will take a while to run (about 50 minutes on a MacBook i7 2.9 GHz).

In [None]:
%%time
from sklearn.model_selection import cross_val_score

scores = cross_val_score(clf, X_train, y_train, cv=3, scoring="accuracy")
print(scores)

Go into details of performance with a confusion matrix.