# Exercise 1

Try to build a classifier for the MNIST dataset that achieves over 97% accuracy on the test set.

---

In [26]:
# Pydata stack
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

plt.rcParams["figure.figsize"] = (20, 10)

import time

from sklearn import set_config  # Displays HTML representation of composite estimators
from sklearn.datasets import fetch_openml  # For open source ML sets
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

set_config(display="diagram")

In [21]:
# useful decorator
def computation_time(func):
    def inner(*args, **kwargs):
        start = time.perf_counter()
        func(*args, **kwargs)
        end = time.perf_counter()
        print("Computation time: {:.1f} seconds".format(end - start))

    return inner

## Get The Data

In [2]:
mnist = fetch_openml("mnist_784", version=1)

# load data into dataframes
X, y = mnist["data"], mnist["target"]

# use predefined train/test split
X_train, X_test, y_train, y_test = (
    X.iloc[:60000],
    X.iloc[60000:],
    y.iloc[:60000],
    y.iloc[60000:],
)

Is the dataset skewed?

In [4]:
class_counts = y_train.value_counts().sort_index()
class_pcts = 100 * class_counts / num_samples
class_pcts.round(1)

0     9.9
1    11.2
2     9.9
3    10.2
4     9.7
5     9.0
6     9.9
7    10.4
8     9.8
9     9.9
Name: class, dtype: float64

Different classes are essentially balanced.

## Select and Train Model

Use K-Nearest Neighbors classifier.

**Warning:** This cell takes around 35 minutes to run!

In [10]:
# initialise
knc_clf = KNeighborsClassifier()

# hyperparams to try
param_grid = [{"n_neighbors": [3, 4, 5], "weights": ["uniform", "distance"]}]

grid_search = GridSearchCV(
    knc_clf,
    param_grid,
    scoring="accuracy",
    cv=5,
    return_train_score=True,
    refit=True,
)


@computation_time
def knc_grid_search(X, y):
    grid_search.fit(X, y)

    cv_results = grid_search.cv_results_

    cv_summary = pd.DataFrame(cv_results["params"])
    cv_summary["mean_train_score"] = cv_results["mean_train_score"].round(3)
    cv_summary["mean_test_score"] = cv_results["mean_test_score"].round(3)
    cv_summary = cv_summary.sort_values(by="mean_test_score", ascending=False)

    return cv_results, cv_summary


knc_grid_search(X_train, y_train)

Mean Accuracy: 0.94


Training error isn't actually a useful metric here. We are weighting the importance of points by their distance. If a point in the training set is evaluated, it will be weighted very highly and so the algorithm will always give that label as its prediction. I think it's better to just look at test error.

In [22]:
knc_clf = grid_search.best_estimator_
knc_clf

## Evaluate on Test Set

In [24]:
y_test_pred = knc_clf.predict(X_test)
acc = accuracy_score(y_test, y_test_pred)

print("Test Accuracy: {:.1f}%".format(100 * acc))

Test Accuracy: 97.1%
