# Escape Summer School - Classification, Performance Evaluation & Hyper-parameter Optimization

### Classification

- The task for which the `digits` dataset was originally devised is, of course, identifying which handwritten digit is in each image.
- This is a useful task in the real world, where machines read postcodes on letters to sort them, and digits on cheques to validate them automatically.
- We have labels in the training dataset, therefore this is a **supervised classification** task. Supervised classification is the most common of modern ML tasks.

### Support Vector Classifiers

- Today, large-scale, high-dimensional classification tasks on complex data are almost always solved with deep neural networks.
- However, simpler algorithms are still used for smaller and lower-dimensional datasets. One of these is called the Support Vector Classifier.

Before we proceed, it's time to introduce the split between training set and test set.
- Having a separate test set, composed of data kept out of the training, is important in order to check whether our model is able to *generalise properly* on data it has not seen before.

In [None]:
import numpy as np
from numpy.random import randint
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()

### Load the dataset and plot some images along with their labels

In [None]:
from sklearn.datasets import load_digits

np.random.seed(0)
digits = load_digits()

images = digits['images'] # The real digit images of size (N, 8, 8)
X = digits['data'] # The vectorised data of size (N, 64)
y = digits['target']
N = len(images)
print('Number of digits :', N)

plt.figure(figsize=(20, 4))
for i, r in enumerate(randint(0, N, 5)):
    plt.subplot(1, 5, i + 1)
    plt.imshow(images[r], cmap="gray")
    plt.title('Label : ' + str(y[r]))

## Classification without cross-validation

### Split in train and test datasets

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = #TODO

print('Training set shape :', X_train.shape)
print('Test set shape     :', X_test.shape)

### Train classifier

In [None]:
from sklearn.svm import SVC

kernel = #TODO
gamma = #TODO
svc = #TODO

# training
svc.fit(X_train, y_train)

# looking at prediction on test set
y_pred = #TODO

# accuracy of the model
accuracy = sum(y_pred == y_test) / len(y_pred)
print("%0.5f accuracy" % (accuracy))

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay. #TODO

## Classification with Cross-validation

In [None]:
from sklearn.model_selection import cross_validate

scores = #TODO

print("%0.5f accuracy with a standard deviation of %0.5f" % (scores['test_score'].mean(), scores['test_score'].std()))

In [None]:
scores_test = []
for estimator in #TODO:
    y_pred = #TODO
    accuracy = sum(y_pred == y_test) / len(y_pred)
    scores_test.append(accuracy)
    
print("%0.5f accuracy with a standard deviation of %0.5f" % (np.mean(scores_test), np.std(scores_test)))

## Hyperparameter optimization

In [None]:
from sklearn.model_selection import GridSearchCV

C = #TODO
gamma = #TODO

param_grid = [
    {"kernel": ["rbf"], "gamma": gamma, "C": C},
    {"kernel": ["linear"], "C": C},
]

grid_search = #TODO
grid_search.fit(X_train, y_train)

print("Best parameters set :", grid_search.best_params_)

In [None]:
means = grid_search.cv_results_["mean_test_score"]
stds = grid_search.cv_results_["std_test_score"]

for mean, std, params in zip(means, stds, grid_search.cv_results_["params"]):
    print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
print()

## ROC curves

The objective is to compute the ROC curves of a classifier trained to distinguish between even and odd numbers for different combinaisons of data dimensionnality reduction (using PCA).

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca.fit(X_train)

y_train = y_train % 2 == 0 # Even or odd
y_test = y_test % 2 == 0 # Even or odd

X_train_pca = #TODO
X_test_pca = #TODO

plt.scatter(
    X_train_pca[y_train==True, 0], 
    X_train_pca[y_train==True, 1], 
    c='b', 
    s=100,
    label='Even',
    alpha=0.3, 
    edgecolors='none'
)

plt.scatter(
    X_train_pca[y_train==False, 0], 
    X_train_pca[y_train==False, 1], 
    c='r', 
    s=100,
    label='Odd',
    alpha=0.3, 
    edgecolors='none'
)

plt.legend()

In [None]:
from sklearn.metrics import RocCurveDisplay

svc = #TODO # Best classifier according to grid search

fig, ax = plt.subplots()
for n in range(1, 10):
    pca = #TODO
    pca.fit(X_train)
    
    X_train_pca = #TODO
    X_test_pca = #TODO
    
    svc.fit(X_train_pca, y_train)
    
    y_pred = #TODO
    
    # accuracy of the model
    accuracy = sum(y_pred == y_test) / len(y_pred)
    print("%0.5f accuracy for %d PC(s)" % (accuracy, n))
    
    viz = RocCurveDisplay.#TODO

ax.plot([0, 1], [0, 1], linestyle="--", lw=2, color="r", label="Chance", alpha=0.8)