## Chapter 3 -  Classification


In [3]:
import pickle

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

from sklearn.base import BaseEstimator
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score,
                             confusion_matrix, classification_report,                              
                             precision_recall_curve, roc_curve, roc_auc_score)

def load(fname):
    mnist = None
    try:
        with open(fname, 'rb') as f:
            mnist = pickle.load(f)
            return mnist
    except FileNotFoundError:
        from sklearn.datasets import fetch_openml
        mnist = fetch_openml('mnist_784', version=1, cache=True)
        with open(fname, 'wb') as f:
            mnist = pickle.dump(mnist, f)
        return mnist

In [4]:
# Ingest
mnist_data = load('mnist.data.pkl')
X, y = mnist_data['data'], mnist_data['target']
# Labelling
y_5 = (y=='5')

### Resampling Methods

Resampling methods are indispensible in modern statistics. They involve repeatedly drawing samples from a training set and refitting a model on each sample to obtain additional information about the fitted model.

There are generally two ways to do resampling - cross-validation and bootstrapping. Cross-validation can be used to measure the test error associated with a given statistical learning method in order to evaluate its performance or select the appropriate level of flexibility. Bootstrapping is usually used to provide a measure of accuracy of a parameter estimate of a statistical learning method, amongst other uses.

### Validation Set

In cases where we don't have a large dataset, we split the observations to a training set and a test set. The model is fit on the training set and is evaluated on the test set. The performance of the model is measured by the performance measure on the test set (e.g. MSE, Cross-Entropy loss etc.). 

In `SKLearn` we use `train_test_split` to achieve this split.

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=0)

In [6]:
# Train
sgd_clf = SGDClassifier(random_state=0)
sgd_clf.fit(X_train, y_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge',
              max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l2',
              power_t=0.5, random_state=0, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

In [17]:
# Predict
y_pred = sgd_clf.predict(X_test)
print("Predict: " + str(y_pred[98:115]))
print("Actual : " + str(y_test[98:115]))

Predict: ['7' '2' '3' '1' '3' '7' '0' '1' '0' '8' '2' '9' '1' '9' '3' '6' '6']
Actual : ['7' '8' '7' '1' '3' '7' '0' '8' '0' '8' '2' '9' '1' '9' '3' '6' '6']


To extend, we can randomly draw 15% of observations to be the test set. However, there are limitations to this:
1. The validation estimate of test error rate can be highly variable, depending on which observations are used in the validation set.

2. Only a subset of observations are used to fit the model. This might overestimate the test error rate for the model fit on the entire dataset.

### Cross-Validation

k-vold cross validation involves splitting the observations into $k$ groups. In the first iteration of model evaluation, the first fold is treated as the validation set and the model is trained on the remaining $k-1$ folds. This is repeated $k$ times, with each successive fold being held out as the validation set. The $k$-fold CV estimate will then be used to average out these values.

In `sklearn`, we can perform `k_fold` cross validation using `cross_val_predict`.

In [18]:
kf = 3

In [20]:
# cross_val_predict gives the prediction result of each sample when it is in the test set group
sgd_ypred = cross_val_predict(sgd_clf, X_train, y_train, cv=kf)

As we can see, the error rate is simply the mean of the error rates of each of the folds:

$$\text{CV}_{(k)} = \frac{1}{k}\sum_{i=1}^K \text{Err}_i$$ where $\text{Err}_i = I(\hat{y}_i \neq y_i$) or when the value is predicted incorrectly. It is $1$ if the prediction is incorrect and $0$ if the prediction is correct.

We can also get scoring values for each fold (say, accuracy) using `cross_val_score`. 

In [23]:
cvs2 = cross_val_score(estimator=sgd_clf, X=X_train, y=y_train_5, cv=kf, scoring='accuracy')
print(cvs2, cvs2.mean(), cvs2.std())

[0.9325905  0.96470529 0.96435234] 0.9538827107934393 0.015056555299559625


$k$-fold CV often gives more accurate estimates of the test error rate. This is due to the bias-variance tradeoff.

leave-one-out-CV (LOOCV) is where $k=n$ and the number of times to train the model to compute test error rate is $n$ times. Comparing LOOCV and $k$-fold CV, and regular validation set approach, observe that LOOCV provides estimates with the least bias as we train on all the observations, minus 1. A $k$-fold CV has higher bias than LOOCV as each training set contains a smaller proportion of data being left out. 

But LOOCV also has higher variance then k-fold CV. Because we average the outputs of $n$ fitted models, and are trained on an almost identical set of observations. Hence, these outputs are highly correlated with each other. But k-fold CV uses outputs of models that are someone less correlated because the overlap between each training set is smaller. This lower correlation leads to lower variance.

### Bootstrapping

Bootstrapping is used to quantify the uncertainty associated with a given estimator or statistical learning method. It is used by repeatedly sampling observations, with replacement from the orignal dataset.