<a href="https://colab.research.google.com/github/dlsun/pods/blob/master/06-Classification-Models/6.3%20Estimating%20Test%20Metrics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 6.3 Estimating Test Metrics

In the previous lesson, we learned several scores (accuracy, precision, recall, F1) for evaluating classification models. We calculated these scores on the training data---that is, the same data that was used to evaluate the model. In Chapter 5, we saw that evaluating machine learning models on the training data is problematic because a machine learning model can achieve a good training score by _overfitting_ to the training data. We argued that the goal of a machine learning model should be to achieve a good score on test data. (Chapter 5.4) However, ground truth labels are often not available for the test data. Nevertheless, we can use cross-validation on the training data to estimate the test scores. (Chapter 5.5) These so-called _validation scores_ can be used to select between models and tune hyperparameters. (Chapter 5.6)

Although Chapter 5 was about regression models, the exact same program can be carried out for classification models. Instead of calculating the *training* accuracy, precision, etc., we estimate the *test* accuracy, precision, etc. using cross-validation. This lesson demonstrates how to carry out this program, but the concepts (and even the code) are essentially identical to Chapter 5.

First, we define a classifier that we want to evaluate.

In [0]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline

data_dir = "http://dlsun.github.io/pods/data/"
df_breast = pd.read_csv(data_dir + "breast-cancer.csv")

X_train = df_breast[["Clump Thickness", "Uniformity of Cell Size", "Uniformity of Cell Shape",
                     "Marginal Adhesion", "Single Epithelial Cell Size", "Bare Nuclei",
                     "Bland Chromatin", "Normal Nucleoli", "Mitoses"]]
y_train = df_breast["Class"]

pipeline = make_pipeline(
    StandardScaler(),
    KNeighborsClassifier(n_neighbors=10)
)

pipeline.fit(X=X_train, y=y_train)

To calculate test scores using $k$-fold cross validation, we use the `cross_val_score` function in scikit-learn. For example, to calculate test accuracy, we do the following:

In [0]:
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(pipeline, X_train, y_train, 
                            cv=10, scoring="accuracy")
cv_scores

We get 10 accuracy scores, one from each of the $k=10$ folds. It is customary to average these accuracy scores to obtain one overall estimate of the test accuracy.

In [0]:
cv_scores.mean()

This validation accuracy is high, but lower than the 97.2% training accuracy that we obtained in the previous lesson. This makes sense because it is always harder for a model to predict on data it has not seen than on data it saw. Recall that Wenger's neural network model that won the Google Science Fair had an accuracy of 97.4%. We have come close to achieving that using a simple $10$-nearest neighbors classifier.

Scikit-Learn can also calculate the precision and recall of a class $c$, but the labels need to be converted to a binary label that is $1$ (or `True`) if the observation is in class $c$ and $0$ (or `False`) otherwise. For example, to calculate the precision for benign tumors (class 0), we define the new label `is_benign`.

In [0]:
is_benign = (y_train == 0)

cross_val_score(pipeline, X_train, is_benign, 
                cv=10, scoring="precision").mean()

To calculate the validation _recall_ for benign tumors, we simply change the scoring method:

In [0]:
cross_val_score(pipeline, X_train, is_benign, 
                cv=10, scoring="recall").mean()

Likewise, the validation precision and recall for malignant tumors is 

In [0]:
is_malignant = (y_train == 1)

precision = cross_val_score(pipeline, X_train, is_malignant, 
                            cv=10, scoring="precision").mean()
recall = cross_val_score(pipeline, X_train, is_malignant, 
                         cv=10, scoring="recall").mean()

precision, recall

Another term for recall is _sensitivity_. Wenger's model was 99.1% sensitive to malignancy; our model is quite a bit worse, achieving a sensitivity (i.e., recall) of only 93.7%.

## Hyperparameter Tuning

Could we do better with a different value of $k$? We can use cross-validation on a grid of $k$ values and pick the one that maximizes some score. Since the F1 score summarizes both precision and recall, we use F1 as the score. There are two F1 scores---one for the benign masses and the malignant masses---`_macro` specifies that we take the average.

In [0]:
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(
    pipeline,
    param_grid={"kneighborsclassifier__n_neighbors": range(1, 50)},
    scoring="f1_macro",
    cv=10
)

grid_search.fit(X_train, y_train)
grid_search.best_params_

Is this value of $k$ better? It certainly has a higher average F1 score. What about its precision and recall for malignant masses?

In [0]:
new_precision = cross_val_score(
    grid_search.best_estimator_,
    X_train, is_malignant,
    scoring="precision",
    cv=10).mean()

new_recall = cross_val_score(
    grid_search.best_estimator_,
    X_train, is_malignant,
    scoring="recall",
    cv=10).mean()

precision, new_precision, recall, new_recall

We see that the new model has a higher precision _and_ a higher recall for malignancy. This suggests that the new model is unambiguously better. (If only recall had been higher, then it could be argued that we were simply trading off precision for recall.)

# Exercises

Exercises 1-2 ask you to use the Titanic data set (`https://dlsun.github.io/pods/data/titanic.csv`).

1\. Train a 5-nearest neighbors model to predict whether or not a passenger on a Titanic survived, using their age, sex, and class as features. Estimate the test accuracy, precision, and recall of this model for the survivors and the deceased.

2\. You want to build a $k$-nearest neighbors model to predict whether or not a passenger on the Titanic survived, using their age, sex, and class. 

- What value of $k$ optimizes overall accuracy?
- What value of $k$ optimizes the F1 score for the deceased?

Does the same value of $k$ optimize accuracy and the F1 score?