# Lecture 5.3: Evaluation

This lecture, we are going to evaluate sklearn models.

**Learning goals:**
- split train, validation, and test set with sklearn
- run end to end machine learning experiments
- compare model quality
- tune a hyperparameter

##  1. Introduction

An estimated [\$70 million](https://en.wikipedia.org/wiki/Counterfeit_United_States_currency) in counterfeit bills are currently in circulation in the USA. That's quite a hustle 😎💰. The Federal Reserve doesn't like it however, and wants our help detecting fake banknotes. This can be a hard task: tiny defects are tough to spot, and counterfeiters constantly change their techniques. 

Machine Learning models can help, because they performs particularly well on unseen data. The [banknote authentication dataset](https://archive.ics.uci.edu/ml/datasets/banknote+authentication) frames this challenge as a binary classification task.  Let's evaluate and compare ML models trained on this fake/genuine banknote dataset.

Let's follow the checklist from the lecture slides.

## 2. 🤔 define ML task

As defined above: we are trying to solve a _binary classification task_: fake vs genuine banknotes.

## 3. 🔍 assess model feasibility

Detecting fake banknote is a pretty hard problem, but can be done by human experts. ML is also particularly good at detecting low level patterns in images. We also know that this is a solved problem, and have a dataset available. This task is therefore feasible!

## 4. 📂 find data you want to do well on

The banknote authentication dataset is a good representation of the bills we might encounter in the "wild". We can load it into a `Dataframe`:


In [1]:
import pandas as pd

df = pd.read_csv('banknote.csv')
df.head()



Unnamed: 0,feature_1,feature_2,feature_3,feature_4,is_fake
0,1.121806,1.149455,-0.97597,0.354561,0
1,1.447066,1.064453,-0.895036,-0.128767,0
2,1.20781,-0.777352,0.122218,0.618073,0
3,1.063742,1.295478,-1.255397,-1.144029,0
4,-0.036772,-1.087038,0.73673,0.096587,0


In [6]:
df.shape

(1372, 5)

In [2]:
df.describe()

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,is_fake
count,1372.0,1372.0,1372.0,1372.0,1372.0
mean,0.0,4.143106e-17,1.0357770000000001e-17,-5.567299e-17,0.444606
std,1.000365,1.000365,1.000365,1.000365,0.497103
min,-2.630737,-2.675252,-1.551303,-3.502703,0.0
25%,-0.776547,-0.6188189,-0.6899455,-0.5817379,0.0
50%,0.021974,0.06771828,-0.1812706,0.2880644,0.0
75%,0.840243,0.8338757,0.4135174,0.7553713,1.0
max,2.249008,1.879908,3.836586,1.73368,1.0


Each row is a banknote example, and there are four numerical features, `feature_i`, and one binary label, `is_fake`. It might be surprising that these examples aren't images. Instead, they are [wavelet transforms](https://en.wikipedia.org/wiki/Wavelet_transform) of the banknote pictures.

All four features have $mean = 0$ and $std = 1$ : they have already been standardized. The `count` row of the summary statistic table shows that there are no missing values. This means no further data preprocessing is necessary, and we can directly train our classifiers. 🏃‍♂️

## 6. ✂️ split a test set and set it aside

We usually jump straight into converting this `DataFrame` to features, which we then use to `.fit()` our model. This time however, we first split a test set.

sklearn makes this easy with the `train_test_split` function. The [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) mentions that it can split many different inputs:

> Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.

We choose to split our `DataFrame` 80%/20%:

In [7]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df, test_size=0.20, random_state=777)
print(f'total size: {len(df)}, train set size: {len(train_df)}, test set size: {len(test_df)}')

total size: 1372, train set size: 1097, test set size: 275


We choose to "set aside" the test set for later use. This prevents us from accidentally using data from the test set during development:

In [8]:
train_df.to_csv('banknote_train.csv', index=False)
test_df.to_csv('banknote_test.csv', index=False)

## 7. ✂️ split train & validation sets

We choose to split the validation set _lazily_ , meaning we won't save it to disk like the test set. This is fine, because validation sets _can_ be reused.  
i.e Our results won't be statistically compromised, if the split isn't the same for each round of experiments.

In [9]:
df = pd.read_csv('banknote_train.csv')
train_df, val_df = train_test_split(df, test_size=0.20, random_state=4242)
print(f'train set size: {len(train_df)}, validation set size: {len(val_df)}')

train set size: 877, validation set size: 220


## 8. 🎯 define single number metric

We are dealing with a balanced binary classification task, and therefore choose accuracy as our single number metric. This sole number will define our model quality.

## 9. 🔁 train + validate until happy with losses and metric(s)

We are now ready to experiment! Let's first create features and labels. We could use all four features, but it turns out that classification task is then too easy, and it wouldn't be interesting to compare training and validation metrics 😑.

So instead, we'll pick features 2 & 4 to spice up the task difficulty 🌶️

In [10]:
def to_features(df):
    X = df[['feature_2', 'feature_4']].values
    y = df['is_fake'].values
    return X, y

X_train, y_train = to_features(train_df)
X_val, y_val = to_features(val_df)

For our first round of experiments, we'd like to know which type of model best solves our task. We'll use three different classifiers:
- logistic regression
- random forest
- SVM with RBF kernel

ℹ️ Don't worry if you haven't heard of the last two models before! There is a whole cornucopia of ML models out there - with new ones published everyday. But a good place to start is with all the [sklearn models](https://scikit-learn.org/stable/supervised_learning.html). Not only is their documentation a great place to learn about how they work and how to use them, but sklearn also has many additional resources, like this [classifier comparison chart](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html). And they mostly follow the sklearn `.fit()` and `.predict()` API, so you can try them out straight away! The best way to learn about models is to explore which are commonly used when you encounter a new ML task. 

We fit these models on the training data:

In [11]:
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

rf_clf = RandomForestClassifier(random_state=0).fit(X_train, y_train)
svm_clf = SVC(kernel='rbf', C=1000, random_state=0).fit(X_train, y_train)
lr_clf = LogisticRegression(random_state=0).fit(X_train, y_train)

We now want to calculate the _accuracy_ of our models. sklearn provides many metric functions in the [`sklearn.metrics`](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics) module, including `accuracy_score()`. It compares labels and predictions, so we can use the `.predict()` method of the model api. For example, for our linear regression model:

In [12]:
from sklearn.metrics import accuracy_score

# predict labels
y_predict = lr_clf.predict(X_val)
# compare them to true labels
accuracy_score(y_val, y_predict)

0.6863636363636364

69% accuracy, not bad!

🧠 What does `accuracy` represent? How does one calculate it?

Since it is such a common usecase, sklearn makes evaluation even easier by assigning default metrics to popular tasks and model types. For _classifiers_ , the default metric is already accuracy, so we can use the `.score()` method from the model api directly. sklearn will predict labels and compare them to the true labels for us:

In [13]:
lr_clf.score(X_val, y_val)

0.6863636363636364

Now that we know how to evaluate sklearn models, lets's compare all of our banknote classifiers:

In [14]:
clfs = [rf_clf, svm_clf, lr_clf]

for clf in clfs:
    accuracy = clf.score(X_val, y_val)
    print(f'classifier: {type(clf).__name__}, validation accuracy: {accuracy}')

classifier: RandomForestClassifier, validation accuracy: 0.9
classifier: SVC, validation accuracy: 0.9181818181818182
classifier: LogisticRegression, validation accuracy: 0.6863636363636364


Wow, these models are pretty good! 🤩

Let's carry out a second round of experiments to determine optimal SVM hyperparameter. We're particularly interested in `C` which controls regularization.

💪 Train 6 SVMs, then compare their training & validation accuracy.
- use the `C` values listed below
- store the training accuracies in a list called `train_accuracies`
- store the validation accuracies in a list called `val_accuracies`
- use the unit test to debug and verify your code

In [22]:
from sklearn.model_selection import train_test_split

c_values = [0.1, 1, 10, 100, 1000]
clfs, train_accuracies, val_accuracies = ([] for i in range(3))
for c in c_values :
    clf = SVC(kernel='rbf', C=c, random_state=0).fit(X_train, y_train)
    clfs.append(clf)
    train_accuracies.append(clf.score(X_train, y_train))
    val_accuracies.append(clf.score(X_val, y_val))


In [23]:
import math

def print_results(c_values, train_accuracies, val_accuracies):
    for c, train_acc, val_acc in zip(c_values, train_accuracies, val_accuracies):
        print(f'C: {c}, train acc: {train_acc}, val acc: {val_acc}')
        
        
def test_svm_C_tuning():
    assert train_accuracies, "Can't find train_accuracies. Did you use the correct variable name?"
    assert val_accuracies, "Can't find val_accuracies. Did you use the correct variable name?"
    assert len(train_accuracies) == 5, f"Expected 5 training accuracies, got {len(train_accuracies)}"
    assert len(val_accuracies) == 5, f"Expected 5 validation accuracies, got {len(val_accuracies)}"
    print_results(c_values, train_accuracies, val_accuracies)
    assert math.isclose(4.221208, sum(train_accuracies), rel_tol=1e-5), "Something is wrong with your training accuracy values"
    assert math.isclose(4.431818, sum(val_accuracies), rel_tol=1e-5), "Something is wrong with your validation accuracy values"
    print('Success! 🎉')
    
test_svm_C_tuning()

C: 0.1, train acc: 0.7913340935005702, val acc: 0.8181818181818182
C: 1, train acc: 0.8255416191562144, val acc: 0.8545454545454545
C: 10, train acc: 0.855188141391106, val acc: 0.9181818181818182
C: 100, train acc: 0.8665906499429875, val acc: 0.9227272727272727
C: 1000, train acc: 0.8825541619156214, val acc: 0.9181818181818182
Success! 🎉


🧠 What is the best value for the hyperparameter `C`?

🧠 For which value of `C` does the SVM seem to start overfitting?

Best value for hyperparameter C is : 1000, We don't know when overfitting is happening because models are doing well for train data and validation data, the gap is not big enough to say that there is an overfitting for any hyperparameter C used.

## 10. 📏 evaluate model on test set to get final metric

The SVM is our fake banknote detection model of choice. The International Monetary Fund would like guarantees about how well this model is going to perform in production. To know the expectation value of accuracy on unseen examples, we decide to use our _test set_ to measure the metric.

In [15]:
test_df = pd.read_csv('banknote_test.csv')
X_test, y_test = to_features(test_df)
svm_clf.score(X_test, y_test)

0.8945454545454545

🧠🧠 The test accuracy is slightly lower than the validation accuracy.
- What does test accuracy < validation accuracy usually indicate?
- Is the difference significant in this case? 
- How would you verify this?


## 4. Summary

Today, we learned about **evaluation methods**. First, we noted that training loss makes for a bad model quality metric, since it cannot detect **overfitting**. We introduced the idea of a held-out **test set** to better estimate generalization properties on unseen examples. We highlighted how test sets work if they are of the **same distribution** as the data we will encounter at prediction time, and if they are **large enough**. We then described how an independent test set can still be prone to overfitting if used as part of **model development**. Since machine learning development is **experimental** & **iterative** in nature, the data scientist introduces an **information leak** between the test set and the model hyperparameters. We introduced the **validation set** as a solution. We split the responsibilities of **comparing** models, and **assessing** models, which allows engineers to both develop and measure the quality machine learning solutions. We then showed that losses weren't always interpretable values, and introducted new **metrics**, like classification accuracy or regression MSE. We underlined the importance of choosing a **single number metric** to define model quality, and speed up model development. We then synthesized all these new workflows into a **ML development checklist**, which captures the steps of typical ML engineering experiments. Finally, we applied this checklist and built a viable ML solution from scratch for banknote authentication in sklearn.


# Resources

## Core Resources

- [Machine learning yearning](https://www.deeplearning.ai/machine-learning-yearning/)  
The Andrew Ng reference for ML engineering, including terse and practical sections about validation and test sets
- [sklearn on evaluation](https://scikit-learn.org/stable/modules/model_evaluation.html)  
Verbose official documentation on sklearn evaluation methods and apis 

## Additional Resources

- [Google ML crash course - accuracy](https://developers.google.com/machine-learning/crash-course/classification/accuracy)  
Intuitive explanation of the accuracy metric and its equation