# Lecture 4.2: Evaluation Pt.1

[**Lecture Slides**](https://docs.google.com/presentation/d/1Z2kUep8v6dKPlUrJh7sFBawHO0h7MpvaMbY8xrVH_9I/edit?usp=sharing)

This lecture, we are going to evaluate sklearn, keras, and pytorch models.

**Learning goals:**
- split train, validation, and test set with sklearn
- calculate validation metrics in keras
- split an image dataset
- add validation scores to a pytorch training loop
- run end to end machine learning experiments
- compare model quality
- tune a hyperparameter
- analyze train & val loss curves for neural network optimization

##  1. sklearn

Let's revisit the banknote authentication dataset. We have trained many models on this data in chapter 3, but never _evaluated_ them. So this time, let's follow the checklist from the lecture slides.

#### 1.1 🤔 define ML task

We have already defined this in lecture 3.9. We are trying to solve a binary classification task: fake vs genuine banknotes.

#### 1.2 🔍 assess model feasibility

Detecting fake banknote is a pretty hard problem, but can be done by human experts. ML is also particularly good at detecting low level patterns in images. We also know that this is a solved problem, and have a dataset available. This task is therefore feasible!

#### 1.3 📂 find data you want to do well on

The banknote authentication dataset is a good representation of the bills we might encounter in the "wild". Let's download it from amazon S3:


In [None]:
!wget --quiet https://introduction-to-machine-learning-ilia-university.s3.eu-west-2.amazonaws.com/banknote.csv

We can then load it into a `DataFrame`:

In [None]:
import pandas as pd

df = pd.read_csv('banknote.csv')
df.head()

#### 1.4 ✂️ split a test set and set it aside

We usually jump straight into converting this `DataFrame` to features, which we then use to `.fit()` our model. This time however, we first split a test set.

sklearn makes this easy with the `train_test_split` function. The [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) mentions that it can split many different inputs:

> Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.

We choose to split our `DataFrame` 80%/20%:

In [None]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df, test_size=0.20, random_state=777)
print(f'total size: {len(df)}, train set size: {len(train_df)}, test set size: {len(test_df)}')

We choose to "set aside" the test set for later use. This prevents us from accidentally using data from the test set during development:

In [None]:
train_df.to_csv('banknote_train.csv', index=False)
test_df.to_csv('banknote_test.csv', index=False)

#### 1.5 ✂️ split train & validation sets

We choose to split the validation set _lazily_ , meaning we won't save it to disk like the test set. This is fine, because validation sets _can_ be reused.  
i.e Our results won't be statistically compromised, if the split isn't the same for each round of experiments.

In [None]:
df = pd.read_csv('banknote_train.csv')
train_df, val_df = train_test_split(df, test_size=0.20, random_state=4242)
print(f'train set size: {len(train_df)}, validation set size: {len(val_df)}')

#### 1.6 🎯 define single number metric

We are dealing with a balanced binary classification task, and therefore choose accuracy as our single number metric. This sole number will define our model quality.

#### 1.7 🔁 train + validate until happy with losses and metric(s)

We are now ready to experiment! Let's first create features and labels. We could use all four features, but it turns out that classification task is then too easy, and it wouldn't be interesting to compare training and validation metrics 😑.

So instead, we'll pick features 2 & 4 to spice up the task difficulty 🌶️

In [None]:
def to_features(df):
    X = df[['feature_2', 'feature_4']].values
    y = df['is_fake'].values
    return X, y

X_train, y_train = to_features(train_df)
X_val, y_val = to_features(val_df)

For our first round of experiments, we'd like to know which type of model best solves our task. We'll use three different classifiers:
- linear regression
- random forest
- SVM with RBF kernel

We fit these models on the training data:

In [None]:
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

rf_clf = RandomForestClassifier(random_state=0).fit(X_train, y_train)
svm_clf = SVC(kernel='rbf', C=1000, random_state=0).fit(X_train, y_train)
lr_clf = LogisticRegression(random_state=0).fit(X_train, y_train)

We now want to calculate the _accuracy_ of our models. sklearn provides many metric functions in the [`sklearn.metrics`](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics) module, including `accuracy_score()`. It compares labels and predictions, so we can use the `.predict()` method of the model api. For example, for our linear regression model:

In [None]:
from sklearn.metrics import accuracy_score

# predict labels
y_predict = lr_clf.predict(X_val)
# compare them to true labels
accuracy_score(y_val, y_predict)

69% accuracy, not bad!

🧠 What does `accuracy` represent? How does one calculate it?

Since it is such a common usecase, sklearn makes evaluation even easier by assigning default metrics to popular tasks and model types. For _classifiers_ , the default metric is already accuracy, so we can use the `.score()` method from the model api directly. sklearn will predict labels and compare them to the true labels for us:

In [None]:
lr_clf.score(X_val, y_val)

Now that we know how to evaluate sklearn models, lets's compare all of our banknote classifiers:

In [None]:
clfs = [rf_clf, svm_clf, lr_clf]

for clf in clfs:
    accuracy = clf.score(X_val, y_val)
    print(f'classifier: {type(clf).__name__}, validation accuracy: {accuracy}')

Wow, these models are pretty good! 🤩

🧠 Why do you think the logistic regression model is considerably less accurate than the other two?

Let's carry out a second round of experiments to determine optimal SVM hyperparameter. We're particularly interested in `C` which controls regularization.

💪 Train 6 SVMs, then compare their training & validation accuracy.
- use the `C` values listed below
- store the training accuracies in a list called `train_accuracies`
- store the validation accuracies in a list called `val_accuracies`
- use the unit test to debug and verify your code

In [None]:
c_values = [0.1, 1, 10, 100, 1000]

# INSERT YOUR CODE HERE

In [None]:
import math

def print_results(c_values, train_accuracies, val_accuracies):
    for c, train_acc, val_acc in zip(c_values, train_accuracies, val_accuracies):
        print(f'C: {c}, train acc: {train_acc}, val acc: {val_acc}')
        
        
def test_svm_C_tuning():
    assert train_accuracies, "Can't find train_accuracies. Did you use the correct variable name?"
    assert val_accuracies, "Can't find val_accuracies. Did you use the correct variable name?"
    assert len(train_accuracies) == 5, f"Expected 5 training accuracies, got {len(train_accuracies)}"
    assert len(val_accuracies) == 5, f"Expected 5 validation accuracies, got {len(val_accuracies)}"
    print_results(c_values, train_accuracies, val_accuracies)
    assert math.isclose(4.221208, sum(train_accuracies), rel_tol=1e-5), "Something is wrong with your training accuracy values"
    assert math.isclose(4.431818, sum(val_accuracies), rel_tol=1e-5), "Something is wrong with your validation accuracy values"
    print('Success! 🎉')
    
test_svm_C_tuning()

🧠 What is the best value for the hyperparameter `C`?

🧠 For which value of `C` does the SVM seem to start overfitting?

#### 1.8 📏 evaluate model on test set to get final metric

The SVM is our fake banknote detection model of choice. The International Monetary Fund would like guarantees about how well this model is going to perform in production. To know the expectation value of accuracy on unseen examples, we decide to use our _test set_ to measure the metric.

In [None]:
test_df = pd.read_csv('banknote_test.csv')
X_test, y_test = to_features(test_df)
svm_clf.score(X_test, y_test)

🧠🧠 The test accuracy is slightly lower than the validation accuracy.
- What does test accuracy < validation accuracy usually indicate?
- Is the difference significant in this case? 
- How would you verify this?


## 2. Keras

We want to try a keras neural network on the exact same task and dataset as section 1, with the same single number metric. We can therefore skip to the 6th checklist entry:

#### 2.7 🔁 train + validate until happy with losses and metric(s)

Let's create a neural network. We'll use the exact same neural architecture as lecture 3.12:

In [None]:
from keras.models import Sequential
from keras.layers import Dense

nn_clf = Sequential([
    Dense(6, activation='relu', input_dim=2),
    Dense(6, activation='relu'),
    Dense(1, activation='sigmoid')
])

Before we train this neural network, we must add one extra argument to the model compilation stage. The model api provides a convenient [.`evaluate()`](https://keras.io/api/models/model_training_apis/#evaluate-method) method to calculate metrics. However, these metrics have to be specified during model compilation, and are added like this:

In [None]:
nn_clf.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

We are now ready to train and evaluate our model:

In [None]:
import numpy as np
import tensorflow as tf

np.random.seed(1337)
tf.random.set_seed(666)

history = nn_clf.fit(X_train, y_train, batch_size=32, epochs=200)
val_loss, val_accuracy = nn_clf.evaluate(X_val, y_val)
print(f'validation accuracy: {val_accuracy}')

That's a lot of info! 👀
- adding `accuracy` as model metric told keras to calculate it for each epoch on the _train_ set
- note how `accuracy` is correlated to the loss, but is more interpretable
- the _validation_ accuracy after 100 epochs is ~ 90%

The logging from keras highlights a key difference between neural networks and other models. With linear regression, SVMs, or random forests, the optimization procedure is _stable_. This allows automated stopping methods to determine when to stop training. For example, the sklearn [`LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) model uses [L-BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS) to find the minimum of the convex loss surface. 

Neural networks, on the other hand, have highly non-convex loss surfaces, and the optimization is _unstable_. They have to use stochasticity and advanced optimization methods to reach global minima, typically over many epochs. As a result, it can be unclear when to stop training based on the loss alone. It would be really helpful however to keep track of the validation loss & accuracy throughout the epochs. We could then keep track of overfitting, and also just pick the checkpoint with the best validation metric (also called [early stopping](https://en.wikipedia.org/wiki/Early_stopping)).

This is done in keras by either supplying a validation dataset to the `.fit()` method, or simply by providing a train/val split ratio (e.g 0.2), and letting keras do the heavy lifting. We'll opt for the latter option.

To do so, we need to give the train + val data to keras... which we have available, since we decided to split the dataset lazily!

In [None]:
X, y = to_features(df)

We can repeat the previous steps with the added `validation_split=0.2` argument to the `.fit()` method. We'll also use `verbose=0` to keep our output cell tidy:

In [None]:
nn_clf = Sequential([
    Dense(6, activation='relu', input_dim=2),
    Dense(6, activation='relu'),
    Dense(1, activation='sigmoid')
])
nn_clf.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
np.random.seed(1337)
tf.random.set_seed(666)
history = nn_clf.fit(X, y, batch_size=32, epochs=200, validation_split=0.2, verbose=0)

We can now plot far more graphs than the last times we trained a neural network 📈
- training loss
- validation loss
- training accuracy
- validation accuracy

Usually, train & val metrics are overlaid on the same graph. This makes it easy to compare the values and detect overfitting.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

fig = plt.figure(figsize=(12,4))

ax1 = fig.add_subplot(121)
ax1.plot(history.history['loss'], label='train')
ax1.plot(history.history['val_loss'], label='val')
ax1.legend()
ax1.set_title('Loss Curve')

ax1 = fig.add_subplot(122)
ax1.plot(history.history['accuracy'], label='train')
ax1.plot(history.history['val_accuracy'], label='val')
ax1.legend()
ax1.set_title('Accuracy Curve');

😯 That's some funky accuracy curve! This "jump" at 125 epochs is likely to correspond to the model suddenly learning some important feature. Remember that neural network training is rather unstable!

Notice how this "jump" is _not_ reflected in the loss curves. This is why we made the effort of picking a single number metric which reflected model quality. It's now much clearer which epoch corresponds to the best model, and we have a better overview of the optimization process.

🧠 What do these curves suggest about overfitting?

#### 2.8 📏 evaluate model on test set to get final metric

Now that we have chosen our final model candidate, we can evaluate it on the test set to estimate its expected production accuracy:

In [None]:
test_df = pd.read_csv('banknote_test.csv')
X_test, y_test = to_features(test_df)
nn_clf.evaluate(X_test, y_test, verbose=0)

👍 nice score!

🧠 How come the test accuracy can be larger than the validation accuracy?



#### 2.9 Note

Typically, one wouldn't run sklearn experiments followed by keras experiments, but would combine and compare all of them in the validation step. This was done sequentially in this notebook for demonstration purposes.

## 3. pytorch

Let's revisit the emoji classification dataset. The CNN we trained in lecture 3.15 had a good training loss curve, but we never _evaluated_ it, so it might have been inaccurate or overfit.

#### 3.1 🤔 define ML task

This is a multi-class image classification task: face 😌 vs flag 🇬🇪 vs animal 🦅 emojis.


#### 3.2 🔍 assess model feasibility

Emoji image classification is a fairly easy task for humans, and image classification methods are very performant. We also already have access to a dataset, so this task is feasible!

#### 3.3 📂 find data you want to do well on

The Unicode Consortium wishes the emoji classifier to work for _all_ colours schemes, so despite being in grayscale, the emoji image dataset is a good representation of the examples we might encounter in the "wild". Let's download it from amazon S3 and unarchive it:


In [None]:
!wget --quiet https://introduction-to-machine-learning-ilia-university.s3.eu-west-2.amazonaws.com/emojis_4.2.tar.gz
!tar -xf emojis_4.2.tar.gz    

We can now load the images with pillow:

In [None]:
import glob
from PIL import Image

data_dir = 'emojis4.2'
paths = glob.glob(data_dir + '/*/*.png')
# reproducible on different os
paths.sort()
Image.open(paths[0])

#### 3.4 ✂️ split a test set and set it aside
#### 3.5 ✂️ split train & validation sets

When dealing with images, it is often easier to eagerly split validation sets, i.e to create a dedicated `val` directory on disk. This is because images are often large files, and can't all be loaded in memory. We'll split both test and validation sets at once by reorganizing the data folders. We will place 20% of the data in a directory called `test`, and 20% of the remaining data in `val`.

Since sklearn's `train_test_split()` function works with any iterable, we can use it to split our image paths:

In [None]:
train_paths, test_paths = train_test_split(paths, test_size=0.20, random_state=1337)
train_paths, val_paths = train_test_split(train_paths, test_size=0.20, random_state=1337)
print(f'total size: {len(paths)}, train size: {len(train_paths)}, val size: {len(val_paths)}, test size: {len(test_paths)}')

We then create the `train`, `val`, and `test` directories. We take care to maintain their sub-directory structure, since that's our labels! (this cell can take a few minutes to copy files)

In [None]:
import os 

def create_image_folder(name, paths):
    !mkdir -p {name}
    !mkdir {name}/flags
    !mkdir {name}/faces
    !mkdir {name}/animals
    for path in paths:
        dirs = path.split('/')
        dirs[0] = name
        new_path = '/'.join(dirs)
        !cp {path} {new_path}
    return

train_dir = 'emojis/train'
val_dir = 'emojis/val'
test_dir = 'emojis/test'

create_image_folder('emojis/train', train_paths)
create_image_folder('emojis/val', val_paths)
create_image_folder('emojis/test', test_paths)

#### 3. 6 🎯 define single number metric

Since this is a classification task, we'll use accuracy as our single number metric to measure model quality.

#### 3.7 🔁 train + validate until happy with losses and metric(s)

We are ready to launch some experiments! We'll use the same `ImageFolder` dataset as lecture 3.16 to load our images. This time we'll use three preprocessing transformations:
- [`Grayscale`](https://pytorch.org/docs/stable/torchvision/transforms.html#torchvision.transforms.Grayscale) to convert our arrays from 3 channels to one, since `ImageFolder` loads in RGB by default
- `ToTensor` to convert `ndarray`s to `Tensor`s
- `Normalize` to feature scale the pixels

I already calculated the pixel value mean (0.7571) and standard deviation (0.2738), as described in this [thread](https://discuss.pytorch.org/t/computing-the-mean-and-std-of-dataset/34949/2).

In [None]:
from torchvision import transforms
from torchvision import datasets

preprocess = transforms.Compose([
    transforms.Grayscale(),
    transforms.ToTensor(),
    transforms.Normalize([0.7571], [0.2738])
    ])

train_ds = datasets.ImageFolder(train_dir, preprocess)
val_ds = datasets.ImageFolder(val_dir, preprocess)
train_ds

The preprocessors have turned our images into $1\times64\times64$ tensors:

In [None]:
next(iter(train_ds))[0].shape

Which means we read to load our `Dataset`s into `DataLoader`s:

In [None]:
from torch.utils.data import DataLoader
batch_size = 32
train_loader = DataLoader(dataset=train_ds, 
                          batch_size=batch_size, 
                          shuffle=True)
val_loader = DataLoader(dataset=val_ds, 
                          batch_size=batch_size, 
                          shuffle=True)

Note that this time, we built two data loaders, one for the train set, one for the validation set.

We'll define the exact same convolutional neural network as lecture 3.15:

In [None]:
import torch
import torch.nn.functional as F


class ConvNet(torch.nn.Module):

    def __init__(self, verbose=False):
        super(ConvNet, self).__init__()
  
        self.verbose = verbose
        # 1x64x64 => 8x64x64
        self.conv_1 = torch.nn.Conv2d(in_channels=1,
                                      out_channels=8,
                                      kernel_size=(3, 3),
                                      stride=(1, 1),
                                      padding=1)
        # 8x64x64 => 8x32x32
        self.pool_1 = torch.nn.MaxPool2d(kernel_size=(2, 2),
                                         stride=(2, 2),
                                         padding=0)
        # 8x32x32 => 16x32x32
        self.conv_2 = torch.nn.Conv2d(in_channels=8,
                                      out_channels=16,
                                      kernel_size=(3, 3),
                                      stride=(1, 1),
                                      padding=1)
        # 16x32x32 => 16x16x16                             
        self.pool_2 = torch.nn.MaxPool2d(kernel_size=(2, 2),
                                         stride=(2, 2),
                                         padding=0)
        
        # 16x16x16 => 32x16x16
        self.conv_3 = torch.nn.Conv2d(in_channels=16,
                                      out_channels=32,
                                      kernel_size=(3, 3),
                                      stride=(1, 1),
                                      padding=1)
        
        # 16x16x32 => 8x8x32                             
        self.pool_3 = torch.nn.MaxPool2d(kernel_size=(2, 2),
                                         stride=(2, 2),
                                         padding=0)
        
        # 2048 => 64
        self.linear_1 = torch.nn.Linear(8*8*32, 64)
        # 64 => 3
        self.linear_2 = torch.nn.Linear(64, 3)

        
        
    def forward(self, x):
      x = F.relu(self.conv_1(x))
      x = self.pool_1(x)

      x = F.relu(self.conv_2(x))
      x = self.pool_2(x)

      x = F.relu(self.conv_3(x))
      x = self.pool_3(x)
      
      # flatten
      x = x.view(-1, 8*8*32)

      x = F.relu(self.linear_1(x))
      
      logits = self.linear_2(x)
      return logits


Last preparation touch, we initialize the torch `device`, send the model weights to it, and create our `optimizer` and `criterion`:

In [None]:
import torch

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

net = ConvNet()
net = net.to(device)

optimizer = torch.optim.Adam(net.parameters(), lr=0.001)
criterion = torch.nn.CrossEntropyLoss()

Which means we are ready for training! This is similar to our previous pytorch training loops, with two additions:
- the accuracy metric is calculated by calculating the correct prediction ratio with `running_corrects`
- the validation loss and metrics are measured after training in the `# VALIDATION` section

In [None]:
import time 
    
torch.manual_seed(1337)
np.random.seed(666)

start_time = time.time()    

losses = []
val_losses = []
accuracies = []
val_accuracies = []

for epoch in range(30):

    # TRAINING
    net = net.train()
    running_losses = []
    running_corrects = 0
    for data in train_loader:
        
        # prediction
        features, labels = data
        features = features.float().to(device)
        labels = labels.long().to(device)
        logits = net(features)
        
        # loss
        loss = criterion(logits, labels)
        running_losses.append(loss.item())
        # accuracy
        _, preds = torch.max(logits.data, 1)
        running_corrects += (preds == labels).sum().item()
                
        # gradient descent
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    
    # training metrics
    epoch_loss = np.array(running_losses).mean()
    losses.append(epoch_loss)
    epoch_accuracy = running_corrects/len(train_loader.dataset)
    accuracies.append(epoch_accuracy)
    
    # VALIDATION
    net = net.eval()
    running_losses = []
    running_corrects = 0
    with torch.no_grad():
        for data in val_loader:
            
            # prediction
            features, labels = data
            features = features.float().to(device)
            labels = labels.long().to(device)
            logits = net(features)

            # loss
            loss = criterion(logits, labels)
            running_losses.append(loss.item())
            # accuracy
            _, preds = torch.max(logits.data, 1)
            running_corrects += (preds == labels).sum().item()
            
    # validation metrics
    val_epoch_loss = np.array(running_losses).mean()
    val_losses.append(val_epoch_loss)
    val_epoch_accuracy = running_corrects/len(val_loader.dataset)
    val_accuracies.append(val_epoch_accuracy)
    
    print(f'epoch: {epoch}, loss: {epoch_loss:.6f}, val loss: {val_epoch_loss:.6f}, acc: {accuracy:.4f}, val acc: {val_epoch_accuracy:.4f}')
            

stop_time = time.time()
print(f'total training time: {stop_time - start_time}')
          

That's a lot of code! Take your time to read through it and understand each part.

🧠 What's the use of `net.train()` and `net.eval()`?

🧠 What's the use of `torch.no_grad()`?

🧠🧠 lines 31 & 62, notice that the predictions are calculated directly on the `logits` and not on the class probabilities. Why does this work? 

ℹ️ Notice that we're loading the validation set in batches, although we don't strictly have to. For training, batches affect the stochasticity of gradient descent step size, but for evaluation it's just cutting up calculations in smaller chunks. This is typically done with deep learning, because the datasets can be very large, and batching better leverages GPU parallelization.

The training loop above is long and contains a lot of duplicate code. We chose to keep the operations sequential and explicit in this notebook for demonstration purposes. But usually, ML engineers will encapsulate similar operations and use design patterns to clean the training loops. They either do it themselves for their particular usecases, or use third party wrappers like pytorch [ignite](https://github.com/pytorch/ignite).

Now that we've added validation to our training loop, we can plot loss & accuracy curves to get insights on this CNN's optimization:

In [None]:
fig = plt.figure(figsize=(12,4))

ax1 = fig.add_subplot(121)
ax1.plot(losses, label='train')
ax1.plot(val_losses, label='val')
ax1.legend()
ax1.set_ylabel('loss')
ax1.set_xlabel('epoch')
ax1.set_title('Loss Curve')

ax2 = fig.add_subplot(122)
ax2.plot(accuracies, label='train')
ax2.plot(val_accuracies, label='val')
ax2.legend()
ax2.set_ylabel('accuracy')
ax2.set_xlabel('epoch')
ax2.set_title('Accuracy Curve');

The model doesn't seem to overfit, since the validation loss & accuracy are close to the training values, and the loss curve doesn't exhibit the usual overfit divergence pattern. It's also clear how long the model takes to fully converge to the best accuracy.

🧠🧠 Why are the validation curves so "quantised"? How could we fix this?  

#### 3.8 📏 evaluate model on test set to get final metric

Now that we've chosen model as our best candidate, we'd like to know its expected production performance. Let's test it! First we must load the test set with the same preprocessing as training (minus the data augmentation):

In [None]:
preprocess = transforms.Compose([
    transforms.Grayscale(),
    transforms.ToTensor(),
    transforms.Normalize([0.7571], [0.2738])
    ])
test_ds = datasets.ImageFolder(test_dir, preprocess)
test_loader = DataLoader(dataset=test_ds, 
                          batch_size=batch_size,
                          shuffle=False)

Next we use the last model checkpoint by simply reusing the `net` variable. Let's not forget to use `net.eval()` and `torch.no_grad()` before calculating the accuracy:

In [None]:
net = net.eval()
running_corrects = 0
with torch.no_grad():
    for data in test_loader:
        features, labels = data
        features = features.float().to(device)
        labels = labels.long().to(device)
        logits = net(features)

        _, preds = torch.max(logits.data, 1)
        running_corrects += (preds == labels).sum().item()
            
test_accuracy = running_corrects / len(test_loader.dataset)
print(f'test accuracy: {test_accuracy}')

That's an amazing result! The Unicode Consortium will be delighted 😊.

🧠🧠 Consider the significance of this score:
- what does it say about the expected generalization on unseen examples?
- does this really mean this is an almost perfect model?
- what factors affect the confidence of this score?

## 4. Summary

Today, we learned about **evaluation methods**. First, we noted that training loss makes for a bad model quality metric, since it cannot detect **overfitting**. We introduced the idea of a held-out **test set** to better estimate generalization properties on unseen examples. We highlighted how test sets work if they are of the **same distribution** as the data we will encounter at prediction time, and if they are **large enough**. We then described how an independent test set can still be prone to overfitting if used as part of **model development**. Since machine learning development is **experimental** & **iterative** in nature, the data scientist introduces an **information leak** between the test set and the model hyperparameters. We introduced the **validation set** as a solution. We split the responsibilities of **comparing** models, and **assessing** models, which allows engineers to both develop and measure the quality machine learning solutions. We then showed that losses weren't always interpretable values, and introducted new **metrics**, like classification accuracy or regression MSE. We underlined the importance of choosing a **single number metric** to define model quality, and speed up model development. We then synthesized all these new workflows into a **ML development checklist**, which captures the steps of typical ML engineering experiments. Finally, we applied this checklist to three ML frameworks: sklearn, keras, and pytorch. In doing so, we built viable ML solutions from scratch for banknote authentication and emoji classification tasks.


# Resources

## Core Resources

- [**Slides**](https://docs.google.com/presentation/d/1Z2kUep8v6dKPlUrJh7sFBawHO0h7MpvaMbY8xrVH_9I/edit?usp=sharing)  
- [Machine learning yearning](https://www.deeplearning.ai/machine-learning-yearning/)  
The Andrew Ng reference for ML engineering, including terse and practical sections about validation and test sets
- [sklearn on evaluation](https://scikit-learn.org/stable/modules/model_evaluation.html)  
Verbose official documentation on sklearn evaluation methods and apis 
- [Train and evaluation with keras](https://www.tensorflow.org/guide/keras/train_and_evaluate)  
Comprehensive official guide to validation and testing in keras

## Additional Resources

- [Google ML crash course - accuracy](https://developers.google.com/machine-learning/crash-course/classification/accuracy)  
Intuitive explanation of the accuracy metric and its equation
- [ignite](https://github.com/pytorch/ignite#why-ignite)  
pytorch ecosystem library with many apis to reduce evaluation boilerplate code