# Evaluation

Colab link [here](https://colab.research.google.com/drive/1-jRsXmrcgku8POGf5Dp_-J_CjjkANX17?usp=sharing)

Now that you have a model, you should test how well it was trained. This is where `scikit-learn`, a popular python ml library comes into play.



***
# Quick Aside

In previous lessons, you may have seen me use `model.train()` or `model.eval()` and wondered what these meant. Here's where I finally explain them.

`.train()` puts the model into training mode. It is typically used before starting the training loop. Train mode makes sure that dropout and batch normalization are working as intended.

`.eval()` disables any regularization. It should be used before any validation or inference. It disables dropout layers and uses learned statistics from batch normalization. Normally, you use `.eval()` in tandem with `torch.no_grad()`.

`torch.no_grad()` disables gradient calculations. This makes calculations quicker and use less memory. Its more efficient to use for validation and inference. It is used with the `with` context management keyword, like this: `with torch.no_grad():`

***
# Validation

The first type of evaluation you can do is actually during training, and this is validation. When we train a model, we want to make sure that it doesn't memorize (overfit) to the training data. We want to avoid overfitting because it means a model won't be able to generalize it's learned information to a problem it may see when deployed.

<br>

Validation is normally implemented by splitting a dataset into three partitions: train, validation and test. Normally, we use an 80 10 10 or a 70 20 10 split.

<br>

During training, We will spend an epoch on the training dataset and calculate the loss. In the same epoch, the model will also test itself on the validation data to see how well the model is generalizing.

<br>

It is important to note that the model will never train on the validation data. It is purely meant to see how well the model is learning. If training loss continually decreases but validation loss increases, it means your model is memorizing the training data and not learning to generalize to unseen data. 

<br>

Let's check out a training loop with a validation implementation.



In [None]:
# import necessary modules
import torch
import torch.optim as optim
import torch.nn as nn

model = fakeModel()
optimizer = optim.Adam(model.parameters(), lr=0.001)
loss = nn.MSELoss()

num_epochs = 2

for epoch in range(num_epochs):
  train_loss = 0
  validation_loss = 0

  model.train()
  for batch in train_dataloader:
    inputs, targets = batch

    # forward pass
    outputs = model(inputs)

    # compute loss
    batch_loss = loss(outputs, targets)

    # backward pass
    optimizer.zero_grad()
    batch_loss.backward()

    # update parameters
    optimizer.step()

    # sum up loss for this epoch
    train_loss += batch_loss.item()

  # switch model to evaluation mode
  model.eval()
  # we dont need gradients since we're not training
  with torch.no_grad():
    for batch in validation_dataloader:
      inputs, targets = batch

      outputs = model(inputs)

      batch_loss = loss(outputs, targets)

      validation_loss += batch_loss.item()

      # no backward pass or updating parameters needed

  # get average loss for an epoch
  avg_train_loss = train_loss / len(train_dataloader)
  avg_validation_loss = validation_loss / len(validation_dataloader)

  print(f'Epoch {epoch + 1} / {num_epochs} | Training Loss: {avg_train_loss:.4f} | Validation Loss: {avg_validation_loss:.4f}')

***
# Types of ML Problems

There are two main types of problems in machine learning, classification and regression.

<br>

Classification problems are when a model sorts an input into a bucket. An example is the MNIST dataset, where we are sorting inputs into their corresponding numbers.

Regression problems are the opposite of classification. They involve predicting continuous values. An example of this could be predicting the number of calories someone would burn during a workout.

***
# Classification Metrics

Classification and Regression problems have different metrics. Some common classification methods include:
1. Accuracy
2. Precision
3. Recall
4. F1 Score

<br>

Accuracy is the number of correct labels divided by the total number of samples. This can be misleading when we have unequal class sizes (i.e. one class has 10x more samples than another class).

The formula is `TP / TOTAL` where TP is true positives.

<br>

Precision is calculated as the number of true positives divided by the sum of true and false positives. It calculates the true positive rate. Higher is better.

The formula is `TP / (TP + FP)`

<br>

Recall takes all the positive samples, and calculates how many the model correctly classified. Higher is better.

The formula is `TP / (TP + FN)`

<br>

F1 Score is a harmonic mean of precision and recall. Higher is better.

The formula is `2 * (Precision * Recall) / (Precision * Recall)`

<br>

## Summary

Never just use one statistic. Combining multiple allows you see a better picture of your model's performance. Here are some general trends to look out for.

<br>

High precision, High recall -> The model is cautious in its predictions and misses true positives

Low precision, high recall -> The model predicts many true positives but also many false positives.

<br>

Let's see some examples on implementation below.

In [None]:
# import necessary modules
from sklearn.metrics import precision_score, recall_score, f1_score

# example predictions
y_true = [0, 1, 1, 0, 1, 0, 1, 1]
y_pred = [0, 1, 0, 0, 1, 1, 1, 0]

precision = precision_score(y_true, y_pred)
print(f'Precision: {precision:.2f}')

recall = recall_score(y_true, y_pred)
print(f'Recall: {recall:.2f}')

f1 = f1_score(y_true, y_pred)
print(f'F1 Score: {f1:.2f}')

***
# Regression Metrics

Regression metrics are a bit easier to understand initially. Let's tackle some basic ones.

1. Mean Squared Error
2. Mean Absolute Error

<br>

Mean Squared Error is the squared distance between the actual and predicted values. It punishes outliers heavily due to the distance being squared. Lower is better.

$$
\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2
$$

<br>

Mean Absolute Error is the absolute distance between the actual and predicted value. It doesn't punish outliers as harshly as MSE. Lower is better.

$$
\frac{1}{n} \sum_{i=1}^n \left| y_i - \hat{y}_i \right|
$$

<br>

Let's see some examples on implementation below.

In [None]:
# import necessary modules
from sklearn.metrics import mean_squared_error, mean_absolute_error

# example predictions
y_true = [3.0, -0.5, 2.0, 7.0]
y_pred = [2.5, 0.0, 2.1, 7.8]

mse = mean_squared_error(y_true, y_pred)
print(f'MSE: {mse:.2f}')

mae = mean_absolute_error(y_true, y_pred)
print(f'MAE: {mae:.2f}')