# Topic 8 - Metrics for Performance Evaluation

## Aims of the Session

* Learn different metrics used to evaluate classification frameworks

* Understand some alternatives to design proper tests

## Resources for the Lecture

### Websites

* https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229
* https://en.wikipedia.org/wiki/Precision_and_recall
* https://en.wikipedia.org/wiki/Sensitivity_and_specificity
* https://en.wikipedia.org/wiki/Confusion_matrix
* https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
* https://towardsdatascience.com/multi-class-metrics-made-simple-part-i-precision-and-recall-9250280bddc2
* https://machinelearningmastery.com/k-fold-cross-validation/
* https://pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/
* https://medium.com/mlearning-ai/understanding-evaluation-metrics-in-medical-image-segmentation-d289a373a3f

### Online Courses

* [Deep Learning Specialization by Andrew NG (Coursera)](https://es.coursera.org/specializations/deep-learning)

## Some important concepts

![Fig. 1. Typical Data Split](https://www.dropbox.com/s/oze1q3wj7d71pa1/traintestvalid.jpg?raw=1)

* `Generalisation`: The ability to correctly classify new examples different from those used for training a model

![Fig. 2. Sample data of a binary dataset](https://www.dropbox.com/s/iiorih73voblfb9/data.png?raw=1)

* `Overfitting`: The trained classifier gets a $100\%$ accuracy in the training/validation data, but only $50\%$ in the testing data.
    * Also known as `high variance`.

![Fig. 2a. Sample data of a binary dataset with an overfitted model](https://www.dropbox.com/s/gtyc6o096si85ii/overfitting.png?raw=1)

* `Underfitting`: The learned classifier is so simplistic that does not capture the structure of the data.
    * This translates on a poor performance on the  validation data
    * Also known as `high bias`

![Fig. 2b. Sample data of a binary dataset with an underfitted model](https://www.dropbox.com/s/sc6t7pocg90xrak/underfitting.png?raw=1)

* What do we expect?

![Fig. 2c. Sample data of a binary dataset with "just right" classification](https://www.dropbox.com/s/8qmcr98jeghw7fe/justright.png?raw=1)

### The bias-variance trade-off

* As you can see, a model can either have high bias or high variance

* The main objective of machine learning is to find a function $h(x)$ that maps feature $X$ to class/target $y$ minimising:
    * bias error
    * variance error
    * irreducible error (noise in the data)

Typically[$^1$](https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229), the **Error** of a learner/classifier is modelled using the following equation:

$Err(x)=Bias^2+Variance+Irreducible\:Error$

**Why $Bias^2$?**

## Performance Measures

* Assume that we are evaluating the classification success of a **binary** dataset

* `True Positives` (TP): This is what many people understand as *accuracy* (but is not!)
    * Samples from the *positive class* that are classified correctly

* `True Negatives` (TN): How many samples from the negative class are **NOT** classified as being from the positive one

* `False Positives` (FP): How many samples from the negative class are classified as being from the positive class
    * Also known in statistics as **False alarms** or **Type I Error**

* `False Negatives` (FN): How many samples from the positive class are classified as being from the negative class
    * Also known in statistics as **Type II Error**

### Accuracy

* $Accuracy = \frac{TP+TN}{TP+TN+FP+FN}$

* The value of the accuracy must be **between $0$ and $1$**

* Recall that we said that this is **not** a good measure for imbalanced datasets

* **WHY?**

### Error Rate

* $Error\:Rate = \frac{FP+FN}{TP+TN+FP+FN} = 1 - Accuracy$

* Also must be **between $0$ and $1$**

* **Do you think this one is good for imbalanced datasets?**

### Precision and Recall

* Assume that we have the following **binary** classification scenario

![Fig. 3. Binary Classification Scenario Example](https://www.dropbox.com/s/kojs26i99ksxwuj/Precisionrecall.png?raw=1)

#### Precision

* $Precision = \frac{TP}{TP+FP}$

![Fig. 3a. Precision Illustrated in the Binary Classification Scenario Example](https://www.dropbox.com/s/1y2z9grr3tle83n/Precision.png?raw=1)

* How much of what I **have** I **need**?

#### Recall

* $Recall = \frac{TP}{TP+FN}$

![Fig. 3b. Recall illustrated in the binary Classification scenario example](https://www.dropbox.com/s/hpm2rck19vxgnqy/Recall.png?raw=1)

* How much of what I **need** I **have**?

* The difference is in what you divide the `TP` with

* Most systems are known to have a precision/recall trade-off

* **Which is better?**

#### F1-score (or F1-measure)

* Harmonic mean between precision and recall

* $F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} = \frac{2 \times TP}{(2 \times TP) + FP + FN}$

### Sensitivity and Specificity

* Similar to precision and recall, but used more in the health sciences domain

#### Sensitivity

* Just another name for **recall**

![Fig. 3c. Sensitivity illustrated in the binary classification scenario example](https://www.dropbox.com/s/5anc9xoualeij76/sensitivity.png?raw=1)

#### Specificity

* The precision for the negative class

![Fig. 3d. Specificity illustrated in the binary classification scenario example](https://www.dropbox.com/s/o3w055swet69c9k/specificity.png?raw=1)

**Is there any "F-measure" for these two?**

### The Confusion Matrix

* Also known as *error matrix*

* Table that allows you to visualise the performance of a supervised learning algorithm

#### Example

* A classifier has been trained to distinguish cats from dogs

* Assuming a sample of 13 animals (8 cats and 5 dogs), you get the following confusion matrix

![Fig 4. Confusion matrix example](https://www.dropbox.com/s/ii6nitc5fxpgb8d/confmat.png?raw=1)

* This table can also be interpreted with respect to the previously seen terms

![Fig 4a. Confusion matrix with previously seen terms](https://www.dropbox.com/s/dwkpg1epk46b5cm/confmat2.png?raw=1)

### Area under the Receiving Operating Characteristic (ROC) Curve

* Suitable to compare classification rates in a more visual way and at **different threshold settings**

In reality, **all** classifiers are probabilistic, which means that they don't really tell you if the data point to be classified is class 0 or class 1, but rather they tell you the probability of being class 0 or class 1 (both probabilities add to 100%). Therefore, what the ROC curve plots is the FPR vs TPR when the threshold varies from t=0 to t=1. Usually the threshold is set to t=0.5, but if we vary it we will find different results which are plotted and create the curve. If the threshold is t=0, it means that samples are equally likely to be of class 0 or class 1, and thus both TPR and FPR are 0. If the threshold is t=1, then no samples are classified, which means that TPR and FPR are 1.

* It is a probability curve that tells you how much your model is able to distinguish between classes

* Higher the AUC, better the model is capable of performing the distinction

* The curve plots **False Positive Rate** (x-axis) vs **True Positive Rate** (y-axis)
    * $FPR: 1-Specificity$
    * $TPR: Recall\:(also\:known\:as\:Sensitivity)$

![Fig. 6. Example of ROC AUC](https://www.dropbox.com/s/orarrocrue4lvzs/ROCAUC.png?raw=1)

### Runtime

* It's not a bad idea to report this, particularly in large image datasets

* Not very "accepted" in the academic world, but extremely useful in the industrial one!

* You can import the `time` module in Python and use the `perf_counter()` function to calculate the time of processes running
    * Just be very careful where in your code you calculate the time!

In [1]:
import time

t = time.perf_counter()
# do stuff
x=0
for i in range(1000):
    x=x+i 
# stuff has finished
print('Elapsed time: ',time.perf_counter() - t)

Elapsed time:  0.00035020000007079943


## What about multi-class classification?

* So far, we have only spoken of metrics in the context of binary datasets

* However, in most cases you will deal with multi-class datasets

* There are many ways to adapt the aforementioned metrics to these scenarios, the most common one being the **One vs All** approach
    * Comparing a metric of one class against the rest as if these were a single class

* Considering that you can still calculate precision, recall and F1-score for each class (against the rest), another commonly used approach is **macro/weighted/micro** metrics:

* `Macro` is the arithmetic mean of all metrics

* `Weighted` is when we multiply each metric by the number of samples of each class

* `Micro` is the harmonic mean of all metrics, which derives in the system's accuracy

* To see an example of this, I recommend you to visit [this site](https://towardsdatascience.com/multi-class-metrics-made-simple-part-i-precision-and-recall-9250280bddc2)

## Validation Frameworks

![Fig. 7. Typical Data Split](https://www.dropbox.com/s/oze1q3wj7d71pa1/traintestvalid.jpg?raw=1)

* Technically this is not the only way to split the data!

* Even when you split uniformly using train/val/test approach, you are still not considering that maybe some train/val data is better/worse for testing and vice versa!

* To address this issue, there are some iterative validation frameworks which let you split data in different ways and perform multiple tests of the same model

### Cross validation

* Simple to understand

* Reduces "bias"
    * i.e. over-optimistic results that may be caused due to chance

* Based on a single parameter $k$ which defined the number of times that the dataset will be *folded*

#### How it works

1) Shuffle the dataset

2) Split the dataset into $k$ groups

3) For each group
    * Take that group as the test data
    * Take the remaining groups as the training data
    * Fit the model
    * Retain the score and discard the model

4) Once you are done, average/summarise all results

#### Which $k$ to choose?

* Representative for the model: Large enough to be statistically significant!

* $k=5$ and $k=10$ are the usual standard, but it depends on how many samples you have!

* If you do $k=n$ ($n$ being the number of samples in the dataset) then you will test every sample as the test against the rest as the training set

* This is also known as the **Leave-One-Out** approach

* Some datasets (like the one you will use in the bonus part of the lab) already are partitioned in the $k$ folds

## Metrics used in Computer Vision

### IoU (A.K.A. Jaccard Index)

![Fig 8. Intersection over Union](https://www.dropbox.com/scl/fi/ass5opbcb3lmohtr22a7v/iou.png?rlkey=gj5g79mz05j3rsnoenwu3btc1&raw=1)

![Fig 9. Intersection over Union Formula](https://www.dropbox.com/scl/fi/tvu4tpdq7w1mns9uukcse/iouformula.png?rlkey=1n20rpcxk9hy2univxqetn1ca&raw=1)

* Normally, $IoU \geq 0.5$ is considered good, while $1$ is perfect!

![Fig 10. Intersection over Union Examples](https://www.dropbox.com/scl/fi/kol5jba0azx6qmrbg2ono/iouexamples.png?rlkey=g8lanp2tu1oqa5sgiyx61y55e&raw=1)

### Dice Coefficient

* The "F1-Score" of computer vision metrics

* More widely used for segmentation

![Fig 11. Dice Coefficient](https://www.dropbox.com/scl/fi/40ne3x72458zpwp9c6fif/dice.png?rlkey=11aaqn1q1wn49tpp6oqs1oywd&raw=1)

**What is the difference between IoU and Dice?**

* IoU is more like recall, so it is good to use when you want to detect if a larger amount of the object pixels are outside the area of interest, but also if the detection is **overestimating** where the object is!

* Dice coefficient penalises false positives, which is better for high imbalanced datasets or when the segmentations are not correct

# LAB: PERFORMANCE MEASURES FOR BINARY DATASETS