# ML Model Evaluation Tools & Metrics

[Source Part 1](https://towardsdatascience.com/20-popular-machine-learning-metrics-part-1-classification-regression-evaluation-metrics-1ca3e282a2ce)

[Source Part 2](https://towardsdatascience.com/20-popular-machine-learning-metrics-part-2-ranking-statistical-metrics-22c3e5a937b6)

## Classification Metrics

### Confusion Matrix

Confusion matrix is a tabular visualization of the model predictions versus ground truth labels. It's often useful to look at a confusion matrix for a quick idea of the recall and precision rate.

For example, we have 1000 non-cat images and 100 cat images. We feed it into a classification model and receive the following result.

![Confusion Matrix](./assets/confusion.png)

We will use this example and its confusion matrix to derive the evaluation metrics.

### Accuracy

Accuracy is the number of correct predictions divided by the total number of predictions.

$$
\text{Accuracy} = \frac{TP + TN}{TP + FP + TN + FN} = \frac{90 + 940}{1100} = 0.936
$$

### Precision

If your class distribution is imbalanced (i.e. one class is more frequently appearing than others), accuracy is not a reliable metric. If the model consistently predicts all samples as the most frequent class, the model has high accuracy but it is deceiving. We need to use precision to understand the model's performance.

$$
\text{Precision}_\text{positive} = \frac{TP}{TP + FP} = \frac{90}{90+60} = 0.60
$$

$$
\text{Precision}_\text{negative} = \frac{TN}{TN + FN} = \frac{940}{940+10} = 0.989
$$

We can see that the model is not performing well on detecting cat. As we optimize for precision, our model might become more "conservative" in what it considers to be a "cat". This will cause our recall score to drop (see next section).

### Recall

Recall is the fraction of samples from a class which are correctly predicted by the model. For a cat image, how often does the model predict correctly? For a non-cat image, how often does the model predict correctly?

$$
\text{Recall}_\text{positive} = \frac{TP}{TP + FN} = \frac{90}{90+10} = 0.90
$$

$$
\text{Recall}_\text{negative} = \frac{TN}{TN + FP} = \frac{940}{940+60} = 0.94
$$

High recall generally means that we try to minimize false negative by predicting more positive even if they are false positive. This will cause our precision to drop.

> If the cost for a FP is low, e.g. detecting cancer for a patient, then we should optimize
  for recall. This is because the cost for FN is high in this scenario.

### F1 Score

Depending on the application, it may need a higher priority for recall or precision. But there are many applications in which both recall and precision are important. Therefore, it is natural to think of a way to combine them into one single score.

**F1** is the harmonic mean of precision and recall.

$$
\text{F1} = \frac{2 * \text{Precision} * \text{Recall}}{\text{Precision} + \text{Recall}}
$$

There is always a trade-off between precision and recall of a model. If you want to make the precision high, you should expect to see a drop in recall, vice versa. 

### Sensitivity and Specificity

Sensitivity and specificity are just recalls for true positive and true negative.

$$
\text{Sensitivity} = \text{True Positive Rate} = \frac{TP}{TP + FN}
$$ 

$$
\text{Specificity} = \text{True Negative Rate} = \frac{TN}{TN + FP}
$$

### ROC Curve

The **receiver operating characteristic** curve is a plot which shows the performance of a binary classifier as function of its cut-off threshold. It shows the true positive rate against the false positive rate for various threshold values.

Classification models produce probabilities for samples as predictions. The models compare the output probability with some cut-off threshold to decide whether the output is positive or negative. For example, a model may predict `[0.45, 0.60, 0.70, 0.30]` for 4 sample images. 

- If `cut-off=0.5` then predicted labels are `[0, 1, 1, 0]`
- If `cut-off=0.2` then predicted labels are `[1, 1, 1, 1]`
- If `cut-off=0.8` then predicted labels are `[0, 0, 0, 0]`

The cut-off rate will directly affect the precision and recall rates. The graph will look like the following.

![ROC Curve](https://upload.wikimedia.org/wikipedia/commons/3/36/Roc-draft-xkcd-style.svg)

ROC curve is a useful tool for picking the best cut-off threshold for the model.

### AUC

The **area under of the curve** is an aggregated measure of performance of a binary classifier on all possible threshold values (and therefore it is threshold invariant). AUC is an integral over all threshold values over the ROC curve. One way to interpreting AUC is _the probability that the model ranks a random positive example more highly than a random negative example_. A model whose predictions are 100% wrong has an AUC of 0.0, one whose predictions are 100% correct has an AUC of 1.0.

![Area Under the Curve](https://developers.google.com/machine-learning/crash-course/images/AUC.svg)