# Classification metrics
Author: Geraldine Klarenberg

Based on the Google Machine Learning Crash Course

## Tresholds
In previous lessons, we have talked about using regression models to predict values. But sometimes we are interested in **classifying** things: "spam" vs "not spam", "bark" vs "not barking", etc. 

Logistic regression is a great tool to use in ML classification models. We can use the outputs from these models by defining **classification thresholds**. For instance, if our model tells us there's a probability of 0.8 that an email is spam (based on some characteristics), the model classifies it as such. If the probability estimate is less than 0.8, the model classifies it as "not spam". The threshold allows us to map a logistic regression value to a binary category (the prediction).

Tresholds are problem-dependent, so they will have to be tuned for the specific problem you are dealing with.

In this lesson we will look at metrics you can use to evaluate a classification model's predictions, and what changing the threshold does to your model and predictions.

## True, false, positive, negative...

Now, we could simply look at "accuracy": the ratio of all correct predictions to all predictions. This is simple, intuitive and straightfoward. 

But there are some problems with this approach:
* This approach does not work well if there is (class) imbalance; situations where certain negative or positive values or outcomes are rare; 
* and, most importantly: different kind of mistakes can have different costs...

### The boy who cried wolf...

We all know the story!

![Illustration of the boy who cried wolf](../nb-images/wolfpic.jpg)

For this example, we define "there actually is a wolf" as a positive class, and "there is no wolf" as a negative class. The predictions that a model makes can be true or false for both classes, generating 4 outcomes:

![An table showing a confusion matrix based on the story of the boy who cried wolf](../nb-images/confusionmatrix_wolf.png)

This table is also called a *confusion matrix*.

There are 2 metrics we can derive from these outcomes: precision and recall.

## Precision
Precision asks the question what proportion of the positive predictions was actually correct?

To calculate the precision of your model, take all true positives divided by *all* positive predictions:
$$\text{Precision} = \frac{TP}{TP+FP}$$

Basically: **did the model cry 'wolf' too often or too little?**

**NB** If your model produces no negative positives, the value of the precision is 1.0. Too many negative positives gives values greater than 1, too few gives values less than 1.

### Exercise
Calculate the precision of a model with the following outcomes

true positives (TP): 1 | false positives (FP): 1 
-------|--------
**false negatives (FN): 8** | **true negatives (TN): 90** 

## Recall
Recall tries to answer the question what proportion of actual positives was answered correctly?

To calculate recall, divide all true positives by the true positives plus the false negatives:
$$\text{Recall} = \frac{TP}{TP+FN}$$

Basically: **how many wolves that tried to get into the village did the model actually get?**

**NB** If the model produces no false negative, recall equals 1.0

### Exercise
For the same confusion matrix as above, calculate the recall.

## Balancing precision and recall
To evaluate your model, should look at **both** precision and recall. They are often in tension though: improving one reduces the other.
Lowering the classification treshold improves recall (your model will call wolf at every little sound it hears) but will negatively affect precision (it will call wolf too often).

### Exercise
#### Part 1
Look at the outputs of a model that classifies incoming emails as "spam" or "not spam".

![Image of outcomes of a spam/not spam classification model](../nb-images/PrecisionVsRecallBase.svg)

The confusion matrix looks as follows

true positives (TP): 8 | false positives (FP): 2 
-------|--------
**false negatives (FN): 3** | **true negatives (TN): 17** 

Calculate the precision and recall for this model.

#### Part 2
Now see what happens to the outcomes (below) if we increase the threshold

![Image of outcomes of a spam/not spam classification model](../nb-images/PrecisionVsRecallRaiseThreshold.svg)

The confusion matrix looks as follows

true positives (TP): 7 | false positives (FP): 4 
-------|--------
**false negatives (FN): 1** | **true negatives (TN): 18** 

Calculate the precision and recall again.

**Compare the precision and recall from the first and second model. What do you notice?** 

## Evaluate model performance
We can evaluate the performance of a classification model at all classification thresholds. For all different thresholds, calculate the *true positive rate* and the *false positive rate*. The true positive rate is synonymous with recall (and sometimes called *sensitivity*) and is thus calculated as

$ TPR = \frac{TP} {TP + FN} $

False positive rate (sometimes called *specificity*) is:

$ FPR = \frac{FP} {FP + TN} $

When you plot the pairs of TPR and FPR for all the different thresholds, you get a Receiver Operating Characteristics (ROC) curve. Below is a typical ROC curve.

![Image of an ROC curve](../nb-images/ROCCurve.svg)

To evaluate the model, we look at the area under the curve (AUC). The AUC has a probabilistic interpretation: it represents the probability that a random positive (green) example is positioned to the right of a random negative (red) example.

![Image with predictions ranked according to logistic regression score](../nb-images/AUCPredictionsRanked.svg)

So if that AUC is 0.9, that's the probability the pair-wise prediction is correct. Below are a few visualizations of AUC results. On top are the distributions of the outcomes of the negative and positive outcomes at various thresholds. Below is the corresponding ROC.

![Image with distributions of positive and negative classes - perfect](../nb-images/TowardsDataScienceAUC_perfect.png) 
![Image with AUC - perfect](../nb-images/TowardsDataScienceAUC_perfect2.png)
**This AUC suggests a perfect model** (which is suspicious!)


![Image with distributions of positive and negative classes - normal](../nb-images/TowardsDataScienceAUC_normal.png)
![Image with AUC - normal](../nb-images/TowardsDataScienceAUC_normal2.png)
**This is what most AUCs look like**. In this case, AUC = 0.7 means that there is 70% chance the model will be able to distinguish between positive and negative classes.

![Image with distributions of positive and negative classes - worst](../nb-images/TowardsDataScienceAUC_worst.png)
![Image with AUC - worst](../nb-images/TowardsDataScienceAUC_worst2.png)
**This is actually the worst case scenario.** This model has no discrimination capacity at all... 

## Prediction bias
Logistic regression should be unbiased, meaning that the average of the predictions should be more or less equal to the average of the observations. **Prediction bias** is the difference between the average of the predictions and the average of the labels in a data set.

This approach is not perfect, e.g. if your model almost always predicts the average there will not be much bias. However, if there **is** bias ("significant nonzero bias"), that means there is something something going on that needs to be checked, specifically that the model is wrong about the frequency of positive labels.

Possible root causes of prediction bias are:
* Incomplete feature set
* Noisy data set
* Buggy pipeline
* Biased training sample
* Overly strong regularization

### Buckets and prediction bias
For logistic regression, this process is a bit more involved, as the labels assigned to an examples are either 0 or 1. So you cannot accurately predict the prediction bias based on one example. You need to group data in "buckets" and examine the prediction bias on that. Prediction bias for logistic regression only makes sense when grouping enough examples together to be able to compare a predicted value (for example, 0.392) to observed values (for example, 0.394). 

You can create buckets by linearly breaking up the target predictions, or create quantiles. 

The plot below is a calibration plot. Each dot represents a bucket with 1000 values. On the x-axis we have the average value of the predictions for that bucket and on the y-axis the average of the actual observations. Note that the axes are on logarithmic scales.

![Image of a calibration plot with buckets](../nb-images/BucketingBias.svg)

## Exercise 