<div align="center">
<h1>Stage 4: Modelling (Metrics)</a></h1>
by Hongnan Gao
<br>
</div>

## Define Metrics

**Disclaimer: For a more detailed understanding of different metrics, do navigate to my self-made notes on metrics [here](https://ghnreigns.github.io/reighns-ml-website/metrics/classification_metrics/classification_metrics/).**

---

Choosing a metric to measure the classifier's (hypothesis) performance is important, as choosing the wrong one can lead to disastrous interpretations. One prime example is using the accuracy metric for imbalanced datasets; consider 1 mil data points, dichotomized by $99\%$ benign and $1\%$ malignant samples, even a baseline model zeroR model which predicts the majority class no matter the process will give a $99\%$ accuracy, completely missing out any positive samples, which unfortunately, is what we may be more interested in.

---

<div class="alert alert-block alert-danger">
<b>Say No to Accuracy:</b> Consider an imbalanced set, where the training data set has 100 patients (data points), and the ground truth is 90 patients are of class = 0, which means that these patients do not have cancer, whereas the remaining 10 patients are in class 1, where they do have cancer. This is an example of class imbalance where the ratio of class 1 to class 0 is 1:9.
</div>   
    
Consider **a baseline (almost trivial) classifier**:

```python
def zeroR(patient_data):
        training...
    return benign
```
        

where we predict the patient's class as the most frequent class. Meaning, the most frequent class in this question is the class = 0, where patients do not have cancer, so we just assign this class to everyone in this set. By doing this, we will inevitably achieve a **in-sample** accuracy rate of $\frac{90}{100} = 90\%$. But unfortunately, this supposedly high accuracy value is completely useless, because this classifier did not label any of the cancer patients correctly.

The consequence can be serious, assuming the test set has the same distribution as our training set, where if we have a test set of 1000 patients, there are 900 negative and 100 positive. Our model just literally predict every one of them as benign, yielding a $90\%$ **out-of-sample** accuracy.

What did we conclude? Well, for one, our `accuracy` can be 90% high and looks good to the laymen, but it failed to predict the most important class of people - yes, misclassifying true cancer patients as healthy people is very bad! 

---

For the reasons mentioned above, we will use metric that can help us reduce False Negatives, and at the same time, outputs meaningful predictions. In order to achieve for both, we will use **Receiver operating characteristic (ROC)** as the primary metric for the model to maximize (which is our $\mathcal{M}$, and **Brier Score**, a [proper scoring rule](https://en.wikipedia.org/wiki/Scoring_rule) to measure the performance of our probabilistic predictions. We will go into some details in the next two subsections to justify our choice.

## Proper Scoring Rule

The math behind the idea of Proper Scoring Rule is non-trivial. Here, we try to understand why a proper scoring rule is desired in the context of binary classification.

---

<div class="alert alert-success" role="alert">
<li> <b>Strictly Proper Scoring Rule:</b> Brier Score Loss, for example, tells us that the best possible score, 0 (lowest loss), is obtained if and only if, the probability prediction we get for a sample, is the true probability itself. In other words, if a selected sample is of class 1, our prediction for this must be 1, with 100% probability, in order to get a score loss of 0.
    
<li> <b>Proper Scoring Rule:</b> Read [here](https://stats.stackexchange.com/questions/339919/what-does-it-mean-that-auc-is-a-semi-proper-scoring-rule) for this.
    
<li> <b>Semi Proper Scoring Rule:</b> AUROC, as mentioned, does not help out in telling whether a prediction by a classifier is close to the true probability or not. In our example, we even see that we can obtain a full score of 1, even if the probabilities all lie within 0.51 and 0.52.

<li> <b>Improper Scoring Rule:</b> Accuracy is a prime example, the accuracy score does not, whatsoever, tells us about how close our predicted probabilities are, to the true probability distribution of our samples.
</div>

## Receiver operating characteristic (ROC)

<div class="alert alert-success" role="alert">
    <b>Definition:</b> The basic (non-probablistic intepretation) of ROC is graph that plots the True Positive Rate on the y-axis and False Positive Rate on the x-axis parametrized by a threshold vector $\vec{t}$. We then look at the area under the ROC curve (AUROC) to get an overall performance measure.
</div>

---

The choice of ROC over other metrics such as Accuracy is detailed initially. **We also established we want to reduce False Negative (FN), since misclassifying a positive patient as benign is way more costly than the other way round.** One can choose to minimize **Recall** in order to reduce FN, but this is less than ideal during training because it is a thresholded metric, and does not provide at which threshold the recall is at minimum. This leads us to choose ROC for the following two main reasons:

### Threshold Invariant

By definition, ROC computes the pair $TPR \times FPR$ over all thresholds $t$, consequently, the AUROC is threshold invariant, allowing us to look at the model's performance over all thresholds. We note that ROC may not be that reliable in the case of very imbalanced datasets where majority is in the negative class, as $FPR = \dfrac{FP}{FP+TN}$ may seem deceptively low as denominator may be made small by the sheer amount of TN, in this case, we may also look at the Precision-Recall curve.

### Scale Invariant

Technically, this is not the desired property that we need, as this means that the ROC is non-proper in scoring, it can take in non-calibrated scores and still perform relatively well. A classic example I always use is the following:

```python
y1 = [1,0,1,0]
y2 = [0.52,0.51,0.52,0.51]
y3 = [52,51,52,51]
uncalibrated_roc = roc(y1,y2) == roc(y1,y3)
print(f"{uncalibrated_roc}") -> 1.0
```

The example tells us two things, as long as the ranking of predictions is preserved, the final AUROC score is the same, regardless of scale. We also notice that even though the model gives very unconfident predictions, the AUROC score is 1, which can be misleadingly over-optimistic. With that, we introduce Brier Score.

### Common Pitfalls

<div class="alert alert-block alert-danger">
<b>Careful when using ROC function!</b>   
    
We also note that when passing arguments to scikit-learn's <code>roc_auc_score</code> function, we should be careful not to pass <code>y_score=model.predict(X)</code> inside as we have to understand that we are passing in <b>non-thresholded</b> probabilities into <code>y_score</code>. If you pass the predicted values (full of 0 and 1s), then you are thresholding on 0 and 1 only, which is incorrect by definition. 
</div>

## Brier Score

<div class="alert alert-success" role="alert">
    <b>Definition:</b> Brier Score computes the squared difference between the probability of a prediction and its actual outcome. 
</div>

---

[Brier Score](https://en.wikipedia.org/wiki/Brier_score) is a strictly proper scoring rule while ROC is [not](https://www.fharrell.com/post/class-damage/); the lower the Brier Score, the better the predictions are calibrated. We can first compute the AUROC score of the model, and compute Brier Score to give us how well calibrated (confident) the predictions are.

### Well Calibrated

A intuitive way of understanding well calibrated probabilities is as follows, extracted from [cambridge's probability calibration](https://blog.cambridgespark.com/probability-calibration-c7252ac123f):

> In very simple terms, these are probabilities which can be interpreted as a confidence interval. Furthermore, a classifier is said to produce well calibrated probabilities if for the instances (data points) receiving probability 0.5, 50% of those instances belongs to the positive class.

---

In my own words, if a classifier is well calibrated, say in our context where we predict binary target, and pretend that out of our test set, 100 of the samples have a probability of around 0.1, then this means 10% of these 100 samples actually belong to the positive class.

The generic steps are as follows to calculate a calibrated plot:

1. Sort all the samples by the classifier's predicted probabilities, in either ascending or descending order.
2. Bin your diagram into N bins, usually we take 10, which means on the X-axis, note this does not mean we have 0-0.1, 0.1-0.2, ..., 0.9-1 as the 10 bins.
3. What step 2 means is let's say you have 100 predictions, if you bin by 10 bins, and since the predictions are ***sorted***, we can easily divide the 100 predictions into 10 intervals: for illustration, assume the 100 predictions are as follows, where we sort by ascending order and the prediction 0.1 has 10 of them, 0.2 have 10 of them, so on and so forth.
    ```python
    y_pred = [0.1, 0.1, ....., 0.2, 0.2, ..., 0.9, 0.9, ..., 1, 1, ...1]
    ```
4. Since we can divide the above into 10 bins, bin 1 will have 10 samples of predictions 0.1, bin 2 will have 10 samples of predictions 0.2, etc. We then take the mean of the **predictions of each bin**, that is for the first bin, we calculate $\dfrac{1}{10}\sum_{i=1}^{10}0.1 = 0.1$, and second bin, $\dfrac{1}{10}\sum_{i=1}^{10}0.2 = 0.2$. Note that this may not be such a nice number in reality, I made this example for the ease of illustration!
5. Now, we have our X-axis from step 4, that is, we turned 10 bins, into 10 numbers, 0.1, 0.2, 0.3, ..., 1, and then we need to find the corresponding points for each of the 10 numbers! This is easy, for 0.1, the corresponding y-axis is just the **fraction of positives**, which means, out of the 10 samples in the first bin, how many of these 10 samples were actually positive? We do this for all 10 bins (points), and plot a line graph as seen in scikit-learn.  

---

Now this should be apparent now that a well calibrated model should lie close to the $y = x$ line. That is, if the mean predicted probability is 0.1, then the y-axis should also be 0.1, meaning to say that out of all the samples that were predicted as 0.1, we should really only have about 10% of them being positive. The same logic applies to the rest!

---

### Brier Score Loss

Brier Score Loss is a handy metric to measure whether a classifier is well calibrated, as quoted from [scikit-learn](https://scikit-learn.org/stable/modules/calibration.html):

> Brier Score Loss may be used to assess how well a classifier is calibrated. However, this metric should be used with care because a lower Brier score does not always mean a better calibrated model. This is because the Brier score metric is a combination of calibration loss and refinement loss. Calibration loss is defined as the mean squared deviation from empirical probabilities derived from the slope of ROC segments. Refinement loss can be defined as the expected optimal loss as measured by the area under the optimal cost curve. As refinement loss can change independently from calibration loss, a lower Brier score does not necessarily mean a better calibrated model.

### Common Pitfalls

<div class="alert alert-block alert-danger">
<b>Class Imbalance:</b> The good ol' class imbalance issue almost always pop up anywhere and everywhere. Intuitively, if we have a super rare positive/negative class, then if the model is very confident in its predictions for the majority class, but not so confident on the rare class, the overall Brier Score Loss may not be sufficient in discriminating the classifier's inability in correctly classifying the minority class.
</div>