# Classification Metrics Review

#### Plan for the video
* Accuracy
* Logarithmic loss
* Area under ROC Curve (`AUC`)
* (Quadratic weighted) Kappa

### Notation

![classification-metrics-notation](../img/classification-metrics-notation.png)

* `Hard label` is usually a function of soft labels
  * usually `argmax` for multi class takss
  * But for binary classification an be thought of as a **thresholding function** 

### Accuracy score

$$Accuracy = \frac{1}{N}|\hat{y_i} = y_i|$$
* How frequently our class prediction is correct

$$Accuracy = \frac{1}{N}|\alpha = y_i|$$
* **Best constant**:
  - predict the most frequent class
  
#### Caveat in interpreting the value of the accuracy score
* Dataset
  - 10 cats
  - 90 dogs
* **Predict always dog**:
  - Accuracy = `.9`
  
#### Additional note for `accuracy score`
* Simple, clean, intuitive.
* Hard to optimize
* Does not care **how confident the classifier is in the prediction** and **what soft predictions are**
  - It cares only about `argmax` hard predictions, not soft predictions.




### Logarithmic loss (`logloss`)

A log-loss usually reasons a little bit differntly for binary and multi-classification tasks

* **Binary**
  * Assuming that `y hat` is a number from 0-1 range
  * Show the probability of an object to belong to calss 1 
$$LogLoss = -\frac{1}{N}\sum^N_{i=1}y_{i}log(\hat{y_i})+(1-y_i)log(1-\hat{y_i})$$

$$y_{i}\in\mathbb{R}$$
$$\hat{y_i}\in\mathbb{R}$$


* **Multiclass**
  * $\hat{y_i}$ - a vector of size L with 1 as its sum
    * The elements are the probabilities to belong to each of the classes

$$LogLoss = -\frac{1}{N}\sum^N_{i=1}\sum^L_{l=1}y_{il}log(\hat{y_{il}})$$
$$y_{i}\in\mathbb{R}$$
$$\hat{y_i}\in\mathbb{R}$$

* **In practice**
  * Predictions are clipped to be not from 0 from 1
  * Predictions clipped from some positive number to 1 minus some small positive number

$$LogL oss = -\frac{1}{N}\sum^N_{i=1}\sum^L_{l=1}y_{il}log(min(max(\hat{y_{il}}, 10^{-15}), 1-10^{-15}))$$


### LogLoss strongly penalizes completely wrong answers!
* Prefers to make **`a lot of small mistakes`** to **`a few but severer mistakes`**
![logloss-analysis](../img/logloss-analysis.png)


#### Best constant

$$LogLoss = -\frac{1}{N}\sum^N_{i=1}y_{i}log(\alpha)+(1-y_i)log(1-\alpha)$$

- **set $\alpha_{i}$ to frequency of i-th class.**

----

Dataset:
- 10 cats
- 90 dgs 
- $\alpha = [0.1, 0.9]$

### Area Under Curve (AUC ROC)

![auc-roc-ex1](../img/auc-roc-ex1.png)
![auc-roc-ex2](../img/auc-roc-ex2.png)

Recall that ...
* to compute accuracy score for a binary task, we usually take soft predictions from our model and apply threshold
  * threshold `.5` ---> $Accuracy(|\hat{y}>0.5|) = \frac{6}{7}$
  * threshold `.7` ---> $Accuracy(|\hat{y}>0.7|) = 1$
  
#### This metric (AUC) tries all possible ones and aggregates those scores
* **`ONLY FOR BINARY TASKS`**
* **Depends only on ordering of the predictions**, not on absolute values

#### Several explanations
1. Area Under Curve
2. Pairs Ordering

### Explanation 1 on AUC

Positive side = `RED` / negative side = `GREEN`
![ex1-on-auc](../img/ex1-on-auc.png)

* We will go from left to right, jump from one obj to another
* Paint `RED` if right (positive) and if not, paint `GREEN`
* The curve we've just build is called `Receiver Operating Curve`

![ex1-on-auc2](../img/ex1-on-auc2.png)

* `AreaSize = 7`
* We need to normalize it by the total plural area of the square
  * **AUC = 7/9**
  
  
#### What AUC will be for the data that can be seperated with a threshold? (like our inital example)
* AUC will be `1` - the maxium value of AUC
* **It does not need a threshold to be specified and it doesn't depend on absolute values**

#### If you build such curve for a huge data set in real classifier ...
![](../img/auc-huge-data.png)
* The curve usually lie above the dashed line (`y = x`) 
  * which shows how would the curve look like if we made predictions **at random** = `baseline`!




### Explanation 2 on AUC

Consider all pairs of objects such that
* one object is from red class and another one is from green
  * **`AUC` is a probability that score for the green one will be higher than the score for the red one**
  
#### In other words, `AUC` is a fraction of correctly ordered pairs

$$AUC = \frac{\text{num of correctly ordered pairs}}{\text{total num of pairs}}\\=1-\frac{\text{num of incorrectly ordered pairs}}{\text{total num of pairs}}$$

![ex2-auc-img](../img/ex2-auc-img.png)

### AUC best constant

* Best constant
  - All constants given same score
  
* Random predictions lead to AUC = `0.5` (`baseline`)

### Cohen's Kappa motivation

Dataset:
* 10 cats
* 90 dogs
* `Baseline accuracy = 0.9`

----

$$\text{my_score} = 1 - \frac{1 - \text{accuracy}}{1 - \text{baseline}}$$

#### We can introduce a new metric such that ...
* accuracy = 1 ------> my-score = 1
* accuracy = 0.9 ------> my_score = 0

### In Cohen's Kappa, we take another value as the baseline (other than `50:50`)

Dataset:
* 10 cats
* 90 dogs
* Predict 20 cats and 80 dogs at random 
  * accuracy ~ 0.74
  * `0.2*0.1 + 0.8*0.9 = 0.74`
  
-----

$$\text{Cohen's Kappa} = 1 - \frac{1 - \text{accuracy}}{1 - p_{e}}$$

$p_{e}$ - `what accuracy would be on average, if we randomly permute our predictions`

$$p_{e} = \frac{1}{N^2}\sum_{k}n_{k1}n_{k2}$$


### We can also recall that error is equal to 1 minus accuracy

Dataset:
* 10 cats
* 90 dogs
* Predict 20 cats and 80 dogs at random 
  * accuracy ~ 0.74
  * error ~ 0.26
  
-----
$$\text{Cohen's Kappa} = 1 - \frac{1 - \text{error}}{1 - \text{baseline error}}$$

### Weighted error

Dataset:
* 10 cats / 90 dogs / 20 tigers


### Say we are more or less okay if we predict dog instead of cat, but it's undesirable to predict cat or dog if it's really a tiger.

#### Error weight matrix
* Each cell contains the weight for the mistake we might do

pred\true|cat|dog|tiger
---|---|---|---
**cat**|0|1|10
**dog**|1|0|10
**tiger**|1|1|0

### Weighted error and weighted Kappa

![confusion-weighted-matrix](../img/confusion-weighted-matrix.png)

1. **Multiply these two matrices element-wise**
2. **Sum their results**
  * This formula needs a proper normalization to make sure the qunatity is between 0 and 1
  * But it does not matter for our purpose, as the noramlization constant will anyway cancel
$$\text{weighted error} = \frac{1}{\text{const}}\sum_{i,j}C_{ij}W_{ij}$$

<br>
$$\text{Weighted Kappa} = 1 - \frac{1 - \text{weighted error}}{1 - \text{weighted baseline error}}$$

### Quadratic and Linear Weighted Kappa
* The more distant the prediction is from the real label, the more the model gets penalized

![quad-lin-weighted-kappa](../img/quad-lin-weighted-kappa.png)


### Conclusion

* Accuracy
* Logloss
* AUC (ROC)
* (Quadratic weighted) Kappa