# Evaluation Metrics (WIP)

- Bowen Li
- 2017/11/17

## Introduction

How to measure our machine learning model's performance? Different problems and different purposes need different metics for us to fairly and effectively make decision. This technical report is to introduce useful metrics for machine learning evaluation.

### Metrics Overview

- R-Square: To measure model fitness for continuous response
- Mean Squared Errors (MSE)
- Mean Absolute Errors (MAE)
- Akaike Information Criterion (AIC)
- Bayesian Information Criterion (BIC)
- Variance Inflation Factor (VIF): To measure multi-collineairity for features
- Cook Distance: To detect influential instance points
- Confusion Matrix
- Accurary
- Precission / Recall
- Specificity
- False Positive Rate (FPR)
- Area under Curve (AUC)
- Gain / Lift Charts
- Kolmogorov-Smirnov Chart
- Gini Coefficient
- Kendall's tau
- $F_{\beta}$-Measures
- Macro / Micro / Weighted Precision and Recall
- Precision@K
- Average Precision (AvP)
- Mean Average Precision (MAP)
- Discounted Cumulative Gain (DCG)
- Normalized DCG (nDCG)

## R-Square

**R-square = Explained Variation / Total Variation.** To measure regression model fitness with continuous response $y$.

$$
R^2 = \sum_{i=1}^n (\hat y_i - y_i)^2 / \sum (y_i - \bar y)^2
$$

## Mean Squared Errors (MSE)

MSE is a common metric to model fitness for regression problems. 

$$
MSE = \sum_{i=1}^n (\hat y_i - y_i)^2
$$

where $\hat y$ is the predicted value of $y$.

## Mean Absolute Errors (MAE)

Similarly for regressio problem, instead of using squared errors, we can use absolute errors to measure model fitness.

$$
MAE = \sum_{i=1}^n |\hat y_i - y_i|
$$

## Akaike Information Criterion (AIC)

For model selection with likelihood inference, we would like to penalize higher likelihood by using more features.

$$
AIC = -2 \times \text{log-likelihood} + 2k
$$

## Bayesian Information Criterion

For model selection with likelihood inference in Bayesian framework.

$$
BIC = -2 \times \text{log-likelihood} + \log(n)k
$$

Now we shift geer a little to talk about useful metrics for measuring features's multi-collinearity and for detecting influential instance.

## Variance Inflation Factor (VIF)

To measure multi-collineairity for multiple features.

- $R^2_j$: R-square with fitting $X_j$ by $X_k$, for all $k != j$

$$
VIF_j = 1 / (1 - R^2_j)
$$

## Cook Distance

To detect influential instance:

- $\hat \beta$: regression weights
- $\hat \beta_{(-i)}$: regression weights without instance $i$

$$
[(\hat \beta - \hat \beta_{(-i)})^T X^T X (\hat \beta - \hat \beta_{(-i)}) / [(p + 1) \sigma^2]
$$

Here we start to introduce lots of metrics basically focusing on classification problems with categorical response; for example binary response $y = 0, 1$.

## Confusion Matrix

|   True   | Predicted Positive (P)  | Predicted Negative (N)  |
|------------------------------------|-------------------------|
|Postitive |           TP            |           FN            |
|Negative  |           FP            |           TN            |

where $n = Total = TP + FP + FN + TN$.

## Accurary

The proportion of true positive and negative cases are correctly identified.

$$
Accurary = (TP + TN) / n
$$

## Precision

- The proportion of predicted positive cases are true positive.
- Also called **Positive Predictive Value (PPV).**
- **Precision-Recall Curve's y-axis** (for PR Curve see later)

$$
Precision = TP / (\text{Predicted Positive}) = TP / (TP + FP)
$$

## False Discovery Rate (FDR)

$$
FDR =  1 - Precision = FP / (\text{Predicted Positive}) = FP / (TP + FP)
$$

## Recall

- The proportion of true positive cases are identified correctly as predictive positive
- Also called **Sensitivity,** or **True Positive Rate (TPR)**
- **ROC's y-axis** (for ROC Curve see later)
- **Preciasion-Recall Curve's x-axis**

$$
Recall = Sensitivity = TP / Positive = TP / P = TP / (TP + FN)
$$

## Specificity

- The proportion of predicted negative cases are true negative
- Also called **True Negative Rate (TNR)**
- 1 - Specificity: **ROC Curve's x-axis**

$$
Specificity = TN / N = TN / (FP + TN)
$$

## False Positive Rate (FPR)

- The proportion of predicted positive cases are true negative
- **ROC Curve's x-axis**

$$
FPR = 1 - Specificity = FP / N = FP / (FP + TN)
$$

## Area under Curve (AUC)

- The following two AUCs are independent of the change in proportion of 
responders.
- To get a single number which was AUC-ROC or PR-AUC for model judgment.
- ROC or PR curve is almost independent of the response rate, compared with lift chart.

### AUC under Receiver Operating Characteristic (ROC) Curve

AUC-ROC: 

- y-axis: Recall = Sensitivity = TP / P = TP / (TP + FN)
- x-axis: FPR = 1 - Specificity = FP / N = FP / (FP + TN)

**Performance guideline:**

- 0.90-1.00: excellent
- 0.80-0.90: good
- 0.70-0.80: fair
- 0.60-0.70: poor
- 0.50-0.60: fail

### AUC under Precision-Recall (PR) Curve

AUC-PR:

- y-axis: Precision = TP / (Predicted Positive) = TP / (TP + FP)
- x-axis: Recall = Sensitivity = TP / (True Positive) = TP / P = TP / (TP + FN) 

## Gain / Lift Charts

To check the rank ordering of the probabilities:

- Step 1 : Calculate probability for each observation
- Step 2 : Rank these probabilities in decreasing order.
- Step 3 : Build deciles with each group having almost 10% of the observations; group i is located in the (i - 1)th and ith deciles.
- Step 4 : Calculate the response rate at each deciles for Good (positive), Bad (negative) and total.

Table:

- Decile ID: 1,...,10
- Count for true negative cases
- Count for true positive cases
- Grand total count for one decile
- %Right
- %Wrong
- %Population
- Cum %Right
- Cum %Population
- Lift @decile
- Total lift

### Cumulative Gain Chart

- y-axis: Cum %Right
- x-axis: Cum %Population

**Lift:** Cum %Right / Cum %Population.

For example, the first decile with 10% of the population has 14% of positive cases. This means we have a 14% / 10% = 140% lift at first decile.

How is this result? We can evaluate the result compared with the **maximum lift at first decile.** 

- Total number of positive cases are 3850. 
- Also the first decile contains 543 observations. 
- So the maximum lift at first decile could have been 543/3850 ~ 14.1%. 
- Hence, we are quite close to perfection with this model.

### Lift Charts

Total lift chart:

- y-axis: Total lift
- x-axis: Cum %Population (decile) 

Decile-wise lift chart:

- y-axis: Lift @decile
- x-axis: Cum %Population (decile) 

The purpose for the decile-wise lift chart? For example,

- Our model does well till the 7th decile. Post which every decile will be skewed towards negative cases. 
- Any model with lift @ decile above 100% till minimum 3rd decile and maximum 7th decile is a good model.


### Lift based on conditional probability

$$
Lift = p(action | feature) / p(action) \\
     = p(action, feature) / [p(action) * p(feature)]
$$

**Issue for lift:** Lift is dependent on total response rate of the population. 

## Kolmogorov-Smirnov Chart

Table:

- Decile ID: 1,...,10
- Count for true negative cases
- Count for true positive cases
- Grand total count for one decile
- %Right
- %Wrong
- %Population
- Cum %Right
- Cum %Wrong
- Cum %Population
- K-S: Cum %Right - Cum %Wrong

### Kolmogorov-Smirnov Chart

Two curves for Cum %Right & Cum %Wrong

- y-axis: percent
- x-axis: decile

**K-S Statistics:** Maximum separation between Cum %Right and Cum %Wrong.

## Gini Coefficient

$$
\text{Gini Coefficient} = (\text{Area between the ROC curve and the diagnol line}) / (\text{Area of the above triangle}) \\
= (\text{Area between the ROC curve and the diagnol line}) / 0.5 \\
= 2 * (\text{Area between the ROC curve and the diagnol line}) \\
= 1 - 2 * (\text{Area between the ROC curve and the y-axis})
$$

Gini above 60% is good.

## Kendall's tau

Kendall's tau: Probability of concordant pairs - Probability of discordant pairs.

$$
\tau
= {\frac {({\text{number of concordant pairs}}) - ({\text{number of discordant pairs}})}{n(n-1)/2}}
$$

The following are variations of Precision or Recall or their combinations.

## $F_{\beta}$ Measures

$F_{\beta}$ Measures is a **harmonic average of Precision & Recall:**

$$
F_{\beta} 
= 1 / {[\beta^2/(\beta^2 + 1) \times (1/Precision)] + 1/(\beta^2 + 1) * (1/Recall)} \\
= (\beta^2 + 1) (Precision \times Recall) / (\beta^2 \times Precision + Recall)
$$

## Macro / Micro / Weighted Precision and Recall

**Macro:** Mean of the binary metrics, giving equal weight to each class. 

- In problems where infrequent classes are nonetheless important, macro-averaging may be a means of highlighting their performance. 
-On the other hand, the assumption that all classes are equally important is often untrue, such that macro-averaging will over-emphasize the typically low performance on an infrequent class.

**Micro:** Give each sample-class pair an equal contribution to the overall metric (except as a result of sample-weight). 

- Rather than summing the metric per class, this sums the dividends and divisors that make up the per-class metrics to calculate an overall quotient.
- Micro-averaging may be preferred in multilabel settings, including multiclass classification where a majority class is to be ignored.
- In multi-class problem, micro Precision & Recall will be same value

**Weighted:** accounts for class imbalance by computing the average of binary metrics in which each class’s score is weighted by its presence in the true data sample.

### Example: Micro / Macro-Precision / Recall

For a set of data:

- True positive (TP1) = 12
- False positive (FP1) = 9
- False negative (FN1) = 3
- Precision (P1) = TP1 / (TP1 + FP1) = 57.14% 
- Recall (R1) = TP1 / (TP1 + FN1) = 80%

For a different set of data:

- True positive (TP2) = 50
- False positive (FP2) = 23
- False negative (FN2) = 9
- Precision (P2) = TP2 / (TP2 + FP2) = 68.49
- Recall (R2) = TP2 / (TP2 + FN2) = 84.75

Thus,

- Micro-Average Precision = (TP1+TP2) / (TP1+TP2+FP1+FP2) = (12+50) / (12+50+9+23) = 65.96.
- Micro-Average Recall = (TP1+TP2) / (TP1+TP2+FN1+FN2) = (12+50) / (12+50+3+9) = 83.78.
- Macro-Average Precision = (P1+P2) / 2 = (57.14+68.49) / 2 = 62.82.
- Macro-average Recall = (R1+R2) / 2 = (80+84.75) / 2 = 82.25.

## Precision@K

**Precision@K:** The number of relevant results on the first k results.

- Nevertheless, it fails to take into account the positions of the relevant documents among the top k.
- More useful than Recall@k since few users will be interested in reading all of their interested articles.

## Average Precision

**Average Precision (AvP):** Averaging the precision over a set of evenly spaced recall levels {0, 0.1, 0.2,..., 1.0}:

$$
AvP
={\frac {1}{11}}\sum_{recall \in \{0, 0.1, \ldots, 1.0\}} p_{\text{inter_precision}}(recall)
$$

where $p_{\text{inter_precision}}(recall)$ is an interpolated precision that takes the maximum precision over all recalls greater than $recall$:

$$
p_{\text{inter_precision}}(r)= \max_{\tilde r: \tilde {r} \geq r} p(\tilde r)
$$

## Mean Average Precision

**Mean Average Precision** for a set of queries: Mean of the average precision scores for each query.

$$
\mbox{MAP}(Q) = \frac{1}{\vert Q\vert} \sum_{j=1}^{\vert Q\vert} \frac{1}{m_j}
\sum_{k=1}^{m_j} \mbox{precision}(recall_{jk})
$$

where $Q$ is the number of queries.

## Discounted Cumulative Gain (DCG)

**DCG:** Allow us to consider relative degrees of relevance, not just binary results.

$$
DCG_{k} = \sum_{r=2}^{k} \frac{rel_{r}}{\log_{2}(r + 1)}. 
$$

where $r$ is the rank, $k$ is the number of documents.

## Normalized DCG (nDCG)

Since DCGs may vary significantly for different queries or systems, to fairly compare performances the normalised version of DCG uses an ideal DCG, which sorts documents by their relevance, generating an ideal DCG ($IDCG_{r}$) at rank $k$:

$$
nDCG_{k} = \frac{DCG_{k}}{IDCG_{k}}. 
$$

Note that in a perfect ranking system, the $DCG_{k}$ will be the same as the $DCG_{k}$, thus this results in $nDCG = 1.0$. All $nDCG$ from different queries or systems are then relative values ranging from 0.0 to 1.0 and comparable.