# Model Evaluation

## 1. Performance estimation techniques

Always evaluate models as if they are predicting future data.
<p>
We do not have access to future data, so we pretend that some data is hidden.
<p>

Simplest way: `holdout` Randomly split data (and corresponding labels) into training and test set (e.g. 75%-25%)

- Use `holdout` for very large datasets (e.g. >1.000.000 examples). Or when learners don’t always converge (e.g. deep learning)

### 1.1. K-fold Cross-validation


- 数据划分: 将数据集随机分成 k 个相同大小的子集(折,fold)
- 模型训练与评估: 对于每一个子集, 
    - 将当前子集用作测试集, 其余 k-1 个子集合并作为训练集.
    - 在训练集上训练模型.
    - 在测试集上评估模型性能, 通常使用一些指标(如准确率、F1 分数等).
- 性能汇总: 收集 k 次评估的性能指标, 计算其平均值和标准差. 
- 选择最优模型: 通过比较不同模型的平均性能指标, 选择在 k-fold 验证中表现最好的模型. 


```python
from sklearn.model_selection import KFold

kf = KFold(n_splits=5)
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
```


**Remark :** 一般情况下3/5是默认选项, 常建议用K=10


### 1.2. Stratified K-Fold cross-validation

If the data is unbalanced, some classes have only few samples -----> Stratification: **proportions** between classes are conserved in each fold

- Always use stratification for classification (sklearn does this by default)


### 1.3. LOOCV

- LOOCV (Leave-one-out cross-validation)

<p>

- 只从可用的数据集中保留一个数据点, 并根据其余数据训练模型. 此过程对每个数据点进行迭代, 比如有n个数据点, 就要重复交叉验证n次.例如一共10个数据, 就交叉验证十次. `test_error` = 所有error的平均数.


- **优点:**

    - 适合小样本数据集
    - 利用所有的数据点，因此偏差将很低

- **缺点:**

    - 重复交叉验证过程n次导致更高的执行时间.
    - 测试模型有效性的变化大. 因为针对一个数据点进行测试, 模型的估计值受到数据点的很大影响. 如果数据点被证明是一个离群值, 它可能导致更大的变化.


### 1.4. Shuffle-Split cross-validation

Shuffles the data, samples (`train_size`) points randomly as the training set. Never use if the data is ordered (e.g. time series)

```python
from sklearn.model_selection import ShuffleSplit

ss = ShuffleSplit(n_splits=5, train_size=0.8, test_size=0.2)
for train_index, test_index in ss.split(X):
    X_train, X_test = X[train_index], X[test_index]
```

### 1.5. The Bootstrap (自助法)

从 `n(数据集大小)` 个数据点中进行有放回抽样, 作为训练集(自助样本, `the bootstrap`)

<p>

On average, bootstraps include **66%** of all data points (some are duplicates).
<p>

Use the **unsampled (out-of-bootstrap)** samples as the test set
<p>

Repeat $k$ times to obtain $k$ scores. Similar to Shuffle-Split with `train_size=0.66`, `test_size=0.34` but with duplicates.


### 1.6. Repeated cross-validation

Cross-validation is still biased in that the initial split can be made in many ways.
<p>
Repeated, or n-times-k-fold cross-validation:
<p>
Shuffle data randomly, do k-fold cross-validation
<p>
Repeat n times, yields n times k scores
<p>
Unbiased, very robust, but n times more expensive

### 1.7. Cross-validation with groups

有时数据包含固有的组: 来自同一患者的多个样本、来自同一人的图像...... <p>

同一个人的数据可能会同时出现在训练集和测试集中, 我们希望衡量模型对其他人的泛化能力, 确保每个人的数据只在训练集或测试集中, 这称为分组.

```python
from sklearn.model_selection import GroupKFold

gkf = GroupKFold(n_splits=5)
for train_index, test_index in gkf.split(X, y, groups=groups):
    X_train, X_test = X[train_index], X[test_index]
```

### 1.8. Time series

When the data is ordered, random test sets are not a good idea.<p>

**Test-then-train (prequential evaluation)**: Every new sample is evaluated only once, then added to the **training set**. Can also be done in batches (of n samples at a time).

- `TimeSeriesSplit`: 在第 k 次切分中, 前 k 个折作为训练集, 第 (k+1) 个折作为测试集. 随着 k 的增加, 训练集也会逐渐增大. 假设有一个时间序列数据集, 数据点为 1 到 10, `TimeSeriesSplit` 可能会产生以下切分:

    - 1-3 训练集 | 4 测试集
    - 1-4 训练集 | 5 测试集
    - 1-5 训练集 | 6 测试集
    - 1-6 训练集 | 7 测试集
    - 1-7 训练集 | 8 测试集
    - 1-8 训练集 | 9 测试集
    - 1-9 训练集 | 10 测试集


```python
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
for train_index, test_index in tscv.split(X):
    X_train, X_test = X[train_index], X[test_index]
```

## 2. Evaluation Metrics for Classification

### 2.1 Binary Classification

2 different kind of errors:

- **False Positive (type I error)**: model predicts positive while true label is negative

- **False Negative (type II error)**: model predicts negative while true label is positive

- We can represent all predictions (correct and incorrect) in a **confusion matrix**
    - n by n array (n is the number of classes)
    - Rows correspond to true classes, columns to predicted classes
    - Count how often samples belonging to a class C are classified as C or any other class.
    - For binary classification, we label these true negative (TN), true positive (TP), false negative (FN), false positive (FP)

| | Predicted Neg | Predicted Pos |
|-|-|-|
| Actual Neg | TN | FP |
| Actual Pos | FN | TP |



#### Predictive accuracy

- Accuracy can be computed based on the confusion matrix
- Not useful if the dataset is very imbalanced
    - E.g. credit card fraud: is 99.99% accuracy good enough? 

\begin{equation}
\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}
\end{equation}




#### Precision
- Use when the goal is to limit FPs
    - Clinical trails: you only want to test drugs that really work
    - Search engines: you want to avoid bad search results

\begin{equation}
\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}
\end{equation}

#### Recall
- Use when the goal is to limit FNs
    - Cancer diagnosis: you don't want to miss a serious disease
    - Search engines: You don't want to omit important hits
- Also know as sensitivity, hit rate, true positive rate (TPR)

\begin{equation}
\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}
\end{equation}

#### F1-score
- Trades off precision and recall:

\begin{equation}
\text{F1} = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}
\end{equation}

### 2.2. Multi-class classification

- Train models _per class_ : one class viewed as positive, other(s) als negative, then average
    - micro-averaging: count total TP, FP, TN, FN (every sample equally important)
        - micro-precision, micro-recall, micro-F1, accuracy are all the same
        $$\text{Precision:} \frac{\sum_{c=1}^C\text{TP}_c}{\sum_{c=1}^C\text{TP}_c + \sum_{c=1}^C\text{FP}_c} \xrightarrow{c=2} \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}$$
        
    - macro-averaging: average of scores $R(y_c,\hat{y_c})$ obtained on each class
        - Preferable for imbalanced classes (if all classes are equally important)
        - macro-averaged recall is also called _balanced accuracy_ 
         $$\frac{1}{C} \sum_{c=1}^C R(y_c,\hat{y_c})$$
    - weighted averaging ($w_c$: ratio of examples of class $c$, aka support): $\sum_{c=1}^C w_c R(y_c,\hat{y_c})$


### 2.3. Other useful classification metrics
- Cohen's Kappa
    - Measures 'agreement' between different models (aka inter-rater agreement)
    - To evaluate a single model, compare it against a model that does random guessing
        - Similar to accuracy, but taking into account the possibility of predicting the right class by chance
    - Can be weighted: different misclassifications given different weights
    - 1: perfect prediction, 0: random prediction, negative: worse than random
    - With $p_0$ = accuracy, and $p_e$ = accuracy of random classifier:
        $$\kappa = \frac{p_o - p_e}{1 - p_e}$$
- Matthews correlation coefficient
    - Corrects for imbalanced data, alternative for balanced accuracy or AUROC
    - 1: perfect prediction, 0: random prediction, -1: inverse prediction
        $$MCC = \frac{tp \times tn - fp \times fn}{\sqrt{(tp + fp)(tp + fn)(tn + fp)(tn + fn)}}$$

### 2.4. ROC曲线和AUC值

- Trade off _true positive rate_ $\textit{TPR}= \frac{TP}{TP + FN}$ with _false positive rate_ $\textit{FPR} = \frac{FP}{FP + TN}$
- Plotting TPR against FPR _for all possible thresholds_ yields a _Receiver Operating Characteristics curve_
    - Change the treshold until you find a sweet spot in the TPR-FPR trade-off
    - Lower thresholds yield higher TPR (recall), higher FPR, and vice versa

- AUC: Area under curve: 值越接近1说明模型效果越好

### 2.5. Cost-sensitive classification (dealing with imbalance)

In the real world, different kinds of misclassification can have different costs

- Misclassifying certain classes can be more costly than others

- Misclassifying certain samples can be more costly than others

Cost-sensitive resampling: **resample (or reweight)** the data to represent real-world expectations

- oversample minority classes (or undersample majority) to ‘correct’ imbalance

- increase weight of misclassified samples (e.g. in boosting)

- decrease weight of misclassified (noisy) samples (e.g. in model compression)

## 3. Bias-Variance Trade-off

**Motivation:** To characterize the learning behavior from statistical perspective

**泛化能力**: 模型在不同样本集(真实分布相同)上的预测效果是否一致, 也即模型是否具备可推广性. 

- 见[机器学习基本概念：偏差、方差、误差和噪声](https://zhuanlan.zhihu.com/p/23090568344)
- 见[DDA3020-LectureNotes](../DDA3020/L11_Bias_Variance_Decomposition.pdf)

## 结论
理解偏差和方差对于提升模型性能非常重要。

## Reference

- [交叉验证方法汇总](https://blog.csdn.net/WHYbeHERE/article/details/108192957)
- [Cross-Validation（交叉验证）详解](https://zhuanlan.zhihu.com/p/24825503)