# Chapter 5: Resampling Methods
## Cross-Validation
### Validation Set Approach
Idea: randomly splitting the available set into a *training set* and a *validation set*; the model is fit on the training set and used to predict in the validation set; validation set error rate (typically MSE) provides an estimate of the test error rate

Two drawbacks:
1) The validation estimate of the test error rate can be highly variable across different training and validation set.
2) Only data in the training set is used to fit the model, hence the validation wet error rate tend to overestimate the test error rate for the model fit on the netire data set.

### Leave-One-Out Cross Validation
Idea: every time, a single observation is used in the validation set; repeat the approach n times.
Then, we have the **LOOCV** estimate for the test MSE:
$$
CV_n=\frac{1}{n}\sum_{i=1}^n MSE_i
$$

Two Advantages:
1) tends not to overestimate the test error
2) no randomness of results

ps. the **leverage statistic** in (3.37): $h_i=\frac{1}{n}+\frac{(x_i-\bar{x})^2}{\sum_{i'=1}^n(x_{i'}-\bar{x})^2}$.
he leverage statistic $h_i$ is always between $1/n$ and $1$, and the average leverage for all the observations is always equal to $(p + 1)/n$. 

**With least squares linear or polynomial regression**, a shortcut for LOOCV MSE is:
$$
CV_n=\frac{1}{n}\sum_{i=1}^{n}(\frac{y_i-\hat{y_i}}{1-h_i})^2
$$
where $\hat{y_i}$ is the ith fitted value from the original least squares fit. 

* This formula allows us to fit the model only once (using all the data) and still estimate the prediction error for each leave-one-out observation.
* The shortcut does not hold in general.



### k-Fold Cross-Validation
Idea: randomly divide the set of observations into k groups of approximately equal size; the first fold is treated as a validation set, and the method is fit on the remaining k − 1 folds; then we have $MSE_1$ computed on the observations in the held-out fold; repeat for k times.
$$
CV_k=\frac{1}{k}\sum_{i=1}^k MSE_i
$$

#### Bias-Variance Trade-Off for k-Fold Cross-Validation

Bias: k-fold CV contains less data in the training set → more biased

Varianve: LOOCV is trained on an almost identical set of observations → outputs are highly (positively) correlated with each other → the test error estimate resulting from LOOCV tends to have higher variance (see details in Notes)

### Cross-Validation on Classification Problems
Idea: use **the number of misclassified observations** to quantify the test error
LOOCV error rate: $CV_n=\frac{1}{n}\sum_{i=1}^n ERR_i$, where $Err_i = I(y_i\ne \hat{y_i})$.

## Bootstrap
Idea: quantify the uncertainty associated with a given estimator or statistical learning method

Steps:
1) a data set set $Z$ with n observations; randomly select n observations **with replacement** from the data set produce a bootstrap data set, call it $Z^{*1}$
2) produce a bootstrap estimate for $\alpha$, call it $\hat{\alpha}^{*1}$
3) repeat the procedure for $B$ (large number) times

Then compute the **standard error of these bootstrap estimates**:
$$
\mathrm{SE}_B(\hat{\alpha}) = \sqrt{ \frac{1}{B - 1} \sum_{r = 1}^B \left( \hat{\alpha}^{*r} - \frac{1}{B} \sum_{r' = 1}^B \hat{\alpha}^{*r'} \right)^2 }
$$


# Notes
## Derivation of Leave-One-Out Cross-Validation (LOOCV) MSE

Consider the linear modesl:
$$
\mathbf{y} = X \beta + \varepsilon, \quad \text{with } \hat{\beta} = (X^T X)^{-1} X^T y
$$

with
$$
\hat{y} = X \hat{\beta}, \quad e = y - \hat{y}
$$

The predicted value for the $i$th observation:
$$
\hat{y}_i = x_i^T \hat{\beta}
$$

The leverage statistic $h_i$, i.e. diagonal elements of $H = X (X^T X)^{-1} X^T$:
$$
h_i = x_i^T (X^T X)^{-1} x_i
$$

---

Leave-one-out cross-validation:
$$
CV_n = \frac{1}{n} \sum_{i=1}^n \left( y_i - \hat{y}_i^{(-i)} \right)^2
$$

---

By Sherman-Morrison equation, we have:

$$
\hat{y}_i^{(-i)} = \hat{y}_i - \frac{e_i}{1 - h_i}
$$

Then the residual is

$$
y_i - \hat{y}_i^{(-i)} = \frac{e_i}{1 - h_i}
$$

---

Put it into the defeinition of LOOCV MSE, then we have

$$
CV_n = \frac{1}{n} \sum_{i=1}^n \left( \frac{y_i - \hat{y}_i}{1 - h_i} \right)^2
= \frac{1}{n} \sum_{i=1}^n \left( \frac{e_i}{1 - h_i} \right)^2
$$



## Why Highly Correlated Variables Increase the Variance of the Mean ？

Consider $ X_1, X_2, \dots, X_n $, and define their mean:

$$
\bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i
$$

The variance of $ \bar{X} $ is:

$$
\text{Var}(\bar{X}) = \frac{1}{n^2} \sum_{i=1}^{n} \sum_{j=1}^{n} \text{Cov}(X_i, X_j)
$$

This shows that the variance of the average depends on **both** the individual variances and the covariances between variables.

---

### Case 1: Independent Variables

If $ X_i $ are i.i.d. with variance $ \sigma^2 $, then:

$$
\text{Cov}(X_i, X_j) = 0 \quad \text{for } i \ne j
$$

$$
\text{Var}(\bar{X}) = \frac{1}{n^2} \cdot n \cdot \sigma^2 = \frac{\sigma^2}{n}
$$

→ Averaging reduces variance.

---

### Case 2: Perfectly Correlated Variables

If $ X_1 = X_2 = \cdots = X_n $, then:

$$
\text{Cov}(X_i, X_j) = \sigma^2 \quad \text{for all } i, j
$$

$$
\text{Var}(\bar{X}) = \frac{1}{n^2} \cdot n^2 \cdot \sigma^2 = \sigma^2
$$

→ Averaging **does not** reduce variance at all.

---

### Conclusion

> The more correlated the variables, the less averaging helps reduce variance.

This is why the average of many **highly correlated** quantities can have **higher variance** than the average of many **uncorrelated** ones.

