# Chapter 7 Notes

## 7.2 Bias, Variance and Model Complexity

For a given loss function $L(Y, \hat{f}(X))$, where the prediction function $\hat{f}$ has been estimated from training set $\mathcal{T}$, the *Test error*, or *generalisation error* is the prediction error over an independent test sample:

\begin{equation}
    \text{Err}_{\mathcal{T}} = \text{E}_{X, Y}\left[ L(Y, \hat{f}(X)) \mid \mathcal{T} \right].
\end{equation}

Taking the expectation over the training set $\mathcal{T}$ gives the *expected test error* or *expected prediction error*:

\begin{equation}
    \text{Err} = \text{E}_{\mathcal{T}}\left[\text{Err}_{\mathcal{T}} \right].
\end{equation}

## 7.10 Cross-Validation

$K$-fold cross-validation estimates $\text{Err}$ (*not* $\text{Err}_{\mathcal{T}}$). 
Leave-one-out cross validation ($K=N$) provides an unbiased estimate but can have high variance.
For $K=5, 10$, the estimator has lower variance but can be baised (because the training sets are smaller).
The magnitude of this bias depends on where you are on the learning curve.

When hyper-parameter tuning, often a 'one-standard error' rule is used.
This involves choosing the most parsimonious model whose error is no more than one standard error (*not* deviation) above the error of the best model.

**Question:** How does the 'one-standard error' rule work when you have multiple hyper-parameters, so it isn't clear which model is most parsimonious?

## 7.11 Bootstrap Methods

Given a training set $\mathbb{Z}$ we take $B$ boostrap samples $\mathbb{Z}^{*b}$ ($1\leq b\leq B$).
If $S(\mathbb{Z})$ is a quantity computed from $\mathbb{Z}$, we can estimate any aspect of the distribution of $S(\mathbb{Z})$ using the $S(\mathbb{Z}^{*b})$.

We can estimate $\text{Err}$ for a model by taking the average error $\hat{\text{Err}}_{\text{boot}}$ of models trained on bootstrap samples.
This will tend to overestimate because on average $1-e^{-1} \approx 0.632$ of the samples will belong to the training set.
We can improve our estimate by taking the average error $\hat{\text{Err}}^{(1)}_{\text{boot}}$ over samples not in the training set for each bootstrapped model.
This then is biased to over-estimate $\text{Err}$ because each training set is $\approx 0.632$ the size of the full training set.
The book gives an improved correction that that is a weighted average of $\hat{\text{Err}}^{(1)}_{\text{boot}}$ with the training error.
It depends on the *no-information error rate* - the error rate of our prediction rule if the inputs and outputs were independent.

**Question:** The bootstrapped training sets will contain duplicates - what is the effect of these? Why not just remove these and use a random subset of proportion 0.632?

### 7.11.1 Example (Continued)

Minimisation of cross-validation, bootstrap, or AIC over possible hyper-parameter values all yield models fairly close to the best available.
In practice, AIC is often not available because estimating the effective number of parameters is difficult.

For the purpose of model selection, it doesn't matter if our estimate of test error is biased as long as it doesn't affect the relative performance of different models.
However, for the models tried in the book (linear and KNN) bootstrap and CV provide better estimates of test error.
It states that for trees these under-estimate the true error by 10% because the search for best tree is strongly affected by the validation set.

**Question:** What does the last sentence mean?

## 7.12 Conditional or Expected Test Error?

Both 10-fold CV and leave-one-out CV estimate $\text{Err}$ rather than $\text{Err}_{\mathcal{T}}$ with 10-fold giving a better estimate.
Similarly the bootstrap estimates $\text{Err}$ rather than $\text{Err}_{\mathcal{T}}$.
In general, estimating $\text{Err}_{\mathcal{T}}$ for a specific training set is a difficult problem.