# $\S$ 7.11. Bootstrap Methods

The bootstrap is a general tool for assessing statistical accuracy.

First we describe the bootstrap in general, and then show how it can be used to estimate extra-sample prediction error.

As with cross-validation, the bootstrap seeks to estimate the conditional error $\text{Err}_{\mathcal{T}}$, but typically estimates well only the expected prediction error $\text{Err}$.

### Basic idea
Suppose we have a model fit to a set of training data. We denote the training set by $\mathbf{Z} = (z_1, z_2, \cdots, z_N)$ where $z_i = (x_i, y_i)$.

> The basic idea is to randomly draw datasets with replacement from the training data, each sample the same size as the original training set.

This is done $B$ times ($B=100$ say), producing $B$ bootstrap datasets, as shown in FIGURE 7.12. Then we refit the model to each of the bootstrap datasets, and examine the behavior of the fits over the $B$ replications.

![FIGURE 7.12](fig7-12.jpg)

In the figure, $S(\mathbf{Z})$ is any quantity from the data $\mathbf{Z}$, for example, the prediction at some input point.

### Estimation on distribution of $S$
From the bootstrap sampling we can estimate any aspect of the distribution of $S(\mathbf{Z})$, for example, its variance,

\begin{align}
\widehat{\text{Var}} ( S(\mathbf{Z}) ) &= \frac{1}{B-1} \sum_{b=1}^B \left( S(\mathbf{Z}^{*b}) - \bar{S}^* \right)^2, \\
\text{where } \bar{S}^* &= \frac{1}{B} \sum_{b=1}^B S(\mathbf{Z}^{*b}).
\end{align}

Note that $\widehat{\text{Var}}$ can be thought of as a Monte-Carlo estimate of the variance of $S(\mathbf{Z})$ under sampling from the empirical distribution function $\hat{F}$ for the data $(z_1, z_2, \cdots, z_N)$.

### Estimation on prediction error
How can we apply the bootstrap to estimate prediction error?

One approach would be to fit the model in question on a set of bootstrap samples, and then keep track of how well it predicts the original training set. If $\hat{f}^{*b}(x_i)$ is the predicted value at $x_i$, from the model fitted to the $b$th bootstrap dataset, our estimate is

\begin{equation}
\widehat{\text{Err}}_{\text{boot}} = \frac{1}{BN} \sum_{b=1}^B \sum_{i=1}^N L \left( y_i, \hat{f}^{*b}(x_i) \right).
\end{equation}

#### It's not good
However, it is easy to see that $\widehat{\text{Err}}_{\text{boot}}$ does not provide a good estimate in general. The reason is that the bootstrap datasets are acting as the training samples, while the original training set is acting as the test sample, and these two samples have observations in common. This overlap can make overfit predictions look unrealistically good, and is the reason that cross-validation explicitly uses non-overlapping data for the training and test samples.

Consider for example a 1NN applied to a two-class classification problem with the same number of observations in each class, in which the predictors and class labels are in fact independent. Then the true error rate is 0.5. But the contributions to the bootstrap estimate $\widehat{\text{Err}}_{\text{boot}}$ will be zero unless the observation $i$ does not appear in the bootstrap sample $b$. In this latter case it will have the correct expectation 0.5. Now

\begin{align}
\text{Pr}\{ \text{observation } i \in \text{bootstrap sample } b \} &= 1-\left( 1-\frac{1}{N} \right)^N \\
&\approx 1 - e^{-1} \\
&= 0.632.
\end{align}

Hence the expectation of $\widehat{\text{Err}}_{\text{boot}}$ is about $0.5 \times (1-0.632) = 0.5 \times 0.368 = 0.184$, far below the correct error rate 0.5.

#### Better way, leave-one out bootstrap
By mimicking cross-validation, a better bootstrap estimate can be obtained. For each observation, we only keep track of predictions from bootstrap samples not containing that observation. The leave-one-out bootstrap estimate of prediction error is defined by

\begin{equation}
\widehat{\text{Err}}^{(1)} = \frac{1}{N} \sum_{i=1}^N \frac{1}{|C^{-i}|} \sum_{b\in C^{-i}} L\left( y_i, \hat{f}^{*b}(x_i) \right).
\end{equation}

Here $C^{-i}$ is the set of indices of the bootstrap samples $b$ that do _not_ contain observation $i$, and $|C^{-i}|$ is the number of such samples.

In computing $\widehat{\text{Err}}^{(1)}$, we either have to choose $B$ large enough to ensure that all of the $|C^{-i}|$ are greater than zero, or we can just leave out the terms in $\widehat{\text{Err}}^{(1)}$ corresponding to $|C^{-i}|$'s that are zero.

#### Training-set-size bias
The leave-one out bootstrap solves the overfitting problem suffered by $\widehat{\text{Err}}_{\text{boot}}$, but has the training-set-size bias mentioned in the discussion of cross-validation. The average number of distinct observations in each bootstrap sample is about $0.632 \cdot N$, so its bias will roughly behave like that of twofold cross-validation. Thus if the learning curve has considerable slope at sample size $N/2$, the leave-one out bootstrap will be biased upward as an estimate of the true error.

#### .632 estimator
The ".632 estimator" is designed to alleviate this bias.

It is defined by

\begin{equation}
\widehat{\text{Err}}^{(.632)} = 0.368 \cdot \overline{\text{err}} + 0.632 \cdot \widehat{\text{Err}}^{(1)}.
\end{equation}

The derivation of the .632 estimator is complex; intuitively it pulls the leave-one out bootstrap estimate down toward the training error rate, and hence reduces its upward bias. The use of the constant 0.632 relates to the above $\text{Pr}\{ \text{observation } i \in \text{bootstrap sample } b \}$.

#### Example of the .632 estimator
The .632 estimator works well in "light fitting" situations, but can break down in overfit ones.

Here is an example due to Breiman et al. (1984). Suppose we have two equal-size classes, with the targets independent of the class labels, and we apply a 1NN rule. Then

\begin{align}
\overline{\text{err}} &= 0, \\
\widehat{\text{Err}}^{(1)} &= 0.5, \text{so}\\
\widehat{\text{Err}}^{(.632)} &= 0.632 \times 0.5 = 0.316.
\end{align}

However, the true error rate is 0.5.

#### Improvement of the .632 estimator
One can improve the .632 estimator by taking into account the amount of overfitting.

First we define $\gamma$ to be the _no-information error rate_: This is the error rate of our prediction rule if the inputs and class labels were independent. An estimate of $\gamma$ is obtained by evaluating the prediction rule on all possible combinations of targets $y_i$ and predictors $x_{i'}$

\begin{equation}
\hat\gamma = \frac{1}{N^2} \sum_{i=1}^N \sum_{i'=1}^N L \left( y_i, \hat{f}(x_{i'}) \right).
\end{equation}

For example, consider the dichotomous classification problem: Let
* $\hat{p}_1$ be the observed proportion of responses $y_i$ equaling 1, and
* $\hat{q}_1$ be the observed proportion of predictions $\hat{f}(x_{i'})$ equaling 1.

Then

\begin{equation}
\hat\gamma = \hat{p}_1 (1-\hat{q}_1) + (1-\hat{p}_1) \hat{q}_1.
\end{equation}

With a rule like 1NN for which $\hat{q}_1 = \hat{p}_1$ the value of $\hat\gamma$ is $2\hat{p}_1(1-\hat{p}_1)$. The multi-category generalization is

\begin{equation}
\hat\gamma = \sum_l \hat{p}_l (1-\hat{q}_l).
\end{equation}

Using this, the _relative overfitting rate_ is defined to be

\begin{equation}
\hat{R} = \frac{\widehat{\text{Err}}^{(1)} - \overline{\text{err}}}{\hat\gamma - \overline{\text{err}}},
\end{equation}

a quantity that ranges from 0 if there is no overfitting (i.e., $\widehat{\text{Err}}^{(1)} = \overline{\text{err}}$) to 1 if the overfitting equals the no-information value $\hat\gamma - \overline{\text{err}}$.

Finally, we define the ".632+" estimator by

\begin{align}
\widehat{\text{Err}}^{(.632+)} &= (1- \hat{w}) \cdot \overline{\text{err}} + \hat{w} \cdot \widehat{\text{Err}}^{(1)} \\
\text{with } \hat{w} &= \frac{.632}{1-.368\hat{R}}.
\end{align}

The weight $\hat{w}$ ranges from .632 if $\hat{R} = 0$ to 1 if $\hat{R}=1$, so $\widehat{\text{Err}}^{(.632+)}$ ranges from $\widehat{\text{Err}}^{(.632)}$ to $\widehat{\text{Err}}^{(1)}$.

Again, the derivation of $\widehat{\text{Err}}^{(.632+)}$ is complicated: Roughly speaking, it produces a compromise between the leave-one-out bootstrap and the training error rate that depends on the amount of overfitting.

For 1NN problem with class labels independent of the inputs, $\hat{w} = \hat{R} = 1$, so$\widehat{\text{Err}}^{(.632+)} = \widehat{\text{Err}}^{(1)}$, which has the correct expectation of 0.5.

In other problems with less overfitting, $\widehat{\text{Err}}^{(.632+)}$ will lie somewhere between $\overline{\text{err}}$ and $\widehat{\text{Err}}^{(1)}$.