## $\S$ 5.5.2. The Bias-Variance Tradeoff

FIGURE 5.9 shows the effect of the choice of $\text{df}_\lambda$ when using a smoothing spline on a simple example:

\begin{align}
Y &= f(X) + \epsilon \\
f(X) &= \frac{\sin(12(X+0.2))}{X+0.2},
\end{align}

with
* $X\sim U[0,1]$,
* $\epsilon\sim N(0,1)$;
* Our training sample consists of $N=100$ pairs of $x_i, y_i$, drawn independently from this model.

In [2]:
"""FIGURE 5.9. CV results and fitted splines for three different df's.
"""
print('Under construction ...')

Under construction ...


### Computing bias and variance

The yellow shaded region in the figure represents the pointwise standard error of $\hat{f}_\lambda$, e.g., we have shaded the region between 

\begin{equation}
\hat{f}_\lambda(x) \pm 2 \cdot \text{se}(\hat{f}_\lambda(x)).
\end{equation}

Since $\hat{\mathbf{f}} = \mathbf{S}_\lambda \mathbf{y}$,

\begin{align}
\text{Cov}(\hat{\mathbf{f}}) &= \mathbf{S}_\lambda \text{Cov}(\mathbf{y}) \mathbf{S}_\lambda^T \\
&= \mathbf{S}_\lambda \mathbf{S}_\lambda^T.
\end{align}

The diagonal contains the pointwise variances at the training $x_i$. The bias is given by

\begin{align}
\text{Bias}(\hat{\mathbf{f}}) &= \mathbf{f} - \text{E}(\hat{\mathbf{f}}) \\
&= \mathbf{f} - \mathbf{S}_\lambda \mathbf{f},
\end{align}

where $\mathbf{f}$ is the (unknown) vector of evaluations of the true $f$ at the training $X$'s.

The expectations and variances are w.r.t. repeated draws of samples of size $N=100$ from the model $f$. In a similar fashion $\text{Var}(\hat{f}_\lambda(x_0))$ and $\text{Bias}(\hat{f}_\lambda(x_0))$ can be computed at any point $x_0$ (Exercise 5.10).

### Visual interpretation of bias-variance tradeoff

The three fits displayed in FIGURE 5.9 give a visual demonstration of the bias-variance tradeoff associated with selecting the smoothing parameter.

* $\text{df}_\lambda = 5$: The spline under fits, and clearly _trims down hills and fills in the valleys_. This leads to a bias that is most dramatic in regions of high curvature. The standard error hand is very narrow, so we estimate a badly biased version of the true function with great reliability!
* $\text{df}_\lambda = 9$: Here the fitted function is close to the true function, although a slight amount of bias seems evident. The variance has not increased appreicably.
* $\text{df}_\lambda = 15$: The fitted function is somewhat wiggly, but close to the true function. The wiggliness also accounts for the increased width of the standard error bands -- the curve is starting to follow some individual points too closely.

Note that in these figures we are seeing a single realization of data and hence fitted spline $\hat{f}$ in each case, while the bias involves an expectation $\text{E}(\hat{f})$.

The middle curve seems "just right", in that it has achieved a good compromise between bias and variance.

The integrated squared prediction error ($\text{EPE}$) combines both bias and variance in a single summary:

\begin{align}
\text{EPE}(\hat{f}_\lambda) &= \text{E}\left( Y - \hat{f}_\lambda(X) \right)^2 \\
&= \text{Var}(Y) + \text{E}\left( \text{Bias}^2(\hat{f}_\lambda(X)) + \text{Var}(\hat{f}_\lambda(X)) \right) \\
&= \sigma^2 + \text{MSE}(\hat{f}_\lambda).
\end{align}

Note that this is averaged both over the training sample (giving rise to $\hat{f}_\lambda$), and the values of the (independently chosen) prediction points $(X,Y)$.

$\text{EPE}$ is a natural quantity of interest, and does create a tradeoff between bias and variance. The test error rate (blue points) in the top left panel of FIGURE 5.9 suggest that $\text{df}=9$ is spot on!

### Estimation of EPE

Since we don't know the true function, we do not have access to $\text{EPE}$, and need an estimate. This topic is discussed in some detail in Chapter 7, and techniques such as $K$-fold cross-validation, $\text{GCV}$ and $C_p$ are all in common use. In FIGURE 5.9 we include the $N$-fold (leave-one-out) cross-validation curve:

\begin{align}
\text{CV}(\hat{f}_\lambda) &= \frac{1}{N} \sum_{i=1}^N \left( y_i - \hat{f}_\lambda^{(-i)}(x_i)\right)^2 \\
&= \frac{1}{N} \sum_{i=1}^N \left( \frac{y_i - \hat{f}_\lambda(x_i)}{1 - S_\lambda(i,i)} \right)^2,
\end{align}

which can (remarkably) be computed for each value of $\lambda$ from the original fitted values and the diagonal elements $S_\lambda(i,i)$ of $\mathbf{S}_\lambda$ (Exercise 5.13).

The $\text{EPE}$ and $\text{CV}$ curves have a similar shape, but the entire $\text{CV}$ curve is above the $\text{EPE}$ curve. For some realizations this is reversed, and everall the $\text{CV}$ curve is approximately unbiased as an estimate of the $\text{EPE}$ curve.