# $\S$ 7.3. The Bias-Variance Decomposition

As in Chapter 2, if we assume that

\begin{equation}
Y = f(X) + \epsilon,
\end{equation}

where $\text{E}(\epsilon) = 0$ and $\text{Var}(\epsilon) = \sigma_\epsilon^2$, we can derive an expression for the expected prediction error of a regression fit $\hat{f}(X)$ at an input point $X=x_0$, using squared-error loss:

\begin{align}
\text{Err}(x_0) &= \text{E}\left[ \left( Y - \hat{f}(x_0) \right)^2 \mid X=x_0 \right] \\
&= \sigma_\epsilon^2 + \left( \text{E}\hat{f}(x_0) - f(x_0) \right)^2 + \text{E}\left[ \hat{f}(x_0) - \text{E}\hat{f}(x_0) \right]^2\\
&= \sigma_\epsilon^2 + \text{Bias}^2\left(\hat{f}(x_0)\right) + \text{Var}\left(\hat{f}(x_0)\right) \\
&= \text{Irreducible Error} + \text{Bias}^2 + \text{Variance}.
\end{align}

* The first term is the variance of the target around its true mean $f(x_0)$, and cannot be avoided no matter how well we estimate $f(x_0)$, unless $\sigma_\epsilon^2 = 0$.
* The second term is the squared bias, the amount by which the average of our estimate differs from the true mean.
* The last term is the variance, the expected squared deviation of $\hat{f}(x_0)$ around its mean.

Typically the more complex we make the model $\hat{f}$, the lower the (squared) bias but the higher the variance.

### Case 1: kNN fit

For the kNN regression fit, these expressions have the simple form

\begin{align}
\text{Err}(x_0) &= \text{E}\left[ \left( Y - \hat{f}_k(x_0) \right)^2 \mid X=x_0 \right] \\
&= \sigma_\epsilon^2 + \left( f(x_0) - \frac1k \sum_{l=k}^k f(x_{(l)}) \right)^2 + \frac{\sigma_\epsilon^2}k.
\end{align}

Here we assume for simplicity that training inputs $x_i$ are fixed, and the randomness arises from the $y_i$.

The number of neighborhoods $k$ is inversely related to the model complexity: For small $k$, the estimate $\hat{f}(x)$ can potentially adapt itself better to the underlying $f(x)$. As we increase $k$, the bias -- the squared difference between $f(x_0)$ and the average of $f(x)$ at the kNN -- will typically increase, while the variance decreases.

### Case 2: linear model fit

For a linear model fit

\begin{equation}
\hat{f}_p(x) = x^T \hat\beta,
\end{equation}

where the parameter vector $\beta$ with $p$ components is fit by least squares, we have

\begin{align}
\text{Err}(x_0) &= \text{E}\left[ \left( Y - \hat{f}_p(x_0) \right)^2 \mid X=x_0 \right] \\
&= \sigma_\epsilon^2 + \left( f(x_0) - \text{E}\hat{f}_p(x_0) \right)^2 + \|\mathbf{h}(x_0)\|^2 \sigma_\epsilon^2.
\end{align}

Here $\mathbf{h}(x_0) = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} x_0$, the $N$-vector of linear weights that produce the fit

\begin{equation}
\hat{f}_p(x_0) = \mathbf{h}(x_0)^T \mathbf{y} = x_0^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y},
\end{equation}

and hence

\begin{equation}
\frac1N \sum_{i=1}^N \text{Err}(x_i) = \sigma_\epsilon^2 + \frac1N \sum_{i=1}^N \left( f(x_i) - \text{E}\hat{f}(x_i) \right)^2 + \frac{p}N \sigma_\epsilon^2,
\end{equation}

the _in-sample_ error. Here model complexity is directly related to the number of parameters $p$.

### Case 2-1: Ridge regression fit

The test error for a ridge regression fit $\hat{f}_\alpha(x)$ is identical in form to the one of linear regression fit

\begin{align}
\text{Err}(x_0) &= \text{E}\left[ \left( Y - \hat{f}_\alpha(x_0) \right)^2 \mid X=x_0 \right] \\
&= \sigma_\epsilon^2 + \left( f(x_0) - \text{E}\hat{f}_\alpha(x_0) \right)^2 + \|\mathbf{h}(x_0)\|^2 \sigma_\epsilon^2,
\end{align}

except the linear weights in the variance term are different:

\begin{equation}
\mathbf{h}(x_0) = \mathbf{X}(\mathbf{X}^T\mathbf{X} + \alpha\mathbf{I})^{-1} x_0.
\end{equation}

The bias term will also be different.

For a linear model family such as ridge regression, we can break down the bias more finely.

Let $\beta_*$ denote the parameters of the best-fitting linear approximation to $f$:

\begin{equation}
\beta_* = \arg\min_\beta \text{E}\left[ f(X) - X^T\beta \right]^2,
\end{equation}

where the expectation is taken w.r.t. the distribution of the input variable $X$. Then we can write the average squared bias as

\begin{align}
\text{E}_{x_0} \left[ f(x_0) - \text{E}\hat{f}_\alpha(x_0) \right]^2 &= \text{E}_{x_0} \left[ f(x_0) - x_0^T\beta_* \right]^2 + \text{E}_{x_0} \left[ x_0^T\beta_* - \text{E} x_0^T\hat\beta_\alpha \right]^2 \\
&= \text{Ave}\left[\text{Model Bias}\right]^2 + \text{Ave}\left[\text{Estimation Bias}\right]^2.
\end{align}

* The first term is the average squared _model bias_, the error between the best-fitting linear approximation and the true function.
* The second  term is the average squared _estimation bias_, the error between the average estimate $\text{E}(x_0^T\hat\beta)$ and the best-fitting linear approximation.

### Case 2: Linear model fit, more on bias

* For linear models fit by ordinary least squares, the estimation bias is zero.
* For restricted fits, such as ridge regression, it is positive, and we trade it off with the benefits of a reduced variance.
* The model bias can only be reduced by enlarging the class of linear models to a richer collection of models, by including interactions and transformations of the variables in the model.

### Review with schematic figure

FIGURE 7.2 shows the bias-variance tradeoff schematically.

![The model space is the set of all possible predictions from the model, with the "closest" fit labeled with a black dot. The model bias from the truth is shown, along with the  variance, indicated by the large yellow circle centered at the black dot labeled "closest fit in population". A shrunken or regularized fit is also shown, having additional estimation bias, but smaller prediction error due to its decreased variance.](./fig7-2.jpg)

In the case of linear models,
* the model space is the set of all linear predictions from $p$ inputs and
* the black dot labeled "closest fit" is $x^T\beta_*$.
* The blue-shaded region indicates the error $\sigma_\epsilon$ with which we see the truth in the training sample.

Also shown is the variance of the least squares fit, indicated by the large yellow circle centered at the black dot labeled "closest fit in population".

Now if we were to fit a model with fewer predictors, or regularize the coefficients by shrinking them toward zero (say), we would get the "shrunken fit" shown in the figure. This fit has an additional estimation bias, due to the fact that it is not the closest fit in the model space. On the other hand, it has smaller variance.

If the decrease in variance exceeds the increase in (squared) bias, then this is worthwhile.