# $\S$ 7.2. Bias, Variance, and Model Complexity

FIGURE 7.1 illustrates the important issue in assessing the ability of a learning method to generalize.

In [2]:
"""FIGURE 7.1. Behavior of test sample and training sample error
as the model complexity varied.

The light blue curves show the training error, while
the light red curves show the conditional test error
for 100 training sets of size 50 each, as the model complexity increased.

The solid curves show the expected test error and the expected training error.

The lasso was used to produce the sequence of fits."""
print('Under construction ...')

Under construction ...


### Basics
Consider first the case of a quantitative or interval scale response. We have
* a target variable $Y$,
* a vector of inputs $X$, and
* a prediction model $\hat{f}(X)$ that has been estimated from a training set $\mathcal{T}$.

The loss function for measuring errors between $Y$ and $\hat{f}(X)$ is denoted by $L(Y, \hat{f}(X))$. Typical choices are

\begin{equation}
L(Y,\hat{f}(X)) = \begin{cases}
(Y-\hat{f}(X))^2 & \text{squared error,} \\
|Y-\hat{f}(X)| & \text{absolute error.}
\end{cases}
\end{equation}

_Test error_, also referred to as _generalization error_, is the prediction error over an independent test sample

\begin{equation}
\text{Err}_\mathcal{T} = \text{E}\left[ L(Y,\hat{f}(X)) | \mathcal{T} \right]
\end{equation}

where both $X$ and $Y$ are drawn randomly from their joint distribution (population). Here the training set $\mathcal{T}$ is fixed, and test error refers to the error for this specific training set. A related quantity is the expected prediction error (or expected test error)

\begin{equation}
\text{Err} = \text{E}\left[ L(Y,\hat{f}(X)) \right] = \text{E}\left[ \text{Err}_\mathcal{T} \right]
\end{equation}

Note that this expectation averages over everything that is random, including the randomness in the training set that produces $\hat{f}$.

FIGURE 7.1 shows the prediction error (light red curves) $\text{Err}_\mathcal{T}$ for 100 simulated training sets each of size 50. The lasso ($\S$ 3.4.2) was used to produce the sequence of fits. The solid red curve is the average, and hence an estimate of $\text{Err}$.

### Possible to estimate $\text{Err}$?

Estimation of $\text{Err}_\mathcal{T}$ will be our goal, although we will see that $\text{Err}$ is more amenable to statistical analysis, and most methods effectively estimate the expected error. It does not seem possible to estimate conditional error effectively, given only the information in the same training set. Some discussion of this point is given in $\S$ 7.12.

### Model complexity

_Training error_ is the average loss over the training sample

\begin{equation}
\overline{\text{err}} = \frac1N \sum_{i=1}^N L(y_i,\hat{f}(x_i)).
\end{equation}

We would like to know the expected test error of our estimated model $\hat{f}$. As the model becomes more and more complex, it uses the training data more and is able to adapt to more compicated underlying structures. Hence there is a decrease in bias but an increase in variance.

There is some intermediate model complexity that give minimum expected test error.

Unfortunately training error is not a good estimate of the test error, as seen in FIGURE 7.1. Training error consistently decreases with model complexity. However, a model with zero training error is overfit to the training data and will typically generalize poorly.

### The similar story for categorical response

Let $G$ be a categorical response taking one of $K$ values in a set $\mathcal{G} = \{1,2,\cdots,K\}$.

Typically we model the probabilities

\begin{equation}
p_k(X) = \text{Pr}(G=k|X),
\end{equation}

or some monotone transformation $f_k(X)$, and then

\begin{equation}
\hat{G}(X) = \arg\max_k \hat{p}_k(X).
\end{equation}

In some cases, such as 1NN classification (Chapter 2 and 13) we produce $\hat{G}(X)$ directly.

#### Loss
Typical loss functions are

\begin{align}
L(G, \hat{G}(X)) &= I(G \neq \hat{G}(X)) &\text{ (0-1 loss)}, \\
L(G, \hat{p}(X)) &= -2\sum_{k=1}^K I(G=k) \log \hat{p}_k(X) \\
&= -2\log \hat{p}_G(X) &\text{ (}-2 \times \text{log-likelihood)}.
\end{align}

The quantity $-2\times\text{log-likelihood}$ is sometimes referred to as the _deviance_.

#### Test error
Again, test error here is

\begin{equation}
\text{Err}_{\mathcal{T}} = \text{E}\left[ L(G, \hat{G}(X)) \mid \mathcal{T} \right],
\end{equation}

the population misclassification error of the classifier trained on $\mathcal{T}$, and $\text{Err}$ is the expected misclassification error.

#### Training error
Training error is the sample analogue, e.g.,

\begin{equation}
\overline{\text{err}} = -\frac2N \sum_{i=1}^N \log \hat{p}_{g_i}(x_i),
\end{equation}

the sample log-likelihood for the model.

#### Log-likelihood as a loss
The log-likelihood can be used as a loss-function for general response densities, such as the Poisson, gamma, exponential, log-normal and others. If $\text{Pr}_{\theta(X)}(Y)$ is the density of $Y$, indexed by a parameter $\theta(X)$ that depends on the predictor $X$, then

\begin{equation}
L(Y,\theta(X)) = -2 \log \text{Pr}_{\theta(X)}(Y).
\end{equation}

The "-2" in the definition makes the log-likelihood loss for the Gaussian distribution match squared-error loss.

### Simplified assumptions & notations

> For ease of exposition, for the remainder of this chapter we will use $Y$ and $f(X)$ to represent all of the above situation, since we focus mainly on the quantitative response (squared-error loss) setting. For the other situations, the appropriate translations are obvious.

In this chapter we describe a number of methods for estimating the expected test error for a model. Typically our model will have a tuning parameter(s) $\alpha$ and so we can write our predictions as $\hat{f}_\alpha(x)$. The tuning parameter varies the complexity of our model, and we wish to find the value of $\alpha$ that minimizes error, i.e., produces the minimum of the average test error curve in FIGURE 7.1. Having said this, for brevity we will often suppress the dependence of $\hat{f}(x)$ on $\alpha$.

### Model selection and model assessment

It is important to note that there are in fact two separate goals that we might have in mind:

* __Model selection__: estimating the performance of different models in order to choose the best one.
* __Model assessment__: having chosen a final model, estimating its prediction error (generalization error) on new data.

### Data set for training, validation, and testing

If we are in a data-rich situation, the best approach for both problems is to randomly divide the dataset into three parts: A training set, a validation set, and a test set.
* The training set is used to fit the models;
* the validation set is used to estimate prediction error for model selection;
* the test set is used for assessment of the generalization error of the final chosen model.

Ideally, the test set should be kept in a "vault", and be brought out only at the end of the data analysis. Suppose instead that we use the test set repeatedly, choosing the model with smallest test-set error. Then the test-set error of the final chosen model will underestimate the true test error, sometimes substantially.

It is difficult to give a general rule on how to choose the number of observations in each of the three parts, as this depends on the signal-to-noise ratio (SNR) in the data and the training sample size. A typical split might be 50% for training, and 25% each for validation and testing.

### The major topic from now on: What if data is insufficient

The methods in this chapter are designed for situations where there is insufficient data to split into three parts. Again it is too difficult to give a general rule on how much training data is enough; among other things, this depends on the SNR of the underlying function, and the complexity of the models being fit to the data.

The methods of this chapter approximate the validation step either
* analytically (AIC, BIC, MDL, SRM) or
* by efficient sample re-use (cross validation and the bootstrap).

Besides their use in model selection, we also examine to what extent each method provides a reliable estimate of test error of the final chosen model.

Before jumping into these topics, we first explore in more detail the nature of test error and the bias-variance tradeoff.