# $\S$ 7.5. Estimates of In-Sample Prediction Error

The general form of the in-sample estimates is

\begin{equation}
\hat{\text{Err}_\text{in}} = \overline{\text{err}} + \hat\omega,
\end{equation}

where $\hat\omega$ is an estimate of the average optimism.

### $C_p$

Using the expression (simplified with assumption; linear fit of additive model)

\begin{equation}
\text{E}_\mathbf{y}(\text{Err}_\text{in}) = \text{E}_\mathbf{y}(\overline{\text{err}}) + 2\cdot\frac{d}{N} \sigma_\epsilon^2,
\end{equation}

applicable when $d$ parameters are fit under squared error loss, leads to a version of the so-called $C_p$ statistic,

\begin{equation}
C_p = \overline{\text{err}} + 2\cdot \frac{d}{N} \hat\sigma_\epsilon^2.
\end{equation}

Here $\hat\sigma_\epsilon^2$ is an estimate of the noise variance, obtained from the mean-squared error of a low-bias model.

Using this criterion we adjust the training error by a factor proportional to the number of basis functions used.

### $\text{AIC}$

The _Akaike information criterion_ is a similar but more generally applicable estimate of $\text{Err}_\text{in}$ when a log-likelihood loss function is used. It relies on a relationship that holds asymptotically as $N \rightarrow \infty$:

\begin{equation}
-2\cdot \text{E}\left[ \log \text{Pr}_{\hat\theta}(Y) \right] \approx -\frac{2}{N} \cdot \text{E}\left[\text{loglik}\right] + 2\cdot \frac{2}{N}.
\end{equation}

Here $\text{Pr}_{\hat\theta}(Y)$ is a family of densities for $Y$ (containing the "true" density), $\hat\theta$ is the maximum-likelihood estimate of $\theta$, and "loglik" is the maximized log-likelihood:

\begin{equation}
\text{loglik} = \sum_{i=1}^N \log\text{Pr}_{\hat\theta} (y_i).
\end{equation}

For example, for the logistic regression model, using the binomial log-likelihood, we have

\begin{equation}
\text{AIC} = -\frac{2}{N}\cdot\text{loglik} + 2\cdot\frac{d}{N}.
\end{equation}

For the Gaussian model (with variance $\sigma_\epsilon^2=\hat\sigma_\epsilon^2$ assumed known). the $\text{AIC}$ statistic is equivalent to $C_p$, and so we refer to them collectively as $\text{AIC}$.

To use $\text{AIC}$ for model selection, we simply choose the model giving smallest $\text{AIC}$ over the set of models considered. For nonlinear and other complex models, we need to replace $d$ by some measure of model complexity. We discuss this in $\S$ 7.6.

### $\text{AIC}$ for more general models

Given a set of models $f_\alpha(x)$ indexed by a tuning parameter $\alpha$, denote
* the training error by $\overline{\text{err}}(\alpha)$ and
* the number of parameters by $d(\alpha)$

for each model. Then for this set of models we define

\begin{equation}
\text{AIC}(\alpha) = \overline{\text{err}}(\alpha) + 2\cdot\frac{d(\alpha)}{N}\hat\sigma_\epsilon^2.
\end{equation}

The function $\text{AIC}(\alpha)$ provides an estimate of the test error curve, and we find the tuning parameter $\hat\alpha$ that minimizes it. Our final chosen model is $f_{\hat\alpha}(x)$.

Note that if the basis functions are chosen adaptively, it does not hold any longer;

\begin{equation}
\sum_{i=1}^N \text{Cov}(\hat{y}_i, y_i) = d\sigma_\epsilon^2
\end{equation}

For example, if we have a total of $p$ inputs, and we choose the best-fitting linear model with $d<p$ inputs, the optimism will exceed $(2d/N)\sigma_\epsilon^2$. Put another way, by choosing the best-fitting model with $d$ inputs, the _effective number of parameters_ fit is more than $d$.

FIGURE 7.4 shows $\text{AIC}$ in action for the phoneme recognition example of $\S$ 5.2.3 on page 148. Briefly speaking,

* The input vector is the log-periodogram of the spoken vowel, quantized to 256 uniformly spacedd frequencies.
* A linear logistic regression model is used to predict the phoneme class, with coefficient function

  \begin{equation}
  \beta(f) = \sum_{m=1}^M h_(f)\theta_m,
  \end{equation}

  an expansion in $M$ spline basis functions.
* For any given $M$, a basis of natural cubic splines is used for the $h_m$, with knots chosen uniformly over the range of frequencies (so $d(\alpha) = d(M) = M$).

Using $\text{AIC}$ to select the number of basis functions will approximately minimize $\text{Err}(M)$ for both entropy and 0-1 loss.

In [1]:
"""FIGURE 7.4."""
print('Under construction ...')

Under construction ...


The simple formula

\begin{equation}
\frac{2}{N}\sum_{i=1}^N \text{Cov}(\hat{y}_i, y_i) = \frac{2d}{N}\sigma_\epsilon^2
\end{equation}

holds exactly for linear models with additive errors and squared error loss, and approximately for linear models and log-likelihoods.

In particular, the formula does not hold in general for 0-1 loss (Efron, 1986), although many authors nevertheless use it in that context (right panel of FIGURE 7.4).