# $\S$ 7.6. The Effective Number of Parameters

The concept of "number of parameters" can be generalized, especially to models where regularization is used in the fitting.

Suppose we stack the outcomes $y_1,y_2,\cdots,y_N$ into a vector $\mathbf{y}$, and similarly for the predictions $\hat{\mathbf{y}}$. Then a linear fitting method is one for which we can write

\begin{equation}
\hat{\mathbf{y}} = \mathbf{Sy},
\end{equation}

where $\mathbf{S}$ is an $N\times N$ matrix depending on the input vectors $x_i$, but not on the $y_i$.

Linear fitting methods include
* linear regression on the original features or on a derived basis set, and
* smoothing methods that use quadratic shrinkage, such as ridge regression and cubic smoothing splines.

Then the _effective number of parameters_ (a.k.a. the _effective degrees of freedom_) is defined as

\begin{equation}
\text{df}(\mathbf{S}) = \text{trace}(\mathbf{S}).
\end{equation}

Note that if $\mathbf{S}$ is an orthogonal-projection matrix onto a basis set spanned by $M$ features, then

\begin{equation}
\text{trace}(\mathbf{S}) = M.
\end{equation}

It turns out that $\text{trace}(\mathbf{S})$ is exactly the correct quantity to replace $d$ as the number of parameters in the $C_p$ statistic.

### For additive-error models

If $\mathbf{y}$ arises from an additive-error model

\begin{equation}
Y = f(X) + \epsilon
\end{equation}

with $\text{Var}(\epsilon) = \sigma_\epsilon^2$, then one can show that

\begin{equation}
\sum_{i=1}^N \text{Cov}(\hat{y}_i,y_i) = \text{trace}(\mathbf{S})\sigma_\epsilon^2,
\end{equation}

which motivates the more general definition (Exercise 7.4 and 7.5)

\begin{equation}
\text{df}(\hat{\mathbf{y}}) = \frac{\sum_{i=1}^N \text{Cov}(\hat{y}_i,y_i)}{\sigma_\epsilon^2}.
\end{equation}

$\S$ 5.4.1 on page 153 gives some more intuition for the definition $\text{df} = \text{trace}(\mathbf{S})$ in the context of smoothing splines.

### For models like neural networks

in which  we minimize an error function $R(\omega)$ with weight decay penalty (regularization) $\alpha \sum_m \omega_m^2$, the effective number of parameters has the form

\begin{equation}
\text{df}(\alpha) = \sum_{m=1}^M \frac{\theta_m}{\theta_m+\alpha},
\end{equation}

where the $\theta_m$ are the eigenvalues of the Hessian matrix $\frac{\partial^2 R(\omega)}{\partial\omega\partial\omega^T}$.

Expression $(7.34)$ follows from $(7.32)$ if we make a quadratic approximation to the error function at the solution (Bishop, 1995).