# 3 Linear Models for Regressin

The focus so far in this book has been on unsupervised learning, including topics such as density estimation and data clustering. We turn now to a discussion of supervised learning, starting with regression. The goal of regression is to predict the value of one or more continuous target variables $t$ given the value of a D-dimensional vector $\mathbf{x}$ of input variables. We have already encountered an example of a regression problem when we considered polynomial curve fitting in Chapter 1. The polynomial is a specific example of a broad class of functions called linear regression models, which share the property of being linear functions of the adjustable parameters, and which will form the focus of this chapter. The simplest form of linear regression models are also linear functions of the input variables. However, we can obtain a much more useful calss of functions by taking linear combinations of a fixed set of nonlinear functions of the input variables, known as basis functions. Such models are linear functions of the parameters, which gives them simple analytical properties, and yet can be nonlinear with respect to the input variables.

Given a training data set comprising $N$ observations $\{\mathbf{x}_n\}$, where $n=1,\dots,N$, together with corresponding target values $\{t_n\}$, the goal is to predict the value of $t$ for a new value of $\mathbf{x}$. In the simplest approach, this can be done by directly constructing an appropriate fucnction　$y\left(\mathbf{x}\right)$ whose values for new input $\mathbf{x}$ constitute the predictions for the corresponding values of $t$. More generally, from a probabilistic perspective, we aim to model the predictive distribution $p\left(t|\mathbf{x}\right)$ because this expresses our uncertainty about the value of $t$ for each value of $\mathbf{x}$. From this conditional distribution we can make predictions of $t$, for any new value of $\mathbf{x}$, in such a way as to minimize the expected value of a suitably chosen loss function. A common choice of a suitably chosen loss function for real-valued variables is the squared loss, for which the optimal solution is given by the conditional expectation of $t$.

Although linear models have significant limitations as practical techniques for pattern recognition, particularly for problems involving input spaces of high dimensionality, they have nice analytical properties and form the foundation for more sophisticated models to be discussed in later chapters.

## 3.1 Linear Basis Function Models

The simplest linear model for regression is one that involves a linear combination of the input variables
$$y\left(\mathbf{x},\mathbf{w}\right)=w_0+w_1x_1+\dots+w_Dx_D$$
where $\mathbf{x}=\left(x_1,\dots,x_D\right)^\top$. This is often simply known as linear regression. The key property of this model is that it is a linear function of the parameters $w_0,\dots,w_D$. It is also, however, alinear function of the input variables $x_i$, and this imposes significant limitations on the model.

We therefore extend the class of models by considering linear combinations of fixed nonlinear functions of the input variables, of the from
$$y\left(\mathbf{x},\mathbf{w}\right)=w_0+\sum_{j=1}^{M-1}w_j\phi_j\left(\mathbf{x}\right)$$
where $\phi_j\left(\mathbf{x}\right)$ are known as basis functions. By denoting the maximum value of the index $j$ by $M-1$, the total number of parameters in this model will be $M$.

The parameter $w_0$ allows for any fixed offset in the data and is sometimes called a bias parameter. It is often convenient to define an additional dummy 'basis function' $\phi_0\left(\mathbf{x}\right)=1$ so that
$$y\left(\mathbf{x},\mathbf{w}\right)=\sum_{j=0}{M-1}w_j\phi_j\left(\mathbf{x}\right)=\mathbf{w}^\top\boldsymbol{\phi}\left(\mathbf{x}\right)$$
where $\mathbf{w}=\left(w_0,\dots,w_{M-1}\right)^\top$ and $\boldsymbol{\phi}=\left(\phi_0,\dots,\phi_{M-1}\right)^\top$. In many practical applications of pattern recognition, we will apply some form of fixed pre-processing, or feature extraction, to the original data variables. If the original variables comprise the vector $\mathbf{x}$, then the features can be expressed in terms of the basis functions $\{\phi_j\left(\mathbf{x}\right)\}$. 

By using nonlinear basis functions, we allow the function $y\left(\mathbf{x},\mathbf{w}\right)$ to be a nonlinear function of the input vector $\mathbf{x}$. Functions are called linear models, however, because this function is linear in $\mathbf{w}$. It is this linearity in the parameters that will greatly simplify the analysis of this calss of models. However, it also leads to some significant limitations.

The example of polynomial regression is a particular example of this model in which there is a single input variable $x$, and the basis functions take the form of powers of $x$ so that $\phi_j\left(x\right)=x^j$. One limitation of polynomial basis functions is that they are global functions of the input variable, so that changes in one region of inputspace affect all other regions. This can be resolved by dividing the input space up into regions and fit a different polynomial in each region, leading to spline functions.

There are many other possible choices for the basis functions, for example
$$\phi_j\left(x\right)=\exp\left\{-\frac{\left(x-\mu_j\right)^2}{2s^2}\right\}$$
where the $\mu_j$ govern the locations of the basis functions in input space, and the parameter $s$ governs their spatial scale. These are usually referred to as 'Gaussian' basis functions, although it should be noted that they are not required to have a probabilistic interpretation, and in particular the normalization coefficient is unimportant because these basis functions will be multiplied by adaptive parameters $w_j$.

Another possibility is the sigmoidal basis function of the form
$$\phi_j\left(x\right)=\sigma\left(\frac{x-\mu_j}{s}\right)$$
where $\sigma\left(a\right)$ is the logistic sigmoid function defined by
$$\sigma\left(a\right)=\frac{1}{1+\exp\left(-a\right)}.$$

Equivalently, we can use the 'tanh' function because this is related to the logistic sigmoid by $\tanh\left(a\right)=2\sigma\left(a\right)-1$, and so a general linear combination of logistic sigmoid functions is equivalent to a general linear combination of 'tanh' functions.

Yet another possible choice of basis function is the Fourier basis, which leads to an expansion in sinusoidal functions. Each basis function represents a specific frequency and has infinite spatial extent. By contrast, basis functions that are localized to finite regions of input space necessarily comprise a spectrum of different spatial frequencies. In many signal processing applications, it is of interest to consider basis functions that are localized in both space and frequency, leading to a class of functions known as wavelets. These are also defined to be mutually orthogonal, to simplify their application. Wavelets are most applicable when the input values live on a regular lattice, such as the successive time points in a temporal sequence, or the pixels in an image. 

Most of the discussion in this chapter, however, is independent of the particular choice of basis function set, and so for most of our discussion we shall not specify the particular form of the basis functions, except for the purposes of numerical illustration. Indeed, much of or discussion will be equally applicable to the situation in which the vector $\phi\left(\mathbf{x}\right)$ of basis functions is simply the identity $\phi\left(\mathbf{x}\right)=\mathbf{x}$. Furthermore, in order to keep the notation simple, we shall focus on the case of a single target variable $t$. However we consider briefly the modifications needed to deal with multiple target variables.

### 3.1.1 Maximum likelihood and least squares

The sum-of-squares error funciton could be motivated as the maximum likelihood solution under an assumed Gaussian noise model.

As before, we assume that the target variable $t$ is given by a deterministic function $y\left(\mathbf{x},\mathbf{w}\right)$ with additive Gaussian noise so that
$$t=y\left(\mathbf{x},\mathbf{w}\right)+\epsilon$$
where $\epsilon$ is a zero mean Gaussian random variable with precision (inverse variance) $\beta$.

Thus we can write 
$$p\left(t|\mathbf{x},\mathbf{w},\beta\right)=\mathcal{N}\left(t|y\left(\mathbf{x},\mathbf{w}\right),\beta^{-1}\right).$$

If we assume a squared loss function, then the optimal prediction, for a new value of $\mathbf{x}$, will be given by the conditional mean of the target variable. In the case of a Gaussian conditional distribution of the from, the conditional mean will be simply
$$\mathbb{E}\left[t|\mathbf{x}\right]=\int tp\left(t|\mathbf{x}\right)\mathrm{d}t=y\left(\mathbf{x},\mathbf{w}\right).$$

Note that the Gaussian noise assumption implies that conditional distribution of $t$ given $\mathbf{x}$ is unimodal, which may be inappropriate for some applications. An extension to mixtures of conditional Gaussian distributions, which permit multimodal conditional distributions, will be discussed in Section 14.5.1.

Now consider a data set of inputs $\mathbf{X}=\{\mathbf{x}_1,\dots,\mathbf{x}_N\}$ with corresponding target values $t_1,\dots,t_N$. We group the target variables $\{t_n\}$ into a column vector that we denote by $\mathsf{t}$. Making the assumption that these data points are drawn independently from the distribution, we obtain the following expression for the likelihood function, which is a function of the adjustable parameters $\mathbf{w}$ and $\beta$, in the form
$$p\left(\mathsf{t}|\mathbf{X},\mathbf{w},\beta\right)=\prod_{n=1}^N\mathcal{N}\left(t_n|\mathbf{w}^\top\phi\left(\mathbf{x}_n\right),\beta^{-1}\right)$$

Note that in supervised learning problems such as regression (and classification), we are not seeking to model the distribution of the input variables. Thus $\mathbf{x}$ will always appear in the set of conditioning variables, and so from now on we will drop the explicit $\mathbf{x}$ from expressions such as $p\left(\mathsf{t}|\mathbf{x},\mathbf{w},\beta\right)$ in order to keep the notation uncluttered.

Taking the logarithm of the likelihood function, and making use of the standard form for the univariate Gaussian, we have
$$\ln p\left(\mathsf{t}|\mathbf{w},\beta\right)=\sum_{n=1}^N\ln\mathcal{N}\left(t_n|\mathbf{w}^\top\phi(\left(\mathbf{x}_n\right),\beta^{-1}\right) \\
=\frac{N}{2}\ln\beta-\frac{N}{2}\ln\left(2\pi\right)-\beta E_D\left(\mathbf{x}\right)$$
where the sum-of-squares error function is defined by
$$E_D\left(\mathbf{w}\right)=\frac{1}{2}\sum_{n=1}^N\{t_n-\mathbf{w}^\top\phi\left(\mathbf{x}_n\right)\}^2.$$

Having written down the likelihood function, we can use maximum likelihood to determine $\mathbf{w}$ and $\beta$. Consider first the maximization with respect to $\mathbf{x}$. We see that maximization of the likelihood function under a conditional Gaussian noise distribution for a linear model is equivalent to minimizing a sum-of-squares error function given by $E_D\left(\mathbf{w}\right)$. The gradient of the log likelihood takes the form
$$\nabla\ln p\left(\mathsf{t}|\mathbf{w},\beta\right)=\sum_{n=1}^N\{t_n-\mathbf{w}^\top\phi\left(\mathbf{x}_n\right)\}\phi\left(\mathbf{x}_n\right)^\top.$$

Setting this gradient to zero gives
$$\sum_{n=1}^Nt_n\phi\left(\mathbf{x}_n\right)^\top-\mathbf{w}^\top\left(\sum_{n=1}^N\phi\left(\mathbf{x}_n\right)\phi\left(\mathbf{x}_n\right)^\top\right)=0.$$

Solving for $\mathbf{w}$ we obtain
$$\mathbf{w}_{ML}=\left(\boldsymbol\Phi^\top\boldsymbol\Phi\right)^{-1}\boldsymbol\Phi^\top\mathsf{t}$$
which are known as the normal equations for the least squares problem. Here $\boldsymbol\Phi$ is an $N\times M$ matrix, called the design matrix, whose elements by $\Phi_{nj}=\phi_j\left(\mathbf{x}_n\right)$, so that
$$\boldsymbol\Phi=\begin{bmatrix} \phi_0\left(\mathbf{x}_1\right) & \phi_1\left(\mathbf{x}_1\right) & \cdots & \phi_{M-1}\left(\mathbf{x}_1\right) \\ \phi_0\left(\mathbf{x}_2\right) & \phi_1\left(\mathbf{x}_2\right) & \cdots & \phi_{M-1}\left(\mathbf{x}_2\right) \\ \vdots & \vdots & \ddots & \vdots \\ \phi_0\left(\mathbf{x}_N\right) & \phi_1\left(\mathbf{x}_N\right) & \cdots & \phi_{M-1}\left(\mathbf{x}_N\right) \end{bmatrix}.$$

The quantity
$$\boldsymbol\Phi^\dagger\equiv\left(\boldsymbol\Phi^\top\boldsymbol\Phi\right)^{-1}\boldsymbol\Phi^\top$$ is known as the Moore-Penrose pseudo-inverse of the matrix $\boldsymbol\Phi$. It can be regarded as a generalization of the notion of matrix inverse to nonsquare matrices. Indeed, if $\boldsymbol\Phi$ is square and invertible, then using the property $\left(AB\right)^{-1}=B^{-1}A^{-1}$ we see that $\boldsymbol\Phi^\dagger\equiv\boldsymbol\Phi^{-1}$.

At this point, we can gain some insight into the role of the bias parameter $w_0$. If we make the bias parameter explicit, then the error function becomes
$$E_D\left(\mathbf{w}\right)=\frac{1}{2}\sum_{n=1}^N\{t_n-w_0-\sum_{j=1}^{M-1}w_j\phi_j\left(\mathbf{x}_n\right)\}^2.$$

Setting the derivative with respect to $w_0$ equal to zero, and solving for $w_0$, we obtain 
$$w_0=\bar{t}-\sum_{j=1}^{M-1}w_j\bar{\phi_j}$$
where we have defined
$$\bar{t}=\frac{1}{N}\sum_{n=1}^Nt_n,\quad\quad\quad\bar{\phi_j}=\frac{1}{N}\sum_{n=1}^N\phi_j\left(\mathbf{x}_n\right).$$

Thus the bias $w_0$ compensates for the difference between the averages (over the training set) of the target values and the weighted sum of the averages of the basis function values.

We can also maximize the log likelihood function with respect to the noise precision parameter $\beta$, giving
$$\frac{1}{\beta_{ML}}=\frac{1}{N}\sum_{n=1}^N\{t_n-\mathbf{w}^\top_{ML}\phi\left(\mathbf{x}\right)\}^2$$
and so we see that the inverse of the noise precision is given by the residual variance of the target values around the regression function.

### 3.1.2 Geometry of least squares

At this point, it is instructive to consider the geometrical interpretation of the least-squares solution. To do this we consider an N-dimensional space whose axes are given by the $t_n$, so that $\mathsf{t}=\left(t_1,\dots,t_N\right)^\top$ is a vector in this space. Each basis function $\phi_j\left(\mathbf{x}_n\right)$, evaluated ate $N$ data points, can also be represented as a vector in the same space, denoted by $\varphi_j$. Note that $\varphi_j$ corresponds to the $j^{th}$ column of $\Phi$, whereas $\phi\left(\mathbf{x}_n\right)$ corresponds to the $n^{th}$ row of $\Phi$. If the number $M$ of basis functions is smaller thatn the number of $N$ of data points, then the $M$ vectors $\phi_j\left(\mathbf{x}_n\right)$ will span a linear subspace $\mathcal{S}$ of dimensionality $M$. We define $\mathsf{y}$ to be an N-dimensional vector whose $n^{th}$ element is given by $y\left(\mathbf{x}_n, \mathbf{w}\right)$, where $n=1,\dots,N$. Because $\mathsf{y}$ is an arbitrary linear combination of the vectors $\varphi_j$, it can live anywhere in the M-dimensional subspace. The sum-of-squares error is then equal (up to a factor of 1/2) to the squared Euclidean distance between $\mathsf{y}$ and $\mathsf{t}$. Thus the least-squares solution for $\mathbf{w}$ corresponds to that choice of $\mathsf{y}$ that lies in subspace $\mathcal{S}$ and that is closest to $\mathsf{t}$. Intuitively, we anticipate that this solution corresponds to the orthogonal projection of $\mathsf{t}$ onto the subspace $\mathcal{S}$．This is indeed the case, as can easily be verified by noting that the solution for $\mathsf{y}$ is given by $\Phi\mathbf{w}_{ML}$, and then confirming that this takes the form of an orthogonal projection.

In practice, a direct solution of the normal equations can lead to numerical difficulties when $\Phi^\top\Phi$ is close to singular. In particular, when two or more of the basis vectors $\varphi_j$ are co-linear, or nearly so, the resulting parameter values can have large magnitudes. Such near degeneracies will not be uncommon when dealing with real data sets. The resulting numerical difficulties can be addressed using the technique of ingular value decomposition, ro SVD. Note that the addition of a regularization term ensures that the matrix is nonsingular, even in the presence of degeneracies.

### 3.1.3 Sequential learning

Batch techniques, such as the maximum likelihood solution, which involve processing the entire training set in one go, can be computationally costly for large data sets. If the data set is sufficiently large, it may be worthwhile to use sequential algorithms, also known as on-line algorithms, in which the data points are considered one at a time, and the model parameters updated after each such presentation. Sequential learning is also appropriate for realtime applications i nwhich the data observations are arriving in a continuous stream, and predictions must be made before all of the data points are seen.

We can obtain a sequential learning alorithm by applying the technique of stochastic gradient descent, also known as sequential gradient descent, as follows. If the error function comprises a sum over data points $E=\sum_nE_n$, then after presentation of pattern $n$, the stochastic gradient descent algorithm updates the parameter vector $\mathbf{w}$ using 
$$\mathbf{w}^{\left(\tau+1\right)}=\mathbf{w}^{\left(\tau\right)}-\eta\nabla E_n$$
where $\tau$ denotes the iteration number, and $eta$ is a learning rate parameter. We shall discuss the choice of value for $\eta$ shortly. The value of $\mathbf{w}$ is initialized to some starting vector $\mathbf{w}^{\left(0\right)}$. 

For the case of the sum-of-squares erro function, this gives
$$\mathbf{w}^{\left(\tau+1\right)}=\mathbf{w}^{\left(\tau\right)}+\eta\left(t_n-\mathbf{w}^{\left(\tau\right)\top}\phi_n\right)\phi_n$$
where $\phi_n=\phi\left(\mathbf{x}_n\right)$. This is known as least-mean-squares or the LMS algorithm. The value of $\eta$ needs to be chosen with care to ensure that the algorithm converges.

### 3.1.4 Regularized least squares

We introduced the idea of adding a regularization term to an error function in order to control over-fitting, so that the total error function to be minimized takes the form 
$$E_D\left(\mathbf{w}\right)+\lambda E_W\left(\mathbf{w}\right)$$
where $\lambda$ is the regularization coefficient that controls the relative importance of the data-dependent error $E_D\left(\mathbf{w}\right)$ and the regularization term $E_W\left(\mathbf{w}\right)$.

One of the simplest forms of regularizer is given by then sum-of-squares of the weight vector elements
$$E_W\left(\mathbf{w}\right)=\frac{1}{2}\mathbf{w}^\top\mathbf{w}.$$

If we also consider the sum-of-squares error function given by
$$E\left(\mathbf{w}\right)=\frac{1}{2}\sum_{n=1}^N\left\{t_n-\mathbf{w}^\top\phi\left(\mathbf{x}_n\right)\right\}^2$$
then the total error function becomes 
$$\frac{1}{2}\sum_{n=1}^N\left\{t_n-\mathbf{w}^\top\phi\left(\mathbf{x}_n\right)\right\}^2+\frac{\lambda}{2}\mathbf{w}^\top\mathbf{w}.$$

This particular choice of regularizer is known in the machine learning literature as weight decay because in sequential learning algorithms, it encourages weight values to decay towards zero, unless supported by the data. In statistics, it provides an example of a parameter shrinkage method because it shrinks parameter values towards zero. It has the advantage that the error function remains a quadratic function of $\mathbf{w}$, and so its exact minimizer can be found in closed form. 

Specifically, setting the gradient with respect to $\mathbf{w}$ to zero, and solving for $\mathbf{w}$ as before, we obtain
$$\mathbf{w}=\left(\lambda\mathbf{I}+\Phi^\top\Phi\right)^{-1}\Phi^\top\mathsf{t}.$$
This represents a simple extension of the least-squares solution.

A more general regularizer is sometimes used, for which the regularized error takes the form
$$\frac{1}{2}\sum_{n=1}^N\left\{t_n-\mathbf{w}^\top\phi\left(\mathbf{x}_n\right)\right\}^2+\frac{\lambda}{2}\sum_{j=1}^M|w_j|^q$$
where $q=2$ corresponds to the quadratic regularizer.

The case of $q=1$ is know as the lasso in the statistics literature. It has property that if $\lambda$ is sufficiently large, some of the coefficients $w_j$ are drive to zero, leading to a sparse model in which the corresponding basis functions play no role.

To see this, we first note that minimizing error function is equivalent to minimizing the unregularized sum-of-squares error subject to the constraint
$$\sum_{j=1}^M |w_j|^q\leqslant\eta$$
for an appropriate value of the parameter $\eta$, where the two approaches can be related using Lagrange multipliers. As $\lambda$ is increased, so an increasing number of parameters are driven to zero.

Regularization allows complex models to be trained on data sets of limited size without severe over-fitting, essentially by limiting the effective model complexity. However, the problem of determining the optimal model complexity is then shifted from one of finding the appropriate number of basis functions to one of determining a suitable value of the regularization coefficient $\lambda$. 

### 3.1.5 Multiple outputs

So far, we have considered the case of a single target variable $t$. In some applications, we may wish to predict $K>1$ target variables, which we denote collectively by the target vector $\mathbf{t}$. This could be done by introducing a different set of basis functions for each component of $\mathbf{t}$, leading to multiple, independent regression problems.

Howerver, a more interesting, and more common, approach is to use the same set of basis functions to model all of the components of the target vector so that 
$$\mathbf{y}\left(\mathbf{x},\mathbf{w}\right)=\mathbf{W}^\top\phi\left(\mathbf{x}\right)$$
where $\mathbf{y}$ is a K-dimensional column vector, $\mathbf{W}$ is an $M\times K$ matrix of parameters, and $\phi\left(\mathbf{x}\right)$ is an M-dimensional column vector with elements $\phi_j\left(\mathbf{x}\right)$, with $\phi_0\left(\mathbf{x}\right)=1$ as before.

Suppose we take the conditional distribution of the target vector to be an isotropic Gaussian of the form
$$p\left(\mathbf{t}|\mathbf{x},\mathbf{W},\beta\right)=\mathcal{N}\left(\mathbf{t}|\mathbf{W}^\top\phi\left(\mathbf{x}\right),\beta^{-1}\mathbf{I}\right).$$
If we have a set of observations $\mathbf{t}_1,\dots,\mathbf{t}_N$, we can combine these into a matrix $\mathbf{T}$ of size $N\times K$ such that the $n^{th}$ row is given by $\mathbf{t}_n^\top$. Similarly, we can combine the input vectors $\mathbf{x}_1,\dots,\mathbf{x}_N$ into a matrix $\mathbf{X}$. The log likelihood function is then given by
$$\ln p\left(\mathbf{T}|\mathbf{X},\mathbf{W},\beta\right)=\sum_{n=1}^N \ln\mathcal{N}\left(\mathbf{t}_n|\mathbf{W}^\top\phi\left(\mathbf{x}_n\right),\beta^{-1}\mathbf{I}\right) \\
=\frac{NK}{2}\ln\left(\frac{\beta}{2\pi}\right)-\frac{\beta}{2}\sum_{n=1}^N\|\mathbf{t}_n-\mathbf{W}^\top\phi\left(\mathbf{x}_n\right)\|^2.$$

As before, we can maximize this function with respect ot $\mathbf{W}$, giving
$$\mathbf{W}_{ML}=\left(\Phi^\top\Phi\right)^{-1}\Phi^\top\mathbf{T}.$$

If we examine this result for each target variable $t_k$, we have 
$$\mathbf{w}_k=\left(\Phi^\top\Phi\right)^{-1}\Phi^\top\mathbf{t}_k=\Phi^\dagger\mathbf{t}_k$$
where $\mathbf{t}_k$ is an N-dimensional column vector with components $t_{nk}$ for $n=1,\dots,N$. Thus the solution to the regression problem decouples between the different target variables, and we need only compute a single pseudo-inverse matrix $\Phi^\dagger$, which is shared by all of the vectors $\mathbf{w}_k$.

The extension to general Gaussian noise distributions having arbitrary covariance matrices is straightforward. Again, this leads to a decoupling into $K$ independent regression problems. This result is unsurprising because the parameters $\mathbf{W}$ define only the mean of the Gaussian noise distribution, and we know that the maximum likelihood solution for the mean of a multivariate Gaussian is independent of the covariance.

## 3.2 The Bias-Variance Decomposition

So far in our discussion of linear models for regression, we have assumed that the form and number of basis functions are both fixed. The use of maximum likelihood , or equivalently least squares, can lead to severe over-fitting if complex models are trained using data sets of limited size. However, limiting the number of basis functions in order to avoid over-fitting has  the side effect of limiting the flexibility of the model to capture interesting and import trends in the data. Although the introduction of regularization terms can control over-fitting for models with many parameters, this raises the question of how to determine a suitable value for the regularization coefficient $\lambda$. Seeking the solutioin that minimizes the regularized error funcition with respect to both the weight vector $\mathbf{w}$ and the regularization coefficient $\lambda$ is clearly not the right approach since this leads to the unregularized solution with $\lambda=0$.

We shall consider a frequentist viewpoint of the model, known as the bias-variance trade-off.

We considered various loss functions each of which leads to a corresponding optimal prediction once we are given the conditional distribution $p\left(t|\mathbf{x}\right)$. A popular choice is the squared loss function, for which the optimal prediction is given by the conditional expection, which we denote by $h\left(\mathbf{x}\right)$ and which is given by
$$h\left(\mathbf{x}\right)=\mathbb{E}\left[t|\mathbf{x}\right]=\int tp\left(t|\mathbf{x}\right)\mathrm{d}t.$$

The expected squared loss can be wirtten in the form
$$\mathbb{E}\left[L\right]=\int\{y\left(\mathbf{x}\right)-h\left(\mathbf{x}\right)\}^2p\left(\mathbf{x}\right)\mathrm{d}\mathbf{x}+\int\int \{h\left(\mathbf{x}\right)-t\}^2p\left(\mathbf{x},t\right)\mathrm{d}\mathbf{x}\mathrm{d}t.$$

Recall that the second term, which is independent of $y\left(\mathbf{x}\right)$, arises from the intrinsic noise on the data and represents the minimum achievable value of the expected loss. The first term depends on our choice for the function $y\left(\mathbf{x}\right)$, and we will seek a solution for $y\left(\mathbf{x}\right)$ which makes this term a minimum. Because it is nonnegative, the smallest that we can hope to make this term is zero. If we had an unlimited supply of data, we could in principle find the regression function $h\left(\mathbf{x}\right)$ to any desired degree of accuracy, and this would represent the optimal choice for $y\left(\mathbf{x}\right)$. However, in practice we have a data set $\mathcal{D}$ containing only a finite number $N$ of data points, and consequently we do not know the regression function $h\left(\mathbf{x}\right)$ exactly.

A frequentist treatment involves making a point estimate of $\mathbf{w}$ based on the data set $\mathcal{D}$, and tries instead to interpret the uncertainty of this estimate through the following thought experiment. Suppose we had a large number of data sets each of size $N$ and each drawn independently from the distribuiton $p\left(t,\mathbf{x}\right)$. For any given data set $\mathcal{D}$, we can run our learning algortihm and obtain a prediction function $y\left(\mathbf{x};\mathcal{D}\right)$. Different data sets from the ensemble will give different functions and consequently different values of the squared loss. The performance of a particular learning algorithm is then assessed by taking the average over this ensemble of data sets.

Consider the integrand of the first term, which for a particular data set $\mathcal{D}$ takes the form 
$$\{y\left(\mathbf{x};\mathcal{D}\right)-h\left(\mathbf{x}\right)\}^2.$$

Because this quantity will be dependent on the particular data set $\mathcal{D}$, we take its average over the ensemble of data sets. If we add and subtract the quantity $\mathbb{E}_D\left[y\left(\mathbf{x};\mathcal{D}\right)\right]$ inside the baraces, and then expand, we obtain
$$\{y\left(\mathbf{x};\mathcal{D}\right)-\mathbb{E}_D\left[y\left(\mathbf{x};\mathcal{D}\right)\right]+\mathbb{E}_D\left[y\left(\mathbf{x};\mathcal{D}\right)\right]-h\left(\mathbf{x}\right)\}^2 \\
=\{y\left(\mathbf{x};\mathcal{D}\right)-\mathbb{E}_D\left[y\left(\mathbf{x};\mathcal{D}\right)\right]\}^2+\{\mathbb{E}_D\left[y\left(\mathbf{x};\mathcal{D}\right)\right]-h\left(\mathbf{x}\right)\}^2 \\+
2\{y\left(\mathbf{x};\mathcal{D}\right)-\mathbb{E}_D\left[y\left(\mathbf{x};\mathcal{D}\right)\right]\}\{\mathbb{E}_D\left[y\left(\mathbf{x};\mathcal{D}\right)\right]-h\left(\mathbf{x}\right)\}.$$

We now take the expectation of this expression with respect to $\mathcal{D}$ and note that the final term will vanish, giving
$$\mathbb{E}_D\left[\{y\left(\mathbf{x};\mathcal{D}\right)-h\left(\mathbf{x}\right)\}^2\right] \\
=\underbrace{\{\mathbb{E}_D\left[y\left(\mathbf{x};\mathcal{D}\right)\right]-h\left(\mathbf{x}\right)\}^2}_{\left(bias\right)^2}+\underbrace{\mathbb{E}_D\left[\{y\left(\mathbf{x};\mathcal{D}\right)-\mathbb{E}_D\left[y\left(\mathbf{x};\mathcal{D}\right)\right]\}^2\right]}_{variance}$$

We see that the expected squared difference between $y\left(\mathbf{x};\mathcal{D}\right)$ and the regression function $h\left(\mathbf{x}\right)$ can be expressed as the sum of two terms. The first term, called the squared bias, represents the extent to which the average predictin over all data sets differs from the desired regression function. The second term, called the variance, measures the extent to which the solutions for individual data sets vary around their average, and hence this measures the extent tot which the function $y\left(\mathbf{x};\mathcal{D}\right)$ is sensitive to the particular choice of data set.

So far, we have considered a single input value $\mathbf{x}$. We obtain the following decomposition of the expected squared loss
$$expected\;loss=\left(bias\right)^2+variance+noise$$
where
$$\left(bias\right)^2=\int \{\mathbb{E}_D\left[y\left(\mathbf{x};\mathcal{D}\right)\right]-h\left(\mathbf{x}\right)\}^2p\left(\mathbf{x}\right)\mathrm{d}\mathbf{x} \\
variance = \int\mathbb{E}_D\left[\{y\left(\mathbf{x};\mathcal{D}\right)-\mathbb{E}_D\left[y\left(\mathbf{x};\mathcal{D}\right)\right]\}^2\right]p\left(\mathbf{x}\right)\mathrm{d}\mathbf{x} \\
noise = \int\int \{h\left(\mathbf{x}\right)-t\}^2p\left(\mathbf{x},t\right)\mathrm{d}\mathbf{x}\mathrm{d}t$$
and the bias and variance terms now refer to integrated quantities.

Our goal is to minimize the expected loss, which we have decomposed into the sum of a (squared) bias, a variance, and a constant noise term. There is trade-off between bias and variance, with very flexible models having low bias and high variance, and relatively rigid models having high bias and low variance. The model with the optimal predictive capability is the one that leads to the best balance between bias and variance.

The bias-variance decomposition is based on averages with respect to ensembles of data sets, whereas in practice we have only the single observed data set. If we had a large number of independent training sets of a given size, we would be better off combining them into a single large training set, which of course the level of over-fitting for a given model complexity.

## 3.3 Bayesian Linear Regression

The Bayesian treatment of linear regression will avoid the over-fitting problem of maximum likelihood, and will also lead to automatic methods of determining model complexity using the training data alone.

### 3.3.1 Parameter distribution

We begin our discussion of the Bayesian treatment of linear regression by introducing a prior probability distribution over the model parameters $\mathbf{w}$. For the moment, we shall treat the noise precision parameter $\beta$ as a known constant. First note that the likelihood function $p\left(\mathsf{t}|\mathbf{w}\right)$ is the exponential of a quadratic fuction of $\mathbf{w}$. The corresponding conjugate prior is therefore given by a Gausian distribution of the form 
$$p\left(\mathbf{w}\right)=\mathcal{N}\left(\mathbf{w}|\mathbf{m}_0,\mathbf{S}_0\right)$$
having mean $\mathbf{m}_0$ and covariance $\mathbf{S}_0$.

Next we compute the posterior distribuiton, which is proportional to the product of the likelihood function and the prior. Due to the choice of a conjugate Gaussian prior distribution, the posterior will also be Gaussian. The posterior distribution in the form 
$$p\left(\mathbf{w}|\mathsf{t}\right)=\mathcal{N}\left(\mathbf{w}|\mathbf{m}_N,\mathbf{S}_N\right)$$
where 
$$\mathbf{m}_N=\mathbf{S}_N\left(\mathbf{S}_0^{-1}\mathbf{m}_0+\beta\Phi^\top\mathsf{t}\right) \\
\mathbf{S}_N^{-1}=\mathbf{S}_0^{-1}+\beta\Phi^\top\Phi.$$

Note that because the posterior distribution is Gaussian, its mode coincides with its mean. Thus the maximum posterior weight vector is simply given by $\mathbf{w}_{MAP}=\mathbf{m}_N$.

If we consider an infinitely broad prior $\mathbf{S}_0=\alpha^{-1}\mathbf{I}$ with $\alpha\to0$, the mean $\mathbf{m}_N$ of the posteriro distribution reduces to the maximum likelihood value $\mathbf{w}_{ML}$. Similarly, if $N\to0$, then the posterior distribution revers to the prior. Furthermore, if data points arrive sequentially, then the posterior distribution at any stage acts as the prior distribution for the subsequent data point.

We consider a zeror-mean isotropic Gaussian governed by a single precision parameter $\alpha$ so that
$$p\left(\mathbf{w}|\alpha\right)=\mathcal{N}\left(\mathbf{w}|\mathbf{0},\alpha^{-1}\mathbf{I}\right)$$
where
$$\mathbf{m}_N=\beta\mathbf{S}_N\Phi^\top\mathsf{t} \\
\mathbf{S}_N^{-1}=\alpha\mathbf{I}+\beta\Phi^\top\Phi.$$

The log of the posterior distribution is given by the sum of the log likelihood and the log of the prior and, as a function of $\mathbf{w}$, takes the from
$$\ln p\left(\mathbf{w}|\mathsf{t}\right)=\frac{\beta}{2}\sum_{n=1}^N\{t_n-\mathbf{w}^\top\phi\left(\mathbf{x}_n\right)\}^2-\frac{\alpha}{2}\mathbf{w}^\top\mathbf{w}+const.$$

Maximization of this posterior distribution with respect to $\mathbf{w}$ is therefore equivalent to the minimizaiotn of the sum-of-squares error function with the addition of a quadratic regularization term, with $\lambda=\alpha/\beta$.

Other forms of prior over the parameters can be considered. For instance, we can generalize the Gaussian prior to given
$$p\left(\mathbf{w}|\alpha\right)=\left[\frac{q}{2}\left(\frac{\alpha}{2}\right)^{1/q}\frac{1}{\Gamma\left(1/q\right)}\right]^M\exp\left(-\frac{\alpha}{2}\sum_{j=1}^M|w_j|^q\right)$$
in which $q=2$ corresponds to the Gaussian distribution, and only in this case is the prior conjugate to the likelihood function. Finding the maximum of the posterior distribution over $\mathbf{w}$ corresponds to minimization of the regularized error function. In the case of the Gaussian prior, the mode of the posterior distribution was equal to the mean, although this will no longer hold if $q\neq2$.

### 3.3.2 Predictive distribution

In practice, we are not usually interested in the value of $\mathbf{w}$ itself but rather in making predictioins of $t$ for new value of $\mathbf{w}$. This requires that we evaluate the predictive distribution defined by
$$p\left(t|\mathsf{t},\alpha,\beta\right)=\int p\left(t|\mathbf{w}\beta\right)p\left(\mathbf{w}|\mathsf{t},\alpha,\beta\right)\mathrm{d}\mathbf{w}$$
in which $\mathsf{t}$ is the vector of target values from the training set.

The predictive distribution takes the form 
$$p\left(t|\mathbf{x},\mathsf{t},\alpha,\beta\right)=\mathcal{N}\left(t|\mathbf{m}_N^\top\phi\left(\mathbf{x}\right),\sigma_N^2\left(\mathbf{x}\right)\right)$$
where the variance $\sigma_N^2\left(\mathbf{x}\right)$ of the predictive distribution is given by
$$\sigma_N^2\left(\mathbf{x}\right)=\frac{1}{\beta}+\phi\left(\mathbf{x}\right)^\top\mathbf{S}_N\phi\left(\mathbf{x}\right).$$

The first term in the form represents the noise on the data whereas the second term reflects the uncertainty associated with the parameters $\mathbf{w}$. Because the noise process and the distribution of $\mathbf{w}$ are independent Gaussians, their variances are additive. Note that, as additional data points are observed, the posterior distribution becomes narrower. As a consequence it can be shown that $\sigma_{N+1}^2\left(\mathbf{x}\right)\leqslant\sigma_{N}^2\left(\mathbf{x}\right)$. In the limit $N\to\infty$, the second term goes to zero, and the variance of the predictive distribution arises solely from the additive noise governed by the parameter $\beta$.

Note that, if both $\mathbf{w}$ and $\beta$ are treated as unknown, then we can introduce a conjugate prior distribution $p\left(\mathbf{w},\beta\right)$ that will be given by a Gaussian-gamma distribution. In this case, the predictive distribution is a Student's t-distribution.

### 3.3.3 Equivalent kernel

We see that the predictive mean can be written in the form 
$$y\left(\mathbf{x},\mathbf{m}_N\right)=\mathbf{m}_N^\top\phi\left(\mathbf{x}\right)=\beta\phi\left(\mathbf{x}\right)^\top\mathbf{S}_N\Phi^\top\mathsf{t}=\sum_{n=1}^N\beta\phi\left(\mathbf{x}\right)^\top\mathbf{S}_N\phi\left(\mathbf{x}_n\right)t_n.$$

Thus the mean of the predictive distribution at a point $\mathbf{x}$ is given by a linear combination of the training set target variables $t_n$, so that we can write
$$y\left(\mathbf{x},\mathbf{m}_N\right)=\sum_{n=1}^N k\left(\mathbf{x},\mathbf{x}_n\right)t_n$$
where the function 
$$k\left(\mathbf{x},\mathbf{x}^{'}\right)=\beta\phi\left(\mathbf{x}\right)^\top\mathbf{S}_N\phi\left(\mathbf{x}^{'}\right)$$
is known as the smoother matrix or the equivalent kernel.

Regression functions, shuch as this, which make predictions by taking linear combinations of the training set target values are known as linear smoothers.

Further insight into the role of the equivalent kernel can be obtained by conidering the covariance between $y\left(\mathbf{x}\right)$ and $y\left(\mathbf{x}^{'}\right)$, which is given by 
$$cov\left[y\left(\mathbf{x}\right),y\left(\mathbf{x}^{'}\right)\right]=cov\left[\phi\left(\mathbf{x}\right)^\top\mathbf{w},\mathbf{w}^\top\phi\left(\mathbf{x}^{'}\right)\right]=\phi\left(\mathbf{x}^\top\right)\mathbf{S}_N\phi\left(\mathbf{x}^{'}\right)=\beta^{-1}k\left(\mathbf{x},\mathbf{x}^{'}\right).$$

From the form of the equivalent kernel, we see that the predictive mean at nearby points will be highly correlated, whereas for more distant pairs of points the correlation will be smaller.

The formulation of linear regression in terms of a kernel function suggests an alternative approach to regression as follows. Instead of introducing a set of basis functions, which implicitly determines an equivalent kernel, we can instead define a localized kernel directly and use this to make predictions for new input vectors $\mathbf{x}$, given the observed training set. This leads to a practical framework for regresson (and classification) called Gaussian processes.

We have seen that the effective kernel defines the weights by which the training set target values are combined in order to make a prediction at a new value of $\mathbf{x}$, and it can be shown that these weights sum to one, in other words
$$\sum_{n=1}^N k\left(\mathbf{x},\mathbf{x}^{'}\right)=1$$
for all values of $\mathbf{x}$.

Finally, we note that the equivalent kernel satisfies an important property shared by kernel functions in general, namely that it can be expressed in the form an inner product with respect to a vector $\psi\left(\mathbf{x}\right)$ of nonlinear functions, so that
$$k\left(\mathbf{x},\mathbf{z}\right)=\psi\left(\mathbf{x}\right)^\top\psi\left(\mathbf{z}\right)$$
where $\psi\left(\mathbf{x}\right)=\beta^{1/2}\mathbf{S}_N^{1/2}\phi\left(\mathbf{x}\right)$.

## 3.4 Bayesian Model Comparison

We highlighted the problem of over-fitting as well as the use of cross-validation as a technique for setting the values of regularization parameters or for choosing between alternative models.

The over-fitting associated with maximum likelihood can be avoided by marginalizing (summing or integrating) over the model parameters instead of making point estimates of their values. Models can then be compared directly on the training data, without the need for a validation set. This allows all available data to be used for training and avoids the multiple training runs for each model associated with cross-validation. It also allows multiple complexity parameters to be determined simultaneously as part of the training process.

The Bayesian view of model comparison simply involves the use of probabilities to represent uncertainty in the choice of model, along with a consistent application of the sum and product rules of probability. Suppose we wish to compare a set of $L$ models $\{\mathcal{M}_i\}$ where $i=1,\dots,L$. Here a model refers to a probability distribution over the observed data $\mathcal{D}$.

The model uncertainty is expressed through a prior probability distribution $p\left(\mathcal{M}_i\right)$. Given a training set $\mathcal{D}$, we then wish to evaluate the posterior distribution 
$$p\left(\mathcal{M}_i|\mathcal{D}\right)\propto p\left(\mathcal{M}_i\right)p\left(\mathcal{D}|\mathcal{M}_i\right).$$

The prior allows us to express a preference for different models. Let us simply assume that all models are given equal prior probability. The model evidence $p\left(\mathcal{D}|\mathcal{M}_i\right)$ expresses the preference shown by the data for different models. The model evidence is sometimes also called the marginal likelihood because it can be viewed as a likelihood function over the space of models, in which the parameters have been marginalized out. The ratio of model evidences $p\left(\mathcal{D}|\mathcal{M}_i\right)/p\left(\mathcal{D}|\mathcal{M}_j\right)$ for two models is known as a Bayes factor.

Once we known the posterior distribution over models, the predictive distribution is given, from the sum and product rules, by
$$p\left(t|\mathbf{x},\mathcal{D}\right)=\sum_{i=1}^Lp\left(t|\mathbf{x},\mathcal{M}_i,\mathcal{D}\right)p\left(\mathcal{M}_i|\mathcal{D}\right).$$

This is an example of a mixture distribution in which the overall predictive distribution is obtained by averaging the predictive distributions $p\left(t|\mathbf{x},\mathcal{M}_i,\mathcal{D}\right)$ of individual models, weighted by the posterior probabilities $p\left(\mathcal{M}_i|\mathcal{D}\right)$ of those models.

For a model governed by a set of parameters $\mathbf{w}$, the model evidence is given, from the sum and product rules of probability, by 
$$p\left(\mathcal{D}|\mathcal{M}_i\right)=\int p\left(\mathcal{D}|\mathbf{w},\mathcal{M}_i\right)p\left(\mathbf{w}|\mathcal{M}_i\right)\mathrm{d}\mathbf{w}.$$
From a sampling perspective, the marginal likelihood can be viewed as the probability of generating the data set $\mathcal{D}$ from a model whose parameters are sampled at random from the prior.

We can obtain some insight into the model evidence by making a simple approximation to the integral over parameters. Consider first the case of a model having a single parameter $w$. The posterior distribution over parameters is proportional to $p\left(\mathcal{D}|w\right)p\left(w\right)$, where we omit the dependence on the model $\mathcal{M}_i$ to keep the notation uncluttered. If we assume that the posterior distribution si sharply peaked around the most probablie value $w_{MAP}$, with $\Delta w_{posterior}$, then we can approximate the integral by the value of the integrand at its maximum times the width of the peak. If we further assume that the prior is flat with width $\Delta w_{priro}$ so that $p\left(w\right)=1/\Delta w_{prior}$, then we have 
$$p\left(\mathcal{D}\right)=\int p\left(\mathcal{D}|w\right)p\left(w\right)\mathrm{d}w\simeq p\left(\mathcal{D}|w_{MAP}\right)\frac{\Delta w_{posterior}}{\Delta w_{prior}}$$
and so taking logs we obtain
$$\ln p\left(\mathcal{D}\right)\simeq\ln p\left(\mathcal{D}|w_{MAP}\right)+\ln\left(\frac{\Delta w_{posterior}}{\Delta w_{prior}}\right).$$

The first term represents the fit to the data given by the most probable parameter values, and for a flat prior this would correspond to the log likelihood. The second term penalizes the model according to its complexity. Because $\Delta w_{posterior}<\Delta w_{prior}$ this term is negative, and it increases in magnitude as the ratio $\Delta w_{posterior}/\Delta w_{prior}$ gets smaller. Thus, if parameters are finely tuned to the data in the posterior distribution, then the penalty term is large.

For a model having a set of $M$ parameters, we can make a similar approximation for each parameter in turn. Assuming that all parameters have the same ratio of $w_{posterior}/\Delta w_{prior}$, we obtain
$$\ln p\left(\mathcal{D}\right)\simeq\ln p\left(\mathcal{D}|w_{MAP}\right)+M\ln\left(\frac{\Delta w_{posterior}}{\Delta w_{prior}}\right).$$

Thus, in this very simple approximation, the size of the complexity penalty increases linearly with the number $M$ of adaptive parameters in the model. As we increase the complexity of the model, the first term will typically decrease, because a more complex model is better able to fit the data, whereas the second term will increase due to the dependence on $M$. The optimal model complexity, as determined by the maximum evidence, will be given by a trad-off between these two competing terms.

Implicit in the Bayesian model comparison fraemwork is the assumption that the true distribution from which the data are generated is contained within the set of models under consideration. Provided this is so, we can show that Bayesian model comparison will on average favour the correct model. To see this, consider two models $\mathcal{M}_1$ and $\mathcal{M}_2$ in which the truth corresponds to $\mathcal{M}_1$. For a given finite data set, it is possible for the Bayes factor to be larger for the incorrect model. However, if we average the Bayes factor over the distribution of data sets, we obtain the expected Bayes factor in the form
$$\int p\left(\mathcal{D}|\mathcal{M}_1\right)\ln\frac{p\left(\mathcal{D}|\mathcal{M}_1\right)}{p\left(\mathcal{D}|\mathcal{M}_2\right)}\mathrm{d}\mathcal{D}$$
where the average has been taken with respect to the true distribution of the data. This quantity is an exampel of the Kullback-Leibler divergence and satisfies the property of always being positive unless the two distributions are equal in which case it is zero. Thus on average the Bayes factor will always favour the correct model.

We have seen that the Bayesian framework avoids the problem of over-fitting and allows models to be compared on the basis of the training data alone. However, a Baesian approach, like any approach to pattern recognition, needs to make assumptions about the form of the model, and if these are invalid then the results can be misleading.

In a practical application, therefore, it will be wise to keep aside an independent test set of data on which to evaluate the overall performance of the final system.

## 3.5 The Evidence Approximation

We discuss an approximation in which we set the hyperparameters to specific values determined by maximizing the marginal likelihood function obtained by first integrating over the parameters $\mathbf{w}$. This framework is known in the statistics literature as empriical Bayes, or type 2 maximum likelihood, or generalized maximum likelihood, and in the machine learning literature is also called the evidence approximation.

If we introduce hyperpriors over $\alpha$ and $\beta$, the predictive distribution is obtained by marginalizing over $\mathbf{w}$, $\alpha$ and $\beta$ so that 
$$p\left(t|\mathsf{t}\right)=\int\int\int p\left(t|\mathbf{w},\beta\right)p\left(\mathbf{w}|\mathsf{t},\alpha,\beta\right)p\left(\alpha,\beta|\mathsf{t}\right)\mathrm{d}\mathbf{w}\;\mathrm{d}\alpha\;\mathrm{d}\beta$$
Here we have omitted the dependence on the input variable $\mathbf{x} to keep the notation uncluttered.

If the posterior distribution $p\left(\alpha,\beta|\mathsf{t}\right)$ is sharply peaked around values $\hat{\alpha}$ and $\hat{\beta}$, then the predictive distribution is obtained simply by marginalizing over $\mathbf{w}$ in which $\alpha$ and $\beta$ are fixed to the values $\hat{\alpha}$ and $\hat{\beta}$, so that 
$$p\left(t|\mathsf{t}\right)\simeq p\left(t|\mathsf{t},\hat{\alpha},\hat{\beta}\right)=\int p\left(t|\mathbf{w},\hat{\beta}\right)p\left(\mathbf{w}|\mathsf{t},\hat{\alpha},\hat{\beta}\right)\mathrm{d}\mathbf{w}.$$

From Bayes' theorem, the posterior distribution for $\alpha$ and $\beta$ is given by 
$$p\left(\alpha,\beta|\mathsf{t}\right)\propto p\left(\mathsf{t}|\alpha,\beta\right)p\left(\alpha,\beta\right).$$
If the prior is relatively flat, then in the evidence framework the values of $\hat{\alpha}$ and $\hat{\beta}$ are obtained by maximizing the marginal likelihood function $p\left(\mathsf{t}|\alpha,\beta\right)$.

### 3.5.1 Evaluation of the evidence function

The marginal likelihood function $p\left(\mathsf{t}|\alpha,\beta\right)$ is obtained by integrating over the weight parameters $\mathbf{w}$, so that 
$$p\left(\mathsf{t}|\alpha,\beta\right)=\int p\left(\mathsf{t}|\mathbf{w},\beta\right)p\left(\mathbf{w}|\alpha\right)\mathrm{d}\mathbf{w}.$$

We can write the evidence function in the form 
$$p\left(\mathsf{t}|\alpha,\beta\right)=\left(\frac{\beta}{2\pi}\right)^{N/2}\left(\frac{\alpha}{2\pi}\right)^{M/2}\int\exp\{-E\left(\mathbf{w}\right)\}\mathrm{d}\mathbf{w}$$
where $M$ is the dimensionality of $\mathbf{w}$.

We have defined 
$$E\left(\mathbf{w}\right)=\beta E_D\left(\mathbf{w}\right)+\alpha E_W\left(\mathbf{w}\right) \\
=\frac{\beta}{2}\|\mathsf{t}-\Phi\mathbf{w}\|^2+\frac{\alpha}{2}\mathbf{w}^\top\mathbf{w}.$$
We recognize the form as being equal, up to a constant of proportionality, to the regularized sum-of-squares error function.

We now complete the square over $\mathbf{w}$ giving 
$$E\left(\mathbf{w}\right)=E\left(\mathbf{m}_N\right)+\frac{1}{2}\left(\mathbf{w}-\mathbf{m}_N\right)^\top\mathbf{A}\left(\mathbf{w}-\mathbf{m}_N\right)$$
where we have introduced 
$$\mathbf{A}=\alpha\mathbf{I}+\beta\Phi^\top\Phi$$
together with 
$$E\left(\mathbf{m}_N\right)=\frac{\beta}{2}\|\mathsf{t}-\Phi\mathbf{m}_N\|^2+\frac{\alpha}{2}\mathbf{m}_N^\top\mathbf{m}_N.$$

Note that $\mathbf{A}$ corresponds to the matrix of second derivatives of the error function
$$\mathbf{A}=\nabla\nabla E\left(\mathbf{w}\right)$$
and is known as the Hessian matrix.

Here we have also defined $\mathbf{m}_N$ given by 
$$\mathbf{m}_N=\beta\mathbf{A}^{-1}\Phi^\top\mathrm{t}.$$
We see that $\mathbf{A}=\mathbf{S}_N^{-1}$, and hence the form is represents the mean of the posterior distribution.

The integral over $\mathbf{w}$ can now be evaluated simply by appealing to the standard result for the normalization coefficient of a multivariate Gaussian, giving
$$\int\exp\{-E\left(\mathbf{w}\right)\}\mathrm{d}\mathbf{w}=\exp\{-E\left(\mathbf{m}_N\right)\}\int\exp\left\{-\frac{1}{2}\left(\mathbf{w}-\mathbf{m}_N\right)^\top\mathbf{A}\left(\mathbf{w}-\mathbf{m}_N\right)\right\}\mathrm{d}\mathbf{w}=\exp\{-E\left(\mathbf{m}_N\right)\}\left(2\pi\right)^{M/2}|\mathbf{A}|^{-1/2}.$$

We can then write the log of the marginal likelihood in the form 
$$\ln p\left(\mathsf{t}|\alpha,\beta\right)=\frac{M}{2}\ln\alpha+\frac{N}{2}\ln\beta-E\left(\mathbf{m}_N\right)-\frac{1}{2}\ln|\mathbf{A}|-\frac{N}{2}\ln\left(2\pi\right)$$
which is the required expression for the evidence function.

### 3.5.2 Maximizing the evidence function

Let us first consider the maximization of $p\left(\mathsf{t}|\alpha,\beta\right)$ with respect to $\alpha$. This can be done by first defining the following eigenvector equation
$$\left(\beta\Phi^\top\Phi\right)\mathbf{u}_i=\lambda\mathbf{u}_i.$$
The from follows that $\mathbf{A}$ has eigenvalues $\alpha+\lambda_i$.

Now consider the derivative of the term involving $\ln|\mathbf{A}|$ with respect to $\alpha$. We have 
$$\frac{\mathrm{d}}{\mathrm{d}\alpha}\ln|\mathbf{A}|=\frac{\mathrm{d}}{\mathrm{d}\alpha}\ln\prod_i\left(\lambda_i+\alpha\right)=\frac{\mathrm{d}}{\mathrm{d}\alpha}\sum_i\ln\left(\lambda_i+\alpha\right)=\sum_i\frac{1}{\lambda_i+\alpha}.$$

Thus the stationary points with respect to $\alpha$ satisfy
$$\frac{M}{2\alpha}-\frac{1}{2}\mathbf{m}_N^\top\mathbf{m}_N-\frac{1}{2}\sum_i\frac{1}{\lambda_i+\alpha}=0.$$

Multiplying through by $2\alpha$ and rearranging, we obtain
$$\alpha\mathbf{m}_N^\top\mathbf{m}_N=M-a\sum_i\frac{1}{\lambda_i+\alpha}=\gamma.$$

Since there are $M$ terms in the sum over $i$, the quantity $\gamma$ can be written
$$\gamma=\sum_i\frac{\lambda_i}{\alpha+\lambda_i}.$$

We see that the value of $\alpha$ tht maximizes the marginal likelihood satsfies
$$\alpha=\frac{\gamma}{\mathbf{m}_N^\top\mathbf{m}_N}.$$

Note that this is an implicit solution for $\alpha$ not only because $\gamma$ depends on $\alpha$, but also because the mode $\mathbf{m}_N$ of the posterior distribution itself depends on the choice of $\alpha$. We therefore adopt an iterative procedure in which we make an initial choice for $\alpha$ and use this to find $\mathbf{m}_N$, and also to evaluate $\gamma$. These values are then used to re-estimate $\alpha$, and the process repeated until convergence. 

Note that because the matrix $\Phi^\top\Phi$ is fixed, we can compute its eigenvalues once at the start and then simply multiply these by $\beta$ to obtain the $\lambda_i$.

It should be emphasized that the value of $\alpha$ has been determined purely by looking at the training data. In contrast to maximum likelihood methods, no independent data set is required in order to optimize the model complexity.

We can similarly maximize the log marginal likelihood with respect to $\beta$. To do this, we note that the eigenvalues $\lambda_i$ are proportional to $\beta$, and hence $\mathsf{d}\lambda_i/\mathsf{d}\beta=\lambda_i/\beta$ giving 
$$\frac{\mathrm{d}}{\mathrm{d}\beta}\ln|\mathbf{A}|=\frac{\mathrm{d}}{\mathrm{d}\beta}\sum_i\ln\left(\lambda_i+\alpha\right)=\frac{1}{\beta}\sum_i\frac{\lambda_i}{\lambda_i+\alpha}=\frac{\gamma}{\beta}.$$

The stationary point of the marginal likelihood therefore satisfies
$$\frac{N}{2\beta}-\frac{1}{2}\sum_{n=1}^N\{t_n-\mathbf{m}_N^\top\phi\left(\mathbf{x}_n\right)\}^2-\frac{\gamma}{2\beta}=0$$
and rearranging we obtain
$$\frac{1}{\beta}=\frac{1}{N-\gamma}\sum_{n=1}^N\{t_n-\mathbf{m}_N^\top\phi\left(\mathbf{x}_n\right)\}^2.$$

Again, this an implicit solution for $\beta$ and can be solved by choosing an initial value for $\beta$ and then using this to calculate $\mathbf{m}_N$ and $\gamma$ and then re-estimate $\beta$, repeating until convergence. If both $\alpha$ and $\beta$ are to be determined from the data, then their values can be re-estimated together after each update of $\gamma$.

### 3.5.3 Effective number of parameters

The quantity $\gamma$ therefore measures the effective total number of well determined parameters.

The effective number of parameters that are determined by the data is $\gamma$, with the remaining $M-\gamma$ parameters set to small values by the prior. This is reflected in the Bayesian result for the variance that has a factor $N-\gamma$ in the denominator, thereby correcting for the bias of the maximum likelihood result.

If we consider the limit $N\gg M$ in which the number of data points is large in relation to the number of parameters, then all of the parameters will be well determined by the data because $\Phi^\top\Phi$ involves an implicit sum over data points, and so the eigenvalues $\lambda_i$ increase with the size of the data set. In this case, $\lambda=M$, and the re-estimations for $\alpha$ and $\beta$ becom
$$\alpha=\frac{M}{2E_W\left(\mathbf{m}_N\right)} \\
\beta=\frac{N}{2E_D\left(\mathbf{m}_N\right)}.$$
These results can be used as an easy-to-compute approximation to the full evidence re-estimation formulate, because they do not require evaluation of the eigenvalue spectrum of the Hessian.

## 3.6 Limitations of Fixed Basis Functions

We have focussed on models comprising a linear combination of fixed, nonlinear basis functions. We have seen that the assumption of linearity in the parameters led to a range of useful properties including closed-form solutions to the least-squares problem, as well as a tractable Bayesian treatment. Furthermore, for a suitable choice of basis functions, we can model arbitrary nonlinearities in the mapping from input variables to targets.

The difficulty stems from the assumption that the basis functions $\phi_j\left(\mathbf{x}\right)$ are fixed before the training data set is observed and is a manifestation of the curse of dimensionality. As a consequence, the number of basis functions needs to grow rapidly, often exponentially, with the dimensionality $D$ of the input space.

There are two properties of real data sets that we can exploit to help alleviate this problem. First of all, the data vectors $\{\mathbf{x}_n\}$ typically lie close to a nonlinear manifold whose intrinsic dimensionality is smaller than that of the input space as a result of strong correlations between the input variables. The second property is that target variables may have significant dependence on only a small number of possible directions within the data manifold. 