# 5 Neural Networks

We considered models for regression and classification that comprised linear combinations of fixed basis functions. Their practical applicability was limited by the curse of dimensionality. 

Support vector machines (SVMs) address this by first defining basis functions that are centred on the training data points and then selecting a subset of these during training. One advantage of SVMs is that, although the training involves nonlinear optimization, the objective function is convex, and so the solution of the optimization problem is relatively straightforword. The number of basis functions in the resulting models is generally much smaller than the number of training points, although it is often still relatively large and typically increases with the size of the training set. The relevance vector machine also chooses a subset from a fixed set of basis functions and typically results in much sparser models. Unlike the SVM it also produces probabilistic outputs, although this is at the expense of a nonconvex optimization during training.

An alternative approach is to fix the number of basis functions in advance but allow them to be adaptive, in other words to use parametric forms for the basis functions in which the parameter values are adapted during training. The most successful model of this type in the context of pattern recognition is the feed-forward neural network, also known as the multilayer perceptron. In fact, 'multilayer perceptron' is really a misnomer, because the model comprises multiple layers of logistic regression models (with continuous nonlinearities) rather than multiple perceptrons (with discontinuous nonlinearities). For many applications, the resulting model can be significantly more compact, and hence faster to evaluate, than a support vector machine having the same generalization performance. The price to be paid for this compactness, as with the relevance vector machine, is that the likelihood function, which forms the basis for network training, is no longer a convex function of the model parameters. In practice, however, it is often worth investing substantial computational resources during the training phas in order to obtain a compact model that is fast at processing new data.

The term 'neural network' has its origins in attempts to find mathematical representations of information processing in biological systems.

We begin by considering the funtional form of the network model, including the specific parameterization of the basis functions, and we then discuss the problem of determining the network parameters within a maximum likelihood framework, which involves the solution of a nonlinear optimization problem. This requires the evaluation of derivatives of the log likelihood function with respect to the network parameters, and we shall see how these can be obtained efficiently using the technique of error backpropagation. We shall also show how hte backpropagation ramework can be extended to allow other derivatives to be evaluated, such as the Jacobian and Hessian matrices. Next we discuss various approaches to regularization of neural network training and the relationships between them. We also consider some extensions to the neural network model, and in particular we describe a general framework for modelling conditional probability distributions known as mixture density networks. Finally, we discuss the use of Bayesian treatments of neural networks.

## 5.1 Feed-froward Network Functions

The linear models for regression and classification are based on linear combinations of fixed nonlinear basis function $\phi_j\left(\mathbf{x}\right)$ and take the form 
$$y\left(\mathbf{x},\mathbf{w}\right)=f\left(\sum_{j=1}^Mw_j\phi_j\left(\mathbf{x}\right)\right)$$
where $f\left(\cdot\right)$ is a nonlinear activation function in the case of classification and is the identity in the case of regression. Our goal is to extend this model by making the basis funciton $\phi_j\left(\mathbf{x}\right)$ depend on parameters and then to allow these parameters to be adjusted, along with the coefficients $\{w_j\}$, during training. There are, of course, many ways to construct parametric nonlinear bais functions. Neural networks use basis functions that follow the same form, so that each basis function is itself a nonlinear function of a linear combination of the inputs, where the coefficients in the linear combination are adaptive parameters.

This leads to the basic neural network model, which can be described a series of functional transformations. First we construct $M$ linear combinations of the input variables $x_1,\dots,x_D$ in the form
$$a_j=\sum_{i=1}^Dw_{ji}^{\left(1\right)}x_i+w_{j0}^{\left(1\right)}$$
where $j=1,\dots,M$, and the superscipt $\left(1\right)$ indicates that the corresponding parameters are in the first 'layer' of the network. We shall refer to the parameters $w_{ji}^{\left(1\right)}$ as weights and the parameters $w_{j0}^{\left(1\right)}$ as biases. The quantities $a_j$ are known as activations.

Each of activations is then transformed using a differentiable, nonlinear activation function $h\left(\cdot\right)$ to give
$$z_j=h\left(a_j\right)$$
These quantities correspond to the outputs of the basis function that, in the context of neural networks, are called hidden units. The nonlinear functions $h\left(\cdot\right)$ are generallyh chosen to be sigmoidal functions such as the logistic sigmoid or the 'tanh' function.

These values are again linearly combined to give output unit activations 
$$a_k=\sum_{j=1}^Mw_{kj}^{\left(2\right)}z_j+w_{k0}^{\left(2\right)}$$
where $k=1,\dots,K$, and $K$ is the total number of outputs. This transformation corresponds to the second layer of the network, and again the $w_{k0}^{\left(2\right)}$ are bias parameters.

Finally, the output unit activations are transformed using an appropriate activation function to give a set of network outputs $y_k$. The choice of activation function is determined by the nature of the data and the assumed distribution of target variables and follows the same considerations as for linear models. Thus for standard regression problems, the activation function is the identity so that $y_k=a_k$. Similarly, for multiple binary classification problems, each output unit activation is transformed using a logistic sigmoid function so that 
$$y_k=\sigma\left(a_k\right)$$
where
$$\sigma\left(a\right)=\frac{1}{1+\exp\left(-a\right)}.$$
Finally, for multiclass problems, a softmax activation function is used.

We can combine these various stages to give the overall network function that, for sigmoidal output unit activation functions, takes the form
$$y_k\left(\mathbf{x},\mathbf{w}\right)=\sigma\left(\sum_{j=1}^Mw_{kj}^{\left(2\right)}h\left(\sum_{i=1}^Dw_{ji}^{\left(1\right)}x_i+w_{j0}^{\left(1\right)}\right)+w_{k0}^{\left(2\right)}\right)$$
where the set of all weight and bias parameters have been grouped together into a vector $\mathbf{w}$. Thus the neural network model is simply a nonlinear function from a set of input variables $\{x_i\}$ to a set of output variables $\{y_k\}$ controlled by a vector $\mathbf{w}$ of adjustable parameters. The process of evaluating can then be interpreted as a forward propagation of information through the network.

The bias parameters can be absorbed into the set of weight parameters by defining an additional input variable $x_0$ whose value is clamped at $x_0=1$, so takes the form
$$a_j=\sum_{i=0}^Dw_{ji}^{\left(1\right)}x_i.$$

We can similarly absorb the second-layer biases into the second-layer weights, so that the overall network function becomes
$$y_k\left(\mathbf{x},\mathbf{w}\right)=\sigma\left(\sum_{j=0}^Mw_{kj}^{\left(2\right)}h\left(\sum_{i=0}^Dw_{ji}^{\left(1\right)}x_i\}\right)\right).$$

The neural network model comprises two stages of processing, each of which resembles the perceptron model, and for this reason the neural netowrk is also known as the multilayer perception, or MLP. A key difference compared to the perceptron, however, is that the neural network uses continuous sigmoidal nonlinearities in the hiden units, whereas the perceptron uses step-function nonlinearities. This means that the neural network function is differentiable with respect to the network parameters, and this property will play a central role in the network training.

If the activation functions of all the hidden units in a network are taken to be linear, then for any such network we can alwoays find an equivalent network without hidden units. This follows from the fact that the composition of successive linear transformations is itself a linear transformation. However, if the number of hidden units is smaller than either the number of input or output units, then the transformations that the network can generate are not the most general possible linear transformations from inputs to outputs because information is lost in the dimensionality reduction at the hidden units. We show that networks of linear units give rise to principal component analysis. In general, however, there is little interest in multilayer networks of linear units.

The network architecture is easily generalized, for instance by considering additional layers of processing each consisting of a weighted linear combination of the form followed by an element-wise transformation using a nonlinear activation function.

Another generalization of the network architecture is to include skip-layer connections, each of which is associated with a corresponding a daptive parameter.

Furthermore, the network can be sparse, with not all possible connections within a layer being present. We shall see an example of a sparse network architecture when we consider convolutional neural network.

Because there is a diret correspondence between a network diagram and its mathematical function, we can develop more general network mappings by considering more complex network diagrams. However, these must be restricted to a feed-forward architecture, in other words to one having no closed directed cycles, to ensure that the outputs are deterministic functions of the input.

The approximation properties of feed-forward networks have been found to be very general. Nerual networks are therefore said to be universal approximators. Fro example, a two-layer network with linear outputs can uniformly approximate any continuous function on a compact input domain to arbitrary accuracy provided the network has a sufficiently large number of hidden units. This result holds for a wide range of hidden unit activation functions, but excluding polynomials. Although such theorems are reassuring, the key problem is how to find suitable parameter values given a set of training data, and we will show that there exist effective solutions to this problem based on both maximum likelihood and Bayesian approaches.

### 5.1.1 Weight-space symmetries

One property of feed-forward networks, which will play a role when we consider Bayesian model comparison, is that multiple distinct choices for the weight vector $\mathbf{w}$ can all give rise to the same mapping funciton from inputs to outputs.

## 5.2 Network Training

We have viewed neural networks as a general class of parametric nonlinear functions from a vector $\mathbf{x}$ of input variables to a vector $\mathbf{y}$ of output variables. A simple approach to the problem of determining the network parameters is to minimize a sum-of-squares error function. Given a training set comprising a set of input vectors $\{\mathbf{x}_n\}$, where $n=1,\dots,N$, together with a corresponding set of target vectors $\{\mathsf{t}_n\}$, we minimize the error function
$$E\left(\mathbf{w}\right)=\frac{1}{2}\sum_{n=1}^N\|\mathbf{y}\left(\mathbf{x}_n,\mathbf{w}\right)-\mathsf{t}_n\|^2.$$

We start by discussing regression problems, and for the moment we consider a single target variable $t$ that can take any real value. We assume that $t$ has a Gaussian distribution with an $\mathbf{x}$-dependent mean, which is given by the output of the nerual network, so that
$$p\left(t|\mathbf{x},\mathbf{w}\right)=\mathcal{N}\left(t|y\left(\mathbf{x},\mathbf{w}\right),\beta^{-1}\right)$$
where $\beta$ is the precision (inverse variance) of the Gaussian noise. For the conditional distribution, it is sufficient to take the output unit activation function to be the identity, because such a network can approximate any continuous function from $\mathbf{x}$ to $y$.

Given a data set of $N$ independent, identically distributed observations $\mathbf{X}=\{\mathbf{x}_1,\dots,\mathbf{x}_N\}$, along with corresponding target vaules $\mathbf{t}=\{t_1,\dots,\_N\}$, we can construct the corresponding likelihood function
$$p\left(\mathbf{t}|\mathbf{X},\mathbf{w},\beta\right)=\prod_{n=1}^Np\left(t_n|\mathbf{x}_n,\mathbf{w},\beta\right).$$

Taking the negative logarithm, we obtain the error function
$$\frac{\beta}{2}\sum_{n=1}^N\{y\left(\mathbf{x}_n,\mathbf{w}\right)-t_n\}^2-\frac{N}{2}\ln\beta+\frac{N}{2}\ln\left(2\pi\right)$$
which can be used to learn the parameters $\mathbf{w}$ and $\beta$. We consider a maximum likelihood approach. Note that in the neural networks literature, it is usual to consider the minimization of an error function rather than the maximization of the (log) likelihood, and so here we shall follow this convention.

Consider first the determination of $\mathbf{w}$. Maximizing the likelihood function is equivalent to minimizing the sum-of-squares error function given by
$$E\left(\mathbf{w}\right)=\frac{1}{2}\sum_{n=1}^N\{y\left(\mathbf{x}_n,\mathbf{w}\right)-t_n\}^2$$
where we have discarded additive and multiplicative constants. The value of $\mathbf{w}$ found by minimizing $E\left(\mathbf{w}\right)$ will be denoted $\mathbf{w}_{ML}$ because it corresponds to the maximum likelihood solution. In practice, the nonlinearity of the network function $y\left(\mathbf{x}_n,\mathbf{w}\right)$ causes the error $E\left(\mathbf{w}\right)$ to be nonconvex, and so in practice local maxima of the likelihood may be found, corresponding to local minima of the error function.

Having found $\mathbf{w}_{ML}$, the value of $\beta$ can be found by minimizing the negative log likelihood to give
$$\frac{1}{\beta_{ML}}=\frac{1}{N}\sum_{n=1}^N\{y\left(\mathbf{x}_n,\mathbf{w}_{ML}\right)-t_n\}^2.$$
Note that this can be evaluated once the iterative optimization required to find $\mathbf{w}_{ML}$ is completed.

If we have multiple target variables, and we assume that they are independent conditional on $\mathbf{x}$ and $\mathbf{w}$ with shared noise precision $\beta$, then the conditional distribution of the target values is given by
$$p\left(\mathsf{t}|\mathbf{x},\mathbf{w}\right)=\mathcal{N}\left(\mathsf{t}|\mathbf{y}\left(\mathbf{x},\mathbf{w}\right),\beta^{-1}\mathbf{I}\right).$$
Following the same argument as for a single target variable, we see that the maximum likelihood weights are determined by minimizing the sum-of-squares error function.

The noise precision is then given by 
$$\frac{1}{\beta_{ML}}=\frac{1}{NK}\sum_{n=1}^N\|\mathbf{y}\left(\mathbf{x}_n,\mathbf{w}_{ML}\right)-\mathsf{t}_n\|^2$$
where $K$ is the number of target variables. The assumption of independence can be dropped at the expense of a slightly more complex optimization problem.

There is a natural pairing of the error function (given by the negative log likelihood) and the output unit activation function. In the regression case, we can view the network as having an output activation function that is the identity, so that $y_k=a_k$. The corresponding sum-of-squares error function has the property
$$\frac{\partial E}{\partial a_k}=y_k-a_k$$
which we shall make use of when discussing error backpropagation.

Now consider the case of binary classification in which we have a single target variable $t$ such that $t=1$ denotes class $\mathcal{C}_1$ and $t=0$ denotes class $\mathcal{C}_2$. Following the discussion of canonical link function, we consider a network having a single output whose activation function is a logistic sigmoid
$$y=\sigma\left(a\right)\equiv\frac{1}{1+\exp\left(-a\right)}$$
so that $0\leqslant y\left(\mathbf{x},\mathbf{w}\right)\leqslant1$. We can interpret $y\left(\mathbf{x},\mathbf{w}\right)$ as the conditional probability $p\left(\mathcal{C}_1|\mathbf{x}\right)$, with $p\left(\mathcal{C}_2|\mathbf{x}\right)$ given by $1-y\left(\mathbf{x},\mathbf{w}\right)$.

The conditional distribution of targets given inputs is then a Bernoulli distribution of the form
$$p\left(t|\mathbf{x},\mathbf{w}\right)=y\left(\mathbf{x},\mathbf{w}\right)^t\{1-y\left(\mathbf{x},\mathbf{w}\right)\}^{1-t}.$$

If we consider a training set of independent observations, then the error function, which is given by the negative log likelihood, is then a cross-entropy error function of the form
$$E\left(\mathbf{w}\right)=-\sum_{n=1}^N\{t_n\ln y_n+\left(1-t_n\right)\ln\left(1-y_n\right)\}$$
where $y_n$ denotes $y\left(\mathbf{x}_n,\mathbf{w}\right)$. Note that there is no analogue of the noise precision $\beta$ because the target values are assumed to be correctly labelled. Simard et al. (2003) found that using the cross-entropy error function instead of the sum-of-squares for a classification problem leads to faster training as well as improved generalization.

If we have $K$ separate binary classifications to perform, then we can use a network having $K$ outputs each of which has a logistic sigmoid activation function. Associated with each output is a binary class label $t_k\in\{0,1\}$, where $k=1,\dots,K$. If we assume that the class labels are independent, given the input vector, then the conditional distribution of the targets is
$$p\left(\mathsf{t}|\mathbf{x},\mathbf{w}\right)=\prod_{k=1}^Ky_k\left(\mathbf{x},\mathbf{w}\right)^{t_k}\left[1-y_k\left(\mathbf{x},\mathbf{w}\right)\right]^{1-t_k}.$$

Taking the negative logarithm of the corresponding likelihood function then gives the following error function
$$E\left(\mathbf{w}\right)=-\sum_{n=1}^N\sum_{k=1}^K\{t_{nk}\ln y_{nk}+\left(1-t_{nk}\right)\ln\left(1-y_{nk}\right)\}$$
where $y_{nk}$ denotes $y_k\left(\mathbf{x}_n,\mathbf{w}\right)$. Again, the derivative of the error function with respect to the activation for a particular output unit takes the form just as in the regression case.

We see that the weight parameters in the first layer of the network are shared between the varouts outputs, whereas in the linear model each classification roblem is solved independently. The first layer of the network can be viewed as perorming a nonlinear feature extraction, and the sharing of features between the different outputs can save on computation and can also lead to improved generalizatin.

Finally, we consider the standard multiclass calssification problem in which each input is assigned to one of K mutually exclusive classies. The binary target variables $t_k\in\{0,1\}$ have a 1-of-K coding scheme indicating the class, and the network outputs are interpreted as $y_k\left(\mathbf{x},\mathbf{w}\right)=p\left(t_k=1|\mathbf{w}\right)$, leading to the follwoing error function
$$E\left(\mathbf{w}\right)=-\sum_{n=1}^N\sum_{k=1}^Nt_{nk}\ln y_k\left(\mathbf{x}_n,\mathbf{w}\right).$$

Following the discussion, we see that the output unit activation function, which corresponds to the canonical link, is given by the softmax function
$$y_k\left(\mathbf{x},\mathbf{w}\right)=\frac{\exp\left(a_k\left(\mathbf{x},\mathbf{w}\right)\right)}{\sum_j\exp\left(a_k\left(\mathbf{x},\mathbf{w}\right)\right)}$$
which satisfies $0\leqslant y_k\leqslant 1$ and $\sum_k y_k=1$. Note that the $y_k\left(\mathbf{x},\mathbf{w}\right)$ are unchanged if a constant is added to all of the $a_k\left(\mathbf{x},\mathbf{w}\right)$, causing the error function to be constant for some directions in weight space. This degeneracy is removed if an appropriate regularizaiton term is added to the error function.

Once again, the derivative of the error function with respect to the activation for a particular output unit takes the familiar form.

In summary, there is a natural choice of both output unit activation function and matching error function, according to the type of problem being solved. For regression we use linear outputs and a sum-of-squares error, for (multiple independent) binary classifications we use logistic sigmoid outputs and a cross-entropy error function, and for multiclass classification we use softmax output with the corresponding nulticlass cross-entropy error function. For classification problems involving two classes, we can use a single logistic sigmoid output, or alternatiely we can use a network with two outputs having a softmax output actiation function.

### 5.2.1 Parameter optimization

We turn next to the task of finding a weight vector $\mathbf{w}$ which minimizes the chosen function $E\left(\mathbf{w}\right)$. At this point, it is useful to have to have a geometrical picture of the error function, which we can view as a surface sitting over weight space. First note that if we make a small step in weight space from $\mathbf{w}$ to $\mathbf{w}+\delta\mathbf{w}$ then the change in the error function is $\delta E\simeq\delta\mathbf{w}^\top\nabla E\left(\mathbf{w}\right)$, where the vector $\nabla E\left(\mathbf{w}\right)$ points in the direction of greatest rate of increase of the error function. Because the error $E\left(\mathbf{w}\right)$ is a smooth continuous function fo $\mathbf{w}$, its smallest value will occur at a point in weight space such that the gradient of the error function vanishes, so that 
$$\nabla E\left(\mathbf{w}\right)=0$$
as otherwise we could make a small step in the direction of $-\nabla E\left(\mathbf{w}\right)$ and thereby further reduce the error. Points at which the gradient vanishes are called stationary points, and may be further classified into minima, maxima, and saddle points.

Our goal is to find a vector $\mathbf{w}$ such that $E\left(\mathbf{w}\right)$ takes its smallest value. However, the error function typically has a highly nonlinear dependence on the weights and bias parameters, and so there will be many points in weight space at which the gradient vanishes (or is numerically very small).

Furthermore, there will typically be multiple inequivalent stationary points and in particular multiple inequivalent minima. A minimum that corresponds to the smallest value of the error function for any weight vector is said to be a global minimum. Any other minima corresponding to higher values of the error function are said to be local minima. For a successful application of neural networks, it may not be necessary to find the global minimul (and in general it will not be known whether the global minimum has been found) but it may be necessary to compare several local minima in order to find a sufficiently good solution.

Because there is clearly no hope of finding an analytical solution to the equaltion $\nabla E\left(\mathbf{w}\right)=0$ we resort to iterative numerical procedures. The optimization of continuous nonlinear functions is a widely studied problem and there exists an extensive literature on how to solve it efficiently. Most techniques involve choosing some initial value $\mathbf{w}^{\left(0\right)}$ for the weight vector and then moving through weight space in a succession of steps of the form 
$$\mathbf{w}^{\left(\tau+1\right)}=\mathbf{w}^{\left(\tau\right)}+\Delta \mathbf{w}^{\left(\tau\right)}$$
where $\tau$ labels the iteration step. Different algorithms involve different choices for the weight vector update $\Delta\mathbf{w}^{\left(\tau\right)}$. Many algorithms make use of gradient information and therefore require that, after each update, the value of $\nabla E\left(\mathbf{w}\right)$ is evaluated at the new weight vector $\mathbf{w}^{\left(\tau+1\right)}$. In order to understand the importance of gradient information, it is useful to consider a local approximation to the error function based on a Taylor expansion.

### 5.2.2 Local quadratic approximation

Insight into the optimization problem, and into the various techniques for solving it, can be obtained by considering a local quadratic approximation to the error function.

Consider the Taylor expansion of $E\left(\mathbf{w}\right)$ around some point $\hat{\mathbf{w}}$ in weight space
$$E\left(\mathbf{w}\right)\simeq E\left(\hat{\mathbf{w}}\right)+\left(\mathbf{w}-\hat{\mathbf{w}}\right)^\top\mathbf{b}+\frac{1}{2}\left(\mathbf{w}-\hat{\mathbf{w}}\right)^\top\mathbf{H}\left(\mathbf{w}-\hat{\mathbf{w}}\right)$$
where cubic and highter terms have been omitted. Here $\mathbf{b}$ is defined to be the gradient of $E$ evaluated at $\hat{\mathbf{w}}$
$$\mathbf{b}\equiv\nabla E\big|_{\mathbf{w}=\hat{\mathbf{w}}}$$
and the Hessian matrix $\mathbf{H}=\nabla\nabla E$ has elements
$$\left(\mathbf{H}\right)_{ij}\equiv\frac{\partial E}{\partial w_i \partial w_j}\bigg|_{\mathbf{w}=\hat{\mathbf{w}}}.$$
The corresponding local approximation to the gradient is given by
$$\nabla E\simeq\mathbf{b}+\mathbf{H}\left(\mathbf{w}-\hat{\mathbf{w}}\right).$$ For points $\mathbf{w}$ that are sufficiently close to $\hat{\mathbf{w}}$, these expressions will give reasonable approximations for the error and its gradient.

Consider the particular case of a local quadratic approximation around a point $\mathbf{w}^{*}$ that is a minimum of the error function. In this case there is no linear term, because $\nabla E=0$ at $\mathbf{w}^{*}$, and the loss function becomes 
$$E\left(\mathbf{w}\right)\simeq E\left(\mathbf{w}^*\right)+\frac{1}{2}\left(\mathbf{w}-\mathbf{w}^*\right)^\top\mathbf{H}\left(\mathbf{w}-\mathbf{w}^*\right)$$
where the Hessian $\mathbf{H}$ is evaluated at $\mathbf{w}^*$. In order to interpret this geometrically, consider the eigenvalue equation for the Hessian matrix
$$\mathbf{H}\mathbf{u}_i=\lambda_i\mathbf{u}_i$$
where the eigenvectors $\mathbf{u}_i$ form a complete orthonormal set so that 
$$\mathbf{u}_i^\top\mathbf{u}_j=\delta_{ij}.$$

We now expand $\left(\mathbf{w}-\mathbf{w}^*\right)$ as a linear combination of the eigenvetors in the form 
$$\left(\mathbf{w}-\mathbf{w}^*\right)=\sum_i\alpha_i\mathbf{u}_i.$$
This can be regarded as a transformation of the coordinate system in which the origin is translated to the point $\mathbf{w}^*$, and the axes are rotated to align with the eigenvectors (through the orthogonal matrix whose columns are the $\mathbf{u}_i$).

The error function to be written in the form 
$$E\left(\mathbf{w}\right)\simeq E\left(\mathbf{w}^*\right)+\frac{1}{2}\left(\sum_i\alpha_i\mathbf{u}_i\right)^\top\mathbf{H}\left(\sum_i\alpha_i\mathbf{u}_i\right)=E\left(\mathbf{w}^*\right)+\frac{1}{2}\sum_i\lambda_i\alpha_i^2.$$

A matrix $\mathbf{H}$ is said to be positive definete if, and only if,
$$\mathbf{v}^\top\mathbf{H}\mathbf{v}>0\quad for\;all\;\mathbf{v}\neq 0.$$
Because the eigenvectors $\{\mathbf{u}_i\}$ form a complete set, an arbitrary vector $\mathbf{v}$ can be written in the form
$$\mathbf{v}=\sum_ic_i\mathbf{u}_i.$$
We have 
$$\mathbf{v}^\top\mathbf{H}\mathbf{v}=\sum_ic_i^2\lambda_i$$
and so $\mathbf{H}$ will be positive definite if, and only if, all of its eigenvalues are positive. In the new coordinate system, whose basis vectors are given by the eigenvectors $\{\mathbf{u}_i\}$, the contours of constant $E$ are ellipses centred on the origin. 

For a one-dimensional weight space, a stationary point $\mathbf{w}^*$ will be a minimum if 
$$\frac{\partial^2E}{\partial w^2}\bigg|_{w^*}>0.$$
The corresponding result in D-dimensions is that the Hessian matrix, evaluated at $\mathbf{w}^*$, should be positive definite.

### 5.2.3 Use of gradient information

It is possible to evaluate the gradient of an error function efficiently by means of the backpropagation procedure. The use of this gradient information can lead to significatnt improvements in the speed with which the minima of the error function can be located.

### 5.2.4 Gradient descent optimization

The simplest approach to using gradient informatin is to choose the weight update to comprise a small step in the direction of the negative gradient, so that
$$\mathbf{w}^{\left(\tau+1\right)}=\mathbf{w}^{\left(\tau\right)}-\eta\nabla E\left(\mathbf{w}^{\left(\tau\right)}\right)$$
where the parameter $\eta>0$ is known as the learning rate. After each such update, the gradient is re-evaluated for the new weight vector and the process repeated. Note that the error function is defined with respect to a training set, and so each step requires that teh entire training set be processed in order to evaluate $\nabla E$. Techniques that use the whole data set at once are called batch methods. At each step the weight vector is moved in the direction of the greatest rate of decrease of the error function, and so this approach is known as gradient descent or steepest descent. 

For batch optimization, there are more efficient methods, such as conjugate gradients and quasi-Newton methods, which are much more robust and much faster than simple gradient descent. Unlike gradient descent, these algorithms have the property that the error function always decreases at each iteration unless the weight vector has arrived at a local or global minimum.

In order to find a sufficiently googd minimum, it may be necessary to run a gradient-based algorithm multiple times, each time using a different randomly chosen starting point, and comparing the resulting performance on an independent validation set.

There is, however, an on-line version of gradient descent that has proved useful in practice for training neural networks on large data seets. Error functions based on maximum likelihood for a set of independent observations comprise a sum of terms, one for each data point
$$E\left(\mathbf{w}\right)=\sum_{n=1}^N E_n\left(\mathbf{w}\right).$$
On-line gradient descent, also known as sequential radient descent or stochastic gradient descent, makes an update to the weight vector based on one data point at a time, so that
$$\mathbf{w}^{\left(\tau+1\right)}=\mathbf{w}^{\left(\tau\right)}-\eta\nabla E_n\left(\mathbf{w}^{\left(\tau\right)}\right).$$
This update is repeated by cycling through the data either in sequence or by selecting points at random with replacement. There are of course intermediate scenarios in which the updates are based on batches of data points.

One advantage of on-line methods compared to batch methods is that the former handle redundancy in the data much more efficiently. Another property of on-line gradient descent is the possibility of escaping from local minima, since a stationary point with respect to the error function for the whole data set will generally not be a stationary point for each data point individually.

## 5.3 Error Backpropagation

Our goal in this section is to find an efficient technique for evaluating the gradient of an error function $E\left(\mathbf{w}\right)$ for a feed-forward neural network. We shall see that this can be achieved using a local message passing scheme in which information is sent alternately forwards and backwards through the network and is known as error backpropagation, or sometimes simply as backprop.

Most training algorithms involve an iterative procendure for minimization of an error function, with adjustments to the weights being made in a sequence of steps. At each such step, we can distinguish between two distinct stages. In the first stage, the derivatives of the error function with respect to the weights must be evaluated. As we shall see, the important contribution of the backpropagation technique is in providing a computationally efficient method for evaluating such derivatives. Because it is at this stage that errors are propagated backwards through the network, we shall use the term backpropagation specifically to describe the evaluation of derivatives. In the second stage, the derivatives are then used to compute the adjustments to be made to the weights. 

It is important to recognize that the two stages are distinct. Thus, the first stage, namely the propagation of errors backwards through the network in order to evaluate derivatives, can be applied to many other kinds of network and not just the multilayer perceptron. It can also be applied to error functions other that just the simple sum-of-squares, and to the evaluation of other derivatives. Similarly, the second stage of weight adjustment using the calculated derivatives can be tackled using a variety of optimization schemes, many of which are substantially more powerful than simple gradient descent.

### 5.3.1 Evaluation of error-function derivatives

We now derive the backpropagation algorithm for a general network having arbitrary feed-forward topology, arbitrary differentiable nonlinear activation functions, and a broad class of error function. The resulting formulae will then be illustrated using a simple layered network structure having a single layer of sigmoidal hidden units together with a sum-of-squares error.

Many error functions of practical interest, for instance those defined by maximum likelihood for a set of i.i.d. data, comprise a sum of terms, one for each data point in the training set, so that
$$E\left(\mathbf{w}\right)=\sum_{n=1}^N E_n\left(\mathbf{w}\right).$$
Here we shall consider the problem of evaluating $\nabla E_n\left(\mathbf{w}\right)$ for one such term in the error function. This may be used directly for sequential optimization, or the results can be accumulated over the training set in the case of batch methods.

Consider first a simple linear model in which the outputs $y_k$ are linear combinations of the input variables $x_i$ so that
$$y_k=\sum_i w_{ki}x_i$$
together with an error function that, for a particular input pattern $n$, takes the form
$$E_n=\frac{1}{2}\sum_k\left(y_{nk}-t_{nk}\right)^2$$
where $y_{nk}=y_k\left(\mathbf{x}_n,\mathbf{w}\right)$.

The gradient of this error function with respect to a weight $w_{ji}$ is given by 
$$\frac{\partial E_n}{\partial w_{ji}}=\left(y_{nj}-t_{nj}\right)x_{ni}$$
which can be interpreted as a 'local' computation involving the product of an 'error signal' $y_{nj}-t_{nj}$ associated with the output end of the link $w_{ji}$ and the variable $x_{ni}$ associated with the input end of the link. We saw how a similar formula arises with the logistic sigmoid activation function together with the cross entropy error function, and similarly for the softmax activation function together with its matching cross-entropy error function. We shall now see how this simple result extends to the more complex setting of multilayer feed-forward networks.

In a general feed-forward network, each unit computes a weighted sum of its inputs of the form 
$$a_j=\sum_i w_{ji}z_i$$
where $z_i$ is the activation of a unit, or input, that sends a connection to unit $j$, and $w_{ji}$ is the weight associated with that connection. We saw that biases can be included in this sum by introducing an extra unit, or input, with activation fixed at $+1$. We therefore do not need to deal with biases explicitly.

The sum is transformed by a nonliner activation function $h\left(\cdot\right)$ to give the activation $z_j$ of unit $j$ in the form 
$$z_j=h\left(a_j\right).$$
Note that one or more of the variables $z_j$ in the sum could be an input, and similarly, the unit $j$ could be an output.

For each pattern in the training set, we shall suppose that we have supplied the corresponding input vector to the network and calculated the activations of all of the hidden and output units in the network. This process is often called forward propagation because it can be regarded as a forward flow of information through the network.

Now consider the evaluation of the derivative of $E_n$ with respect to a weight $w_{ji}$. The output of the various units will depend on the particular input pattern $n$. However, in order to keep the notation uncluttered, we shall omit the subscript $n$ from the network variables. First we note that $E_n$ depends on the weight $w_{ji}$ only via the summed input $a_j$ to unit $j$. We can therefore apply the chain rule for partial derivatives to give
$$\frac{\partial E_n}{\partial w_{ji}}=\frac{\partial E_n}{\partial a_j}\frac{\partial a_j}{\partial w_{ji}}.$$

We now introduce a useful notation
$$\delta_j\equiv\frac{\partial E_n}{\partial a_j}$$
where the $\delta$'s are often referred to as errors for reasons we shall see shortly. We can write 
$$\frac{\partial a_j}{\partial w_{ji}}=z_i.$$
We then obtain
$$\frac{\partial E_n}{\partial w_{ji}}=\delta_j z_j.$$
Equation tells us that the required derivative is obtained simply by multiplying the value of $\delta$ for the unit at the output end of the weight by the value of $z$ for the unit at the input end of the weight (where $z=1$ in the case of a bias). Note that this takes the same form as for the simple linear model considered at the start of this section. Thus, in order to evaluate the derivatives, we need only to calculate the value of $\delta_j$ for each hidden and output unit in the network.

As we have seen already, for the output units, we have
$$\delta_k=y_k-t_k$$
provided we are using the canonical link as the output-unit activation function.

To evaluate the $\delta$'s for hidden units, we again make use of the chain rule for partial derivatives,
$$\delta\equiv\frac{\partial E_n}{\partial a_j}=\sum_k\frac{\partial E_n}{\partial a_k}\frac{\partial a_k}{\partial a_j}$$
where the sum runs over all units $k$ to which unit $j$ sends connections. Note that the units labelled $k$ could include other hidden units and/or output units. We are making use of the fact that variations in $a_j$ give rise to variations in the error function only through variations in the variables $a_k$. We obtain the following backpropagation formula
$$\delta_j=\sum_k\frac{\partial E_n}{\partial a_k}\frac{\partial a_k}{\partial a_j}=\sum_k\delta_k\frac{\partial}{\partial a_j}\left(\sum_i w_{ki}z_i\right)=h^{'}\left(a_j\right)\sum_k w_{kj}\delta_k$$
which tells us that the value of $\delta$ for a particular hidden unit can be obtained by propagating the $\delta$'s backwards from units higher up in the network. Note that the summation is taken over the first index on $w_{kj}$ (corresponding to backward propagation of information through the network), whereas in the forward propagation equation it is takn over the second index. Because we already know the values of the $\delta$'s for the output units, we can evaluate the $\delta$'s for all of the hidden units in a feed-forward network, regardless of its topology.

The backpropagation procedure can therefore be summarized as follows.
### Error Backpropagation
1. Apply an input vector $\mathbf{x}_n$ to the network and forward propagate through the network to find the activations of all the hidden and output units.
2. Evaluate the $\delta_k$ for all the output units.
3. Backpropagate the $\delta$'s to obtain $\delta_j$ for each hidden unit in the network.
4. Evaluate the required derivatives.

For batch methods, the derivative of the total error $E$ can then be obtained by repeating the above steps for each pattern in the training set and then summing over all patterns:
$$\frac{\partial E}{\partial w_{ji}}=\sum_n\frac{\partial E_n}{\partial w_{ji}}$$
In the above derivation we have implicitly assumed that each hidden or output unit in the network has the same activation function $h\left(\cdot\right)$. The derivation is easily generalized, however, to allow different units to have individual activation functions, simply by keeping track of which form of $h\left(\cdot\right)$ goes with which unit.

### 5.3.2 A simple example

The above derivation of the backpropagation procedure allowed for general forms for the error function, the activation function, and the network topology.

Specifically, we shall consider a two-layer network of the form illustrated, together with a sum-of-squares error, in which the output units have linear activation functions, so that $y_k=a_k$, while the hidden units have logistic sigmoid activation functions given by
$$h\left(a\right)\equiv\tanh\left(a\right)$$
where
$$\tanh\left(a\right)=\frac{e^a-e^{-a}}{e^a+e^{-a}}.$$
A useful feature of this function is that its derivative can be expressed in a particularly simple form:
$$h^{'}\left(a\right)=1-h\left(a\right)^2.$$

We also consider a standard sum-of-squares error function, so that for pattern $n$ the error is given by 
$$E_n=\frac{1}{2}\sum_{k=1}^K\left(y_k-t_k\right)^2$$
where $y_k$ is the activation of output unit $k$, and $t_k$ is the corresponding target, for a particular input pattern $\mathbf{x}_n$.

For each pattern in the training set in turn, we first perform a forward propagation using
$$a_j=\sum_{i=0}^D w_{ji}^{\left(1\right)}x_i \\
z_j=\tanh\left(a_j\right) \\
y_k=\sum_{j=0}^M w_{kj}^{\left(2\right)} z_j.$$

Next we compute the $\delta$'s for each output unit using
$$\delta_k=y_k-t_k.$$
Then we backpropagate these to obtain $\delta$s for the hidden units using
$$\delta_j=\left(1-z_j^2\right)\sum_{k=1}^Kw_{kj}\delta_k.$$
Finally, the derivatives with respect to the first-layer and second-layer weights are gibven by 
$$\frac{\partial E_n}{\partial w_{ji}^{\left(1\right)}}=\delta_j x_i,\quad \frac{\partial E_n}{\partial w_{ji}^{\left(2\right)}}=\delta_k z_j.$$

## 5.5 Regularization in Neural Networks

The number of inputs and outputs units in a neural network is generally determined by the dimensionality of the data set, whereas the number $M$ of hidden units is a free parameter that can be adjusted to give the best predictive performance. Note that $M$ controls the nunmber of parameters (weights and biases) in the network, and so we might expect that in a maximum likelihood setting there will be an optimum value of $M$ that gives the best generalization performance, corresponding to the optimum balance between under-fitting and over-fitting. 

We see that an alternative approach is to choose a relatively large value for $M$ and then to control complexity by the addition of a regularization term to the error function. The simplest regularizer is the quadratic, giving a regularized error of the form 
$$\tilde{E}\left(\mathbf{w}\right)=E\left(\mathbf{w}\right)+\frac{\lambda}{2}\mathbf{w}^\top\mathbf{w}.$$
This regularizer is also known as weight decay. The effective model complexity is then determined by the choice of the regularization coefficient $\lambda$. As we have seen previously, this regularizer can be interpreted as the negative logarithm of a zero-mean Gaussian prior distribution over the weight vector $\mathbf{w}$.

### 5.5.1 Consistent Gaussian priors

One of the limitations of simpe weight decay is that is inconsistent with certain scaling properties of network mappings. To illustrate this, consider a multilayer perceptron network having two layers of weights and linear output units, which performs a mapping from a set of input variables $\{x_i\}$ to a set of output variables $\{y_k\}$. The activations of the hidden units in the first hidden layer take the form
$$z_j=h\left(\sum_iw_{ji}x_i+w_{j0}\right)$$
while the activations of the output units are given by
$$y_k=\sum_jw_{kj}z_j+w_{k0}.$$

Suppose we perform a linear transformation of the input data of the from
$$x_i\to\tilde{x}_i=ax_i+b.$$
Then we can arrange for the mapping performed by the network to be unchanged by making a corresponding linear transformation of the weights and biases from the input to the units in the hidden layer of the form
$$w_{ji}\to\tilde{w}_{ji}=\frac{1}{a}w_{ji} \\
w_{j0}\to\tilde{w}_{j0}=w_{j0}-\frac{b}{a}\sum_iw{ji}.$$

$$\tilde{z}_j=h\left(\sum_i\tilde{w}_{ji}\tilde{x}_i+\tilde{w}_{j0}\right) \\
=h\left(\sum_i\left(\frac{1}{a}w_{ji}\right)\left(ax_i+b\right)+\left(w_{j0}-\frac{b}{a}\sum_iw{ji}\right)\right) \\
=h\left(\sum_iw_{ji}x_i+\sum_i\frac{b}{a}w_{ji}+w_{j0}-\sum_i\frac{b}{a}w_{ji}\right) \\
=h\left(\sum_iw_{ji}x_i+w_{j0}\right)=z_j$$

Similarly, a linear transformation of the output variables of the newwork of the form
$$y_k\to\tilde{y}_k=cy_k+d$$
can be achieved by making a transformation of the second-layer weights and biases using
$$w_{kj}\to\tilde{w}_{kj}=cw_{kj} \\
w_{k0}\to\tilde{w}_{k0}=cw_{k0}+d.$$

$$\tilde{y}_k=\sum_j\tilde{w}_{kj}\tilde{z}_j+\tilde{w}_{k0} \\
=\sum_jcw_{kj}z_j+cw_{k0}+d \\
=c\left(\sum_jw_{kj}z_j+w_{k0}\right)+d \\
=cy_k+d$$

If we train one network using the original data and one network using data for which the input and/or traget variables are transformed by one of the above linear transformation, then consistency requires that we should obtain equivalent networks that differ only by the linear transformation of the weights as given. Any regularizer should be consistent with this property, otherwise it arbitrarily favours one solution over another, equivalent one. Clearly, simple weight decay, that treats all weights and biases on an equal footing, does not satisfy this property.

We therefore look for a regularizer which is invariant under the linear transformations. These require that the regularizer should be invariant to re-scaling of the weights and to shifts of the biases. Such a regularizer is given by
$$\frac{\lambda_1}{2}\sum_{w\in\mathcal{W}_1}w^2+\frac{\lambda_2}{2}\sum_{w\in\mathcal{W}_2}w^2$$
where $\mathcal{W}_1$ denotes the set of weights in the frist layer, $\mathcal{W}_2$ denotes the set of weights in the second layer, and biases are excluded from the summation. This regularizer will remain unchanged under the weight transformations provided the regularization parameters are re-scaled using $\lambda_1\to a^{1/2}\lambda$ and $\lambda_2\to c^{-1/2}\lambda$.

### 5.5.2 Early stopping

An alternative to regularization as a way of controlling the effective complexity of a network is the procedure of early stopping. The training of nonlinear network models corresponds to an iterative reduction of the error function defined with respect to a set fo training data. For many of the optimization algorithms used for network training, such as conjugate gradients, the error is a nonincreasing function of the iteration index. However, the error measured with respect to independent data, generally called a validation set, often shows a decrease at first, followed by an increase as the network starts to over-fit. Training can therefore be stopped at the point of smallest error with respect to the validation data set in order to obtain a network having good generalization performance. 

The behaviour of the network in this case is sometimes explained qualitatively in terms of the effective number of degrees of freedom in the network, in which this number starts out small and then to grows during the training process, corresponding to a steady increase in the effective complexity of the model. Halting training before a minimum of the training error has been reached then represents a way of limiting the effective network complexity.

### 5.5.3 Invariances

In many applications of pattern recognition, it is known that predictions should be unchanged, or invariant, under one or more transformations of the input variables. For example, in the classification of objects in two-dimensional images, such as handwritten digits, a particular object should be assigned the same classification irrespective of its position within the image (translation invariance) or of its size (scale invariance). Such transformations produce significant changes in the raw data, expressed in terms of the intensities at each of the pixels in the image, and yet should give rise to the same output from the classification system. Similarly in speech recognition, small levels of nonlinear warping along the time axis, which preserve temporal ordering, should not change the interpretation of the signal.

If sufficiently large numbers of training patterns are available, then an adaptive model such as a nerual network can learn the invariance, at least approximately. This involves including within the training set a sufficiently large number of examples of the effects of the various transformations. Thus, for translation invariance in an image, the training set should include examples of objects at many different positions.

This approach may be impractical, however, if the number of training examples is limited, or if there are several invariants (because the number of combinations of transformations grows exponentially with the number of such transformations). We therefore seek alternative approaches for encouraging an adaptive model to exhibit the required invariances. 

These can broadly be divided into four categories:
1. The training set is augmented using replicas of the training patterns, transformed according to the desired invariances. For instance, in our digit recognition example, we could make multiple copies of each example in which the digit is shifted to a different position in each image.
2. A regularization term is added to the error function that penalizes changes in the model output when the input is transformed. This leads to the technique of tangent propagation.
3. Invariance is built into the pre-processing by extracting features that are invariant under the required transformations. Any subsequent regression or classification system that uses such features as inputs will necessarily also respect these invariances.
4. The final option is to build the invariance properties into the structure of a neural network (or into the definition of a kernel function in the case of techniques such as the relevance vetor machine). One way to achieve this is through the use of local receptive fields and shared weights, as discussed in the context of convolutional neural networks.

Approach 1 is often relatively easy to imlement and can be used to encourage complex invariances. For sequential training algorithms, this can be done by transforming each input pattern before it is presented to the model so that, if the patterns are being recycled, a different transformation (drawn from an appropriate distribution) is added each time. For batch methods, a similar effect can be achieved by replicating each data point a number of times and transforming each copy independently. The use of such augmented data can lead to significant improvements in generalization, although it can also be computationally costly.

Approach 2 leaves the data set unchanged but modifies the error function through the addition of a regularizer. We shall show that this approach is closely related to approach 1.

One advantage of approach 3 is that it can correctly extrapolate well beyond the range of transformations included in the training set. However, it can be difficult to find hand-crafted features with the required invariances that do not also discard information that can be useful for discrimination.

### 5.5.4 Tangent propagation

We can use regularization to encourage models to be invariant to transformations of the input through the technique of tangent propagation. Consider the effect of a transformation on a particular input vector $\mathbf{x}_n$. Provided the transformation is continuous (such as translation or rotation, but not mirror reflection for instance), then the transformed pattern will sweep out a manifold $\mathcal{M}$ within the D-dimensional input space. 

This is illustrated for the case of $D=2$ for simplicity. Suppose the transformation is governed by a single parameeter $\xi$ (which might be rotation angle for instance). Then the subspace $\mathcal{M}$ swept out by $\mathbf{x}_n$ will be one-dimensional, and will be parameterized by $\xi$. Let the vector that results from acting on $\mathbf{x}_n$ by this transformation be denoted by $\mathbf{s}\left(\mathbf{x}_n,\xi\right)$, which is defined so that $\mathbf{s}\left(\mathbf{x}_n,0\right)=\mathbf{x}$.Then the tangent to the curve $\mathcal{M}$ is given by the directional derivative $\boldsymbol{\tau}=\partial\mathbf{s}/\partial\xi$, and the tangent vector at the pint $\mathbf{x}_n$ is given by
$$\boldsymbol{\tau}_n=\frac{\partial\mathbf{s}\left(\mathbf{x}_n,\xi\right)}{\partial\xi}\bigg|_{\xi=0}.$$
Under a transformation of the input vector, the network output vector will, in general, change. The derivative of output $y_k$ with respect to $\xi$ is given by
$$\frac{\partial y_k}{\partial\xi}\bigg|_{\xi=0}=\sum_{i=1}^D\frac{\partial y_k}{\partial x_i}\frac{\partial x_i}{\partial\xi}\bigg|_{\xi=0}=\sum_{i=1}^D J_{ki}\tau_i$$
where $J_{ki}$ is the $\left(k,i\right)$ element of the Jacobian matrix $\mathbf{J}$.

The result can be used to modify the standard error function, so as to encourage local invariance in the neighbourhood of the data points, by the addition to the original error function $E$ of a regualrization function $\Omega$ to give a total error function of the form
$$\tilde{E}=E+\lambda\Omega$$
where $\lambda$ is a regularization coefficient and 
$$\Omega=\frac{1}{2}\sum_n\sum_k\left(\frac{\partial y_{nk}}{\partial\xi}\bigg|_{\xi=0}\right)^2=\frac{1}{2}\sum_n\sum_k\left(\sum_{i=1}^D J_{nki}\tau_{ni}\right)^2.$$

The regularization function will be zero when the network mapping function is invariant under the transformation in the neighbourhood of each pattern vector, and the value of the parameter $\lambda$ determines the balance between fitting the training data and learning the invariance property.

In a practical implementation, the tangent vector $\boldsymbol{\tau}_n$ can be approximated using finite differences, by subtracting the original vector $\mathbf{x}_n$ from the corresponding vector after transformation using a small value of $\xi$, and then dividing by $\xi$.

The regularization function depends on the network weights through the Jacobian $\mathbf{J}$. A backpropagation formalism for computing the derivatives of the regularizer with respect to the network weights is easily.

If the transformation is governed by $L$ parameters (e.g., $L=3$ for the case of t4ranslations combined with in-plane rotations in a two-dimensional image), then the manifold $\mathcal{M}$ will have dimensionality $L$, and the corresponding regularizer is given by the sum of terms, one for each transformation. If several transformations are considered at the same time, and the network mapping is made invariant to each separately, then it will be (locally) invariant to combinations of the transformations.

A related technique, called tangent distance, can be used to build invariance properties into distance-based methods such as nearest-neighbour classifiers.

### 5.5.5 Training with transformed data

We have seen that one way to encourage invariance of a model to a set of transformations is to expand the training set using transformed versions of the original input patterns. Here we show that this approach is closely related to the technique of tangent propagation.

### 5.5.6 Convolutional networks

Another approach to creating models that are invariant to certain transformation of the inputs is to build the invariance properties into the structure of a neural network. This is the basis for the convolutional neural network, which has been widely applied to image data.

A key property of images is that nearby pixels are more strongly correlated than more distant pixels. Many of the modern approaches to computer vision exploit this property by extracting local features that dependd only on small subregions of the image. Information from such features can then be merged in later stages of processing in order to detect higher-order features and ultimately to yield information about the image as whole. Also, local features that are useful in one region of the image are likely to be useful in other regions of the image, for instance if the object of interest is translated.

These notions are incorporated into convolutional neural networks through three mechanismis:(i) local receptive fields, (ii) weight sharing, and (iii) subsampling. In the convolutional layer the units are organized into planes, each of which is called a feature map. Units in a feature map each take inputs only from a small subregion of the image, and all of the units in a feature map are constrained to share the same weight values. If we think of the units as feature etectors, then all of the units in a feature map detect the same pattern but at different localtions in the input image. Due to the weight sharing, the evaluation of the activations of these units is equivalent to a convolution of the image pixel intensities with a 'kernel' comprising the weight parameters. If the input image is shifted, the activations of the feature map will be shifted by the same amount but will otherwise be unchanged. This provides the basis for the (approximate) invariane of the network outputs to translations and distortions of the input image. Because we will typically need to detect multiple features in order to build an effective model, there will generally be multiple feature maps in the convolutional layer, each having its own set of weight and bias parameters.

The outputs of the convolutional units form the inputs to the subsampling layer of the network. For each feature map in the convolutional layer, there is a plane of units in the subsampling layer and each unit takes inputs from a small receptive field in the corresponding feature map of the convolutional layer. These units perform subsampling. The receptive fields are chosen to be contiguous and nonoverlapping so that there are half the number of rows and columns in the subsampling layer compared with the convolutional layer. In this way, the response of a unit in the subsampling layer will be relatively insensitive to small shifts of the image in the correspondign regions of the input space.

In a practical architecture, there may be several pairs of convolutional and subsampling layers. At each stage there is a larger degree of invariance to input transformations compared to the previous layer. There may be several feature maps in a given convolutional layer for each plane of units in the previous subsampling layer, so that the gradual reduction in spatial resolution is then compensated by an increasing number of features. The final layer of the network would typically be a fully connected, fully adaptive layer, with a softmax output nonliearity in the case of multicalass calssification.

The whole network can be trained by error minimization using backpropagation to evaluate the gradient ofg the error function. This involves a slight modification of the usual backpropagation algorithm to ensure that the shared-weight constraints are satisfied. Due to the use of local receptive fields, the number of weights in the network is smaller than if the network were fully connected. Furthermore, hte number of independent parameters to be learned from the data is much smaller still, due to the sumbstantial numbers of constraints on the weights.

### 5.5.7 Soft weight sharing

One way to reducre the effective complexity of a network with a large number of weights is to constrain weights within certain groups to be equal. This is the technique of weights sharing as a way of building translation invariance into networks used for image interpretation. It is only applicable, however, to particular problems in which the form of the constraints can be specified in advance. Here we consider a form of soft weight sharing in which the hard constraint of equal weights is replaced by a form of regularization in which groups of weights are encouraged to have similar values. Fruthermore, the  division of weights into groups, the mean weight value for each group, and the sparead of values within the groups are all determied as part of the learning process.

Recall that the simple weight decay regularizer can be viewed as the negative log of a Gaussian prior distribution over the weights. We can encourage the weight values to form several groups, rather than just one group, by considering instead a probability distribution that is a mixture of Gaussians. The cetres and variances of the Gaussian components, as well as the mixing coefficients, will be considered as adjustable parameters to be determined as part of the learning process.

Thus, we have a probability density of the form
$$p\left(\mathbf{w}\right)=\prod_ip\left(w_i\right)$$
where 
$$p\left(w_i\right)=\sum_{i=1}^M\pi_j\mathcal{N}\left(w_i|\mu_j,\sigma_j^2\right)$$
and $\pi_j$ are the mixing coefficients.

Taking the negative logarithm then leads to a regularization function of the form
$$\Omega\left(\mathbf{w}\right)=-\sum_i\ln\left(\sum_{j=1}^M\pi_j\mathcal{N}\left(w_i|\mu_j,\sigma_j^2\right)\right).$$

The total error function is then given by
$$\tilde{E}\left(\mathbf{w}\right)=E\left(\mathbf{w}\right)+\lambda\Omega\left(\mathbf{w}\right)$$
where $\lambda$ is the regularization coefficient. This error is minimized both with respect to the weights $w_i$ and with respect to the parameters $\{\pi_j,\mu_j,\sigma_j\}$ of the mixture model. However, the distrbution of weights is itself evolving during the learning process, and so to avoid numerical instability, a joint optimization is performd simultaneously over the weights and the mixture-model parameeters. This can be done using a standard optimization algorithm such as conjugate gradients or quasi-Newton methods.

In order to minimize the toal error function, it is necessary to be able to evaluate its derivatives with respect to the various adjustable parameters. To do this it is convenient to regard the $\{\pi_j\}$ as prior probabilities and to introduce the corresponding posterior probabilities which are given by Bayes' theorem in the form 
$$\gamma_j\left(w\right)=\frac{\pi_j\mathcal{N}\left(w|\mu_j,\sigma_j^2\right)}{\sum_k\pi_k\mathcal{N}\left(w|\mu_k,\sigma_k^2\right)}.$$

The derivatives of the total error function with respect to the weights are then given by 
$$\frac{\partial{\tilde{E}}}{\partial w_i}=\frac{\partial{E}}{\partial w_i}+\lambda\sum_j\gamma_j\left(w_i\right)\frac{\left(w_i-\mu_j\right)}{\sigma_j^2}.$$
The effect of the regularization term is therefore to pull each weight towards the centre of the $j^{th}$ Gaussian, with a fore proportional to the posterior probability of that Gaussian for the given weight. This is precisely the kind of effect that we are seeking.

Derivatives of the error with respect to the centres of the Gaussians are also easily computed to give 
$$\frac{\partial{\tilde{E}}}{\partial \mu_j}=\lambda\sum_i\gamma_j\left(w_i\right)\frac{\left(\mu_i-w_j\right)}{\sigma_j^2}$$
which has a simple intuitive interpretation, because it pushes $\mu_j$ towards an average of the weight values, weighted by posterior probabilities that the respective weight parameters were generated by component $j$.

Similarly, the derivatives with respect to the variances are given by
$$\frac{\partial{\tilde{E}}}{\partial \sigma_j}=\lambda\sum_i\gamma_j\left(w_i\right)\left(\frac{1}{\sigma_j}-\frac{\left(w_i-\mu_j\right)^2}{\sigma_j^3}\right)$$
which drives $\sigma_j$ towards the weighted average of the squared deviations of the weights around the corresponding centre $\mu_j$, where the weighting coefficients are again given by the posterior probability that each weight is generated by component $j$.

Note that in a practical implementation, new variables $\eta_j$ defined by
$$\sigma_j^2=\exp\left(\eta_j\right)$$
are introduced, and the minimization is performed with respect to the $\eta_j$. This ensures that the parameters $\sigma_j$ remain postive. It also has the effect of discouraging pathological solutions in which one or more of the $\sigma_j$ goes to zero, corresponding to a Gaussian component collapsing onto one of the weight parameter values. Such solutions are discussed in more detail in the context of Gaussian mixture models.

For the derivatives with respect to the mixing coefficients $\pi_j$, we need to take account of the constraints
$$\sum_j\pi_j=1,\quad 0\leqslant\pi_i\leqslant 1$$
which follow from the interpretation of the $\pi_j$ as prior probabilities. This can be done by expressing the mixing coefficients in terms of a set of auxiliary variables $\{\eta_j\}$ using the softmax function given by
$$\pi_j=\frac{\exp\left(\eta_j\right)}{\sum_{k=1}^M\exp\left(\eta_k\right)}.$$

The derivatives of the regularized error function with respect to the $\{\eta_j\}$ then take the form
$$\frac{\partial{\tilde{E}}}{\partial \eta_j}=\sum_i\{\pi_j-\gamma_j\left(w_i\right)\}.$$
We seen that $\pi_j$ is therefore driven towards the average posterior probability for component $j$.

## 5.6 Mixture Density Networks

The goal of supervised learning is to model a conditional distribution $p\left(\mathrm{t}|\mathbf{x}\right)$, which for many simple regression problems is chosen to be Gaussian. However, practical machine learning problems can often have significantly non-Gaussian distributions. These can arise, for example, with inverse problems in which the distribution can be multimodal, in which case the Gaussian assumption can lead to very poor predictions.

We therefore seek a general framework for modelling conditional probability distributions. This can be achieved by using a mixture model for $p\left(\mathrm{t}|\mathbf{x}\right)$ in which both the mixing coefficients as well as the component densities are flexible functions of the input vector $\mathbf{x}$, giving rise to the mixture density network. For any given value of $\mathbf{x}$, the mixture model provides a general formalism for modelling an arbitrary conditional density function $p\left(\mathrm{t}|\mathbf{x}\right)$. Provided we consider a sufficiently flexible network, we then have a framework for approximating arbitrary conditional distributions.

Here we shall develop the model explicitly for Gaussian components, so that 
$$p\left(\mathrm{t}|\mathbf{x}\right)=\sum_{k=1}^K\pi_k\left(\mathbf{x}\right)\mathcal{N}\left(\mathrm{t}|\boldsymbol{\mu}_k\left(\mathbf{x}\right),\sigma_k^2\left(\mathbf{x}\right)\right).$$
Instead of Gaussians, we can use other distributions for the components.

We now take the various parameters of the mixture model, namely the mixing coefficients $\pi_k\left(\mathbf{x}\right)$, the means $\boldsymbol{\mu}_k\left(\mathbf{x}\right)$, and the variances $\sigma_k^2\left(\mathbf{x}\right)$, to be governed by the outputs of a conventional neural netwrok that takes $\mathbf{x}$ as its input.

The neural network can be a two-layer network having sigmoidal ('tanh') hidden units. If there are $L$ components in the mixture model, and if $\mathrm{t}$ has $K$ components, then the network will have $L$ output unit activations denoted by $a_k^\pi$ that determine the mixing coefficients $\pi_k\left(\mathbf{x}\right)$, $K$ outputs denoted by $a_k^\sigma$ that determine the kernel widths $\sigma_k\left(\mathbf{x}\right)$, and $L\times K$ outputs denoted by $a_{kj}^\mu$ that determine the components $\mu_{kj}\left(\mathbf{x}\right)$ of the kernel centres $\boldsymbol{\mu}_k\left(\mathbf{x}\right)$. The total number of nutwork outputs is given by $\left(K+2\right)L$, as compared with the usual $K$ outputs for a network, which simply predicts the conditional means of the target variables.

The mixing coefficients must satisfy the constraints
$$\sum_{k=1}^K\pi_k\left(\mathbf{x}\right)=1,\quad 0\leqslant\pi_k\left(\mathbf{x}\right)\leqslant 1$$
which can be achieved using a set of softmax outputs
$$\pi_k\left(\mathbf{x}\right)=\frac{\exp\left(a_k^\pi\right)}{\sum_{l=1}^K\exp\left(a_l^\pi\right)}.$$

Similarly, the variances must satisfy $\sigma_k^2\left(\mathbf{x}\right)\geqslant 0$ and so can be represented in terms of the exponentials of the corresponding network activations using 
$$\sigma_k\left(\mathbf{x}\right)=\exp\left(a_k^\sigma\right).$$

Finally, because the means $\boldsymbol{\mu}_k\left(\mathbf{x}\right)$ have real components, they can be represented directly by the network output activations
$$\mu_{kj}\left(\mathbf{x}\right)=a_{kj}^\mu.$$

The adaptive parameters of the mixture density network comprise the vector $\mathbf{w}$ of weights and biases in the neural network, that can be set by maximum likelihood, or equivalently by minimizing an error function defined to be the negative logarithm of the likelihood. For independent data, this error function takes the from 
$$E\left(\mathbf{w}\right)=-\sum_{n=1}^N\ln\left\{\sum_{k=1}^K\pi_k\left(\mathbf{x}_n,\mathbf{w}\right)\mathcal{N}\left(\mathrm{t}|\boldsymbol{\mu}_k\left(\mathbf{x}_n,\mathbf{w}\right),\sigma_k^2\left(\mathbf{x}_n,\mathbf{w}\right)\right)\right\}$$
where we have made the dependencies on $\mathbf{w}$ explicit.

In order to minimize the error function, we need to calculate the derivatives of the errro $E\left(\mathbf{w}\right)$ with respect to the components of $\mathbf{w}$. These can be evaluated by using the standard backpropagation procedure, provided we obtain suitable expressions for the derivatives of the error with respect to the output-unit activations. These represent error signals $\delta$ for each pattern and for each output unit, and can be backpropagated to the hidden units and the error function derivatives evaluated in the usual way. Because the error function is composed of a sum of terms, one for each training data point, we can consider the derivatives for a particular pattern $n$ and then find the derivatives of $E$ by summing over all patterns.

Because we are dealing with mixture distributions, it is convenient to view the mixing coefficients $\pi_k\left(\mathbf{x}\right)$ as $\mathbf{x}$-dependent prior probabilities and to introduce the corresponding posterior probabilities given by
$$\gamma_k\left(\mathrm{t}|\mathbf{x}\right)=\frac{\pi_k\mathcal{N}_{nk}}{\sum_{l=1}^K\pi_l\mathcal{N}_{nl}}$$
where $\mathcal{N}_{nk}$ denotes $\mathcal{N}\left(\mathrm{t}_n|\boldsymbol{\mu}_k\left(\mathbf{x}_n\right),\sigma_k^2\left(\mathbf{x}_n\right)\right)$.

The derivatives with respect to the network output activations governing the mixing coefficients are given by
$$\frac{\partial E_n}{\partial a_k^\pi}=\pi_k-\gamma_k.$$

Similarly, the derivatives with respect to the output activations controlling the component means are given by
$$\frac{\partial E_n}{\partial a_{kl}^\mu}=\gamma_k\left\{\frac{\mu_{kl}-t_{nl}}{\sigma_k^2}\right\}.$$

Finally, the derivatives with respect to the output activations controlling the component variances are given by
$$\frac{\partial E_n}{\partial a_k^\sigma}=\gamma_k\left\{L-\frac{\|\mathrm{t}_n-\boldsymbol{\mu}_k\|^2}{\sigma_k^2}\right\}.$$

Once a mixture density network has been trained, it can predict the conditional density function of the target data for any given value of the input vector. This conditional density represents a complete description of the generator of the data, so far as the problem of predicting the value of the output vector is concerned. From this density function we can calculate more specific quantities that may be of interest in different applications.

One of the simplest of these is the mean, corresponding to the conditional average of the target data, and is given by
$$\mathbb{E}\left[\mathrm{t}|\mathbf{x}\right]=\int\mathrm{t}p\left(\mathrm{t}|\mathbf{x}\right)\mathrm{d}\mathrm{t}=\sum_{k=1}^K\pi_k\left(\mathbf{x}\right)\boldsymbol{\mu}_k\left(\mathbf{x}\right).$$
Because a standard network trained by least squares is aproximating the conditional mean, we see that a mixture density network can reproduce the conventional least-squares result as a special case. Of course, as we have already noted, for a multimodal distribution the conditional mean is of limited value.

We can similarly evaluate the variance of the density function about the conditional average, to give
$$s^2\left(\mathbf{x}\right)=\mathbb{E}\left[\|\mathrm{t}-\mathbb{E}\left[\mathrm{t}|\mathbf{x}\right]\|^2\big|\mathbf{x}\right] \\
=\sum_{k=1}^K\pi_k\left(\mathbf{x}\right)\left\{\sigma_k^2\left(\mathbf{x}\right)+\|\boldsymbol{\mu}_k-\sum_{l=1}^K\pi_l\left(\mathbf{x}\right)\boldsymbol{\mu}_l\left(\mathbf{x}\right)\|^2\right\}.$$
This is more general than the corresponding least-squares result because the variance is a function of $\mathbf{x}$.

We have seen that for multimodal distributions, the conditional mean can give a poor representation of the data. In such cases, the conditional mode may be of more value. Because the conditional mode for the mixture density network does not have a simple analytical solution, this would require numerical iteration. A simple alternative is to take the mean of the most problble component (i.e., the one with the largest mixing coefficient) at each value of $\mathbf{x}$. 

## 5.7 Bayesian Neural Networks

So far, our discussion of neural networks has foucussed on the use of maximum likelihood to determine the network parameters (weights and biases). Regularized maximum likelihood can be interpreted as a MAP (maximum posterior) approach in which the regularizer can be viewed as the logarithm of a prior parameter distribution. However, in a Bayesian treatment we need to marginalize over the distribution of parameters in order to make predictions.

We developed a Bayesian solution for a simple linear regression model under the assumption fo Gaussian noise. We saw that posterior distribution, which is Gaussian, could be evaluated exactly and that the predictive distribution could also be found in closed form. In the case of a multilayered network, the highly nonlinear dependence of the network function on the parameter values means that an exact Bayesian treatment can no longer be found. In fact, the log of the posterior distribution will be nonconvex, corresponding to the multiple local minima in the error function.

The most complete treatment, however, has been based on the Laplace approximation. We will approximate the posterior distribution by a Gaussian, centred at a mode of the true posterior. Furthermore, we shall assume that the covariance of this Gaussian is small so that the network function is approximately linear with respect to the parameters over the region of parameter space for which the posterior probability is significantly nonzero. With these two approximations, we will obtain models that are analogoous to the linear regression and classification models discussed in earlier chapters and so we can exploit the results obtained there. We can then make use of the evidence framework to provide point estimates for the hyperparameters and to compare alternative models (for exampel, netowrks having different numbers of hidden units).

### 5.7.1 Posterior parameter distribution

Consider the problem of predicting a single continuous target variable $t$ from a vector $\mathbf{x}$ of inputs (the extension to multiple targets is straightfroward). We shall suppose that the conditional distribution $p\left(t|\mathbf{x}\right)$ is Gaussian, with an $\mathbf{x}$-dependent mean given by the output of a neural network model $y\left(\mathbf{x},\mathbf{w}\right)$, and with precision (inverse variance) $\beta$
$$p\left(t|\mathbf{x},\mathbf{w},\beta\right)=\mathcal{N}\left(t|y\left(\mathbf{x},\mathbf{w}\right),\beta^{-1}\right).$$

Similarly, we shall choose a prior distribution over the weights $\mathbf{w}$ that is Gaussian of the form 
$$p\left(\mathbf{w}|\alpha\right)=\mathcal{N}\left(\mathbf{w}|\mathbf{0},\alpha^{-1}\mathbf{I}\right).$$

For an i.i.d. data set of $N$ observations $\mathbf{x}_1,\dots,\mathbf{x}_N$, with a corresponding set of target values $\mathcal{D}=\{t_1,\dots,t_N\}$, the likelihood function is given by
$$p\left(\mathcal{D}|\mathbf{w},\beta\right)=\prod_{n=1}^N\mathcal{N}\left(t_n|y\left(\mathbf{x}_n,\mathbf{w}\right),\beta^{-1}\right)$$
and so the resulting posterior distribution is then
$$p\left(\mathbf{w}|\mathcal{D},\alpha,\beta\right)\propto p\left(\mathbf{w}|\alpha\right)p\left(\mathcal{D}|\mathbf{w},\beta\right).$$
which, as a consequence of the nonlinear dependence of $y\left(\mathbf{x},\mathbf{w}\right)$ on $\mathbf{w}, will be nonGaussian.