# $\S$ 11.4. Fitting Neural Networks

The neural network model has unknown parameters, often called _weights_, and we seek values for them that make the model fit the training data well.

We denote the complete set of weights by $\theta$, which consists of

\begin{align}
\{ \alpha_{0m}, \alpha_m : m=1,2,\cdots,M \} & & M(p+1) \text{ weights}, \\
\{ \beta_{0k}, \beta_k : k=1,2,\cdots,K \} & & K(M+1) \text{ weights}.
\end{align}

For regression, we use sum-of-squared errors as our measure of fit (error function)

\begin{equation}
R(\theta) = \sum_{k=1}^K \sum_{i=1}^N (y_{ik} - f_k(x_i))^2.
\end{equation}

For classification we use either squared error or cross-entropy (deviance)

\begin{equation}
R(\theta) = -\sum_{i=1}^N \sum_{k=1}^K y_{ik}\log f_k(x_i),
\end{equation}

and the corresponding classifier is

\begin{equation}
G(x) = \arg\max_k f_k(x).
\end{equation}

With the softmax activation function and the cross-entropy error function, the neural network model is exactly a linear logistic regression in the hidden units, and all the parameters are estimated by maximum likelihood.

### Regularization

Typically we don't want the global minimizer of $R(\theta)$, as this is likely to be an overfit solution. Instead some regularization is needed: this is achieved directly through a penalty term, or indirectly by early stopping.

### Back-propagation = gradient descent

The generic approach to minimizing $R(\theta)$ is by gradient descent, called _back-propagation_ in this setting. Because of the compositional form of the model, the gradient can be easily derived using the chain rule for differentiation.

This can be computed by a forward and backward sweep over the network, keeping track only of quantities local to each unit.

### Back-propagation for squared error loss

Let
* $z_{mi} = \sigma(\alpha_{0m} + \alpha_m^T x_i)$, and
* $z_i = (z_{1i}, z_{2i}, \cdots, z_{Mi})$.

Then we have

\begin{align}
R(\theta) &\equiv \sum_{i=1}^N R_i \\
&= \sum_{i=1}^N \sum_{k=1}^K \left( y_{ik} - f_k(x_i) \right)^2,
\end{align}

with derivatives

\begin{align}
\frac{\partial R_i}{\partial\beta_{km}} &= -2\left( y_{ik} - f_k(x_i) \right) g_k'(\beta_k^T z_i) z_{mi}, \\
\frac{\partial R_i}{\partial\alpha_{ml}} &= -2\sum_{k=1}^K \left( y_{ik} - f_k(x_i) \right) g_k'(\beta_k^T z_i) \beta_{km} \sigma'(\alpha_m^T x_i) x_{il}.
\end{align}

Given these derivatives, a gradient descent update at the $(r+1)$st iteration has the form

\begin{align}
\beta_{km}^{(r+1)} &= \beta_{km}^{(r)} - \gamma_r \sum_{i=1}^N \frac{\partial R_i}{\partial\beta_{km}^{(r)}}, \\
\alpha_{ml}^{(r+1)} &= \alpha_{ml}^{(r)} - \gamma_r \sum_{i=1}^N \frac{\partial R_i}{\partial \alpha_{ml}^{(r)}},
\end{align}

where $\gamma_r$ is the learning rate, discussed below.

### Back-propagation equations

Now write the gradients as

\begin{align}
\frac{\partial R_i}{\partial\beta_{km}} &= \delta_{ki} z_{mi}, \\
\frac{\partial R_i}{\partial\alpha_{ml}} &= s_{mi} x_{il}.
\end{align}

The quantities $\delta_{ki}$ and $s_{mi}$ are "errors" from the current model at the output and hidden layer units, respectively. From their definitions, these errors satisfy

\begin{equation}
s_{mi} = \sigma'(\alpha_m^T x_i) \sum_{k=1}^K \beta_{km} \delta_{ki},
\end{equation}

known as the _back-propagation equations_. Using this, the gradient descent updates can be implemented with a two-pass algorithm.
1. In the _forward pass_, the current weights are fixed and the predicted values $\hat{f}_k(x_i)$ are computed from the formula  
  
  \begin{align}
  Z_m &= \sigma(\alpha_{0m} + \alpha_m^T X), & m=1,\cdots,M, \\
  T_k &= \beta_{0k} + \beta_k^T Z, & k=1,\cdots,K, \\
  f_k(X) &= g_k(T), & k=1,\cdots,K.
  \end{align}
  
2. In the _backward pass_, the errors $\delta_{ki}$ are computed, and then back-propagated via  

  \begin{equation}
  s_{mi} = \sigma'(\alpha_m^T x_i) \sum_{k=1}^K \beta_{km} \delta_{ki},
  \end{equation}
  
  to give the errors $s_{mi}$.

Both set of errors are then used to compute the gradients for the updates.

### Advantages of back-propagation

This two-pass procedure is what is known as back-propagation. It has also been called the _delta rule_ (Widrow and Hoff, 1960). The computational components for cross-entropy have the same form as those for the sum of squares error function (Exercise 11.3).

The advantages of back-propagation are its simple, local nature. In the back-propagation algorithm, each hidden unit passes and receives information only to and from units that share a connection. Hence it ca be implemented efficiently on a parallel architecture computer.

### Batch and online learning

The updates

\begin{align}
\beta_{km}^{(r+1)} &= \beta_{km}^{(r)} - \gamma_r \sum_{i=1}^N \frac{\partial R_i}{\partial\beta_{km}^{(r)}}, \\
\alpha_{ml}^{(r+1)} &= \alpha_{ml}^{(r)} - \gamma_r \sum_{i=1}^N \frac{\partial R_i}{\partial \alpha_{ml}^{(r)}},
\end{align}

are a kind of _batch learning_, with the parameter updates being a sum over all of training cases.

Learning can also be carried out online -- processing each observation one at a time , updating the gradient after each training case, and cycling through the training cases many times. In this cases, the sums in the updates are replaced by a single summand.

A _training epoch_ refers to one sweep through the entire training set. Online training allows the network to handle very large training sets, and also to update the weights as new observations come in.

### Learning rate

The learning rate $\gamma_r$ for batch learning is usually taken to be a constant, and can also be optimized by a line search that minimizes the error function at each update.

With online learning $\gamma_r$ should decrease to zero as the iteration $r \rightarrow \infty$. This learning is a form of _stochastic approximation_ (Robbins and Munro, 1951); results in this field ensure convergence if
* $\gamma_r \rightarrow \infty$,
* $\sum_r \gamma_r = \infty$, and
* $\sum_r \gamma_r^2 \lt \infty$.

Satisfied, for example, by $\gamma_r = 1/r$.

### It's slow

Back-propagation can be very slow, and for that reason is usually not the method of choice. Second-order techniques such as Newton's method are not attractive here, because the second derivative matrix of $R$ (the Hessian) can be very large.

Better approaches to fitting include conjugate gradient and variable metric methods. These avoid explicit computation of the second derivative matrix while still providing faster convergence.