# $\S$ 11.5. Some Issues in Training Neural Networks

> There is quite an art in training neural networks.

The model is generally overparametrized, and the optimization problem is nonconvex and unstable unless certain guidelines are followed.

## $\S$ 11.5.1. Starting Values

Note that if the weights are near zero, then the operative part of the sigmoid (FIGURE 11.3) is roughly linear, and hence the neural network collapses into an approximately linear model (Exercise 11.2).

Usually starting values for weights are chosen to be random values near zero. Hence the model starts out nearly linear, and becomes nonlinear as the weights increases. Individual units localize to directions and introduce nonlinearities where needed.

Use of exact zero weight leads to zero derivatives and perfect symmetry, and the algorithm never moves.

Starting instead with large weights often leads to poor solutions.

## $\S$ 11.5.2. Overfitting

Often neural networks have too many weights and will overfit the data at the global minimum of $R$.

In early developments of neural networks, either by design or by accident, an early stopping rule was used to avoid overfitting. Here we train the model only for a while, and stop well before we approach the global minimum. Since the weights starts at a highly regularized (linear) solution, this has the effect of shrinking the final model toward a linear model. A validation dataset is useful for determining when to stop, since we expect the validation error to start increasing.

### Weight decay

A more explicit method for regularization is _weight decay_, which is analogous to ridge regression used for linear models ($\S$ 3.4.1). We add a penalty to the error function

\begin{equation}
R(\theta) + \lambda J(\theta),
\end{equation}

where

\begin{equation}
J(\theta) = \sum_{k,m} \beta_{km}^2 + \sum_{m,l} \alpha_{ml}^2
\end{equation}

and $\lambda \ge 0$ is a tuning parameter.

Larger values of $\lambda$ will tend to shrink the weights toward zero: Typically cross-validation is used to estimate $\lambda$. The effect of the penalty is to simply add terms $2\beta_{km}$ and $2\alpha_{ml}$ to the respective gradient expressions (11.13).

Other forms for the penalty have been proposed, e.g.,

\begin{equation}
J(\theta) = \sum_{k,m}\frac{\beta_{km}^2}{1+\beta_{km}^2} + \sum_{m,l}\frac{\alpha_{ml}^2}{1+\alpha_{ml}^2},
\end{equation}

known as the _weight elimination_ penalty. This has the effect of shrinking smaller weights more.

### Examples

FIGURE 11.4 shows the results of training a neural network with 10 hidden units, to the mixture example of Chapter 2. Weight decay has clearly improved the prediction.

FIGURE 11.5 shows heat maps of the estimated weights from the training (grayscale versions of these are called _Hinton diagrams_). We see that weight decay has dampened the weights in both layers: The resulting weights are spread fairly evenly over the 10 hidden units.

In [1]:
"""FIGURE 11.4. A neural network on the mixture example with and without weight decay."""
print('Under construction ...')

Under construction ...


In [2]:
"""FIGURE 11.5. Heat maps of the estimated weights"""
print('Under construction ...')

Under construction ...


## $\S$ 11.5.3. Scaling of the Inputs

Since the scaling of the inputs determines the effective scaling of the weights in the bottom layer, it can have a large effect on the quality of the final solution.

At the outset it is best to standardize all inputs to have mean zero and standard deviation one. This ensures all inputs are treated equally in the regularization process, and allows one to choose a meaningful range for the random starting weights.

With standardized inputs, it is typical to take random uniform weights over the range $[-0.7, +0.7]$.

## $\S$ 11.5.4. Number of Hidden Units and Layers

> Generally speaking it is better to have too many hidden units than too few.

* With too few hidden units, the model might not have enough flexibility to capture the nonlinearities in the data.
* With too many hidden units, the extra weights can be shrunk toward zero if appropriate regularization is used.

Typically the number of hidden units is somewhere in the range of 5 to 100, with the number increasing with the number of inputs and number of training cases. It is most common to put down a reasonably large number of units and train them with regularization.

Some researchers use cross-validation to estimate the optimal number, but this seems unnecessary if cross-validation is used to estimate the regularization parameter.

> Choice of the number of hidden layers is guided by background knowledge and experimentation.

Each layer extracts features of the input for regression or classification. Use of multiple hidden layers allows construction of hierarchical features at different levels of resolution (for an example see $\S$ 11.6).

## $\S$ 11.5.5. Multiple Minima

The error function $R(\theta)$ is nonconvex, possessing many local minima. As a result, the final solution obtained is quite dependent on the choice of starting weights. One must at least try a number of random starting configurations, and choose the solution giving lowest (penalizes) error.

Probabily a better approach is to use the average predictions over the collection of networks as the final prediction (Ripley, 1996). This is preferrable to averaging the weights, since the nonlinearity of the model implies that this averaged solution could be quite poor.

Another approach is via _bagging_, which averages the predictions of networks training from randomly perturbed versions of the training data ($\S$ 8.7).