Training deep neural networks can involve a number of problems, such as

- gradients becoming too small to influence lower layers during backpropagation (vanishing)
- gradients becoming too large to converge (exploding)
- training data that's insufficient or impractical to label
- slow training
- too many parameters, not enough instances

Each of these issues can be solved with the right techniques, enabling training of very deep networks.

# Vanishing and Exploding Gradients

Deep neural networks suffer from unstable gradients, where different layers can learn at very different speeds. During backpropagation, gradients tend to shrink as they progress towards the lower layers, sometimes to the point that the lower layer weights won't ever change. The opposite can also happen, resulting in very high weights that force the algorithm to diverge.

This issue was made clear by the popular combination of two factors:

1. logistic sigmoid activation function
2. initialization following a normal distribution with $\mu=0$ and $\sigma=1$

The sigmoid activation function saturates at 0 or 1 with a derivative near 0 for very large inputs, so by the time backpropagation begins it usually can't change much.

The Glorot initialization was introduced to alleviate this issue, based on the assertion that the variance of the outputs of each layer be equal to the variance of its inputs and that the gradients have equal variance before and after flowing through a layer. A practical approximation of this was defined to randomly initialize the connection weights of each layer.

*Equation 1: Glorot initialization for use with the logistic activation function*

\begin{equation*}
\text{Normal distribution with } \mu=0 \text{ and } \sigma^2=\frac{1}{{fan}_{avg}}\\[3ex]
\text{or uniform distribution between } -r \text{ and } +r\text{, with }r = \sqrt{\frac{3}{{fan}_{avg}}}
\end{equation*}

where

- ${fan}_{avg} = \frac{fan_{in} + fan_{out}}{2}$
- ${fan}_{in}$ is the number of inputs in a layer
- ${fan}_{out}$ is the number of neurons in a layer

A similar strategy is the LeCun initialization, which replaces ${fan}_{avg}$ with ${fan}_{in}$, which is useful when paired with the SELU activation function. Similarly, He initialization uses $\sigma^2 = \frac{2}{{fan}_{in}}$ to be more useful alongside ReLU and its variants.