# Training Deep Neural Networks
The NNs developed so far have been shallow, with only a few layers. What if we're tackling a much more complex problem, such as detecting hundreds of types of objexts in high-res images?

Training deep NNs can be problematic, for example:
- You may face the *vanishing/exploding gradients* problem. This is when the gradients grow smaller and small, or larger and larger, when flowing backwards through the DNN during training. This makes it difficult to train lower layers
- You might not have enough training data, or it may be too costly to label
- Training may be extremely slow
- A model with millions of parameters would severely risk overfitting the training set, especially if there's not enough training instances or the dataset is too noisy.

## The vanishing/exploding gradients problem

Recall the backpropagation algorithm used to train Neural nets. At each step, the gradient often gets smaller and smaller as the algorithm progresses to the lower layers. As a result, the Gradient Descent update leaves the lower layer's connection weights virtually unchanged and training never converges to a good solution. 

The opposite can also happen, the gradients can grow bigger and bigger until layers get insanely large weight updates and the algorithm diverges. This is the *exploding gradients* problem, which surfaces in recurrent NNs. In general, deep networks suffer from unstable gradients, different layers learn at widely different speeds.

In a [2010 paper](https://homl.info/47) the authors found a few suspects to why gradients can be so unstable, including a combination of the popular logistic sigmoid activation function and the weight initialization technique that was popular at the time (normal distribution centered around 0 with deviation of 1). They showed that with this activation function and this initialization scheme, the variance of the outputs of each layer is much greater than the variance of its inputs. Going forward in the network, the variance keeps increasing until the activation function saturates at the top layers. (fig 11-1 on pg 333 exemplifies this)

## Glorot and He initialization

The authors of the paper Xavier Gloror and Yoshua Bengio propose a way to mitigate the unstable gradients problem. They point out that we need the signal to flow in both directions: forwards when making predictions and in the reverse direction when backpropagating gradients. We don't want the signal to die out, not to explode and saturate. They argue that we need the variance of the outputs of each layer to be equal to the variance of the inputs, and we need the gradients to have equal variance before and after flowing through a layer in the reverse direction.

It is not actually possible to guarantee both, unless a layer has an equal number of inputs and neurons (these numbers are called *fan-in* and *fan-out* of the layer), but the authors proposed a good compromise: the connection weights of each layer must be initialized randomly as described by the equation below:

$$\text{Normal distributions with mean 0 and variance }\sigma^2 = \frac{1}{fan_{\text{avg}}}$$
or
$$\text{Uniform distribution between -r and +r with }r = \sqrt{\frac{3}{fan_{\text{avg}}}}$$

where $fan_{\text{avg}} = (fan_{in} + fan_{out})/2$. This strategy is called *Xavier* or *Glorot initialization*. Using Glorot initialization can speed up training considerably.

If we replace $fan_{\text{avg}}$ with $fan_{\text{in}}$ we get *LeCun initialization*, which was proposed in the 90s. 

Some papers have provided different strategies for initialization for various activations functions. They differ only by the scale of the variance and whether they use $fan_{\text{avg}}$ or $fan_{\text{in}}$

| Initialization | Activation Functions           | $\sigma^2$ (Normal)     |
| -------------- | ------------------------------ | ----------------------- |
| Glorot         | None, tanh, logistic, softmax  | 1/$fan_{\text{avg}}$    |
| He             | ReLU and variants              | 2/$fan_{\text{avg}}$    |
| LeCun          | SELU                           | 1/$fan_{\text{avg}}$    |

For the uniform distribution just compute $r=\sqrt{3\sigma^2}$. Note that for ReLU and its variants, the initialization is called *He initialization*

By default, Keras uses Glorot with a uniform distribution. When creating a layer we can pass in the initialization by setting ```kernel_initializer="he_uniform"``` or ```kernel_initializer="he_normal"```, for example

If you want He initialization with uniform distribution but based on $fan_\text{avg}$ rather than $fan_\text{in}$ you can use ```VarianceScaling``` initializer as follows

In [2]:
import keras
from keras.layers import Dense

he_avg_init = keras.initializers.VarianceScaling(scale=2., mode='fan_avg', distribution='uniform')
Dense(10, activation='sigmoid', kernel_initializer=he_avg_init)

<keras.layers.core.Dense at 0x7ff0b6c15cd0>

## Nonsaturating Activation Functions