# Chapter 11 - Training Deep Neural Networks

## The Vanishing/Exploding Gradients Problems

Unfortunately, gradients often get smaller and smaller as the algorithm
progresses down to the lower layers. As a result, the Gradient Descent update
leaves the lower layers’ connection weights virtually unchanged, and training
never converges to a good solution. We call this the vanishing gradients
problem. In some cases, the opposite can happen: the gradients can grow bigger
and bigger until layers get insanely large weight updates and the algorithm
diverges. This is the exploding gradients problem, which surfaces in recurrent
neural networks. Deep Neural Networks suffer from unstable gradients; different layers may learn at a widely different speeds.

To mitigate this problem, we can explore other activation functions over the logistic function. Furthermore, we can also explore other forms of network weights initialization.

### Glorot and He Initialization

To mitigate the problem of vanishing and exploding gradients, Glorot and Bengio, pointed out that we need the signal to flow properly in both directions of the network: forwards and backwards. For this to happen, they argue that we need the variance of the outputs of each layer to be equal the variance of its inputs, and also, we need the gradients to have equal variance before and after flowing through a layer in reverse direction.

To accomplish that, they proposed a initialization schema call Xavier initialization of Glorot initialization. Let's define two quantities $fan_{in}$ and $fan_{out}$:

- The $fan_{in}$ of a layer is the number of inputs in the layer, and the $fan_{out}$ is the number of outputs of the layer, see this [image](https://miro.medium.com/max/424/1*aIMnYrXAlawJEOWIUgKFug.jpeg). Thus, we can also define: $fan_{avg} = (fan_{in} + fan_{out}) / 2$.

The Glorot initialization procedure is that the initial weights of each layer must obey the following rules:

1. Normal distribution with mean 0 and variance $\sigma^2 = \frac{1}{fan_{avg}}$;
2. or a uniform distribution between -r and +r, with $r = \sqrt{\frac{3}{fan_{avg}}}$.

These initializations are recommend when using tanh, logistic or softmax activation functions. If you replace $fan_{avg}$ by $fan_{in}$, you got an initialization strategy proposed by Yann LeCun, called LeCun initialization. LeCun initialization is recommended when using SELU activation function, which will be discussed soon.

In summary:

| Initialization | Activation Functions | $\sigma^2 (Normal)$ |
|----------------|----------------------|--------------------|
|Glorot | None, tanh, logistic, softmax | $1/fan_{avg}$ |
|He | ReLu and variants (leaky ReLu, RReLu, PReLu) | $2/fan_{in}$ |
|LeCun | SELU | $1/fan_{in}$ |

These initialization strategies helps to minimize the problem with vanishing and exploding gradients, but they alone are not enough to solve it. Let's talk about activation functions.

### Nonsaturating Activation Functions

Choosing the right activation function for the architecture of the network can improve model performance and speed up training time. After years only using the logistic function as activation function, researches discovered another good activation function: ReLu.

ReLu is nice because its derivative is straightforward to compute, but it comes at a cost: the dying ReLus problem. This problem happens when somehow the combination of weights in a network causes the neurons output to be always negative, which implies that the output of the neurons will always be zero, and that neuros is effectively "dead".

---
A turnaround to this problem was the leaky ReLu function: $LeakyReLu(z) = \max(\alpha z, z)$, where $\alpha$ is tipically set to a low value, e.g., 0.02. This guarantees that when $z < 0$, the output and the gradient will not be zero, but at least a small value. Some variations of the LeakyRelu are:

- the Randomized LeakyReLu (RReLu), where $alpha$ is picked randomly in a given range during training and is fixed to and average during testing.
- the Parametric LeakuReLu (PReLu), where $\alpha$ is learning during training. This strategy works good only for huge datasets, in small datasets, it runs the risk of overfitting the data.

---

Another proposed alternative to ReLu is the ELU activation function. It has proven to outperforms all variants of ReLu and alleviate the vanishing gradients problem and avoids the dead neurons problem. Further, if its parameter $\alpha$ is set to 1, it is smooth everywhere, which helps speed up Gradient Descent.

The ELU is given by $\alpha(\exp(z) - 1)$, if $z < 0$, and $z$, if $z \geq 0 $. The problem with ELU is the exponential functial, which slower computation. It's a tradeoff that can be overcomed by the fact that the ELU function helps accelerate convergence, but predictions will be slower, anyway.

---
Last but not least, comes the SELU function. It is a Scaled variant of the ELU function (thus, SELU). Researches showed that SELU can impose the network to **self-normalize**: the output of each layer tend to have zero mean and unit variance, which solves vanishing/exploding gradients. As a result, SELU often outperforms all other activation functions. However, there are some restrictions to use it:

1. Input features must be standardized (mean 0, variance 1);
2. Every hidden layer's weights must be initialized with LeCun normal initialization (`kernel_initializer="lecun_normal"`);
3. The network architecture must be sequential. It cannot have skip connections;
4. Self-normalization is only guaranteed if all layers are dense.

To use it in keras, type:

`layer = keras.layers.Dense(10, activation="selu", kernel_initializer="lecun_normal")`