# Training Deep Neural Networks
The NNs developed so far have been shallow, with only a few layers. What if we're tackling a much more complex problem, such as detecting hundreds of types of objexts in high-res images?

Training deep NNs can be problematic, for example:
- You may face the *vanishing/exploding gradients* problem. This is when the gradients grow smaller and small, or larger and larger, when flowing backwards through the DNN during training. This makes it difficult to train lower layers
- You might not have enough training data, or it may be too costly to label
- Training may be extremely slow
- A model with millions of parameters would severely risk overfitting the training set, especially if there's not enough training instances or the dataset is too noisy.

## The vanishing/exploding gradients problem

Recall the backpropagation algorithm used to train Neural nets. At each step, the gradient often gets smaller and smaller as the algorithm progresses to the lower layers. As a result, the Gradient Descent update leaves the lower layer's connection weights virtually unchanged and training never converges to a good solution. 

The opposite can also happen, the gradients can grow bigger and bigger until layers get insanely large weight updates and the algorithm diverges. This is the *exploding gradients* problem, which surfaces in recurrent NNs. In general, deep networks suffer from unstable gradients, different layers learn at widely different speeds.

In a [2010 paper](https://homl.info/47) the authors found a few suspects to why gradients can be so unstable, including a combination of the popular logistic sigmoid activation function and the weight initialization technique that was popular at the time (normal distribution centered around 0 with deviation of 1). They showed that with this activation function and this initialization scheme, the variance of the outputs of each layer is much greater than the variance of its inputs. Going forward in the network, the variance keeps increasing until the activation function saturates at the top layers. (fig 11-1 on pg 333 exemplifies this)

## Glorot and He initialization

The authors of the paper Xavier Gloror and Yoshua Bengio propose a way to mitigate the unstable gradients problem. They point out that we need the signal to flow in both directions: forwards when making predictions and in the reverse direction when backpropagating gradients. We don't want the signal to die out, not to explode and saturate. They argue that we need the variance of the outputs of each layer to be equal to the variance of the inputs, and we need the gradients to have equal variance before and after flowing through a layer in the reverse direction.

It is not actually possible to guarantee both, unless a layer has an equal number of inputs and neurons (these numbers are called *fan-in* and *fan-out* of the layer), but the authors proposed a good compromise: the connection weights of each layer must be initialized randomly as described by the equation below:

$$\text{Normal distributions with mean 0 and variance }\sigma^2 = \frac{1}{fan_{\text{avg}}}$$
or
$$\text{Uniform distribution between -r and +r with }r = \sqrt{\frac{3}{fan_{\text{avg}}}}$$

where $fan_{\text{avg}} = (fan_{in} + fan_{out})/2$. This strategy is called *Xavier* or *Glorot initialization*. Using Glorot initialization can speed up training considerably.

If we replace $fan_{\text{avg}}$ with $fan_{\text{in}}$ we get *LeCun initialization*, which was proposed in the 90s. 

Some papers have provided different strategies for initialization for various activations functions. They differ only by the scale of the variance and whether they use $fan_{\text{avg}}$ or $fan_{\text{in}}$

| Initialization | Activation Functions           | $\sigma^2$ (Normal)     |
| -------------- | ------------------------------ | ----------------------- |
| Glorot         | None, tanh, logistic, softmax  | 1/$fan_{\text{avg}}$    |
| He             | ReLU and variants              | 2/$fan_{\text{avg}}$    |
| LeCun          | SELU                           | 1/$fan_{\text{avg}}$    |

For the uniform distribution just compute $r=\sqrt{3\sigma^2}$. Note that for ReLU and its variants, the initialization is called *He initialization*

By default, Keras uses Glorot with a uniform distribution. When creating a layer we can pass in the initialization by setting ```kernel_initializer="he_uniform"``` or ```kernel_initializer="he_normal"```, for example

If you want He initialization with uniform distribution but based on $fan_\text{avg}$ rather than $fan_\text{in}$ you can use ```VarianceScaling``` initializer as follows

In [1]:
import keras
from keras.layers import Dense

he_avg_init = keras.initializers.VarianceScaling(scale=2., mode='fan_avg', distribution='uniform')
Dense(10, activation='sigmoid', kernel_initializer=he_avg_init)

Using TensorFlow backend.


<keras.layers.core.Dense at 0x7fbe040a8d30>

## Nonsaturating Activation Functions

ReLU is a great choice of activation function for NNs because it doesn;t saturate for positive values (unlike the sigmoid function) and it is fast to compute. It suffers however, from the *dying ReLU* problem: during training some neurons *'die'* and stop outputting anything other than 0. A neuron dies when its weights get tweaked in such a way that the weighted sum of its inputs are negative for all instances in the training set. When this happens, it just keeps outputting zeros and Gradient Descent does not affect it anymore because the gradient of the ReLU function is zero when its output is negative.

A variant of ReLU, *leaky ReLU* can help solve this problem.

$$ \text{LeakyReLU}_a(z) = \max(\alpha z, z) $$

The $\alpha$ hyperparameter defines how much 'leaks': it is the slope of the function for z<0 and is typically set to 0.01. This small slope ensures the leaky ReLU never dies; they can go into a ,long coma but they have a chance to eventually wake up.

A [2015 paper](https://homl.info/49) compared several variants of the ReLU function and one of its conclusions was that leaky variants alwayas outperformed the strict ReLU. Setting $\alpha=0.2$ (a huge leak) seemd to result in a better performance than $\alpha=0.01$ (a small leak). The paper also evaluated *randomized leaky ReLU* (RReLU), where $\alpha$ is picked randomly in a given range during training and is fixed to an average value during testing. It performemed well and acted as a regularizer. Finally it evaluated *parametric leaky ReLU* (PReLU), where $\alpha$ is authorized to be learned during training (i.e. becimoing a parameter that can be modified by backpropagation). PReLU was reported to strongly outperform ReLU on large image datasets, but on smaller datasets it runs the risk of overfitting the training set.

In 2015 the [*exponential linear unit (ELU)*](https://homl.info/50) was introduced and outperformed all other ReLU variants in the author's experiments: training time reduced, and the neural network performed better on the test set. 

$$ \text{ELU}_\alpha(z) = \begin{cases}
                          \alpha(\exp(z) - 1) & \text{if } z<0\\
                          z & \text{if } z\geq0
                          \end{cases} $$
                          
Where $\alpha$ is the hyperparameter that defines the value the ELU function takes when $z$ is a large negative number. The ELU function looks like the ELU (fig 11-3 on pg 336) with a few major differences:
- It takes on negative values when z<0; allowing units to have an average output closer to zero, which alleviates the vanishing gradients problem
- It has non-zero gradient for $z<0$, which avoids the dead neurons problem
- if $\alpha=1$ then the function is smooth everywhere, which helps speed up Gradient Descent, since it does not bounce as much

The main drawback of ELU is the that it is slower to compute than ReLU and its variants. Its faster convergence rate compensates for that slow computation, but still at test time an ELU network will be slower than a ReLU network.

In 2017 the [Scaled ELU (SELU)](https://homl.info/selu) was introduced. The authors showed that if you build a neural network composed exclusively of a stack of dense layers, and if all hidden layers use the SELU function, then the network will *self-normalize*: the output of each layer will tend to preserve a mean of 0 and standard deviation of 1 during training. As a result SELU significantly outperforms other activation functions. However there are certain conditions for self normalization to happen:
- Input features must be standardized
- Every hidden layer myst be initialized with LeCun normal initialization
- The network's architecture must be sequential*
- The paper only guarantees self-normalization if all layers are dense, but some researchers have noted that the SELU activation function can improve performance in convolutional neural nets as well

Note: For non-sequential architectures such as recurrent networks or networks with *skip-connections*, self-normalization is not guaranteed, however some researchers noted SELU to perform well in convolutional Neural networks

In general 
$$ \text{SELU} > \text{ELU} >\text{leaky ReELU (and variants)} > \text{ReLU} > \text{tanh} > \text{sigmoid}$$

Architecture might prevent you from using SELU, in which case you switch to ELU. If you care about runtime latency then use leaky ReLU instead. If you don't want to tweak $\alpha$, use the keras defaults. If you have spare time and computing power, use cross validation to evaluate other activation functions such as RReLU and PReLU. that said, because ReLU is the most common function, many libraries and hardware accelerators provide ReLU-specific optimizations.


To use the leaky ReLU function, create leaky ReLU layer and add it to model just after the layer you want to apply to

In [6]:
model = keras.models.Sequential([
    keras.layers.Dense(10, kernel_initializer='he_normal'),
    keras.layers.LeakyReLU(alpha=0.2)
])

For PReLU replace LeakyReLU with ```PReLU()```. There's currently no implementation of RReLU in keras but you can easily implement your own.

For SELU set ```activation='selu'``` and ```kernel_initizalizer='lecun_normal'``` when creating a layer

### Batch normalization

While He normalization along with ELU (and ReLU variants) can help with the exploding gradients problem at the beginning of training, it doesn't guarantee it won't come back during trainig. [*Batch normalization*](https://homl.info/51) was introduced in 2017 to address these problems.

It consists of zero-centering and normalizing each input, then scaling and shifting the results using two neu parameter vectors per layer; one for scaling and the other for shifting. This way the model is allowed to learn the optimal scale and mean of each of the layer's inputs.
In many cases, adding a BN layer as the very first input means you don't need to standardize your training set. 

The algorithm computes the mean and standard deviation of the input over the current mini-batch. The operation is summarized below

1. $$\boldsymbol{\mu}_B = \frac{1}{m_B}\sum_{i=1}^{m_B}\textbf{x}^{(i)} $$
2. $$\boldsymbol{\sigma}^2 = \frac{1}{m_B}\sum_{i=1}^{m_B}(\textbf{x}^{(i)} - \mu_B)^2 $$
3. $$\hat{\textbf{x}}^{(i)} = \frac{\textbf{x}^{(i)} - \boldsymbol{\mu_B}}{\sqrt{\boldsymbol{\sigma}^2+\epsilon}} $$
4. $$\textbf{z}^{(i)} = \boldsymbol{\gamma}\otimes\hat{\textbf{x}}^{(i)}+ \boldsymbol{\beta}$$

Where 
- $\boldsymbol{\mu}_B$ is the vector of input means, evaluated over the whole mini-batch $B$
- $\boldsymbol\sigma_B$ is the vector of input standard deviations over mini-batch $B$
- $m_B$ is the number of instances in the mini batch
- $\hat{\textbf{x}}^{(i)}$ is the vector of zero centered and normalized inputs for instance $i$
- $\boldsymbol\gamma$ is the output scale parameter vector for the layer
- $\otimes$ is element-wise multiplication (each input is multiplied by its correspoding scale parameter)
- $\boldsymbol\beta$ is the output shift parameter vector for the layer. Each input is offser by its corresponding shift parameter
- $\epsilon$ is a tiny number that avoids division by zero, called a *smoothing term*
- $\textbf{z}^{(i)}$ is the output of the Batch Normalization

You might ask *'but what mean and deviation do I use at test time?'*. you might have only one test instance or even if we have a test batch, the samples might not be I.I.D.. 

One solution would be to wait until end of training, then run the whole training set through the NN to compute the mean and deviation of each input of the BN layer. These "final" input means and deviations could then be used instead of the batch means/deviation when making predictions. 

However, most implementations of BN, estimate these final statistics by using a moving average of the layer's input means and standard deviations. Keras does this automatically.

Batch Normalization also acts as a regularizer, reducing the needs for other normalization techniques. It does however, add some complexity to the model. It also makes slower predictions due to the extra computations required at each layer. 

#### Batch Normalization with keras