# Vanishing/Exploding Gradients Problems

A problem deep neural networks have is that gradients tend to get smaller as Gradient Descent advances to lower layers. This is why Gradient Descent almost does not change the weights from lower layers preventing the training to converging to a good solution. This problem is called *vanishing gradients*. Sometimes the opposite scenario can occur: gradients can grow bigger, because many layers get large weights updates making the algorithm diverge. This is called the *exploding gradients* problem and is most commonly found in recurrent neural networks.

This problems can be solved by using a specific initialization of the weights and changing the activation function of the neurons. These two changes prevented the network from increasing the variance after each level. This increment in the variance caused the activation function to saturate at the top layers. This saturation caused that during backpropagation the gradient was calculated in the saturated region of the activation function; this caused that the little gradient in the activation function was diluted when brackpropagation advanced to lower layers. 

## Xavier and He Initialization


The Xavier and Xe initialization is based on the idea that signal has to flow properly both forward and backwards: forwards when making predictions and backwards when backpropagatin gradients. It is not possible to guarantee the proper behavior of the two directions unless the layer has the same number of inputs and outputs. However, it is has been proven that initializing the weights with certain distributions improves the performance.

The first way to initialize the weights in a layer is using a normal distribution with **mean 0** and a standard deviation determined by the equation

$$\sigma = \sqrt{\frac{2}{n_{inputs}+ n_{outputs}}}$$

The second distribution used to initialize weights is a uniform distribution between $-r$ and $r$, with $r$ defined according to

$$r = \sqrt{\frac{6}{n_{inputs} + n_{outputs}}}$$

The equations for $\sigma$ and $r$ shown above are used when the activation function is a logistic function. However, similar definitions are valid for activation functions such as Hyperbolic tangent and ReLU and its variations.

The `tf.layers.dense()` function uses Xavier initialization (uniform distribution) by default. This initialization can be changed to the He initialization by using the following below. It has to be remembered that the He initialization only considers the number of inputs. To include both the number of inputs and outputs, the argument `mode='FAN_AVG'` has to be set in the `variance_scaling_initializer()` function.

```python
he_init = tf.contrib.layers.variance_scaling_initializer()
hidden1 = tf.contrib.dense(X, n_hidden1, activation=tf.nn.relu,
                           kernel_initializer=he_init, name='hidden1')
```



## Nonsaturating Activation Functions

One of the reasons for vanishing and exploding gradients problems was found to be the activation function choice. This is why now more functions besides the sigmoid activation function are used. 

**ReLU activation function:**

One of the main advantages of the ReLU activation function is that it does not saturate for positive values. Also, it is a function that is is fast to calculate.

**Leaky ReLU:**

One of the problems with strict ReLU is that sometimes neurons die (keep ouputting zero) and cannot come back to life. A way to solve this problem is to define a leaky ReLU function. That is, a ReLU function that, instead of having a zero value for $z \lt 0$, has a small loop defined by a hyperparameter $\alpha$. This small slope allows the existence of a gradient for values negative values of $z$ so that way neurons can remain dormant for some time but come back to life at a certain point. It has been found out that leaky ReLUs outperform strict ReLUs and also prevent overfitting.

**ELU Activation Function:**

Other activation that is commonly used besided the ReLUs is the Exponential Linear Unit. This function is similar to a ReLU, but instead of a linear function has and exponential function for $z \lt 0$. The ELU has the advantage of having an average output closer to zero because the exponential part takes negative values for $z \lt 0$. This average helps alleviate problems related with vanishing gradients. Furthermore, ELU functions have a non-zero gradient for negative values of $z$, so problems related with dying neurons are avoided. Also, ELUs are smooth for all values of $z$, so Gradient Descent converges faster. Although ELUs are slower to compute that ReLUs, the computing time of training is compensated with the performance improvement of Gradient Descent. Nevertheless, ELUs will take longer than ReLUs in the test time. 

To use ELU activation functions in TensorFlow, it is enough to set the `activation` argument in the `dense()` function to `elu()` as 

```python
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.elu, name='hidden1')`
```

Altough TensorFlow does not have a function for leaky ReLUs, the following code is an implementation of this type of function.

```python
def leaky_relu(z, name=None):
    return tf.maximum(0.01*z, z, name=name)
    
hidden1 = tf.layers.dense(X, n_hidden1, activation=leaky_relu, name='hidden1')
```

## Batch Normalization

Altough using the appropiate activation function and initializing the weights with the correct methods can solve vanishing/exploding gradients issues at the beginning of training, these problems can reappear during training. Batch Normalization addresses this issue by solving the problem of the change of input distribution in each layer when parameters of the previous layers change. The process described below consists in zero-centering and normalizing the inputs of the layer before the activation function is computed. The algorithm needs to calculate the mean and the standard deviation, so these quantities are evaluated for the inputs of the current mini-batch.

1. $\mu_{B} = \frac{1}{m_{B}}\sum^{m_{B}}_{i=1} \mathbf{x}^{(i)}$

2. $\sigma_{B}^{2} = \frac{1}{m_{B}} \sum_{i=1}^{m_{B}} \left(\mathbf{x}^{(i)} - \mu_{B} \right)^{2} $

3. $\mathbf{\hat{x}}^{(i)} = \frac{\mathbf{x}^{(i)} - \mu_{B}}{\sqrt{\sigma_{B}^{2} + \epsilon}}$

4. $\mathbf{z}^{(i)} = \gamma \mathbf{\hat{x}}^{(i)} + \beta$

The terms introduced in the previous steps are:

+ $\mu_{B}$: mini-batch mean
+ $\sigma_{B}$: mini-batch standard deviation
+ $m_{B}$: number of instances in mini-batch
+ $\hat{x}^{(i)}$: zero-centered and normalized input
+ $\gamma$: scaling parameter for the layer
+ $\beta$: offset for layer
+ $\vec{z}^{(i)}$: scaled and shifted inputs

The advantages of using batch normalization are:
+ Less sensitivity to weight initialization
+ Larger learning rates can be used during training. This leads to a speed up in training.
+ Batch normalization acts like a regularizer.

The disavantage of using batch normalization is that it adds complexity to the model. This leads to a runtime penalty, since the model has to perform the normalization for each layer. If the implementation needs to run fast, ELU + He initialization has to be tried first before Batch Normalization. 

### Batch Normalization Implementation

The easiest way to implement Batch Normalization in TensorFlow is using the `tf.layers.batch_normalization()` function. The code below implements Batch Normalization for a network with two hidden layers. The first line that is not self-explanatory is the one that defines `training`. `training` will be set to `True` during training and will tell the `tf.layers.batch_normalization()` whether it should use the mean and standard deviation of the current mini-batch or the whole training set's mean and standard deviation. The former is going to be used during training and the latter during testing. The next lines define the layers of the network. It has to be noted that the activation function is not specified in the connected layers, because the activation function should be applied after the batch normalization layer. 

The `tf.layers.batch_normalization()` function uses exponential decay to calculate running averages. This is why the function requires the momentum parameter. In general, a good momentum parameter is typically close to 1 such as 0.9, 0.99, etc. The larger the dataset, the closer to one this parameter should be. 

```python

import tensorflow as tf

n_inputs = 28*28
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')

training = tf.placeholder_with_default(False, shape=(), name='training')

hidden1 = tf.layers.dense(X, n_hidden1, name='hidden1')
bn1 = tf.layers.batch_normalization(hidden1, training=training, momentum=0.9)
bn1_act = tf.nn.elu(bn1)

hidden2 = tf.layers.dense(bn1_act, n_hidden2, name='hidden2')
bn2 = tf.layers.batch_normalization(hidden2, training=training, momentum=0.9)
logits_before_bn = tf.layers.dense(bn2_act, n_outputs, name='outputs')
logits = tf.layers.batch_normalization(logits_before_bn, training=training, momentum=0.9)
```

The previous code can seem repetitve when it comes to the layer creation. A better implementation for the the creation of the network layer could be one that uses the `partial()` function as in the following code. The partial function allows to create a wrapper around a function and to define default values for some parameters. 

```python

from functools import partial

my_batch_norm_layer = partial(tf.layers.batch_normalization, training=training, momentum=0.9)

hidden1 = tf.layers.dense(X, n_hidden1, name='hidden1')
bn1 = my_batch_norm_layer(hidden1)
bn1_act = tf.nn.elu(bn1)

hidden2 = tf.layers.dense(bn1_act, n_hidden2, name='hidden2')
bn2 = my_batch_norm_layer(hidden2)
bn2_act = tf.nn.elu(bn2)
logits_before_bn = tf.layers.dense(bn2_act, n_outputs, name='outputs')
logits = my_batch_norm_layer(logits_before_bn)

```

The execution phase when batch normalization is included does not change much compared to the implementation without batch normalization. The significant changes are, first, the `training` placeholder must be set to `True` during training. Secondly, some operations are created while evaluating batch normalization at each step to update moving averages. These operation are added to the `UPDATE_OPS` collection, so the list of operations has to be retrieved from that collection and run at each training iteration. 

```python
extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)

with tf.Session() as sess:

    init.run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
        
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run([training_op, extra_update_ops],
                      feed_dict={training: True, X: X_batch, y: y_batch})
                      
         accuracy_val = accuracy.eval(feed_dict={X: mnist.test.images,
                                                 y: mnist.test.labels})

         print(epoch, 'Test accuracy:' accuracy_val)
         
    save_path = saver.save(sess, './my_model_final.ckpt')
    
```