<a href="https://colab.research.google.com/github/Utkarshp1/Learning_TensorFlow/blob/master/Training_Deep_Neural_Networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


References: Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow by Aurélien Géron.

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
import matplotlib.pyplot as plt 
import pandas as pd

## Glorot and He Initialization

$fan_{in}$ is the number of inputs to the neuron. <br />
$fan_{out}$ is the number of outputs of tht neuron. <br />
$fan_{avg} = (fan_{in} + fan_{out})/2$

Glorot initialization when using the logistic activation function:
* Normal distribution with mean 0 and variance $\sigma^2 = \frac{1}{fan_{avg}}$
* Or a uniform distribution between $-r$ and $+r$, with $r = \sqrt{\frac{3}{fan_{avg}}}$

LeCun Initialization:
Replace $fan_{avg}$ with $fan_{in}$ in Glorot Initialization. In other words the LeCun initialization is equivalent to the Glorot Initialization when $fan_{in}$ = $fan_{out}$. <br /> <br />

|**Initialisation**| **Activation functions** | **$\sigma^2$ (Normal)** |
|--- |---| ---|
|Glorot| None, tanh, logistic, softmax | $\frac{1}{fan_{avg}}$ |
| He | ReLU and its variants | $\frac{2}{fan_{in}}$ |
| LeCun | SeLU | $\frac{1}{fan_{in}}$ |

The table list only the variance for the normal initialisation. If you want to use the uniform distribution for the initialisation in the range $-r$ to $+r$, compute $r = \sqrt {3\sigma^2}$. 

By default, Keras uses Glorot initialisation with a uniform distribution. When creating a layer, you can change this to He initialisation by setting `kernel_initializer="he_uniform" or kernel_initializer="he_normal"` like this:
```python
    keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")
```
If you want the He initialization with a uniform distribution but based on $fan_{avg}$ rather than $fan_{in}$, you can use the `VariableScaling` initializer like this:

```python
    he_avg_init = keras.initializers.VariableScaling(scale=2, mode='fan_avg', distribution='uniform')
    keras.layers.Dense(10, activation='sigmoid', kernel_initializer=he_avg_init)
```


## Batch Normalization

In [4]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28,28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation='elu', kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax")
])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 784)               0         
_________________________________________________________________
batch_normalization_1 (Batch (None, 784)               3136      
_________________________________________________________________
dense (Dense)                (None, 300)               235500    
_________________________________________________________________
batch_normalization_2 (Batch (None, 300)               1200      
_________________________________________________________________
dense_1 (Dense)              (None, 100)               30100     
_________________________________________________________________
batch_normalization_3 (Batch (None, 100)               400       
_________________________________________________________________
dense_2 (Dense)              (None, 10)                1

In [6]:
# Parameters of the first BN Layers, Two trainable and two are not
[(var.name, var.trainable) for var in model.layers[1].variables]

[('batch_normalization_1/gamma:0', True),
 ('batch_normalization_1/beta:0', True),
 ('batch_normalization_1/moving_mean:0', False),
 ('batch_normalization_1/moving_variance:0', False)]

The last two parameters **$\mu$** and **$\sigma$**, are the moving averages; they are not affected by backpropagation, so Keras calls them "non-trainable".

Now when you create a BN layer in Keras, it also creates two operations that will be called by Keras at each iteration during training. These operations will update the moving averages, Since we are using the Tensorflow backend, these operations are TensorFlow operations:

In [8]:
model.layers[2].updates



[]

The authors of the BN paper argued in favour of adding the BN layers before the activation functions. rather than after (as we just did). To add the BN layers before the activation functions, you must remove the activation function from the hidden layers and add them as separate layers after the BN layers. Moreover, since a Batch Normalization layer includes one offset parameter per input, you can remove the bias term from the previous layer (just pass `use_bias=False` when creating it):

In [None]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28,28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, kernel_initializer="he_normal", use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation('elu')
    keras.layers.Dense(100, kernel_initializer='he_normal', use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation('elu'),
    keras.layers.Dense(10, activation="softmax")
])

**Hyperparameters of BN:**
* `momentum`: This hyperparameter is used by the `BatchNormalization` layer when it updates the exponential moving averages; given a new value **$v$** (i.e., a new vector of input means or standard deviations computed over the current batch), the layer updates the running average **$\hat{v}$** using the following equation:
$$ \hat{\textbf{v}} \leftarrow \hat{\textbf{v}} \times momentum + \textbf{v} \times (1-momentum) $$
A good momentum value is typically close to 1; for example. 0.9, 0.99, or 0.999 (you want more 9s for larger datasets and smaller mini-batches). The defaults will usually be fine, but you may occasionally need to tweak.

* `axis`: This hyperparameter decides which axis to normalize. It defaults to -1, meaning that by-default it will normalize the last axis (using the means and standard deviations computed across other axes). When the input batch is 2D (i.e., the batch shape is [*batch size*, *features*]), this means that each input feature will be normalized based on the mean and standard deviation computed across all the instances in the batch. If the input to the BN layer is 3D, with shape [*batch size*, *height*, *width*]; it will normalize across all the instances in the batch and across all the rows in the column. If you want to treat each of the element of a training example i.e. each element of the array in a training example as different, then you should set `axis=[1,2]`.

Notice that the BN layer does not perform the same computation during training and after training: it uses batch statistics during training and the "final" statistics after training (i.e. the final values of the moving averages). The source code for this class looks like:
```python
    class BatchNormalization(keras.layers.Layer):
        [...]
        def call(self, inputs, training=None):
            [...]
```
The `call()` method is the one that performs the computations; as you can see, it has an extra `training` argument, which is set to `None` by default, but the `fit()` method sets it to 1 during training.

**TIP**: If you ever need to write a custom layerm and it must behave differently during training and testing, add a `training` argument to the `call()` method and use this argument in the method to decide what to compute.