# Why Batch Norm

* Although, proper weight initialization ensure no vanishing/exploding grad at begining -> it doesnt guarantee this during training

* Hence, we need prevent it during course of training

## Batch Norm

* Step 1: zero center & Normalize input
* Step 2: Scale & Shift using two new parameter

* Steps:
    * $\mu = \frac{1}{n} \sum x $
    * $ \sigma ^2 = \frac{1}{n} \sum (x - \mu)^2 $

    * $ \hat x = \frac {x - \mu}{\sqrt{\sigma ^2 + \epsilon}} $
    * $ z = \gamma * \hat x + \beta $

* Batch Norm is done in batch of training data, hence the name. the "n" in calculation represents the batch size

* $\gamma$ -> output scale parametet, $\beta$ -> output offset/shift parameter

* z -> rescaled and shifted version of inputs

## Testing

* When working with batches, the mean and sigma can vary lot 

* One solution : run the whole training set through the neural network and compute the mean and standard deviation of each input of the BN layer.

* estimate these final statistics during training by using a moving average of layer’s input means and standard deviations.

* Hence, four vectors are learned at each batch-norm layer:
    * $\gamma$ : output scale vector
    * $\beta$ : output offset vector
    * $\mu$ : final input mean vector (exp mov avg)
    * $\sigma$ : final input std dev vector (exp mov avg)

* The $\mu, \sigma$ are estimated only at training and used only after training

## Summary:

* It had solved vanishing/exploding grads even with saturating activation functions (sigmoid and logistic)

* Faster convergence but training is slow (bcos more computation)

In [1]:
import tensorflow as tf

In [2]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape = [28, 28]),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(20, activation='elu'),
    tf.keras.layers.Dense(10, activation='softmax'),
    ])

In [3]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
batch_normalization (BatchNo (None, 784)               3136      
_________________________________________________________________
dense (Dense)                (None, 20)                15700     
_________________________________________________________________
dense_1 (Dense)              (None, 10)                210       
Total params: 19,046
Trainable params: 17,478
Non-trainable params: 1,568
_________________________________________________________________


* Notice the batchnorm has 4 parameters per input feature. Here the input is 784 features

* Implies 4*784 = 3136

* The $\mu, \sigma$ at bach norm layer is not used during back-prop, hence it is shown as non-trainable params.

* ie 2*784 = 1568

In [4]:
[(var.name, var.trainable) for var in model.layers[1].variables]

[('batch_normalization/gamma:0', True),
 ('batch_normalization/beta:0', True),
 ('batch_normalization/moving_mean:0', False),
 ('batch_normalization/moving_variance:0', False)]

## Where to add Batch Norm layer

* The authors of the BN paper argued in favor of adding the BN layers before the activation
functions, rather than after.

* Depends on our dataset - needs experimentation
* To add the BN layers before the activation
functions, you must remove the activation function from the hidden layers and
add them as separate layers after the BN layers. 
* Moreover, since a Batch Normalization
layer includes one offset parameter per input, you can remove the bias term from
the previous layer (just pass use_bias=False when creating it)