**Batch normalisation** is a technique for improving the performance and stability of neural networks, and also makes more sophisticated deep learning architectures work in practice (like DCGANs).

The idea is to normalise the inputs of each layer in such a way that they have a mean output activation of zero and standard deviation of one. This is analogous to how the inputs to networks are standardised.

How does this help? We know that normalising the inputs to a network helps it learn. But a network is just a series of layers, where the output of one layer becomes the input to the next. That means we can think of any layer in a neural network as the first layer of a smaller subsequent network.

Thought of as a series of neural networks feeding into each other, we normalising the output of one layer before applying the activation function, and then feed it into the following layer (sub-network).

<p>
        <img src = 'assets/bn_algorithm.png'/ height = "400px" width = "600px">
    
    Look at the last line of the algorithm. After normalizing the input x the result is squashed through a linear function with parameters gamma and beta. These are learnable parameters of the BatchNorm Layer and make it basically possible to say “Hey!! I don’t want zero mean/unit variance input, give me back the raw input - it’s better for me.” If gamma = sqrt(var(x)) and beta = mean(x), the original activation is restored. This is, what makes BatchNorm really powerful. We initialize the BatchNorm Parameters to transform the input to zero mean/unit variance distributions but during training they can learn that any other distribution might be better
    
    Btw: it’s called “Batch” Normalization because we perform this transformation and calculate the statistics only for a subpart (a batch) of the entire trainingsset.
</p>
    

**Batch normalisation was introduced in [Ioffe & Szegedy’s 2015 paper](https://arxiv.org/pdf/1502.03167.pdf). The idea being that, instead of just 
normalising the inputs to the network, we normalise the inputs to layers within the network. It’s called “batch” 
normalization because during training, we normalise the activations of the previous layer for each batch, i.e. 
apply a transformation that maintains the mean activation close to 0 and the activation standard deviation close 
to 1.**

Beyond the intuitive reasons, there are good mathematical reasons why it helps the network learn better, too. 
It helps combat what the authors call internal covariate shift.

#### Benefits of Batch Normalization:

The intention behind batch normalisation is to optimise network training. It has been shown to have several benefits:

- Networks train faster — Whilst each training iteration will be slower because of the extra normalisation calculations during the forward pass and the additional hyperparameters to train during back propagation. However, it should converge much more quickly, so training should be faster overall.

- Allows higher learning rates — Gradient descent usually requires small learning rates for the network to converge. As networks get deeper, gradients get smaller during back propagation, and so require even more iterations. Using batch normalisation allows much higher learning rates, increasing the speed at which networks train.

- Makes weights easier to initialise — Weight initialisation can be difficult, especially when creating deeper networks. Batch normalisation helps reduce the sensitivity to the initial starting weights.

- Makes more activation functions viable — Some activation functions don’t work well in certain situations. Sigmoids lose their gradient quickly, which means they can’t be used in deep networks, and ReLUs often die out during training (stop learning completely), so we must be careful about the range of values fed into them. But as batch normalisation regulates the values going into each activation function, nonlinearities that don’t work well in deep networks tend to become viable again.

- Simplifies the creation of deeper networks — The previous 4 points make it easier to build and faster to train deeper neural networks, and deeper networks generally produce better results.

- Provides some regularisation — Batch normalisation adds a little noise to your network, and in some cases, (e.g. Inception modules) it has been shown to work as well as dropout. You can consider batch normalisation as a bit of extra regularization, allowing you to reduce some of the dropout you might add to a network.

As batch normalisation helps train networks faster, it also facilitates greater experimentation — as you can iterate over more designs more quickly.

### Background:

In 1998, Yan LeCun in his famous paper Effiecient BackProp highlighted the importance of normalizing the inputs. Preprocessing of the inputs using normalization is a standard machine learning procedure and is known to help in faster convergence. Normalization is done to achieve the following objectives:

- The average of each input variable (or feature) over the training set is close to zero (Mean subtraction).
- Covariances of the features are same (Scaling).
- The correlation among features is minimum (Whitening).

The first two are easy to implement:

```
# Assume input data matrix X of size [N x D]
X -= np.mean(X, axis=0) # Mean subtraction
X /= np.std(X, axis=0)  # Scaling
```


<p>
        <img src = 'assets/prepro1.jpeg'/ >
    Common data preprocessing pipeline. Left: Original toy, 2-dimensional input data. Middle: The data is zero-centered by subtracting the mean in each dimension. The data cloud is now centered around the origin. Right: Each dimension is additionally scaled by its standard deviation. The red lines indicate the extent of the data - they are of unequal length in the middle, but of equal length on the right.
</p>


<p>
        <img src = 'assets/1.png'/>
</p>

<p>
        <img src = 'assets/2.png'/>
</p>
    
<p>
        <img src = 'assets/3.png'/>
</p>

##### REFERENCES:
- [Understanding gradient flow through batch normalization](https://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html)
- [Training Deep Neural Networks with Batch Normalization](https://zaffnet.github.io/batch-normalization)

- [batchnorm](https://wiseodd.github.io/techblog/2016/07/04/batchnorm/)