# Batch Normalization
## Unit Gaussian Activations

So if you really want unit Gaussian activations, you can make them so by applying batch normalization to every layer. Let's consider a batch of activations at some layer, we can make each dimension (denoted by $k$) unit Gaussian by applying: 

$$
\hat{x}^{(k)} = \frac{x^{(k)} - E[x^{(k)}]}{\sqrt{Var[x^{(k)}]}}
$$

Each batch of training example has dimension `D`. Compute the empirical mean and variance independently for each dimension by using all the training data.

For example:

In [10]:
import numpy as np

# We have three training activation examples and each example has a dimension of 3
activations = np.array([[1, 0.9, 1],[1, 1, 1],[1, 1, 0.5]])

print activations.mean(axis=0)
print activations.var(axis=0)

[ 1.          0.96666667  0.83333333]
[ 0.          0.00222222  0.05555556]


Batch normalization is usually inserted after fully connected or convolutional layers and before nonlinearity is applied. For the convolutional layer, we are basically going to have one mean and one standard deviation per activation map that we have. And then we are going to normalize across all of the examples in the batch of data.

## Avoid Constraints by Learning

If we have a tanh layer, we don't really want to constraint it to the linear regime. The act of normalization might force us to stay within the center, which is known as the linear regime. We want flexibility so ideally we should learn batch normalization as a paramter of the network. In other words, we should insert a parameter which can be learned to effectively cancel out batch normalization if the network sees fit.

We will apply the following operation to each normalized vector:

$$
y^{(k)} = \gamma^{(k)}\hat{x}^{(k)} + \beta^{(k)}
$$

Such that the network can learn

$$
\gamma^{(k)} = \sqrt{Var[x^{(k)}]} \\
\beta^{(k)} = E[x^{(k)}]
$$

And effectively recover the identity mapping as if you didn't have batch normalization, i.e. to cancel out the batch normalization if the network sees fit.

## Summary

**Inputs**: Values of $x$ over a mini-batch: **B** = $\{x_{1}...x_{m}\}$

**Outputs**: $\{y_{i} = BN_{\gamma, \beta}(x_{i})\}$

Find mini-batch mean:
$$
\mu_{B} = \frac{1}{m} \sum^{m}_{i = 1} x_{i}
$$

Find mini-batch variance:
$$
\sigma_{B}^{2} = \frac{1}{m} \sum^{m}_{i = 1} (x_{i} - \mu_{B})^{2}
$$

Normalize:
$$
\hat{x_{i}} = \frac{x_{i} - \mu_{B}}{\sqrt{\sigma_{B}^{2} + \epsilon}}
$$

Scale and shift:
$$
y_{i} = \gamma \hat{x_{i}} + \beta = BN_{\gamma, \beta}(x_{i})
$$

### Benefits
* Improves gradient flow through the network
* Allows higher learning rates
* Reduces the strong dependence on initialization
* Acts as a form of regularization in a funny way, and slightly reduces the need for dropout