# Batch Normalization
## Unit Gaussian Activations

So if you really want unit Gaussian activations, you can make them so by applying batch normalization to every layer. Let's consider a batch of activations at some layer, we can make each dimension (denoted by $k$) unit Gaussian by applying: 

$$
\hat{x}^{(k)} = \frac{x^{(k)} - E[x^{(k)}]}{\sqrt{Var[x^{(k)}]}}
$$

Each batch of training example has dimension `D`. Compute the empirical mean and variance independently for each dimension by using all the training data.

For example:

In [10]:
import numpy as np

# We have three training activation examples and each example has a dimension of 3
activations = np.array([[1, 0.9, 1],[1, 1, 1],[1, 1, 0.5]])

print activations.mean(axis=0)
print activations.var(axis=0)

[ 1.          0.96666667  0.83333333]
[ 0.          0.00222222  0.05555556]


## Avoid Constraints by Learning

If we have a tanh layer, we don't really want to constraint it to the linear regime. The act of normalization might force us to stay within the center, which is known as the linear regime. We want flexibility so ideally we should learn batch normalization as a paramter of the network. In other words, we should insert a parameter which can be learned to effectively cancel out batch normalization if the network sees fit.