# Batch Normalization Review<hr>
- Recall that before we input data into many ML algorithms, we like to normalize the data first
- \\(z = (x - \mu)/\sigma\\)<br>
Recall that normalization means subtracting the mean and them dividing by the standard deviation. in other words we make sure that the data has a mean of 0 and a variance of 1.
- With batch norm, instead of manually normalizing data first, we do a normalization at every layer of the neural net
- Building in normalization into the neural net, vs doing the calculation yourself before it is input into the neural net

## How does it work?
- It's called *"batch normalization"* because we'll be doing batch gradient descent
- During training, we consider a small batch of data for each gradient descent step<br>
\\(X_\beta =\\) next batch of data<br>
\\(\mu_\beta = mean(X_\beta)\\)<br>
\\(\sigma_\beta = std(X_\beta)\\)<br>
\\(Y_\beta = (X_\beta - \mu_\beta)/\sigma_\beta\\)

- Only applies during training, since only then will we have batches (we'll do something else for testing)

## Practical Issue
- Important: When does batch normalization actually happen ?
![batchnomal2](../images/batchnomal2.PNG)

## Pretty Simple, No?
\\(X_\beta = \\) next batch of data (small note: "X" refers to the activation here)<br>
\\(\mu_\beta = mean(X_\beta)\\)<br>
\\({\sigma_\beta}^2 = var(X_\beta)\\)<br>

- One more small detail: add a small number to denominator to avoid dividing by 0

\\(Y_\beta = (X_\beta - \mu_\beta)/sqrt({\sigma_\beta}^2+\epsilon\\))

But we are misiing one step!

## Counter-Intuitive Step
- After going through all that trouble to normalize the data, we change its scale and location to something else !
![counter_intuitive](../images/counter_intuitive_step.PNG)

- Why do we "un-standardize" out data after standardizing it?
- Standardization may not be good (we don't know)
- Let gradient descent figure out what's best by updating \\(\gamma, \beta\\)


- Suppose standardization is good -> then the neural network will learn that \\(\gamma = 1, \beta = 0\\)
- \\(\gamma, \beta\\) should be whatever minimizes our cost

## We still have a problem
- We know how to train, but not how to test
- Suppose we want to make a prediction for 1 data point - if we subtract its mean(which is just itself), we get the vector 0!
- Would be nice if we kept track of a **"global mean"** and **"global variance"** during training, and subtract those from the test samples
- That's exactly what we do! (looks like like RL / RMSprop / Adam smoothing)
<br>
for each batch B:<br>
\\('_\mu = decay*'_\mu + (1 - decay) * {'_\mu}_{\beta}\\)<br>
\\(\sigma^2 = decay*\sigma^2 + (1 - decay) * \sigma^2_\beta\\)<br>

- Theoretically, could just use sample mean/var of all training data, but may not scale

### Batch norm test mode
\\(\mu, \sigma^2\\) collected during training<br>
\\(x_{test} = (x_{test} -\mu)/\log(\sigma^2 + \epsilon)\\)<br>
\\(y_{test} = \gamma x_{test}+\beta\\)

## Implementation
- You could try to implement these from scratch (I think it would be a great exercise)
- But instead, use these:
- TF: tf.nn.batch_normalization or tf.contrib.layers.batch_norm
- Theano:<br>
from theano.tensor.nnet.bn import batch_normalization_train, batch_normalization_test
![batchnomal_implementation](../images/batchnomal_implementation.PNG)

## One Last Note
- These are all element-wise operations
- It applies equally to scalars, vectors, images
- Difference for images (e.g. convolution output): we think of # feature maps as the "output"<br>
So for image of (H,W,C) = (28,28,512),\\(\gamma,\beta\\) are size 512 NOT 28 x 28x 512!
- Direct analogy to a fully connected layer, \\(\gamma,\beta\\) are same size as # hidden usnits(a.k.a. # of features)