# Batch Normalization

## Normalizing activations in a network

We have seen how normalizing can help us train faster. We cannot expect inputs to all fit within the same range. 

We are able to normalize the inputs values but what about normalizing the values in the hidden layers?
- *Sidenote*
    - There's often debate of what whether we should normalize after or before the activation function
    - Most people do it after the activation functions
    
The image below describes normalizing the z in a specific hidden layer
- Like shown in the previous section, we must calculate the mean and variance to find normalize the values
- One thing to note is that we might not always want our values to be normalzied around 0 with a standard deviation of 1.
    - Hence we have to additional parameters that we could tweak that could normalize the values across a different distribution
        - $\gamma$ (gamma) and $\beta$ (beta)
        - These values that we could modify...
        - If gamma were to be $\sqrt{\sigma + \epsilon}$, and beta to $\mu$, we would inverse the the normalization
        
<img src="./images/improv_40.png" alt="Drawing" style="width: 550px;"/>


## Fitting Batch Norm into a neural network
<img src="./images/improv_41.png" alt="Drawing" style="width: 450px;"/>

- Notice how we are finding the normalized value after each iteration. We are not finding the value after the activation. The new values are used as z for the activation. 
- We also have many new parameters that we have to calculate.


This is typically done through mini-batches.
<img src="./images/improv_42.png" alt="Drawing" style="width: 450px;"/>


Parameters: 
- We will not be using the beta (beta in the equation) since it is a constant that is applied to every example.
- Thus, removing it, will not do anything since we will remove it from all the examples
- The other beta term (found in the equation where we modified the normalization of z). It controls the way we shift (or the bias) in our problem (since it controls the mean of the normalization of z)

<img src="./images/improv_43.png" alt="Drawing" style="width: 450px;"/>

## Why does batch Norm work?
- It makes weight of specific layer more robust to changes
- If we are training a shallow network, or a deep network:
    - If we only trained to black cats, or model will not do great against color cats
    - This is called covariate shift: this means that if the mapping to x to y changes, we would have to retrain the model.
    
    
In practice, we will face issues with our data. Since we will be using previous networks, we will need to find a way to capture how we map it. If we do not this, we will put in position where our model will suffer if the values from its previous network were to change.

Since all the values in a layer will be somehow used for futures layers we wil need to find to normalized these values.

This makes the job of the later layers easier because it will have similar values as input. These inputs values will not shift as much.

<img src="./images/improv_44.png" alt="Drawing" style="width: 550px;"/>

With the normalization, it pushes the later layers not to rely on any one hidden unit, and thus normalization has a slight regularization effect

<img src="./images/improv_45.png" alt="Drawing" style="width: 550px;"/>

In the image above, we notice that when predicting new vlaues for the neural network could lead to wrong answers if the values do not match to how the model were trained. One way to reduce is to make sure we have values with the same distribution (mean and variance)
- So for example, their values could change but we would maintain the values to have a mean of 0 awith a variance of 1.


## Batch Norm at test time
One of the drawbacks with the mini-batch is that we cannot test our testing set through mini-batches since we would typically have one example. And having the mean and variance does not make much sense.

During test time, we would need to calculate a separate mean and variance
- To calculate, we would estimate the values using an expontentially weighted average across mini-batches
- We must store the $\mu$ and $\sigma$ for each of the mini-batches and use those values to calcualte the exponentially weighed averages

<img src="./images/improv_46.png" alt="Drawing" style="width: 350px;"/>
