### Batch Normalization
* Batch normalization is monstly a technique for improving optimization.
* When you have a large dataset, it's more important to optimize well, while regularization becomes less critical as your number of samples increases.
* Of course you can use both batch normalization and dropout at the same time.
* BatchNorm isn't really used for regularization, as BatchNorm doesn't smooth the cost.
* Instead, BatchNorm is added in order to improve the performance of backpropagation.
* In essence, it keeps the back propagated gradient from getting too big or small by rescaling and recentering.
* As a technique, it is related to second-order optimization methods that attempt to model the curvature of the cost surface.
* BatchNorm can also be used to guarantee that the relative scaling is correct if you are going to add random noise to neural activations.

#### BatchNorm vs Dropout
* As a side effect, batch normalization also introduces some noise into the network, so it can regularize the model a little bit.
* Why? Basically, we are multiplying network weights by a noise vector (containing ones and zeros).
* BatchNorm is similar to dropout in the sense that it multiplies each hidden unit by a random value at each step of training. 
* In this case, the random value is the standard deviation of all the hidden units in the minibatch.
* Because different examples are randomly chosen for inclusiion in the minibatch at each step, the standard deviation fluctuates randomly.
* BatchNorm also subtracts a random value (the mean of the minibatch) from each hidden unit at each step.
* Both of these sources of noise mean that every layer has to learn to be robust o a lot of variation in its input, just like with dropout.

#### Why BatchNorm Works
* After normalizing a neural network's inputs, we no longer have to worry that the scale of the input features have an extremely high variance.
* Thus, gradient descent's oscillations are dampened when approaching a minima in the loss surface, and convergence is faster.
* BatchNorm also reduces the impact of earlier layers on later layers in a deep neural network.
* If we take a slice from the middle of our network, e.g. layer #10, we can see that layer 10's input features change during training.
* This makes training more difficult, causing the model to take longer to converge.
* BatchNorm can reduce the impact of earlier layers by keeping the mean and variance fixed, which makes the layers kind of more independent of each other.


#### Drawbacks of BatchNorm
* BatchNorm has a computational cost as it has two more parameters to optimize.
* Due to the exponential moving average, if the mini-batch does not properly represent the entire data distribution, model performance could be heavily impacted.

#### Covariate Shift
* Batch norm limits the internal co-variate shift by normalizing the data over and over again.
* So what is covariate shift and why does it matter?
* Covariate shift is when your inputs change on you, and your algorithm can't deal with it.
* More formally, covariate shif is a change in the distribution of a function's domain.
* So if a net's parameters were trained on distribution A, and we give it data from a different distribution, lets say B, then the trained model will not perform very well.
* Within a single training set, covariate shift isn't normally a problem.
* Even if you're taking subsets/mini-batches, the statistics between batches shouldn't be off by too much, provided you've randomized your dataset.
* But most deep learning architectures are hierarchical...
* At the first layer, you're looking at data from dataset D and the statistics between batches remain similar during training.
* But the first layer feeds the second layer, and the second feeds the third, etc., and once you get to layer, for example, 100 - this becomes problematic.

### Batch Normalization Worksheet

In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed(678)
tf.set_random_seed(678)
config = tf.ConfigProto(device_count = {'GPU': 0})
sess = tf.InteractiveSession(config=config)

### Create Some Random Data
* To simulate a real world use case, we create an 32*32 image from random normal distrubition and add some noise to it.

In [None]:
global  test_data
test_data = np.zeros((30,32,32,1))
for i in range(30):
    new_random_image = np.random.randn(32,32) * np.random.randint(5) + np.random.randint(60)
    new_random_image = np.expand_dims(new_random_image,axis=2)
    test_data[i,:,:,:] = new_random_image

print('\n===================================')
print("Data Shape: ",test_data.shape, " (# of Images, Image Width, Image Height, Channels)")
print("Data Max: ",test_data.max())
print("Data Min: ",test_data.min())
print("Data Mean: ",test_data.mean())
print("Data Variance: ",test_data.var())
print('===================================')

### Sample two images from our data and plot them along with the data distribution

In [None]:
testdata_img_1 = np.squeeze(test_data[0,:,:,:])
testdata_img_2 = np.squeeze(test_data[4,:,:,:])

f, axarr = plt.subplots(1,3,figsize=(12,4))    
axarr[0].imshow(testdata_img_1,cmap='gray')
axarr[1].imshow(testdata_img_2,cmap='gray')
axarr[2].hist(test_data.flatten() ,bins='auto')

plt.show()

### --------- Case 1 normalize the entire dataset ------
$$\LARGE{X_{new} = \frac{X-X_{min}}{X_{max} - X_{min}}}$$

In [None]:
normdata = (test_data - test_data.min(axis=0)) / \
(test_data.max(axis=0) - test_data.min(axis=0))
normdata_img_1 = np.squeeze(normdata[0,:,:,:])
normdata_img_2 = np.squeeze(normdata[4,:,:,:])

print('============== Normalized Data ==============')
print("Data Shape: ",normdata.shape)
print("Data Max: ",normdata.max())
print("Data Min: ",normdata.min())
print("Data Mean: ",normdata.mean())
print("Data Variance: ",normdata.var())
print('=============================================')

#### We can now clearly see that while our images appear the same, the data now ranges between 0 and 1.

In [None]:
f, axarr = plt.subplots(1,3,figsize=(12,4))    
axarr[0].imshow(normdata_img_1,cmap='gray')
axarr[1].imshow(normdata_img_2,cmap='gray')
axarr[2].hist(normdata.flatten() ,bins='auto')
plt.show()

### --- case 2 Standardization: standardize the whole dataset using standard deviation ---
$$\LARGE{X_{new} = \frac{X - \mu}{\sigma}}$$

In [None]:
standdata = (test_data - test_data.mean(axis=0)) / test_data.std(axis=0)
standdata_img_1 = np.squeeze(standdata[0,:,:,:])
standdata_img_2 = np.squeeze(standdata[4,:,:,:])

print('============== Standardized Data  ==============')
print("Data Shape: ",standdata.shape)
print("Data Max: ",standdata.max())
print("Data Min: ",standdata.min())
print("Data Mean: ",standdata.mean())
print("Data Variance: ",standdata.var())
print('================================================')

#### Following standardization, we can see that the mean of our data has shifted to around 0, and variance to 1, but visually, it still looks exactly the same

In [None]:
f, axarr = plt.subplots(1,3,figsize=(12,4))    
axarr[0].imshow(standdata_img_1,cmap='gray')
axarr[1].imshow(standdata_img_2,cmap='gray')
axarr[2].hist(standdata.flatten() ,bins='auto')
plt.show()

### --------- case 3a batch normalize the first 10 images ------
* **Input:** Values of $x$ over a mini-batch: $\mathcal{B} = x_{1,...,m}$
* Parameters to be learned: $\large{\gamma,\beta}$
* **Output:** ${y_i = BN_{\gamma,\beta}(x_i)}$

**mini-batch mean:**
$$\mu\mathcal{B}\leftarrow\frac{1}{m}\sum_{i=1}^mx_i$$

**mini-batch variance:**
$$\sigma_{\mathcal{B}}^2\leftarrow\frac{1}{m}\sum_{i=1}^m(x_i-\mu\mathcal{B})^2$$

**normalize:** `note this is also equivalent to the equation for Standardization`
$$\hat x\leftarrow\frac{x_i-\mu\mathcal{B}}{\sqrt{\sigma_{\mathcal{B}}^2+\epsilon}}$$

**scale and shift:**
$$y_i\leftarrow\gamma\hat x_i + \beta \equiv BN_{\gamma,\beta}(x_i)$$

In [None]:
first10_data = test_data[:10,:,:,:]

# column-wise sums / # of samples
mini_batch_mean = first10_data.sum(axis=0) / len(first10_data)
mini_batch_var = ((first10_data - mini_batch_mean) ** 2).sum(axis=0) / len(first10_data)
batchnorm_data = (first10_data - mini_batch_mean)/ ( (mini_batch_var + 1e-8) ** 0.5 )

bndata_img_1 = np.squeeze(batchnorm_data[0,:,:,:])
bndata_img_2 = np.squeeze(batchnorm_data[4,:,:,:])
print('============== Case 3 Implementation ===================')
print("Data Shape: ",batchnorm_data.shape)
print("Data Max: ",batchnorm_data.max())
print("Data Min: ",batchnorm_data.min())
print("Data Mean: ",batchnorm_data.mean())
print("Data Variance: ",batchnorm_data.var())

#### Again, the mean is around zero, and variance is around 1, and the images do not look much different

In [None]:
f, axarr = plt.subplots(1,3,figsize=(12,4))    
axarr[0].imshow(bndata_img_1,cmap='gray')
axarr[1].imshow(bndata_img_2,cmap='gray')
axarr[2].hist(batchnorm_data.flatten() ,bins='auto')
plt.show()

### --------- case 3b batch norm TensorFlow ------

In [None]:
bndataTF = tf.nn.batch_normalization(first10_data,
                mean = first10_data.mean(axis=0),
                variance = first10_data.var(axis=0),
                offset = 0.0,scale = 1.0,
                variance_epsilon = 1e-8
                ).eval()

bndataTF_img_1 = np.squeeze(bndataTF[0,:,:,:])
bndataTF_img_2 = np.squeeze(bndataTF[4,:,:,:])
print('============== Case 3b Tensorflow ===================')
print("Data Shape: ",bndataTF.shape)
print("Data Max: ",bndataTF.max())
print("Data Min: ",bndataTF.min())
print("Data Mean: ",bndataTF.mean())
print("Data Variance: ",bndataTF.var())
print('=================================')