<div style="    font-variant: small-caps;
    font-weight: normal;
    font-size: 30px;
    text-align: center;
    padding: 15px;
margin: 10px;">Reparametrization Trick for Batch Normalization</div>
<div style="    font-variant: small-caps;
    font-weight: normal;
    font-size: 20px;
    text-align: center;
    padding: 15px;">Proof of unaltered Accuracy</div>
<div style="  float:right;
    font-size: 12px;
    line-height: 12px;
padding: 10px 15px 8px;">Alberto IBARRONDO | Melek ONEN</div>

<div style=" display: inline-block; font-family: 'Lato', sans-serif; font-size: 12px; font-weight: bold; line-height: 12px; letter-spacing: 1px; padding: 10px 15px 8px; ">14/04/2018</div>

This notebook serves as proof of the Reparametrization Trick used to absorb a Batch Normalization (BN) layer by the immediately previous Fully Connected (FC) or Convolutional (Conv) layer. To prove it, we will:
1. Setup and CNN implementation using Tensorflow.
2. Train a Convolutional Neural Network (CNN) containing BN layers from scratch, using MNIST training dataset.
3. Evaluate accuracy of such CNN using the MNIST test dataset.
4. Extract all the parameters from the trained CNN.
5. Define a CNN with the same architecture except for the BN layers, that will be deleted.
6. Perform the reparametrization trick to alter the weights and biases in Conv and FC layers of the original CNN and load them in the CNN without BN.
7. Evaluate accuracy of reparametrized CNN using the MNIST test dataset, comparing it to the original.
For an easy reproducibility of our proof, we include all the functions and CNNet class in `CNNet.py`. They can be easily imported using:
             
             from CNNet import *
             
We have also saved the models `tinyCNN` (with BN) and `NoBN_tinyCNN` (without BN and reparametrized) inside the `Models/` folder.

# Setup & Dataset

We will use Tensorflow to implement the CNNs, and NumPy to perform the reparametrization trick.

In [None]:
import tensorflow as tf
import numpy as np
import time

As stated before, we will use the MNIST dataset, normalized to [0,1].

In [None]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

# CNN implementation in Tensorflow

**CNNet** implements a simple CNN with only one Conv layer and two FC layers, with this architecture:
        
        28x28 => Conv1 -> BN -> ReLU -> Flat -> FC1 -> BN_1 -> ReLU -> FC2 => [0-9]
        
        
The details of the network architecture are:
- **Conv1**: 20 Kernels/filters of 5x5, 20 biases. {1x28x28 -> 20x28x28}
- **BN**: 20 betas, gammas, moving averages and variances. {20x28x28 -> 20x28x28}
- **ReLU**: no parameters. {20x28x28 -> 20x28x28}
- **Flat**: {20x28x28 -> 15684}
- **FC1**: 15684\*1000 weights, 1000 biases.  {15684 -> 1000}
- **BN_1**: 1000 betas, gammas, moving averages and variances. {1000 -> 1000}
- **ReLU**:no parameters. {1000 -> 1000}
- **FC1**: 1000*10 weights, 10 biases.  {1000 -> 10}

    


![CNNwBN.png](attachment:CNNwBN.png)

With this architecture we can use the reparametrization trick in **Conv1 + BN** and in **FC1 + BN_1**.

Our implemented python class **CNNet** takes charge of everything. It includes the following methods:
- *init*: setup the whole tf graph and session.
- *del*: clean the tf session before deleting the object.
- *CNN_var_init*: initializing weights and biases as tf variables.
- *preproc*: simple preprocessing for the input data (mean substraction).
- *CNN_feedforward*: define the feedforward operations of the network.
- *batchNorm*: wrapper of Batch Normalization in Tensorflow.
- *save & restore*: handle different models/instances.
- *train*: use MNIST train dataset to train the model.
- *benchmark*: use one of the MNIST dataset parts (train/validation/test) to evaluate accuracy.
- *predict*: perform inference on a single image.

ReLU will be our chosen activation function. Additionally, we initialize weights randomly and biases to a small constant. For all of the definitions we use the standard Tensorflow API.

In [None]:
cnn.save('Models/tinyCNN')

In [None]:
cnn = CNNet('tinyCNN', lr=0.002, loadInstance='Models/tinyCNN')

In [None]:
cnn.train(epochs=5)

# Evaluating CNN with BN

In [None]:
cnn.benchmark('TEST')

As a result of the brief training, our trained CNN with BN layers yields a **test accuracy of 99.02%**.

# Extracting parameters from trained CNN

For the sake of completeness, we will be verbose and show all the parameters being extracted from the current model (obtained from the default tf graph):

- Trainable variables

In [None]:
trainVariables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)
trainVariables

- Moving means and variances from BN layers

In [None]:
modelVariables = tf.get_collection(tf.GraphKeys.MODEL_VARIABLES)[2:4] + \
tf.get_collection(tf.GraphKeys.MODEL_VARIABLES)[6:8]
modelVariables

We will save them all inside a dictionary due to its ease of access:

In [None]:
allVariables = {}
for var in trainVariables:
    allVariables[var.name]=var.eval(session=cnn.sess)
for var in modelVariables:
    allVariables[var.name]=var.eval(session=cnn.sess)

In [None]:
allVariables.keys()

# Defining a CNN without BN

This CNN has exactly the same architecture as the previous one, save for the BN layers, that have been erased:

![CNNrep.png](attachment:CNNrep.png)



In [None]:
NoBNcnn = CNNet('NoBN_tinyCNN', lr=0.01, flag_bNorm=False)

Since we haven't trained this network, its initial test accuracy (13.6%) is based on random initialization of the weights, and thus it is similar to random picking a class for each image (10% chance of success).

In [None]:
NoBNcnn.benchmark('TEST')

If we assign the parameters of the trained tinyCNN (without assigning any BN parameters, since there are none in this model) we obtain some kind of improvement in accuracy, but still far from optimal:

In [None]:
NoBNcnn.sess.run(NoBNcnn.W_c1.assign(allVariables['Model/W_c1:0']))  
NoBNcnn.sess.run(NoBNcnn.b_c1.assign(allVariables['Model/b_c1:0'])) 
NoBNcnn.sess.run(NoBNcnn.W_fc1.assign(allVariables['Model/W_fc1:0'])) 
NoBNcnn.sess.run(NoBNcnn.b_fc1.assign(allVariables['Model/b_fc1:0'])) 
NoBNcnn.sess.run(NoBNcnn.W_fc2.assign(allVariables['Model/W_fc2:0'])) 
NoBNcnn.sess.run(NoBNcnn.b_fc2.assign(allVariables['Model/b_fc2:0'])) 

# Applying the trick

Now we apply the reparametrization trick described in the paper, that is valid both for Conv and for FC layers, modifying weights and biases using the parameters in the BN layer so that the Conv/FC reparametrized absorbs the operations in BN layer:

In [None]:
def ReparamTrickBN(W, b, beta, gamma, mu, sigma2):
    W_rep = W * gamma / np.sqrt(sigma2)
    b_rep = (b - mu) * gamma / np.sqrt(sigma2) + beta
    return W_rep, b_rep

First we apply it to Conv1:

In [None]:
W_rep_c1, b_rep_c1 = ReparamTrickBN(
    W=allVariables['Model/W_c1:0'],
    b=allVariables['Model/b_c1:0'],
    beta=allVariables['BatchNorm/beta:0'],
    gamma=allVariables['BatchNorm/gamma:0'],
    mu=allVariables['BatchNorm/moving_mean:0'],
    sigma2=allVariables['BatchNorm/moving_variance:0'],
)

And then to FC1:

In [None]:
W_rep_fc1, b_rep_fc1 = ReparamTrickBN(
    W=allVariables['Model/W_fc1:0'],
    b=allVariables['Model/b_fc1:0'],
    beta=allVariables['BatchNorm_1/beta:0'],
    gamma=allVariables['BatchNorm_1/gamma:0'],
    mu=allVariables['BatchNorm_1/moving_mean:0'],
    sigma2=allVariables['BatchNorm_1/moving_variance:0'],
)

Since we now have the reparametrized Weights and Biases, we can finally load them into the CNN without BN:

In [None]:
NoBNcnn.sess.run(NoBNcnn.W_c1.assign(W_rep_c1))  
NoBNcnn.sess.run(NoBNcnn.b_c1.assign(b_rep_c1)) 
NoBNcnn.sess.run(NoBNcnn.W_fc1.assign(W_rep_fc1)) 
NoBNcnn.sess.run(NoBNcnn.b_fc1.assign(b_rep_fc1)) 
NoBNcnn.sess.run(NoBNcnn.W_fc2.assign(allVariables['Model/W_fc2:0'])) 
NoBNcnn.sess.run(NoBNcnn.b_fc2.assign(allVariables['Model/b_fc2:0'])) 

At this point we save the model to be able to reproduce our results if desired.

In [None]:
NoBNcnn.save('Models/NoBN_tinyCNN')

# Proving same accuracy

With our recently loaded reparametrized Weights and Biases in Conv1 and FC1, we evaluate test accuracy once again:

The final test accuracy for the CNN without BN is **99.02%**, which is the exact **same value** that the CNN with BN yielded (see section 4). We successfully modified the Weights and Biases of Conv and FC layers to absorb the subsequent BN layers.

**This serves as proof of the reparametrization trick for Batch Normalization. QED**

# EXTRA: Performance gain

In [None]:
from CNNet import CNNet

# Training a CNN with BN from stratch 

To speed up training, we will perform two phases of 5 steps each, one with learning rate 0.01 and the second with learning rate 0.002 for finer tuning. 

In [None]:
cnn = CNNet('tinyCNN', lr=0.01)