# Problem 6 - Batch Normalization, Dropout, MNIST

Batch normalization and Dropout are used as effective regularization techniques. However its not clear which one should be preferred and whether their benefits add up when used in conjunction. In this problem we will compare batch normalization, dropout, and their conjunction using MNIST and LeNet-5 (see e.g., http://yann.lecun.com/exdb/lenet/). LeNet-5 is one of the earliest convolutional neural network developed for image classification and its implementation in all major framework is available. You can refer to Lecture 3 slides for definition of standardization and batch normalization.


1. Explain the terms co-adaptation and internal covariance-shift. Use examples if needed. You may need to refer to two papers mentioned below to answer this question. (Papers are in my ipad)

2. Batch normalization is traditionally used in hidden layers, for input layer standard normalization is used. In standard normalization the mean and standard deviation are calculated using the entire training dataset whereas in batch normalization these statistics are calculated for each mini-batch. Train LeNet-5 with standard normalization of input and batch normalization for hidden layers. What are the learned batch norm parameters for each layer ?

In [6]:
import os
! export TMPDIR=$HOME/tmp
os.environ['JOBLIB_TEMP_FOLDER'] = '/tmp'
from tensorflow.keras.layers import Dense, Input, Flatten, Conv2D, BatchNormalization, MaxPool2D, Normalization
from tensorflow.keras.models import Model
import tensorflow as tf
import tensorflow.keras as keras

class LeNet5_Norm(tf.keras.Model):
  def __init__(self, norm_layer, *args, **kwargs):
    super(LeNet5_Norm, self).__init__()
    self.conv1 = Conv2D(filters=6, kernel_size=(5,5), padding="same")
    self.norm1 = norm_layer(*args, **kwargs)
    self.relu = relu
    self.max_pool2x2 = MaxPool2D(pool_size=(2,2))
    self.conv2 = Conv2D(filters=16, kernel_size=(5,5), padding="same")
    self.norm2 = norm_layer(*args, **kwargs)
    self.flatten = Flatten()
    self.fc1 = Dense(units=120)
    self.norm3 = norm_layer(*args, **kwargs)
    self.fc2 = Dense(units=84)
    self.norm4 = norm_layer(*args, **kwargs)
    self.fc3 = Dense(units=10, activation="softmax")
  def call(self, input_tensor):
    conv1 = self.conv1(input_tensor)
    conv1 = self.norm1(conv1)
    conv1 = self.relu(conv1)
    maxpool1 = self.max_pool2x2(conv1)
    conv2 = self.conv2(maxpool1)
    conv2 = self.norm2(conv2)
    conv2 = self.relu(conv2)
    maxpool2 = self.max_pool2x2(conv2)
    flatten = self.flatten(maxpool2)
    fc1 = self.fc1(flatten)
    fc1 = self.norm3(fc1)
    fc1 = self.relu(fc1)
    fc2 = self.fc2(fc1)
    fc2 = self.norm4(fc2)
    fc2 = self.relu(fc2)
    fc3 = self.fc3(fc2)
    return fc3

# load dataset, using cifar10 to show greater improvement in accuracy
(trainX, trainY), (testX, testY) = keras.datasets.cifar10.load_data()

normalization_layer = Normalization()
normalization_layer.adapt(trainX)

input_layer = Input(shape=(32,32,3,))
x = LeNet5_Norm(BatchNormalization)(normalization_layer(input_layer))

model = Model(inputs=input_layer, outputs=x)

model.compile(optimizer="adam", loss=tf.keras.losses.SparseCategoricalCrossentropy(), metrics="acc")

history = model.fit(x=trainX, y=trainY, batch_size=256, epochs=10, validation_data=(testX, testY))

Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz

OSError: [Errno 28] No space left on device


3. Next instead of standard normalization use batch normalization for input layer also and train the network. Plot the distribution of learned batch norm parameters for each layer (including input) using violin plots. Compare the train/test accuracy and loss for the two cases ? Did batch normalization for input layer improve performance ? 


4. Train the network without batch normalization but this time use dropout. For hidden layers use dropout probability of 0.5 and for input layer take it to be 0.2 Compare test accuracy using dropout to test accuracy obtained using batch normalization in part 2 and 3.


5. Now train the network using both batch normalization and dropout. How does the performance (test accuracy) of the network compare with the cases with dropout alone and with batch normalization alone ? 