<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Batch-Normalization" data-toc-modified-id="Batch-Normalization-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Batch Normalization</a></span></li><li><span><a href="#Batch-Normalization-layer" data-toc-modified-id="Batch-Normalization-layer-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Batch Normalization layer</a></span></li><li><span><a href="#LeNet-5-architecture" data-toc-modified-id="LeNet-5-architecture-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>LeNet-5 architecture</a></span></li><li><span><a href="#Parameter-initialization" data-toc-modified-id="Parameter-initialization-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Parameter initialization</a></span></li><li><span><a href="#Using-Softmax-cross-entropy-Loss" data-toc-modified-id="Using-Softmax-cross-entropy-Loss-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Using Softmax cross-entropy Loss</a></span></li></ul></div>

## Batch Normalization
<i style='color:blue'> Internal Covariate Shift</i> is defined as the change in the distribution of network activations due to the change in network parameters during training. 
<img src='../images/batch.jpg'>
<p>Source: <a href='https://arxiv.org/pdf/1502.03167.pdf'>Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift</a></p>

 for more on Batch Normalization
<p><a href='https://arxiv.org/pdf/1502.03167.pdf'>Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift</a></p>
<p><a href='https://arxiv.org/pdf/1806.02375.pdf'>Understanding Batch Normalization</a></p>

<p><a href=></a></p>

In [15]:
import d2l
from mxnet import autograd, np, npx, init
from mxnet.gluon import nn
import mxnet as mx
from mxnet import gluon
npx.set_np()

In [16]:
ctx = mx.cpu()

In [17]:
batch_size = 200
num_inputs = 784
num_outputs = 10
num_examples = 60000

In [18]:
def transform(data,labels):
    return data.astype('float32')/255, labels.astype('float32')
train_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=True,transform=transform),
                                      batch_size, shuffle=True)
test_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=False, transform=transform),
                                     batch_size, shuffle=False)


## Batch Normalization layer

Batch normalization layer is <b>used before the activation layer </b> (according to the authors' original paper), instead of after activation layer.


## LeNet-5 architecture
we are going to integrate batch normalization into the LeNet-5 architecture displayed below
<img src='../images/lenet5.jpg'>
 (source: Hands-On Computer Vision with TensorFlow 2 (Leverage deep learning to create powerful image processing apps with TensorFlow 2.0 and Keras) by Benjamin Planche Eliot Andres page 94)
  

In [20]:
lenet = nn.Sequential()
lenet.add(nn.Conv2D(6, kernel_size=5,padding=2),
        nn.BatchNorm(),
        nn.Activation('relu'),
        nn.MaxPool2D(pool_size=2, strides=2),
        nn.Conv2D(16, kernel_size=5),
        nn.BatchNorm(),
        nn.Activation('relu'),
        nn.MaxPool2D(pool_size=2, strides=2),
        nn.Dense(120),
        nn.BatchNorm(),
        nn.Activation('relu'),
        nn.Dense(84),
        nn.BatchNorm(),
        nn.Activation('relu'),
        nn.Dense(10))

## Parameter initialization


In [21]:
lenet.initialize(mx.init.Xavier())

## Using Softmax cross-entropy Loss

In [22]:
softmax_cross_entropy=gluon.loss.SoftmaxCrossEntropyLoss()

In [23]:
optimizer=gluon.Trainer(params=lenet.collect_params(),optimizer='adam',
                        optimizer_params= {'learning_rate': 0.001})

In [24]:
def evaluate_accuracy(data_iterator, net):
    acc = mx.metric.Accuracy()
    for i, (data, label) in enumerate(data_iterator):
        data = data.as_in_context(ctx)
        label = label.as_in_context(ctx)
        data = data.reshape((-1,1, 28, 28))
        output = net(data)
        predictions = output.argmax(axis=1)
        acc.update(preds=predictions, labels=label)
    return acc.get()[1]

In [25]:
evaluate_accuracy(test_data,lenet)

0.0923

In [26]:
epochs = 5
moving_loss = 0.
train_acc=[]
test_acc=[]
for e in range(epochs):
    cumulative_loss = 0
    for i, (data, label) in enumerate(train_data):
        #data = data.reshape((-1,784))
        with autograd.record():
            data = data.as_in_context(ctx)
            label = label.as_in_context(ctx)
            data=data.reshape(200,-1,28,28)
            output = lenet(data)
            loss = softmax_cross_entropy(output, label)
        loss.backward()
        optimizer.step(batch_size)
        cumulative_loss += np.mean(loss)
    
    test_accuracy = evaluate_accuracy(test_data, lenet)
    train_accuracy = evaluate_accuracy(train_data, lenet)
    train_acc.append(train_accuracy)
    test_acc.append(test_accuracy)
    print("Epoch %s. Loss: %s, Train_acc %s, Test_acc %s" % (e, cumulative_loss/num_examples, 
                                                             train_accuracy, test_accuracy)) 
    

Epoch 0. Loss: 0.0011684117, Train_acc 0.98335, Test_acc 0.9817
Epoch 1. Loss: 0.00027387743, Train_acc 0.9907833333333333, Test_acc 0.985
Epoch 2. Loss: 0.00017864369, Train_acc 0.9939166666666667, Test_acc 0.9891
Epoch 3. Loss: 0.00012929647, Train_acc 0.99495, Test_acc 0.9878
Epoch 4. Loss: 9.661367e-05, Train_acc 0.9961166666666667, Test_acc 0.9885
