## BatchNorm

When we train a deep neural network, the change of scale in weight from first layer to later layer can be drastically different and it can produce a serious implications say for an ideal learning rate each time. Also we need to be caustious about the weight initialization  strategy and using of higher learning rate. We need a mechanism which standardize weights from first layer to deep into the network which helps in making the training faster , forget about weight initialization and  gradient shrinking and explode issue. That's where the BathNorm comes into picture here. 

In this exercise,we will extend our layer API to support batch normalization

References

https://gluon.mxnet.io/chapter04_convolutional-neural-networks/cnn-batch-norm-scratch.html

https://wiseodd.github.io/techblog/2016/07/04/batchnorm/

### Forward Propogation
Unlike dropout the batch norm take place usually before activation layer instead of after activation layer. The main idea here, we will normalize the output from linear layer(input to batch norm) so that its distribution is Standard Normal (zero mean and one standard deviation).  


<img src="files/bforward.png">

### Backward Propogation

<img src="files/bback.png">

#### Besides that, in the testing process, we want to use the mean and variance of the complete dataset, instead of those of mini batches. In the implementation, we will accumulate the moving/runninnp.g mean and varience for testing as well

In [20]:
import numpy as np
from DeepLearnerBase import Layer 
import copy

In [50]:
class BathNorm(Layer):
    def __init__(self, momentum = 0.9,eps =1e-8):
        self.momentum = momentum
        self.moving_mean = None
        self.moving_varience = None  
        self.eps = eps
        
    def setup(self, optimizer,loss):
        #based on the documentation read
        self.gamma = np.ones(self.inputshape)
        self.beta = np.zeros(self.inputshape)
        
        # parameter optimizers
        self.gamma_opt  = copy.copy(optimizer)
        self.beta_opt = copy.copy(optimizer)
    
    @property
    def shape(self):
        return (self.inputshape ,self.outputshape())
        
    def forward(self, X, training = True): 
        if(self.moving_mean is None) :
            self.moving_mean = np.mean(X,axis=0)
            self.moving_varience = np.var(X,axis =0)
            
        if(training):
            mean =  np.mean(X,axis=0)
            varience = np.var(X,axis =0)
            
            self.moving_mean = (self.momentum * self.moving_mean) + ((1-self.momentum) * mean)
            self.moving_varience = (self.momentum * self.moving_varience) + ((1-self.momentum) * varience)
        else:
            #in the testing process, we want to use the mean and variance of the complete dataset
            mean = self.moving_mean 
            varience = self.moving_varience
            
        #storing it for backward pass
        self.X_centered = X- mean
        self.std_dev = 1/(np.sqrt(varience+ self.eps))
        
        X_norm = self.X_centered * self.std_dev
        
        #scale and shift
        out =  (self.gamma * X_norm) + self.beta
        return out
        
    
    def backward(self, grad):
        # Save parameters used during the forward pass
        gamma = self.gamma
       
       
        X_norm = self.X_centered * self.std_dev
        grad_gamma = np.sum(grad * X_norm, axis=0)
        grad_beta = np.sum(grad, axis=0)

        self.gamma = self.gamma_opt.update(self.gamma, grad_gamma)
        self.beta = self.beta_opt.update(self.beta, grad_beta)

        batch_size = grad.shape[0]

        # The gradient of the loss with respect to the layer inputs (use weights and statistics from forward pass)
        accum_grad = (1 / batch_size) * gamma * self.std_dev * (
            batch_size * grad
            - np.sum(grad, axis=0)
            - self.X_centered * self.std_dev**2 * np.sum(grad * self.X_centered, axis=0)
            )

        return accum_grad
    
    def outputshape(self):
        return self.inputshape 

In [51]:
from scipy.io import loadmat
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from DeepLearnerBase import Sequential, Dense, Activation, CrossEntropyForSoftMax,relu,softmax,SGD
%matplotlib inline

In [52]:
data = loadmat("data\handwritten.mat")
print(data['X'].shape)
print(data['y'].shape)

(5000, 400)
(5000, 1)


In [53]:
X = data['X']
y =  data['y']

In [54]:
X_train, X_valid, y_train, y_valid = train_test_split(
            X,y, test_size=0.20, random_state=42)
print(X_train.shape)
print(X_valid.shape)

(4000, 400)
(1000, 400)


In [55]:
optimizer = SGD(learning_rate =  0.1,momentum=0.9)
loss = CrossEntropyForSoftMax()

In [56]:
model = Sequential([    
    Dense(100),
    BathNorm(),
    Activation(relu),    
    Dense(50),
    BathNorm(),
    Activation(relu),    
    Dense(10),
    BathNorm(),
    Activation(softmax)    
],  optimizer, loss, X.shape[1])

In [57]:
model.summary()

+---------------+
| Model Summary |
+---------------+
Input Shape: 400
+------------+-------------+--------------+------------+
| Layer Name | Input Shape | Output Shape | Shape      |
+------------+-------------+--------------+------------+
| Dense      | 400         | 100          | (400, 100) |
| BathNorm   | 100         | 100          | (100, 100) |
| relu       | 100         | 100          | (100, 100) |
| Dense      | 100         | 50           | (100, 50)  |
| BathNorm   | 50          | 50           | (50, 50)   |
| relu       | 50          | 50           | (50, 50)   |
| Dense      | 50          | 10           | (50, 10)   |
| BathNorm   | 10          | 10           | (10, 10)   |
| softmax    | 10          | 10           | (10, 10)   |
+------------+-------------+--------------+------------+


In [58]:
model.fit(X_train,y_train,X_valid,y_valid,epochs= 10000,batchsize= 1000)

  0% (2 of 10000) |                       | Elapsed Time: 0:00:00 ETA:  0:23:37

Epoch# 0 Training Loss:1.0312971001220088 Validation Loss: 0.7682038774176346 Training Accuracy:0.788 Validation Accuracy:0.785


 10% (1001 of 10000) |##                  | Elapsed Time: 0:01:57 ETA:  0:18:08

Epoch# 1000 Training Loss:0.029824808597426655 Validation Loss: 0.2026289694657334 Training Accuracy:1.0 Validation Accuracy:0.946


 20% (2002 of 10000) |####                | Elapsed Time: 0:03:56 ETA:  0:17:46

Epoch# 2000 Training Loss:0.02845456320401412 Validation Loss: 0.20786208128671693 Training Accuracy:1.0 Validation Accuracy:0.946


 30% (3003 of 10000) |######              | Elapsed Time: 0:05:48 ETA:  0:13:13

Epoch# 3000 Training Loss:0.02800745170782575 Validation Loss: 0.21127903376849794 Training Accuracy:1.0 Validation Accuracy:0.946


 40% (4002 of 10000) |########            | Elapsed Time: 0:07:48 ETA:  0:12:53

Epoch# 4000 Training Loss:0.02778633210976665 Validation Loss: 0.21454089334569365 Training Accuracy:1.0 Validation Accuracy:0.945


 50% (5002 of 10000) |##########          | Elapsed Time: 0:10:49 ETA:  0:12:10

Epoch# 5000 Training Loss:0.027654664660758315 Validation Loss: 0.21704573942161415 Training Accuracy:1.0 Validation Accuracy:0.944


 60% (6002 of 10000) |############        | Elapsed Time: 0:12:44 ETA:  0:10:08

Epoch# 6000 Training Loss:0.02756729342278871 Validation Loss: 0.21931392291169435 Training Accuracy:1.0 Validation Accuracy:0.943


 70% (7002 of 10000) |##############      | Elapsed Time: 0:15:01 ETA:  0:05:59

Epoch# 7000 Training Loss:0.027505142916145732 Validation Loss: 0.22131256619582254 Training Accuracy:1.0 Validation Accuracy:0.944


 80% (8002 of 10000) |################    | Elapsed Time: 0:17:19 ETA:  0:06:23

Epoch# 8000 Training Loss:0.02745863445597696 Validation Loss: 0.22331078274313718 Training Accuracy:1.0 Validation Accuracy:0.944


 90% (9002 of 10000) |##################  | Elapsed Time: 0:19:22 ETA:  0:02:00

Epoch# 9000 Training Loss:0.027422554950695413 Validation Loss: 0.2252454186162289 Training Accuracy:1.0 Validation Accuracy:0.944


100% (10000 of 10000) |###################| Elapsed Time: 0:21:33 Time: 0:21:33


### If you see the above result with BatchNorm, the optimization just zip through and converging in much faster speed (noticed in epoch 2000) when compare to the earlier model even with Dropout.