# Homework II (Part 2 out 2): Batch Normalization 



------------------------------------------------------
**Machine Learning. Master in Big Data Analytics**

*Pablo M. Olmos pamartin@ing.uc3m.es*

------------------------------------------------------


Batch normalization was introduced in Sergey Ioffe's and Christian Szegedy's 2015 paper [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/pdf/1502.03167.pdf). The idea is that, instead of just normalizing the inputs to the network, we normalize the inputs to _layers within_ the network. 

> It's called **batch** normalization because during training, we normalize each layer's inputs by using the mean and variance of the values in the current *batch*.

We will first analyze the effect of Batch Normalization (BN) in a simple NN with dense layers. Then you will be able to incorportate BN into the CNN that you designed in the first part of Lab 3. 

Note: a big part of the following material is a personal wrap-up of [Facebook's Deep Learning Course in Udacity](https://www.udacity.com/course/deep-learning-pytorch--ud188). So all credit goes for them!!


## Batch Normalization in PyTorch<a id="implementation_1"></a>

This section of the notebook shows you one way to add batch normalization to a neural network built in PyTorch. 

The following cells import the packages we need in the notebook and load the MNIST dataset to use in our experiments.

## Batch normalization in dense and convolution layers

Batch normalization addresses the issue of changing input distribution deep in the network. For example, if I trained linear model at each iteration the distribution of the inputs would be stable. In a deep neural network the distribution at iteration 10 for a deep layer can very different from the distribution at iteration 1. This because the outputs of the previous layers change every time the weights are updated. Since changes in the output propagate trough the layers this issue becomes larger the deeper we are in the network. Since are updated after every mini batch the distribution changes betweent the minibatches which is why normalization occurs at the batch layer.
The output of each normalization is not merely the standardized value of the input but a the input scaled by a constant weight and a bias term which are learned over time.

In a dense layer the mean and variance are computed with respect to all inputs at each neuron. For the convolutional layer we need to create a mean for each channel.

In [0]:
%matplotlib inline

import numpy as np
import copy
import time

import torch
from torch import nn
from torch import optim
import matplotlib.pyplot as plt
from torchvision import datasets, transforms


In [5]:
### Run this cell


# Define a transform to normalize the data
transform = transforms.Compose([transforms.ToTensor(),
                              transforms.Normalize((0.5,), (0.5,)),
                              ])

# Download and load the training  data
trainset = datasets.MNIST('~/.pytorch/MNIST_data/', download=True, train=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)

# Download and load the test data
testset = datasets.MNIST('~/.pytorch/MNIST_data/', download=True, train=False, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=64, shuffle=True)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to /root/.pytorch/MNIST_data/MNIST/raw/train-images-idx3-ubyte.gz


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Extracting /root/.pytorch/MNIST_data/MNIST/raw/train-images-idx3-ubyte.gz to /root/.pytorch/MNIST_data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to /root/.pytorch/MNIST_data/MNIST/raw/train-labels-idx1-ubyte.gz


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Extracting /root/.pytorch/MNIST_data/MNIST/raw/train-labels-idx1-ubyte.gz to /root/.pytorch/MNIST_data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to /root/.pytorch/MNIST_data/MNIST/raw/t10k-images-idx3-ubyte.gz


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Extracting /root/.pytorch/MNIST_data/MNIST/raw/t10k-images-idx3-ubyte.gz to /root/.pytorch/MNIST_data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to /root/.pytorch/MNIST_data/MNIST/raw/t10k-labels-idx1-ubyte.gz


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Extracting /root/.pytorch/MNIST_data/MNIST/raw/t10k-labels-idx1-ubyte.gz to /root/.pytorch/MNIST_data/MNIST/raw
Processing...
Done!





### Neural network classes

The following class, `MLP`, allows us to create identical neural networks **with and without batch normalization** to compare. We are defining a simple NN with **two dense layers** for classification; this design choice was made to support the discussion related to batch normalization and not to get the best classification accuracy.

Two importants points about BN:

- We use PyTorch's [BatchNorm1d](https://pytorch.org/docs/stable/nn.html#batchnorm1d). This is the function you use to operate on linear layer outputs; you'll use [BatchNorm2d](https://pytorch.org/docs/stable/nn.html#batchnorm2d) for 2D outputs like filtered images from convolutional layers. 
- We add the batch normalization layer **before** calling the activation function.


In [0]:
class MLP(nn.Module):
    def __init__(self,dimx,hidden1,hidden2,nlabels,use_batch_norm=False): #Nlabels will be 10 in our case
        
        super().__init__()
        
        # Keep track of whether or not this network uses batch normalization.
        self.use_batch_norm = use_batch_norm
        
        self.output1 = nn.Linear(dimx,hidden1)
        
        self.output2 = nn.Linear(hidden1,hidden2)        
        
        self.output3 = nn.Linear(hidden2,nlabels)
    
        self.relu = nn.ReLU()
        
        self.logsoftmax = nn.LogSoftmax(dim=1)
        
        if self.use_batch_norm:

            self.batch_norm1 = nn.BatchNorm1d(hidden1)
            
            self.batch_norm2 = nn.BatchNorm1d(hidden2)
            
        
    def forward(self, x):
        x = x.view(x.shape[0],-1)
        # Pass the input tensor through each of our operations
        x = self.output1(x)
        if self.use_batch_norm:
            x = self.batch_norm1(x)
        x = self.relu(x)
        x = self.output2(x)
        if self.use_batch_norm:
            x = self.batch_norm2(x)        
        x = self.relu(x)
        x = self.output3(x)
        x = self.logsoftmax(x) 
        
        return x

> **Exercise:** 
> 
> - Create a validation set with the 20% of training set
> - Extend the class above to incorporate a training method where both training and validation losses are computed, and a method to evaluate the classification performance on a given set

**Note:** As we do with Dropout, for BN we have to call the methods `self.eval()` and `self.train()` in both validation and training. Setting a model to evaluation mode is important for models with batch normalization layers!

>* Training mode means that the batch normalization layers will use **batch** statistics to calculate the batch norm. 
* Evaluation mode, on the other hand, uses the estimated **population** mean and variance from the entire training set, which should give us increased performance on this test data!  

In [0]:
train_len=int(len(trainset)*0.8)

valiloader=copy.deepcopy(trainloader)

trainloader.dataset.data=trainloader.dataset.data[:train_len]
trainloader.dataset.targets=trainloader.dataset.targets[:train_len]

valiloader.dataset.data=valiloader.dataset.data[train_len:]
valiloader.dataset.targets=valiloader.dataset.targets[train_len:]

In [0]:
# Your code here

In [0]:
class MLP_extended(MLP):

    def __init__(self,dimx,hidden1,hidden2,nlabels,use_batch_norm=True, epochs=10, lr=0.001):
      super().__init__(dimx,hidden1,hidden2,nlabels,use_batch_norm=False)
      self.epochs = epochs 
      self.optimizer=optim.Adam(self.parameters(), lr=lr)
      self.criterion=torch.nn.NLLLoss()


      self.vali_loss=[]
      self.accuracy=[]
      self.train_loss=[]
      self.test_accuracy=0
      # self.cuda()

    def validate(self, valiloader):
      self.eval()
      val_accuracy = 0
      val_loss = 0

      with torch.no_grad():
        for images, targets in valiloader:
          # images, targets = images.cuda(), targets.cuda()

          log_proba=self.forward(images)
         
          loss=self.criterion(log_proba,targets)

          val_loss+=loss.item()

          proba=torch.exp(log_proba)

          top_p, top_c = proba.topk(1, dim=1)
          equals = top_c ==targets.view(*top_c.shape)
          val_accuracy +=torch.mean(equals.type(torch.FloatTensor))
        
        else:
          self.vali_loss.append(val_loss/len(valiloader))
          self.accuracy.append(val_accuracy/len(valiloader)) 

          print("Validation loss: {}" .format(self.vali_loss[-1]))
          print("Validation Accuracy : {}\n\n" .format(self.accuracy[-1]))

          self.train()

    def train_loop(self, trainloader, valiloader=None):
      for e in range(self.epochs):
        start_time=time.time()
        running_loss=0

        for images, targets in trainloader:
        
          # images,targets=images.cuda(), targets.cuda()
          self.optimizer.zero_grad()

          log_proba=self(images)
          loss=self.criterion(log_proba,targets)
          loss.backward()

          self.optimizer.step()
        
          running_loss+=loss.item()

        else:
          end_time=time.time()
          self.train_loss.append(running_loss/len(trainloader))
          print(20*'_'+'Running Epoch {}'.format(e+1)+180*'_'+'\n')
          print("Training loss: {}".format(self.train_loss[-1]))
          print("This epoch took {} seconds to train\n".format(round(end_time-start_time)))
          if valiloader:
            self.validate(valiloader)

    def testing(self, testloader):
      self.eval()
      for images, targets in testloader:
        # images,targets=images.cuda(), targets.cuda()
      
        log_proba=self(images)
        proba=torch.exp(log_proba)

        top_p, top_c= proba.topk(1, dim=1)
        equals=top_c==targets.view(*top_c.shape)


        self.test_accuracy+=torch.mean(equals.type(torch.FloatTensor))


      else:
        print('Accuracy on the test set is: {}'.format(self.test_accuracy/len(testloader)))
      return



### Create two different models for testing

* `net_batchnorm` uses batch normalization applied to the output of its hidden layers
* `net_no_norm` does not use batch normalization

Besides the normalization layers, everthing about these models is the same.

In [99]:
net_batchnorm = MLP_extended(dimx=784,hidden1=128,hidden2=64,
                              nlabels=10,epochs=10,lr=1e-3,use_batch_norm=True)
net_no_norm = MLP_extended(dimx=784,hidden1=128,hidden2=64,
                              nlabels=10,epochs=10,lr=1e-3,use_batch_norm=False)

print(net_batchnorm)
print()
print(net_no_norm)

MLP_extended(
  (output1): Linear(in_features=784, out_features=128, bias=True)
  (output2): Linear(in_features=128, out_features=64, bias=True)
  (output3): Linear(in_features=64, out_features=10, bias=True)
  (relu): ReLU()
  (logsoftmax): LogSoftmax()
  (criterion): NLLLoss()
)

MLP_extended(
  (output1): Linear(in_features=784, out_features=128, bias=True)
  (output2): Linear(in_features=128, out_features=64, bias=True)
  (output3): Linear(in_features=64, out_features=10, bias=True)
  (relu): ReLU()
  (logsoftmax): LogSoftmax()
  (criterion): NLLLoss()
)


> **Exercise:** Train both models and compare the evolution of the train/validation loss in both cases

In [100]:
net_batchnorm.train_loop(trainloader,valiloader)
net_batchnorm.testing(testloader)


____________________Running Epoch 1____________________________________________________________________________________________________________________________________________________________________________________

Training loss: 0.6541432928505951
This epoch took 3 seconds to train

Validation loss: 0.3562145022855651
Validation Accuracy : 0.8933971524238586


____________________Running Epoch 2____________________________________________________________________________________________________________________________________________________________________________________

Training loss: 0.3309376639866732
This epoch took 2 seconds to train

Validation loss: 0.2936151852050135
Validation Accuracy : 0.9108222723007202


____________________Running Epoch 3____________________________________________________________________________________________________________________________________________________________________________________

Training loss: 0.2566576819049149
This epoch took 3

In [101]:
net_no_norm.train_loop(trainloader,valiloader)
net_no_norm.testing(testloader)


____________________Running Epoch 1____________________________________________________________________________________________________________________________________________________________________________________

Training loss: 0.6650253648801547
This epoch took 2 seconds to train

Validation loss: 0.3986229105822502
Validation Accuracy : 0.8783122301101685


____________________Running Epoch 2____________________________________________________________________________________________________________________________________________________________________________________

Training loss: 0.33500413893441844
This epoch took 2 seconds to train

Validation loss: 0.31427947527939276
Validation Accuracy : 0.9062860012054443


____________________Running Epoch 3____________________________________________________________________________________________________________________________________________________________________________________

Training loss: 0.27286241878944684
This epoch too

---
### Considerations for other network types

This notebook demonstrates batch normalization in a standard neural network with fully connected layers. You can also use batch normalization in other types of networks, but there are some special considerations.

#### ConvNets

Convolution layers consist of multiple feature maps. (Remember, the depth of a convolutional layer refers to its number of feature maps.) And the weights for each feature map are shared across all the inputs that feed into the layer. Because of these differences, batch normalizing convolutional layers requires batch/population mean and variance per feature map rather than per node in the layer.

> To apply batch normalization on the outputs of convolutional layers, we use [BatchNorm2d](https://pytorch.org/docs/stable/nn.html#batchnorm2d). To use it, we simply state the **number of input feature maps**. I.e. `nn.BatchNorm2d(num_features=nmaps)`


#### RNNs

Batch normalization can work with recurrent neural networks, too, as shown in the 2016 paper [Recurrent Batch Normalization](https://arxiv.org/abs/1603.09025). It's a bit more work to implement, but basically involves calculating the means and variances per time step instead of per layer. You can find an example where someone implemented recurrent batch normalization in PyTorch, in [this GitHub repo](https://github.com/jihunchoi/recurrent-batch-normalization-pytorch).

For 10 epochs, the model with batch normialization, is slightly 0.0017 better. Whether or not this is cooincidence or not is not clear from this approach.

> **Exercise:** Incorporate BN to your solution of Homework 2 (Part I). Compare the results with and without BN!!

In [8]:
# Define a transform to normalize the data
transform = transforms.Compose([transforms.ToTensor(),
                                transforms.Normalize((0.5,), (0.5,))])
# Download and load the training data
trainset = datasets.FashionMNIST('~/.pytorch/F_MNIST_data/', download=True, train=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=256, shuffle=True)

# Download and load the test data
testset = datasets.FashionMNIST('~/.pytorch/F_MNIST_data/', download=True, train=False, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=256, shuffle=True)




# creating the valiation dataset


train_len=int(len(trainset)*0.8)

valiloader=copy.deepcopy(trainloader)

trainloader.dataset.data=trainloader.dataset.data[:train_len]
trainloader.dataset.targets=trainloader.dataset.targets[:train_len]

valiloader.dataset.data=valiloader.dataset.data[train_len:]
valiloader.dataset.targets=valiloader.dataset.targets[train_len:]


Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to /root/.pytorch/F_MNIST_data/FashionMNIST/raw/train-images-idx3-ubyte.gz


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Extracting /root/.pytorch/F_MNIST_data/FashionMNIST/raw/train-images-idx3-ubyte.gz to /root/.pytorch/F_MNIST_data/FashionMNIST/raw
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to /root/.pytorch/F_MNIST_data/FashionMNIST/raw/train-labels-idx1-ubyte.gz


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Extracting /root/.pytorch/F_MNIST_data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to /root/.pytorch/F_MNIST_data/FashionMNIST/raw
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to /root/.pytorch/F_MNIST_data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Extracting /root/.pytorch/F_MNIST_data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to /root/.pytorch/F_MNIST_data/FashionMNIST/raw
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to /root/.pytorch/F_MNIST_data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Extracting /root/.pytorch/F_MNIST_data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to /root/.pytorch/F_MNIST_data/FashionMNIST/raw
Processing...
Done!





In [9]:
train_on_gpu = torch.cuda.is_available()
if(train_on_gpu):
    print('Training on GPU!')
else: 
    print('No GPU available, training on CPU; consider making n_epochs very small.')

Training on GPU!


In [0]:
class Lenet5(nn.Module):
  def __init__(self, dimx, nlabels, drop_proba=0.2,use_drop_out=False,use_batch_norm=False):
    super().__init__()

    self.conv1=nn.Conv2d(1,6,kernel_size=5, stride=1, padding=0)

    self.conv2=nn.Conv2d(6,16,kernel_size=5,stride=1, padding=0)

    self.pool=nn.MaxPool2d(2,2)

    self.final_dim=int(((dimx-4)/2-4)/2)


    self.fc1=nn.Linear(256, 120)
    self.fc2=nn.Linear(120, 84)
    self.fc3=nn.Linear(84,10)
    self.relu=nn.ReLU()
    self.use_drop=use_drop_out
    self.use_batch_norm=use_batch_norm
    if self.use_drop:
      print('Using Dropout')
      self.dropout=nn.Dropout(p=drop_proba)
    else:
      print('Not using Dropout')
    self.softmax=nn.LogSoftmax(dim=1)

    if self.use_batch_norm:
      print('Using Batch Nornmalization')
      self.batch_norm1 = nn.BatchNorm2d(num_features=6,affine=True)
      self.batch_norm2 = nn.BatchNorm2d(num_features=16,affine=True)

    else:
      print('Not using Batch Normalization')




  def forward(self,x):
    x=self.conv1(x)
    if self.use_batch_norm:
      x=self.batch_norm1(x)
      # according to: https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/
      # we should use batchnormalization before the activation for non linear activation functions.
    x=self.relu(x)
    # th size is gonna change to dimx-kernel_size/1+1
    # pooling squashes both height and width by factor 2
    x=self.pool(x)
    x=self.conv2(x)
    if self.use_batch_norm:
      x=self.batch_norm2(x)
    x=self.relu(x)
    x=self.pool(x)
    # stackup the 3 dimensional tensor
    # 16 channels, and a height/widt of final_dim
    # -1 accounts for the batch size
    
    x=x.view(-1,16*self.final_dim**2)
    x=self.relu(self.fc1(x))
    if self.use_drop:
      x=self.dropout(x)
    x=self.relu(self.fc2(x))
    if self.use_drop:
      x=self.dropout(x)
    output=self.softmax(self.fc3(x))

    return output



In [0]:
class Lenet5_ext(Lenet5):
  def __init__(self, dimx, nlabels, lr=0.001, epochs=2,use_cuda=True,drop_proba=0.2,use_drop_out=False, use_batch_norm=False):
    super().__init__(dimx, nlabels,drop_proba,use_drop_out,use_batch_norm)

    self.epochs=epochs

    self.optimizer=optim.Adam(self.parameters(), lr=lr)
    self.criterion=nn.NLLLoss()

    self.vali_loss=[]
    self.accuracy=[]
    self.train_loss=[]
    self.use_cuda=use_cuda
    self.test_accuracy=0


    if train_on_gpu and self.use_cuda:
        print('Using GPU')
        self.cuda()
    else:
        print('Using CPU')
        self.cpu()

  def validate(self,valiloader):
      self.eval()
      val_accuracy=0
      val_loss=0

      with torch.no_grad():
        for images, targets in valiloader:
          if train_on_gpu and self.use_cuda:
            images,targets=images.cuda(),targets.cuda()

          log_proba=self.forward(images)
          loss=self.criterion(log_proba,targets)

          val_loss+=loss.item()

          proba=torch.exp(log_proba)

          top_p, top_c = proba.topk(1, dim=1)
          equals = top_c ==targets.view(*top_c.shape)
          val_accuracy +=torch.mean(equals.type(torch.FloatTensor))


        else:
          self.vali_loss.append(val_loss/len(valiloader))
          self.accuracy.append(val_accuracy/len(valiloader)) 

          print("Validation loss: {}" .format(self.vali_loss[-1]))
          print("Validation Accuracy : {}\n\n" .format(self.accuracy[-1]))

          self.train()

  def train_loop(self,trainloader,valiloader=None,use_cuda=True, early_stopping=False):
    increasing_valiLoss=0
      
    self.train()

    for e in range(self.epochs):
      

      # print(self.fc1.weight)
      start_time=time.time()

      running_loss=0

      for images, targets in trainloader:
        if train_on_gpu and self.use_cuda:
          images,targets=images.cuda(), targets.cuda()
        self.optimizer.zero_grad()
        
        log_proba=self(images)
        loss=self.criterion(log_proba,targets)
        loss.backward()
        # for param in self.parameters():
        #   print(param.grad.data.sum())

        # # start debugger
        # import pdb; pdb.set_trace()

        self.optimizer.step()

        running_loss+=loss.item()

        
      else:
        end_time=time.time()
        self.train_loss.append(running_loss/len(trainloader))
        print(20*'_'+'Running Epoch {}'.format(e+1)+180*'_'+'\n')
        print("Training loss: {}".format(self.train_loss[-1]))
        print("This epoch took {} seconds to train\n".format(round(end_time-start_time)))
        if valiloader:
            self.validate(valiloader)

            try:
              if early_stopping and self.validation_loss[-1]>self.validation_loss[-2] :
                increasing_valiLoss+=1
              else:
                increasing_valiLoss=0
            except:
              IndexError
            if  increasing_valiLoss>=2:
              return

  def testing(self, testloader):
    for images, targets in testloader:
      if train_on_gpu and self.use_cuda:
        images,targets=images.cuda(), targets.cuda()
      
      log_proba=self.forward(images)
      proba=torch.exp(log_proba)

      top_p, top_c= proba.topk(1, dim=1)
      equals=top_c==targets.view(*top_c.shape)


      self.test_accuracy+=torch.mean(equals.type(torch.FloatTensor))


    else:
      print('Accuracy on the test set is: {}'.format(self.test_accuracy/len(testloader)))
      return




We will compare the model from homework without dropout or early stopping against the the model with batch_normalization.

In [22]:
model_overfitting=Lenet5_ext(28, 10, lr=0.001, epochs=50,use_cuda=True,use_drop_out=False,use_batch_norm=True)
model_overfitting.train_loop(trainloader, valiloader)
model_overfitting.testing(testloader)

Not using Dropout
Using Batch Nornmalization
Using GPU
____________________Running Epoch 1____________________________________________________________________________________________________________________________________________________________________________________

Training loss: 0.6812205279761172
This epoch took 7 seconds to train

Validation loss: 0.43248051214725414
Validation Accuracy : 0.8411023020744324


____________________Running Epoch 2____________________________________________________________________________________________________________________________________________________________________________________

Training loss: 0.39090870915258186
This epoch took 7 seconds to train

Validation loss: 0.36498835619459763
Validation Accuracy : 0.8679711222648621


____________________Running Epoch 3_______________________________________________________________________________________________________________________________________________________________________________

In [23]:
model_overfitting=Lenet5_ext(28, 10, lr=0.001, epochs=50,use_cuda=True,use_drop_out=False,use_batch_norm=False)
model_overfitting.train_loop(trainloader, valiloader)
model_overfitting.testing(testloader)

Not using Dropout
Not using Batch Normalization
Using GPU
____________________Running Epoch 1____________________________________________________________________________________________________________________________________________________________________________________

Training loss: 0.8522664343423032
This epoch took 6 seconds to train

Validation loss: 0.5862548782470378
Validation Accuracy : 0.7754083871841431


____________________Running Epoch 2____________________________________________________________________________________________________________________________________________________________________________________

Training loss: 0.5235228717644164
This epoch took 6 seconds to train

Validation loss: 0.49906437637958123
Validation Accuracy : 0.8118588924407959


____________________Running Epoch 3______________________________________________________________________________________________________________________________________________________________________________

We see that the the performance on the training set with batch normalization and without it is pretty much the same although the model without normalization actually performs better by 0.2%.
