##Classifying MNIST handwriting digits with Multi Layer Perceptron(MLP)

* In this session, we will create a fully connected MLP with one hidden layer, train and evaluate the network on the MNIST dataset. 
* The MNIST dataset is small enough that we can use a simple MLP for training( (The MNIST images are relatively small)

##What does our planned MLP looks like? Draw the diagram 

* How many inputs?
* How many hidden layers?
* How many neurons in the hidden layers?
* How many neurons in the output layer?
* What about activation functions?


##The Main Steps

Generally, the main steps for building a Deep Learning Neural Network are as follows. 

1. Import libraries, seed
2. Set data preprocessing (transform), download dataset, split train and test 
3. Set Dataloaders 
4. Define the model class
5. Set loss function, optimizer and learning rate
6. Training : Load the data
7. Training : Zero the parameter gradients
8. Training : Compute fwd
9. Training : Compute loss
10. Training : Compute backward,set optimizer(update weights)
10. Evaluation of trained model on test dataset

##1. Import libraries, seeding random for reproducibility

In [2]:
import torch
import torch.nn.functional as F
import torch.nn as nn
import numpy as np 
from torchvision import datasets
from torchvision import transforms
from torch.utils.data import DataLoader

seed = 7
torch.manual_seed(seed)
torch.backends.cudnn.deterministic = True  
torch.backends.cudnn.benchmark = False #for a small dataset, simple network , this is not really needed
np.random.seed(seed)


##2. Download the MNIST dataset and Data pre-processing

* Each PIL image is converted to a PyTorch tensor using transforms.ToTensor()
* The 28*28 image data is flattened into a vector using lambda which is a customized transformation. Here lambda returns a new view of the input tensor which is the product of the input tensor's dimension. 

* We are also splitting the MNIST dataset into training and test dataset








In [3]:

transformCustom = transforms.Compose([
                                transforms.ToTensor(), #this convert to tensor
                                transforms.Lambda(lambda x:x.view(-1))  #this flatten 28*28 into a 784 vector for each image
])


train = datasets.MNIST(root='.',train=True,transform=transformCustom, download=True)
test = datasets.MNIST(root='.', train=False, transform=transformCustom,download=True)

print(len(train.data))


Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./MNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 9912422/9912422 [00:00<00:00, 102851876.64it/s]


Extracting ./MNIST/raw/train-images-idx3-ubyte.gz to ./MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ./MNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 28881/28881 [00:00<00:00, 34550968.00it/s]


Extracting ./MNIST/raw/train-labels-idx1-ubyte.gz to ./MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ./MNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 1648877/1648877 [00:00<00:00, 25885152.10it/s]


Extracting ./MNIST/raw/t10k-images-idx3-ubyte.gz to ./MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ./MNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 4542/4542 [00:00<00:00, 15718258.06it/s]


Extracting ./MNIST/raw/t10k-labels-idx1-ubyte.gz to ./MNIST/raw

60000


#3. DataLoaders

* Previously like in the perceptron assignment, we passed data manually. Depends on the implementation, for each iteration sometimes we pass one row of data, or one mini batch of data, or one whole batch of data. And for the case of Stochastic Gradient Descent, we need to randomly shuffle the dataset or randomly pick one sample from the dataset i.e we have to do it manually

* In PyTorch, we can use DataLoader class that automatically pass the batches of data fetched from a Dataset object. We can also set if we want the DataLoader to shuffle the data and the size of each batch of data.

* Each batch is a tuple containing the images in the first element and the labels in the second

In [4]:
#Set DataLoader
batchSize = 128  # Rule of thumb is to set to the power of 2. In this case 2^7
train_loader = DataLoader(train, batch_size=batchSize,shuffle=True)
test_loader = DataLoader(test,batch_size=batchSize, shuffle=False) # no need to shuffle test data


#Task:

What is the shape of the train data and train label?

How many batches are there? Is the size for each batch the same ? 

In [None]:
count = 0
for xb, yb in train_loader:
  #Your code here
print(f'There are {count} batches in train_loader')
#How many batches are there ?   #128*468+96=60,000

IndentationError: ignored

#Task: 
Do the same for test data (test_loader)

In [None]:
count=0
for x, y in test_loader:
  #Your code here
print(f'There are {count} batches in test_loader') 
 #78*128+16=10,000

There are 79 batches in test_loader


#Define the Neural Net model/class (MLP with one hidden layer, fully connected)

* Define our model in a class that extends nn.Module. 
* nn.Module subclasses must do a minimum of one thing: implement the forward method which takes a batch of data and performs the forward-pass. 

* PyTorch's autograd system will take care of computing the gradients of the forward pass for us. In the code below we'll also make use of the constructor of our model to instantiate the hidden and output layers.


* The nn.Module class defines a instance variable called `training` that is set to True when the model is being trained and False when it is being evaluated after being trained. 

* In our model definition we've used a softmax activation function on the output layer to turn the outputs into probability-like values, but have only set this to be enabled when we are not training the model. We've done this because we will use PyTorch's implementation of Cross Entropy Loss (nn.CrossEntropyLoss) during training which **implicitly** adds a softmax before a logarithmic loss.

* In our case the softmax isn't actually necessary for model evaluation if we're only interested in the most likely class; the logits (unscaled log probabilities) provided by the final fully connected layer before the softmax can be used directly as the largest logit will correspond to the most likely class.


In [5]:
class MLP(nn.Module):
  def __init__(self,input_size, hidden_size,num_classes):
    super(MLP,self).__init__()

    self.layer1 = nn.Linear(input_size,hidden_size) 
    self.layer2 = nn.Linear(hidden_size,num_classes)

  def forward(self,x): 

    out = self.layer1(x)
    #out = F.sigmoid(out) 
    out = torch.sigmoid(out)
    out = self.layer2(out)
    #out = softmax(out) #implicitly added during training by nn.CrossEntropyLoss

    if not self.training:
      out = F.softmax(out,dim=1)
    return out
 


#5 Set loss function, optimizer
#6-10 Training




Training and Evaluating the Model
* One of the design decisions of PyTorch is that everything should be explicit so we have full control over our models and the training process. 

* This means that we actually need to write the model training loop by hand, and perform each of the various operations (perform the forward-pass, compute the loss, perform the backward-pass, and update the weights). 

* In the code below we'll fit the model to the data over several epochs using batches of 128 images provided by the DataLoader defined previously. 

* We'll make use of the ADAM optimiser as it broadly tends to work well practically despite its limitations.

In [None]:

model = MLP(784, 784, 10) #input_size,hidden_size,num_classes

#5. Set loss function and optimizer
loss_fn = nn.CrossEntropyLoss() 
opt = torch.optim.Adam(model.parameters()) #optimizer, optimization strategy-to escape the local minima and to converge quickly
#Rule of thumb for optimizer
#1. if you want to keep things simple. use ADAM
#2. if you have time, then use SGD, and tune the learning rate/parameters (this is usually done by postgrad students)
#3. if you are implementing a paper, use the same strategy as what the authors are using 

epochSize =3 #obviously this isn't enough

for epoch in range(epochSize): #this training part can be made into a function, or defined as function of class MLP
  
   #model.train() #by default, this is set to true. so not really needed. what is important is model.eval() that we'll see that later
   #refer https://pytorch.org/docs/stable/generated/torch.nn.Module.html for more details.


  loss = 0
  # 6. Load the data 
  for input_batch, target_batch in train_loader:

    #7. Zero the gradients
    opt.zero_grad() 
    
    #8. Forward pass
    predict_batch = model(input_batch) 
    
    #9. Compute loss
    loss_batch = loss_fn(predict_batch,target_batch)  
    
    #10. Backward pass and update weights
    loss_batch.backward() 
    opt.step()

    loss += loss_batch.item() #store the loss
    

  print(f'Epoch: {epoch+1}  loss: {loss}')    
  
        
    


Epoch: 1  loss: 224.84893715381622
Epoch: 2  loss: 110.55411703139544
Epoch: 3  loss: 84.45046036690474


#Task: Visualize epoch vs loss/cost and minibatches vs loss/cost


* In the above we printed out the total loss at the end of each epoch. With your own code, plot the loss.



In [None]:
#plot the loss by epochs, minibatches here

#Task: Create a function that evaluates trained model on the training dataset

* Compute the overall accuracy of the training set. Note that you need a call to model.eval() - this sets the model into evaluation mode and supresses non-training things (gradients, and things such as dropout being applied/computed).

In [None]:
def evaluate_model(data_loader): #or def compute_accuracy():  , use whatever conventions that you like


  model.eval() # sets the model in evaluation mode 
  for input_batch, target_batch in data_loader: #data_loader can take train or test dataset
  
    #Your code here 

  print('number of evaluated data')
  print(f'number of wrongly predicted label ')


  print(f'accuracy')




#11.Task: Evaluation of trained model on test data

* Compute the overall accuracy of the test set. Note that you need a call to model.eval() - this sets the model into evaluation mode and supresses non-training things (gradients, and things such as dropout being applied/computed).

In [None]:

evaluate_model(test_loader)


