In below cell, we import required python packages which will help us to write python script to perfrom model training.
Here we are majorly importing 2 main packages - 


1.   `torch`: main python package containing pytorch Deep Lerning framework. It will help us to define layers, optimizers, loss, activation, etc.
2.   `__future__`: This is more of a utility package help us to use features avaliable in new python.
3. `torchvision`: This python package focuses on computer vision related utility for torch. This package contains popular dataset for computer vision, famous model architecture and some utility to form image transformation.



In [1]:
# we import print_function from __future__. This will help us to use newer print utility.
from __future__ import print_function 

# we are importing torch package. This will give us access to different modules.
import torch 

# nn(neural network) is a base module from torch pacakge. It contains all utility module to build a neural network.
import torch.nn as nn 

# `torch.nn.functional` module contains functions that are commonly used in neural network. 
#This module is often used in conjunction with the `torch.nn` module
import torch.nn.functional as F 

# here we are importing optimization module from `torch` module and naming it as optim. 
# optimization modules contains different type of optimization which helps in updating the paramerters during model training.
import torch.optim as optim

# finally we are importing datasets and transforms sub-module from torchvision module.
# datasets module will have collection of popular datasets like MNIST, COCO, etc.
# transforms module will have functions to perform pre-procssing on datasets like normalization, augmentation, etc.
from torchvision import datasets, transforms

In below cell, we create a custom class called `Net`. We create this class by inhereting other class called `Module` from `nn` sub-module. 
Here `Module` class is the base class for all neural network modules in `nn` package. `Module` class has useful function like `parameters()` and `train()` which are used to access and manipulate the parameters of the network.
In our custom class `Net` we will define following things-


1.   Our Neural Network architecutre by specifing each layer.
2.   forward function which will help us to perform forward propogation to calculate our predicted value.
3. we also define activation funcation in our NN architecture. 




In [2]:
class Net(nn.Module): # define class called `Net` by inheriting `nn.module` class 
    def __init__(self): # define class constructure, This function is called when we create an instance of this class
        super(Net, self).__init__() # here we are initlizing parent class of Net which is nn.module, this will set internal state of parent class.

        #Below we are initilzing all layers required for our Neural Network from nn module. 

        # input_size = 28x28x1, output_size = 28x28x32
        # input_channel = 1, output_channel/no. of kernels = 32, kernel_size = 3, padding=1, receptive_field = 2x2 (it is not 3 because we remove padding part from receptive field)
        # here due to padding input and output size (not considering the channel) is same i.e. 28x28
        self.conv1 = nn.Conv2d(1, 32, 3, padding=1) 

        # input_size = 28x28x32, output_size = 28x28x64
        # input_channel = 32, output_channel/no. of kernels = 64, kernel_size = 3, padding=1, receptive_field = 3x3
        # due to padding input and output are still same i.e. 28
        self.conv2 = nn.Conv2d(32, 64, 3, padding=1)

        # input_size = 28x28x64, output_size = 14x14x64
        # input_channel = 64, output_channel/no. of kernels = remains same, filter_size = 2, stride=2, padding=0, receptive_field = 9x9
        self.pool1 = nn.MaxPool2d(2, 2)

        # input_size = 14x14x64, output_size = 14x14x128
        # input_channel = 64, output_channel/no. of kernels = 128, filter_size = 3, receptive_field = 10x10
        self.conv3 = nn.Conv2d(64, 128, 3, padding=1)

        # input_size = 14x14x128, output_size = 14x14x256
        # input_channel = 128, output_channel/no. of kernels = 256, filter_size = 3, receptive_field = 11x11
        self.conv4 = nn.Conv2d(128, 256, 3, padding=1)

        # input_size = 14x14x256, output_size = 7x7x256
        # input_channel = 256, output_channel/no. of kernels = remains same, filter_size = 2, stride=2, padding=0, receptive_field = 22x22
        self.pool2 = nn.MaxPool2d(2, 2)

        # input_size = 7x7x256, output_size = 5x5x512
        # input_channel = 256, output_channel/no. of kernels = 512, filter_size = 3, padding=0, receptive_field = 24x24 (now it not 23 because padding is not added)
        self.conv5 = nn.Conv2d(256, 512, 3)

        # input_size = 5x5x512, output_size = 3x3x1024
        # input_channel = 512, output_channel/no. of kernels = 1024, filter_size = 3, receptive_field = 26x26
        self.conv6 = nn.Conv2d(512, 1024, 3)

        # input_size = 3x3x1024, output_size = 1x1x10
        # input_channel = 1024, output_channel/no. of kernels = 10, filter_size = 3, receptive_field = 28x28 (equal to the size of image!!)
        # here output_channel also represent the no. of class we need to predict.
        self.conv7 = nn.Conv2d(1024, 10, 3)
    
    # Below function `forward()` is responsible for connecting the layers we defined and generate output while propogating through this layers
    def forward(self, x): #defining forward function to take x which is the input vector.

        # here x -> conv1 -> F.relu -> conv2 -> F.relu -> pool1 -> update value of x
        # here F.relu is activation function, it helps to impart non-linarity and state of the neuron
        x = self.pool1(F.relu(self.conv2(F.relu(self.conv1(x)))))

        # here updated x -> conv3 -> F.relu -> conv4 -> F.relu -> pool2 -> again update value of x
        x = self.pool2(F.relu(self.conv4(F.relu(self.conv3(x)))))

        # update x -> conv5 -> F.relu -> convn6 -> F.relu -> update x again
        x = F.relu(self.conv6(F.relu(self.conv5(x))))

        # updated x -> conv7 -> update value of x
        # here we have total of 7 convolution layers and two pooling layer
        x = F.relu(self.conv7(x))

        # we finally flatten the output of 1x1x10 into a single vector of 10 neurons.
        x = x.view(-1, 10)

        # we return the value after perfoming softmax activation.
        # we use softmax activation in final layer when we are working with mutli class problem.
        return F.log_softmax(x)

In below cell we are installing torchsummary python package and setting up GPU device to efficiently train model.


*   `torchsummary` is a python package which contain's utility function to get infromation about model architecture, model summary, input, output, etc.

*   In `torch` we can use cuda devices like GPU, TPU to trian the model faster. Below cell provide code to initilize cuda GPU's 



In [3]:
!pip install torchsummary # install torchsummary python pacakge 
from torchsummary import summary # import summary module from torchsummary, it will contain required utility function

# we check if cuda device is avaialble or not in the given system.
# use_cuda=1 if device is there & use_cuda=0 if there is no device avaliable
use_cuda = torch.cuda.is_available() 

# we setup device based on availablity
# if cuda is there then we will use cuda GPU else we will use CPU 
device = torch.device("cuda" if use_cuda else "cpu")

# Here we are creating an instance of our custom class Net() called model.
# we are calling .to(device) method on our Net() instance to set all tensors and parameters to compute on selected device
model = Net().to(device)

# here we are using our utility function to print our our model summary, no. of parameter, inputs, outputs, etc.
summary(model, input_size=(1, 28, 28))

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1           [-1, 32, 28, 28]             320
            Conv2d-2           [-1, 64, 28, 28]          18,496
         MaxPool2d-3           [-1, 64, 14, 14]               0
            Conv2d-4          [-1, 128, 14, 14]          73,856
            Conv2d-5          [-1, 256, 14, 14]         295,168
         MaxPool2d-6            [-1, 256, 7, 7]               0
            Conv2d-7            [-1, 512, 5, 5]       1,180,160
            Conv2d-8           [-1, 1024, 3, 3]       4,719,616
            Conv2d-9             [-1, 10, 1, 1]          92,170
Total params: 6,379,786
Trainable params: 6,379,786
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 

  return F.log_softmax(x)


Till now we have defined our model and device over which we want to train our model. 
Now we will look into how to define data, batch size and how to load data for training. 

*   In below code, we create an instance of `torch.utils.data.DataLoader` module. Here dataloader helps us to maintain the flow of data from storage to memory on fly and helps to manage pre-processing of data. Below are the arguments for DataLoader.
  1. `dataset` - dataset object which need to be created, here we are creating MNIST dataset object.
  2. `batch_size` - batch size or no. of images over which model parameters will be updated.
  3. `shuffle` - flag to specifiy wheather dataset is needed to shuffle. It helps to break any kind of bias related to the order of image used during training.
  
*   we use `batch_size` to set number of input images to use during optimization. This are the no. of images over which model will update it's parameter. If model update it's parameter after each image it is called as schostic optimization and if we update parameters after passing through all image we call it batch optimiation. It is always better to batch_size of power of 2 i.e. 16, 64, 128, etc.

* Here we use existing MNIST dataset from datasets module. Below are arguments which we used in datasets.MNIST().
  1. Path wher MNIST dataset needs to be downloaded.
  2. `train` flag which specific wheather dataset instance will be used for training or evaluation.
  3. `transform.to_tensor()` which will transform image data into pytorch tensors.
  4. `transforms.Normalize()` to normalize image i.e. scaling rgb value  from 0-255 to other scale depending on provided mean and standerd deviation.











In [4]:


torch.manual_seed(1) # since lot's of values will be initilized randomly, this line help us to generate similar random values in each run.

batch_size = 128 # setting batch size to 128 means model will update parameters after every 128 image pass.

# if we are using cuda based GPU and generate kwargs dictionary.
# here num_workers are no. of worker processes to use for data loading, setting more then 0 will help to parallelize data loading.
# pin_memory is a special type of memory that is faster to access compare to regular memory, it can be used to speed up data transfer between CPU and GPU.
# setting pin_memory to True means allow dataloader to use this special memory.
kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}

# creating dataloader instance which will be used for training the model.
# here we are using available dataset MNIST
train_loader = torch.utils.data.DataLoader(
    datasets.MNIST('../data', train=True, download=True,
                    transform=transforms.Compose([
                        transforms.ToTensor(),
                        transforms.Normalize((0.1307,), (0.3081,))
                    ])),
    batch_size=batch_size, shuffle=True, **kwargs)

#creating dataloader instance which will be used for evaluating the model
# we set train to False to signify that it is for evaluation.
test_loader = torch.utils.data.DataLoader(
    datasets.MNIST('../data', train=False, transform=transforms.Compose([
                        transforms.ToTensor(),
                        transforms.Normalize((0.1307,), (0.3081,))
                    ])),
    batch_size=batch_size, shuffle=True, **kwargs)


Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ../data/MNIST/raw/train-images-idx3-ubyte.gz


  0%|          | 0/9912422 [00:00<?, ?it/s]

Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ../data/MNIST/raw/train-labels-idx1-ubyte.gz


  0%|          | 0/28881 [00:00<?, ?it/s]

Extracting ../data/MNIST/raw/train-labels-idx1-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw/t10k-images-idx3-ubyte.gz


  0%|          | 0/1648877 [00:00<?, ?it/s]

Extracting ../data/MNIST/raw/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz


  0%|          | 0/4542 [00:00<?, ?it/s]

Extracting ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw



In below cell we are defining our main `train()` and `test()` function which will be responsible for training and evaluating our model.

Our `train()` and `test()` function takes following input arguments:
*  model - model which is an instance of our `Net()` module which contains model architecture.
*  device - device over which we will be training our neural network.
*  train_loader - DataLoader instance created for handling training dataset.
*  optimizer - which optimizer to use for performing gradient decent like Adam, SGD, GD with momentum, etc.
*  epoch - number of iteration over complete dataset. Each iteration consist of forward and backward pass while training the model. 20 epoch means model will go throught complete dataset 20 times. 



In [5]:
from tqdm import tqdm # a python library utility libray which provide progress-bar for iterative function

# define train() function which will be responsible for forward and backward pass for training
def train(model, device, train_loader, optimizer, epoch):
    model.train() # call train() method on model instance to set internel train flag as true. helps to initlize certain layers.
    pbar = tqdm(train_loader) # object to show progress-bar while iterating through train_loader
    for batch_idx, (data, target) in enumerate(pbar): # iterate through train_loader and unpack data and ground truth
        data, target = data.to(device), target.to(device) # set computer device for data and target tensor
        optimizer.zero_grad() # initilize gradients of optimizer to zero
        output = model(data) # perfrom forward pass or prediction on given data and model
        loss = F.nll_loss(output, target) # calculate loss using predicted value and ground truth, we are using negative log likelihood loss
        loss.backward() # now we perform backward pass to calculate gradients 
        optimizer.step() # this will update the value of parameters/weights
        pbar.set_description(desc= f'loss={loss.item()} batch_id={batch_idx}') # display loss of each iteration.

# define test() function which will be responsible to evaluate model performance on non-training data.
def test(model, device, test_loader):
    model.eval() # will set model into evaluate state i.e. setting training flag to False
    test_loss = 0 
    correct = 0
    with torch.no_grad(): # switch off gradient computation
        for data, target in test_loader: # iterate through test_loader datasets
            data, target = data.to(device), target.to(device) # set device for tensor
            output = model(data) # perfrom prediction on test data
            test_loss += F.nll_loss(output, target, reduction='sum').item()  # calcuate and sum up batch loss
            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item() # count number of correct predection using target/ground-truth and predicted value.

    test_loss /= len(test_loader.dataset) # calculate avearge loss across no. of datasets

    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))  # print loss and accuracy on the terminal

In Below cell, we first create our architecture instance `model` then we define our optimizer which is SGD (stochastic Gradient Descent) with learning-rate = 0.01 and momentum=0.9.
  * Here learning-rate control by how much the parameters value should be updated. larger learning-rate means parameters will be updated by larger values.
  * momentum help us to avoid converging in local minimua.

In this cell we are also performing epoch iteration and in each iteration we are calling train() and test() fucntion.

In [8]:

model = Net().to(device) # initilize instance of Net() and then set which device to use
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9) # set optimizer for training

# iterate through epochs, here each epoch will again iterate through all training data.
for epoch in range(1, 20):
    train(model, device, train_loader, optimizer, epoch) # call train() function for training
    test(model, device, test_loader) # call test() function for evaluating model on test dataset

  return F.log_softmax(x)
loss=1.4207569360733032 batch_id=468: 100%|██████████| 469/469 [00:18<00:00, 24.96it/s]



Test set: Average loss: 1.4180, Accuracy: 5030/10000 (50%)



loss=1.4068306684494019 batch_id=468: 100%|██████████| 469/469 [00:17<00:00, 26.85it/s]



Test set: Average loss: 1.3802, Accuracy: 5082/10000 (51%)



loss=1.4327501058578491 batch_id=468: 100%|██████████| 469/469 [00:17<00:00, 26.30it/s]



Test set: Average loss: 1.3694, Accuracy: 5089/10000 (51%)



loss=1.4305704832077026 batch_id=468: 100%|██████████| 469/469 [00:17<00:00, 26.85it/s]



Test set: Average loss: 1.3681, Accuracy: 5103/10000 (51%)



loss=1.2765380144119263 batch_id=468: 100%|██████████| 469/469 [00:17<00:00, 26.14it/s]



Test set: Average loss: 1.1427, Accuracy: 6080/10000 (61%)



loss=0.8513968586921692 batch_id=468: 100%|██████████| 469/469 [00:18<00:00, 25.45it/s]



Test set: Average loss: 0.9220, Accuracy: 6087/10000 (61%)



loss=0.9032487869262695 batch_id=468: 100%|██████████| 469/469 [00:17<00:00, 26.69it/s]



Test set: Average loss: 0.9191, Accuracy: 6088/10000 (61%)



loss=0.7056861519813538 batch_id=468: 100%|██████████| 469/469 [00:17<00:00, 26.64it/s]



Test set: Average loss: 0.6945, Accuracy: 7037/10000 (70%)



loss=0.5310333967208862 batch_id=468: 100%|██████████| 469/469 [00:17<00:00, 26.16it/s]



Test set: Average loss: 0.6956, Accuracy: 7042/10000 (70%)



loss=0.8190118670463562 batch_id=468: 100%|██████████| 469/469 [00:17<00:00, 27.00it/s]



Test set: Average loss: 0.6959, Accuracy: 7031/10000 (70%)



loss=0.7738959789276123 batch_id=468: 100%|██████████| 469/469 [00:20<00:00, 23.20it/s]



Test set: Average loss: 0.6948, Accuracy: 7042/10000 (70%)



loss=0.6029121279716492 batch_id=468: 100%|██████████| 469/469 [00:17<00:00, 26.34it/s]



Test set: Average loss: 0.6994, Accuracy: 7022/10000 (70%)



loss=0.6089175343513489 batch_id=468: 100%|██████████| 469/469 [00:17<00:00, 26.70it/s]



Test set: Average loss: 0.6929, Accuracy: 7036/10000 (70%)



loss=0.5516868829727173 batch_id=468: 100%|██████████| 469/469 [00:18<00:00, 25.22it/s]



Test set: Average loss: 0.6964, Accuracy: 7040/10000 (70%)



loss=0.623724639415741 batch_id=468: 100%|██████████| 469/469 [00:17<00:00, 26.35it/s]



Test set: Average loss: 0.6916, Accuracy: 7051/10000 (71%)



loss=0.6434343457221985 batch_id=468: 100%|██████████| 469/469 [00:18<00:00, 25.11it/s]



Test set: Average loss: 0.6992, Accuracy: 7030/10000 (70%)



loss=0.7675644755363464 batch_id=468: 100%|██████████| 469/469 [00:17<00:00, 26.43it/s]



Test set: Average loss: 0.7072, Accuracy: 7033/10000 (70%)



loss=0.6762976050376892 batch_id=468: 100%|██████████| 469/469 [00:17<00:00, 26.15it/s]



Test set: Average loss: 0.6967, Accuracy: 7046/10000 (70%)



loss=0.7195828557014465 batch_id=468: 100%|██████████| 469/469 [00:18<00:00, 25.44it/s]



Test set: Average loss: 0.6997, Accuracy: 7054/10000 (71%)



Mon Dec  5 18:42:39 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   65C    P0    31W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces