## **D19021 - Pooja More - Feed Forward Network for MNIST Classification in Pytorch**

### Objective :
To build a Feed Forward Network for MNIST Classification in Pytorch in not more than 10 epochs.

### Parameters :
1. Number of parameters used in the model ( lower the better)
2. Validation data accuracy (higher the better)
3. Experimentation details to reach at the final set of parameters used in the model.
4. Uploading your code on your github profile.

**Solution** :

Importing necessary Libraries

In [None]:
import torch
from torch import nn
import torch.nn.functional as F
from torchvision import datasets,transforms
from torch import optim

Uploading the train and test data using dataloaders

In [None]:
transform=transforms.Compose([transforms.ToTensor()])
trainset=datasets.MNIST('~/.pytorch/MNIST_data/',train=True,transform=transform,download=True)
testset=datasets.MNIST('~/.pytorch/MNIST_data/',train=False,transform=transform,download=True)

trainloader=torch.utils.data.DataLoader(trainset,batch_size=100,shuffle=True,num_workers=0)
#will explain later
testloader=torch.utils.data.DataLoader(testset,batch_size=100,shuffle=True,num_workers=0)

Defining the Neural Network Architecture

In [None]:
class Network(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 100)  #Input Layer
        self.fc2 = nn.Linear(100, 50)   #Hidden Layer 1
        #self.fc3 = nn.Linear(64, 50)   #Hidden Layer 2
        self.fc4 = nn.Linear(50, 10)    #Output Layer

        # Dropout module with 0.2 drop probability
      '''Dropout to avoid overfitting the model onto the training data'''

        self.dropout = nn.Dropout(p=0.2)

    def forward(self, x):
        # make sure input tensor is flattened
        x = x.view(x.shape[0], -1)

        # Now with dropout
        x = self.dropout(F.relu(self.fc1(x)))
        x = self.dropout(F.relu(self.fc2(x)))
        #x = self.dropout(F.relu(self.fc3(x)))

        # output so no dropout here
      '''Softmax function, a wonderful activation function that turns numbers aka logits into probabilities that sum to one.
      Softmax function outputs a vector that represents the probability distributions of a list of potential outcomes.
      The algorithms leverages the power of adaptive learning rates methods to find individual learning rates for each parameter.
      It also has advantages of Adagrad, which works really well in settings with sparse gradients'''

        x = F.log_softmax(self.fc4(x), dim=1)

        return x

model=Network()
#Adam Optimizer
'''Adam - adaptive learning rate optimization algorithm
Adam actually finds worse solution than stochastic gradient descent.
'''
optimizer = optim.Adam(model.parameters(), lr=0.001)

#SGD - Stochastic Gradient Descent Optimizer
'''Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function
with suitable smoothness properties (e.g. differentiable or subdifferentiable).
It can be regarded as a stochastic approximation of gradient descent optimization,
since it replaces the actual gradient (calculated from the entire data set) by an estimate
thereof (calculated from a randomly selected subset of the data).'''

#optimizer = optim.SGD(model.parameters(), lr=0.001,momentum=0.9)

#Negative Log Likelihood
'''The negative log-likelihood loss function is often used in combination with a SoftMax activation function
to define how well your neural network classifies data.
The negative log-likelihood function is defined as loss=-log(y) and produces a high value when the values of the output layer
are evenly distributed and low. In other words, there's a high loss when the classification is unclear.
It also produces relative high values when the classification is wrong. '''

criterion=nn.NLLLoss()

In [None]:
epochs=7
train_losses,test_losses=[],[]
for e in range(epochs):
    running_loss=0
    for images,labels in trainloader:
        optimizer.zero_grad()
        images=images.view(images.shape[0],-1)
        log_ps=model(images)
        loss=criterion(log_ps,labels) # a single value for ex 2.33
        loss.backward()
        optimizer.step()
        running_loss += loss.item() * images.shape[0] ## (2.33*64 + 2.22*64 + 2.12*33) / 138

    else:
        test_loss=0
        accuracy=0

        with torch.no_grad():
            model.eval()
            for images,labels in testloader:
                log_ps=model(images)
                test_loss+=criterion(log_ps,labels) *images.shape[0]
                ps=torch.exp(log_ps)
                top_p,top_class=ps.topk(1,dim=1)
                equals=top_class==labels.view(*top_class.shape)
                accuracy+=torch.sum(equals).item()
        model.train()
        train_losses.append(running_loss/len(trainloader.dataset))
        test_losses.append(test_loss.item()/len(testloader.dataset))

        print("Epoch: {}/{}.. ".format(e+1, epochs),
              "Training Loss: {:.3f}.. ".format(running_loss/len(trainloader.dataset)),
              "Test Loss: {:.3f}.. ".format(test_loss/len(testloader.dataset)),
              "Test Accuracy: {:.3f}".format(accuracy/len(testloader.dataset)))

Epoch: 1/7..  Training Loss: 0.065..  Test Loss: 0.077..  Test Accuracy: 0.978
Epoch: 2/7..  Training Loss: 0.062..  Test Loss: 0.077..  Test Accuracy: 0.977
Epoch: 3/7..  Training Loss: 0.062..  Test Loss: 0.076..  Test Accuracy: 0.979
Epoch: 4/7..  Training Loss: 0.058..  Test Loss: 0.076..  Test Accuracy: 0.979
Epoch: 5/7..  Training Loss: 0.055..  Test Loss: 0.081..  Test Accuracy: 0.978
Epoch: 6/7..  Training Loss: 0.052..  Test Loss: 0.076..  Test Accuracy: 0.980
Epoch: 7/7..  Training Loss: 0.053..  Test Loss: 0.083..  Test Accuracy: 0.978


Calculating the total number of parameters

In [None]:
print("Our model: \n\n", model, '\n')

pytorch_total_params = sum(p.numel() for p in model.parameters())
pytorch_total_params

Our model: 

 Network(
  (fc1): Linear(in_features=784, out_features=100, bias=True)
  (fc2): Linear(in_features=100, out_features=50, bias=True)
  (fc4): Linear(in_features=50, out_features=10, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
) 



84060

**Conclusion :**

As per the observations recorded in excel sheet we can find that as the number of layers increases, apparently the number of neurons increases and so does the number of parameters. So we have purposely avoided including hidden layers into our model.
<br>So far the best accuracy we have got is **98.0** using the RELU activation function using Adam optimizer having the neurons distribution as follows :-
<br>Input Layer : 784
<br>Hidden Layer : 50
<br>Output Layer : 10
<br><br>The specifications are as follows :-
<br>Batch Size : 100
<br>Learning Rate :  0.001
<br>Dropout : 0.2
<br>No. of Epochs : 7

<br>The accuracy of the model can be increased if we increase the number of hidden layers and the number of neurons as well as the batch size.

<br> ***For now I can conclude that this is so far the best optimized solution I have found.***