# Digit Recognition - using NN and CNN (Pytorch) 
> Classify digit images using Neural Network and Convolutional Neural Network. 

- toc: false 
- badges: true
- comments: false
- categories: [jupyter, pytorch, neuralnetwork, convolutionalneuralnetwork]
- author: Venkataramani, Suja

## Overview


## Method

We use MINST dataset which is a dataset of handwritten digits black and white images which have been centred,  normalised to a standard size of 28 X 28 pixels.

PyTorch is deeplearning framework based on Torch, developed by Facebook. Tensorflow is another such framework developed by Google. Keras a is wrapper framework for Tensorflow with simpler interface more suitable for smaller datasets. for a comparison between these frameworks, read [this](https://towardsdatascience.com/keras-vs-pytorch-for-deep-learning-a013cb63870d).

In this blog we will use PyTorch to build our deep learning models. Let's install the cpu version using:  

pip install torch==1.7.1+cpu torchvision==0.8.2+cpu torchaudio===0.7.2 -f https://download.pytorch.org/whl/torch_stable.html



In [37]:
# Import packages.
import torch
import torchvision
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
from PIL import Image
import matplotlib.pyplot as plt
from torch import nn, optim
from torch.nn import Linear, ReLU, Sequential, Conv2d, MaxPool2d
import torch.nn.functional as func

### What is a tensor?

Tensor is a n-dimentional array data structure used to store numbers with which mathmematical operations can be performed for machine learning. In Pytorch Tensors are build on GPUs which makes tensor computations such as slicing, mathematical operations extremely efficient. 

[DataCamp](https://www.datacamp.com/community/tutorials/investigating-tensors-pytorch)  

Input to the neural network in the form of a tensor. Normally images are in the format (H, W, C), these will need to be first converted into a tensor of the format (B, C, H, W) where 
    B = Number of Images (batch)  
    C = number of colour channels (Black and white = 1, colour = 3)  
    H = Height of the image  
    W = Width of the image


In [38]:
# "tranforms" method aids image tranformations. A set of tranformations can be chained together using Compose.

# ToTensor converts a numpy image array of (H, W, C) in the range (0, 255) in to a tensor of (C, H, W) in the range (0, 1)

# Normalize method accepts mean and std deviation as input. For every channel performs (image - mean)/std. this arranges all the numbers of the channel within the same range and reduces the skews in input data dute to different ranges of numbers.
transform_step = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])

# TorchVision.datasets has the most commonly used deep learning datasets available for easy download. All datasets have common interface of tranform. train=True gets the training data (60,000 samples), train=False gets test data (10,00 samples).
train_data = datasets.MNIST(root='./data/mmist_train', download=True, train=True, transform=transform_step)
test_data = datasets.MNIST(root='./data/minst_test', download=True, train=False, transform=transform_step)

# DataLoader creates a iterable batches of data in order to aid with training a nn model. Setting shuffle to True results in a random suffled batch of images.
train_data_loader = DataLoader(train_data, batch_size=32, shuffle=True)
test_data_loader = DataLoader(test_data, batch_size=32, shuffle=True)

In [None]:
images, labels = iter(train_data_loader).next()

In [None]:
# Squeeze function is used to remove single dimension from an array. Tensor images have the channel in the beginning and these need to be removed. before displaying.
print("Before squeeze", images[0].numpy().shape)
print("After squeeze", images[0].numpy().squeeze().shape)

In [None]:
# Let's look at a few images.
plt.figure(figsize=(10,10))

plt.subplot(221), plt.imshow(images[0].numpy().squeeze());
plt.subplot(222), plt.imshow(images[1].numpy().squeeze());
plt.subplot(223), plt.imshow(images[2].numpy().squeeze());
plt.subplot(224), plt.imshow(images[3].numpy().squeeze());

### Deep Neural Network

Deep neural network is a stacked set of nodes with more than one layer between the input and output. Suppose we have a multiple linear regression porblem to solve:

y = a + bx1 + cx2

where x1 and x2 are the inputs, a is the bias, y is the expected value and b and c are the co-efficients we are trying to learn from the machine learning model.

At the beginning, b and c - also called a weights are assigned random values. The inputs are passed to a node, a bias is assinged which is unrelated to the inputs x1 and x2. The values are combined and passed to an activation function - which decides if the output. The predicted y is then compared with the expected y and the error is calculated. 

In a feed-forward netowrk, the error is sent all the way to the initial weights assingment, the weights are adjusted based on the error and it goes for a second round through the network. Every pass through the network continuously improves the result, such that the error reduces with every pass.

Deep neural networks are particularly useful when the input data has a large dimension, the features will need to be learnt but the model (automatic feature extraction) rather than being input - such a image recognition. The lower layers learn the low level features. As it advances through the node stack, each layer learns higher level features based on the output of the previous layer. 

#### Activation Functions

Activation functions decide which of the inputs most influence the model output, they normalize the input to be between (0, 1) or (-1, 1). Three main types are:

Binary Step Function:  
Given threshold - returns 0 for values below threshold, and 1 for greater than equal to threshold.  

Linear Activation Fucntion:  
Linearly dependent on input.  

Non-linear Activation Function:  
Sigmoid (smooth curve)  
Hyperbolic (Curve  centred around 0)
ReLU (Rectified Linear Unit) - similary to linear but has derivative function which helps with back propogation.  
Leaky ReLU - Has a small positive slope for negative values  
Parametric ReLU - 
Softmax - Can give multi-class output, where the value is assigned a probability of belonging to the classes. Used in the final layer of the stack to assign the class.  
Swish - Self gated activation function 

Bias can be considered equivalent to the intercept of a linear equation. It determines the threshold over which a activation function triggers. Weights determine how fast the activation function triggers. 


[PathMind](https://wiki.pathmind.com/neural-network)  
[MissingLink](https://missinglink.ai/guides/neural-network-concepts/7-types-neural-network-activation-functions-right/)  
[GeeksForGeeks](https://www.geeksforgeeks.org/effect-of-bias-in-neural-network/#:~:text=On%20the%20other%20hand%20Bias,best%20for%20the%20given%20data.)


In [60]:
# Let's set up a simple neural network with 3 layers, ReLU as the activation function for the first layers and LogSoftMax - which is log function applied to soft max to account for big range of data. 

# 28 * 28
input_size = 784
# First layer has 128 neurons, second layer has 64 neurons
hidden_size = [128, 64]
# Digits 0 - 9
output_size = 10

# Sequential functions stacks the layers one after another in the order given. LogSoftMax parameter dim=1 is the dimension along which LogSoftMax will be calculated.
model = nn.Sequential(nn.Linear(input_size, hidden_size[0]),
                      nn.ReLU(),
                      nn.Linear(hidden_size[0], 
                      hidden_size[1]),
                      nn.ReLU(),
                      nn.Linear(hidden_size[1],
                      output_size),
                      nn.LogSoftmax(dim=1))
print(model)

Sequential(
  (0): Linear(in_features=784, out_features=128, bias=True)
  (1): ReLU()
  (2): Linear(in_features=128, out_features=64, bias=True)
  (3): ReLU()
  (4): Linear(in_features=64, out_features=10, bias=True)
  (5): LogSoftmax(dim=1)
)


### Why do we use log functions in Machine Learning?

When the input values have  high range of values - very small to very high values, we call this range skewed. When performing mathematical operations on the extreme values it will either underflow or overflow when computing. 

In maths, log is the inverse function for power. So when dealing with very high or low powered values, we can minimise the effect of the powers by applying log because:

e^a.e^b = e^(a+b)

log(a.b) = log(a) + log(b)

So instead of mutiplying numbers which lead to very big /very low numbers, we dampen the effect of the powers by using log sums.

[Feedly](https://blog.feedly.com/tricks-of-the-trade-logsumexp/)

### What is a Criterion?

Machine Learning model needs to measure loss after every epoch to so that the weights can be adjusted to reduce this loss for the next epoch. Criterion is the loss function is used to calculate the gradient loss. There are several loss functions:

AbsCriterion (Absolute Error): loss(x,y) = sum(xi - yi)/n  
MSECriterion (Mean Squared Error): loss(x,y) = (sum(xi - yi)^2)/n  
NLLLoss (Negative Log Likelihood): loss(y) = -log(y) - summed for all correct classes. Higher the log probability assigned to the right class, more correct the model

[GitHub](https://ljvmiranda921.github.io/notebook/2017/08/13/softmax-and-the-negative-log-likelihood/)


### What is an Optimiser?

Optimiser is an algorithm which adjusts the weights and learning rates of the machine learning model in order to reduce loss. Some of the popular ones are:

Gradient Descent: Calculates derivative of the loss function for the entire dataset before weights can be adjusted. 
Stochastic Gradient Descent (SGD): Calculates derivative one sample at a time.  
Minibatch Gradient Descent: Calculates derivative of loss after every batch.  
Adaptive Moment Estimation (ADAM): Gradual change of velocity based on past gradients.  

[TowardsDataScience](https://towardsdatascience.com/optimizers-for-training-neural-network-59450d71caf6#:~:text=Optimizers%20are%20algorithms%20or%20methods,help%20to%20get%20results%20faster)  

### Parameter vs Hyperparameter

In machine learning, parameters are the co-efficients of the equation we are trying to learn from the model. These are calculated by the model and not given as input to the model.

Hyperparameters are the type of inputs given by the user to train the model such that it achieves the best parameter values. For eg. learning rate, momumtum, number of epochs, batch size, etc.

### What is Learning?

Learning rate is a hyperparameter which determines the rate at which the weights are adjusted - value is set between 0 and 1. Ideally, the model must learn the best weights without getting stuck in a local minima but at the same time finding the best possible values with the lowest loss.

Instead of learning rate remaining the same across all epochs they could be made to vary across epochs. Decaying learning rate is one such technique where the learning rate drops steadily as the model advances through the epochs. Scheduled learning rate drops the rate every few epochs. Adaptive learning rate is a technique where the the rate increases and decreases proportional to the value of the gradient descent.

[MyGreatLearning](https://www.mygreatlearning.com/blog/understanding-learning-rate-in-machine-learning/)  

### What is Momentum?

In SGD, the loss is determined after every sample. When the sample is noisy, the steps taken to achieve the optimum weights can vary randomly depending on next sample. Momentum is hyperparameter which retains some portion of the learning from the past in taking next step - value between 0 and 1. Momentum is an attempt to smooth the direction of descent, a moving average of gradients which helps avoid local minima and help moving in the direction of the lowest cost for the model without getting stuck in local fluctuations.

https://towardsdatascience.com/stochastic-gradient-descent-with-momentum-a84097641a5d


In [61]:
# Train the model.
# Set the criterion = Negative Log Probability.
criterion = nn.NLLLoss()
# Set the optmiser as Stochastic Gradient Descent with learning rate of 0.001 and momentum of 0.9.
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

# The number of complete passes of the entire training dataset.
epochs = 10

for e in range (epochs):
    # Loads 32 random images at a time, this is a mini-batch SGD. Multiple passes over the same dataset helps learn the the co-efficients with          the lowest cost. 
    for images, labels in train_data_loader:
        # tensor.view reshapes the view into a new matrix. -1 means calculate this value based on the other values given such that the count                matches. [32, 1, 28, 28] changes to [32, 784]
        images = images.view(images.shape[0], -1)
        
        # Start by initializing with 0 gradients for all the parameters.
        optimizer.zero_grad()
        # Logsoftmax returns log probabilities - forward pass.
        log_prob = model(images)
        # Calculate the loss by comparing the predicted labels with the actual labels using NLL loss.
        loss = criterion(log_prob, labels)
        # Computes the loss gradient for each each parameter and stores it - backward pass.
        loss.backward()
        # Updates the value of of the co-efficient for all parameters with the corresponding gradient - taking into account learning rate and               momentum.
        optimizer.step()               

In [62]:
# Test the model with the test data set.
correct_count = 0
count = 0

for images, labels in test_data_loader:
    # Pick one image at a time.
    for i in range(len(labels)):
        image = images[i].view(1, 784)
        label = labels[i]

        # De-activates autograd (gradient calculation).
        with torch.no_grad():
            log_prob = model(image)

        # Exponential operation is the inverse of log. 
        ps = torch.exp(log_prob)
        # Convert the tensor into numpy, convert into list.
        prob = list(ps.numpy()[0])
        # Find the index of the list with the highest probablity, the index order is the order of the digits [0-9].
        pred_label = prob.index(max(prob))

        if (label == pred_label):
            correct_count += 1
        
        count += 1

In [64]:
accuracy = (correct_count/count) * 100
print(accuracy)

96.57


## CNN

Convolutional Neural Network is a variant of neural network which involves a convolutional layer in the stack. CNN is good a reducing the dimentionality of the input without losing the features. Convolutional layer involves applying a kernel function (a matrix of numbers) over the input and adding the values. This extracts the high level features. This is usually followed by a max pooling layer - either max or average pooling where the maximum value of a matrix is taken to the next level - the idea being the noise is left behind and only the key features are extracted. Finally the data is passed on to a fully connected neural network and SoftMax for class prediction. 

[TowardsDataScience](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)  
[Jeremyjordan](https://www.jeremyjordan.me/convnet-architectures/)

In [39]:
# New class extending the nn.Module - the base class of all neural networks.
class Net(nn.Module):
    # Constructor which first calls the base class constructor.
    def __init__(self):
        super(Net, self).__init__()

        self.cnn_layers = Sequential(
            # in_channels: black and white = 1, colour image = 3, out_channels: number of features to learn, kernel_size = size of the matrix,                  stride = number of pixels to jump when applying kernel, paddint = number of pixels to add around the image.
            # conv2d is used of images, conv3d for videos.
            Conv2d(in_channels=1, out_channels=4, kernel_size=5),
            # Does the operation in place, can save memory, but original image is lost.
            ReLU(inplace=True),
            # Applies max pooling over a matrix of 2 x 2, jumps of 2.
            MaxPool2d(kernel_size=2),
            Conv2d(in_channels=4, out_channels=4, kernel_size=5),
            ReLU(inplace=True),
            MaxPool2d(kernel_size=2)
        )

        # Final fully connected layer.
        self.linear_layers = Sequential(
            # Input and output.
            Linear(4 * 4 * 4, 10)
        )
    
    def forward(self, x):
        x = self.cnn_layers(x)
        x = x.view(-1, 4 * 4 * 4)
        x = self.linear_layers(x)
        return func.log_softmax(x, dim=1)

### How is the input and output size calculated?
(inputsize - (filtersize - 1))

Layer | Output size | Image 
----- | ----------- | ------------
input | 1 x 28 x 28 | 28 x 28 input image size  
conv2d-1 (1, 4, 5) | 4 x 24 x 24 | (28 - (5 - 1)) = `24`  
maxpool2d-1 (2) | 4 x 12 x 12 | 24/2 = `12`  
conv2d-2 (4, 4, 5) | 4 x 8 x 8 | (12 - (5 - 1)) = `8`  
maxpool2d-2 (2) | 4 x 4 x 4 | 8/2 = `4`  
fc1 () | 10 |   
  
[StackOverflow](https://stackoverflow.com/questions/42786717/how-to-calculate-the-number-of-parameters-for-convolutional-neural-network/42787467)


In [40]:
# Instantiate the CNN class.
cnn_model = Net()

cnn_model

Net(
  (cnn_layers): Sequential(
    (0): Conv2d(1, 4, kernel_size=(5, 5), stride=(1, 1))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(4, 4, kernel_size=(5, 5), stride=(1, 1))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (linear_layers): Sequential(
    (0): Linear(in_features=64, out_features=10, bias=True)
  )
)

In [41]:
# Set the loss criterion and optimzer.
cnn_criterion = nn.NLLLoss()
cnn_optimizer = optim.SGD(cnn_model.parameters(), lr=0.001, momentum=0.9)
epochs = 10

for e in range (epochs):
    for images, labels in train_data_loader:
        # images = images.view(images.shape[0], -1)
        
        cnn_optimizer.zero_grad()
        # Forward.
        log_prob = cnn_model(images)
        loss = cnn_criterion(log_prob, labels)
        # Backward.
        loss.backward()
        # Optimize.
        cnn_optimizer.step()

In [42]:
correct_count = 0
count = 0

for data in test_data_loader:
    # 32 images and labels.
    images, labels = data

    # Get the predictions.
    # 32 outputs with log probabilities of 10 each for each of the 10 digits.
    with torch.no_grad():
        outputs = cnn_model(images)

    # torch.max - with dim=1 (column) results in max of the probablilities for each of 32 images. It returns 2 values - max probabliltiy and            max index, we are interested in the max index. _, is used to ignore the first set of output.
    _, predicted = torch.max(outputs.data, 1)

    # Get the number of images - 32 in each batch except for the last batch.
    count += labels.size(0)

    # Get the number of correct guesses in this batch.
    correct_count += (predicted == labels).sum().item()

In [43]:
accuracy = (correct_count/count) * 100
print(accuracy)

97.48


## Conclusion

In this example we learnt how to build a simple neural network and CNN. We also found that the performance with a simple neural network was *96.57%* while the same data with CNN was *97.48%*. In the next blog let us find out how to make CNN more accurate.
