# Improving CNN (Pytorch) 
> Classify digit images using Convolutional Neural Network. 

- toc: false 
- badges: true
- comments: false
- categories: [jupyter, pytorch, neuralnetwork, convolutionalneuralnetwork]
- author: Venkataramani, Suja

In [18]:
# Import packages.
import torch
import torchvision
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
from PIL import Image
import matplotlib.pyplot as plt
from torch import nn, optim
from torch.nn import Linear, ReLU, Sequential, Conv2d, MaxPool2d, Dropout, BatchNorm2d
import torch.nn.functional as F

In [19]:
# "tranforms" method aids image tranformations. A set of tranformations can be chained together using Compose.

# ToTensor converts a numpy image array of (H, W, C) in the range (0, 255) in to a tensor of (C, H, W) in the range (0, 1)

# Normalize method accepts mean and std deviation as input. For every channel performs (image - mean)/std. this arranges all the numbers of the channel within the same range and reduces the skews in input data dute to different ranges of numbers.
transform_step = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])

# TorchVision.datasets has the most commonly used deep learning datasets available for easy download. All datasets have common interface of tranform. train=True gets the training data (60,000 samples), train=False gets test data (10,00 samples).
train_data = datasets.MNIST(root='./data/mmist_train', download=True, train=True, transform=transform_step)
test_data = datasets.MNIST(root='./data/minst_test', download=True, train=False, transform=transform_step)

# DataLoader creates a iterable batches of data in order to aid with training a nn model. Setting shuffle to True results in a random suffled batch of images.
train_data_loader = DataLoader(train_data, batch_size=32, shuffle=True)
test_data_loader = DataLoader(test_data, batch_size=32, shuffle=True)

### What is Regularization?

When a model learns the training data so well that it has not learnt the pattern but instead the data itself, the model tends to not generalize well, i.e. it does not predict the test data well. Regularization is a penalty term added to the error term such that the co-efficients learnt by this model does not fluctuate wildly. 

In Neural Networks there are several techniques for applying  regularization:
1. Dropout: This is implemented as a layer where the parameters are dropped randomly with a probablity of retaining the values. This has the effect forcing the model to learn from sparse neurons. Probability of 1 means no dropouts, 0.2 means 0.2 of the the neurons will be dropped.  
2. L1 Regularization (Lasso Regression): This regression is applied such that the the weights are sparse and therefore the model has fewer parameters. Only the important parameters are retained and the co-efficient of less important parameters are reduced to zero, thus learning the general trend of the model. (nn.L1Loss)  
3. L2 Regularization (Ridge Regression): In this the squared magnitude of the cofficient is added to the penalty term, leaving us with a set of simple features that explain the model most, the less important ones have lesser weights.  

[TowardsDataScience](https://towardsdatascience.com/regularization-an-important-concept-in-machine-learning-5891628907ea#:~:text=Regularization%20is%20a%20technique%20used,don't%20take%20extreme%20values.)  
[EddenGerber] (https://medium.com/@edden.gerber/thanks-for-the-article-1003ad7478b2)  
[MachineLearningMastery](https://machinelearningmastery.com/dropout-for-regularizing-deep-neural-networks/)  


In [20]:
# New class extending the nn.Module - the base class of all neural networks.
class Net_d(nn.Module):
    # Constructor which first calls the base class constructor.
    def __init__(self):
        super(Net_d, self).__init__()

        self.cnn_layers = Sequential(
            Conv2d(in_channels=1, out_channels=4, kernel_size=5),
            ReLU(),
            MaxPool2d(kernel_size=2),
            Dropout(p=0.5),            
            Conv2d(in_channels=4, out_channels=4, kernel_size=5),
            ReLU(),            
            MaxPool2d(kernel_size=2),
            Dropout(p=0.5)
        )

        # Final fully connected layer.
        self.linear_layers = Sequential(
            # Input and output.
            Linear(4 * 4 * 4, 10)
        )
    
    def forward(self, x):
        x = self.cnn_layers(x)
        x = x.view(-1, 4 * 4 * 4)
        x = self.linear_layers(x)
        return F.log_softmax(x, dim=1)

In [10]:
# Instantiate dropout CNN.
cnn_model = Net_d()

cnn_model

Net(
  (cnn_layers): Sequential(
    (0): Conv2d(1, 4, kernel_size=(5, 5), stride=(1, 1))
    (1): BatchNorm2d(4, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (4): Conv2d(4, 4, kernel_size=(5, 5), stride=(1, 1))
    (5): BatchNorm2d(4, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (6): ReLU()
    (7): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (linear_layers): Sequential(
    (0): Linear(in_features=64, out_features=10, bias=True)
  )
)

In [26]:
# Train the model.
cnn_criterion = nn.NLLLoss()
cnn_optimizer = optim.SGD(cnn_model.parameters(), lr=0.001, momentum=0.9)
epochs = 5

for e in range (epochs):
    for images, labels in train_data_loader:
        
        cnn_optimizer.zero_grad()
        # Forward.
        log_prob = cnn_model(images)
        loss = cnn_criterion(log_prob, labels)
        # Backward.
        loss.backward()
        # Optimize.
        cnn_optimizer.step()

In [27]:
# Test the model on the test set.
correct_count = 0
count = 0

for data in test_data_loader:
    # 32 images and labels.
    images, labels = data

    # Get the predictions.
    # 32 outputs with log probabilities of 10 each for each of the 10 digits.
    with torch.no_grad():
        outputs = cnn_model(images)

    # torch.max - with dim=1 (column) results in max of the probablilities for each of 32 images. It returns 2 values - max probabliltiy and            max index, we are interested in the max index. _, is used to ignore the first set of output.
    _, predicted = torch.max(outputs.data, 1)

    # Get the number of images - 32 in each batch except for the last batch.
    count += labels.size(0)

    # Get the number of correct guesses in this batch.
    correct_count += (predicted == labels).sum().item()

In [17]:
accuracy = (correct_count/count) * 100
print(accuracy)

97.92999999999999


### Dropout did not work well for our CNN. Why?
Using this CNN class, we got an accuracy of 81.69. Dropouts are not always effective, they are more effective when the training data - some of the reasons are using dropouts as the last step in NN gives the model no means to correct itself, when the network is small compared to the training data size, when there are not enough epochs to give reach convergence are all cited as reasons for this. 

Other reasons CNN particularly does not take to droput has been observed as high level of correlation between activations and use of max pooling to reduce the number of parameters. 

[StackExchange](https://stats.stackexchange.com/questions/299292/dropout-makes-performance-worse)  
[KDNuggets](https://www.kdnuggets.com/2018/09/dropou9t-convolutional-networks.html)


### Batch Normalization

When a model is made of tens of layers, and data is fed in batches, the weight inputs a a layer get adjusted after every batch, which causes the model constantly readjust its weights and take longer to converge. This is called internal covariate shift. To counteract this, the learning rates need to be reduced or the initial parameters must be selected with care. This also means higher epochs to achieve a good model convergence.

Batch normalization is technique where the input to a layer is normalised such than mean is 0 and standard deviation of 1, which calms the constant shift in weights and helps achieve convergence faster. This also allows us to get rid of dropout layers.  

[MachineCurve]https://www.machinecurve.com(/index.php/2020/01/14/what-is-batch-normalization-for-training-neural-networks/)
[MachineLearningMastery](https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/)  
[AIWorkBox](https://www.aiworkbox.com/lessons/batchnorm2d-how-to-use-the-batchnorm2d-module-in-pytorch)  
[OrignalPaper](https://arxiv.org/abs/1502.03167)

In [24]:
# CNN class with Batch normalisation.
class Net_b(nn.Module):
    # Constructor which first calls the base class constructor.
    def __init__(self):
        super(Net_b, self).__init__()

        self.cnn_layers = Sequential(
            Conv2d(in_channels=1, out_channels=4, kernel_size=5),
            BatchNorm2d(num_features=4, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True),
            ReLU(),
            MaxPool2d(kernel_size=2),
            Conv2d(in_channels=4, out_channels=4, kernel_size=5),
            BatchNorm2d(num_features=4, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True),
            ReLU(),            
            MaxPool2d(kernel_size=2),
        )

        # Final fully connected layer.
        self.linear_layers = Sequential(
            # Input and output.
            Linear(4 * 4 * 4, 10)
        )
    
    def forward(self, x):
        x = self.cnn_layers(x)
        x = x.view(-1, 4 * 4 * 4)
        x = self.linear_layers(x)
        return F.log_softmax(x, dim=1)

In [25]:
# Instantiate the batch norm CNN.
cnn_model = Net_b()

cnn_model

Net_b(
  (cnn_layers): Sequential(
    (0): Conv2d(1, 4, kernel_size=(5, 5), stride=(1, 1))
    (1): BatchNorm2d(4, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (4): Conv2d(4, 4, kernel_size=(5, 5), stride=(1, 1))
    (5): BatchNorm2d(4, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (6): ReLU()
    (7): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (linear_layers): Sequential(
    (0): Linear(in_features=64, out_features=10, bias=True)
  )
)

In [28]:
accuracy = (correct_count/count) * 100
print(accuracy)

97.63


### Conclusion

We learnt important concepts such a dropout and batch normalization and applied it to our CNN. Neither actually performed better than the original CNN.