### Learning of CNN through different layers
![Different layers of CNN](../img/cnn_layers.png)

        The above pictures are of feature maps generated from different layers of VGG16(Visual Geometry Group of Oxford) trained on imagenet.
        The Layer 1 generates mostly horizontal, vetical and diagonal lines. There mostly used for detecting edges in an image. The Layer 2 will try to give more informations than first. It detects the corners. The CNN learns to do this on its own. There is no special instruction for the CNN to focus on more complex objects in deeper layers. That’s just how it normally works out when you feed training data into a CNN. Layer 3 is where we start to see some complex patterns like the eyes, face etc. We can assume that this feature maps are obtained from a model trained for detection of human faces. In Layer 4 we see our features finding patterns in the more complex parts of the faces such as eyes.
![5th layer of VGG16](../img/layer5.png)
        
        In Layer 5, you can the feature map generates the specific faces of humans, tyres of cars, faces of animals etc. This feature map contains to most information about the patters found in the images.
        
### Different Parts of a CNN
![Different Parts of CNN](../img/cnn_parts.png)

    Now, we have learnt a lot about CNNs, it's time to implement it using PyTorch. 

# Implementation of a CNN Classification Network

In [29]:
import numpy as np
import pandas as pd
import torch
from torch import nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
%matplotlib inline

In [30]:
ds = pd.read_csv('data/mnist.csv').values
ds.shape

(42000, 785)

In [31]:
# Reshaping our data according to our CNN
xtrain=ds[2000:10000, 1:].reshape((-1, 1, 28, 28))/255.0
ytrain=ds[2000:10000, 0]

xtest=ds[23000:24500, 1:].reshape((-1, 1, 28, 28))/255.0
ytest=ds[23000:24500, 0]

print(xtrain.shape, ytrain.shape)
print(xtest.shape, ytest.shape) # batch_size, no_of_channels, width, height

(8000, 1, 28, 28) (8000,)
(1500, 1, 28, 28) (1500,)


        Here a question comes, why we divided image by 255 or why it's necessary?
Ans:-

         These are all scaling techniques, the pixel values are small (Note that these small values still represents the original image), and hence the computation required and time to converge the model reduces significantly. CNN will converge despite taking 0–255 as inputs instead of scaled down to 0 -1 . However, it will converge very slowly.
## Data Normalization
        Data normalization is an important step which ensures that each input parameter (pixel, in this case) has a similar data distribution. This makes convergence faster while training the network. Data normalization is done by subtracting the mean from each pixel and then dividing the result by the standard deviation. The distribution of such data would resemble a Gaussian curve centered at zero. For image inputs we need the pixel numbers to be positive, so we might choose to scale the normalized data in the range [0,1] or [0, 255].
        In PyTorch Normalisation is done, channel-wise.

In [12]:
from torchvision import transforms
transforms.Normalize??

In [13]:
# Checking Output Values or labels
print(set(ytrain))

{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}


In [5]:
class CNN(nn.Module):     # inherited methods from super-class nn.Module
    def __init__(self):
        super(CNN, self).__init__()     # super() is used for calling init method of superclass and use in subclass CNN.
                                        #  super() allows to call methods of the superclass in subclass.
        self.conv1= nn.Sequential(
                nn.Conv2d(1, 16, 5, 1, 2),
                nn.ReLU(),
                nn.MaxPool2d(2))
        
        self.conv2=nn.Sequential(
                nn.Conv2d(16, 32, 5, 1, 2),
                nn.ReLU(),
                nn.MaxPool2d(2))
        
        self.out=nn.Linear(32*7*7, 10)
        
        
    def forward(self, x):
        x=self.conv1(x)
        x=self.conv2(x)
#         print(x.size())
        x=x.view(x.size(0), -1)
#         print(x.size())
        output=F.softmax(self.out(x))
#         print(output.size())
        return output
        
        

In [60]:
# nn.Conv2d??
# nn.MaxPool2d??

In [6]:
cnn = CNN()
print(cnn)

CNN(
  (conv1): Sequential(
    (0): Conv2d(1, 16, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (conv2): Sequential(
    (0): Conv2d(16, 32, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (out): Linear(in_features=1568, out_features=10, bias=True)
)


In [7]:
optimizer = torch.optim.Adam(cnn.parameters(), lr=0.0001)   

loss_func = nn.CrossEntropyLoss() # Also known as classification loss.
# L = -y.log(y^) here y is ground-truth and y^ is estimated.

In [9]:
def make_batch(train, labels, batch=32):
    start=0
    stop=start+batch
    while start<train.shape[0]:
        yield torch.FloatTensor(train[start:stop]), torch.LongTensor(labels[start:stop])
        start=stop
        stop=start+batch

In [64]:
for epoch in range(100):
    for step, (b_x, b_y) in enumerate(make_batch(xtrain, ytrain, 32)):   # gives batch data, normalize x when iterate train_loader
        #print (step,)
        
#         print(b_x.size())
        output = cnn(b_x)  
#         print(b_y.size())
#         cnn output
#         print output.size(), output.sum(dim=0)
        loss = loss_func(output, b_y)   # cross entropy loss
        optimizer.zero_grad()           # clear gradients for this training step
        loss.backward()                 # backpropagation, compute gradients
        
        # for params in cnn.parameters():
        #     print params.grad.cpu().data.sum() # Y U no train!!!
        optimizer.step()                # apply gradients

        if step % 250 == 0:
            test_output = cnn(torch.FloatTensor(xtest))
            outs = test_output.data.numpy().argmax(axis=1)
            acc = (outs == ytest).sum()*100.0 / test_output.shape[0]
            # pred_y = torch.max(test_output, 1)[1].data.squeeze().numpy()
            # accuracy = float((pred_y == test_y.data.numpy()).astype(int).sum()) / float(test_y.size(0))
            print('Epoch: ', epoch, '| Step: ', step, '| Acc: ', acc)



Epoch:  0 | Step:  0 | Acc:  92.66666666666667
Epoch:  1 | Step:  0 | Acc:  92.66666666666667
Epoch:  2 | Step:  0 | Acc:  92.93333333333334
Epoch:  3 | Step:  0 | Acc:  93.2
Epoch:  4 | Step:  0 | Acc:  93.26666666666667
Epoch:  5 | Step:  0 | Acc:  93.6
Epoch:  6 | Step:  0 | Acc:  93.66666666666667
Epoch:  7 | Step:  0 | Acc:  93.8
Epoch:  8 | Step:  0 | Acc:  94.0
Epoch:  9 | Step:  0 | Acc:  94.33333333333333
Epoch:  10 | Step:  0 | Acc:  94.33333333333333
Epoch:  11 | Step:  0 | Acc:  94.6
Epoch:  12 | Step:  0 | Acc:  94.53333333333333
Epoch:  13 | Step:  0 | Acc:  94.33333333333333
Epoch:  14 | Step:  0 | Acc:  94.4
Epoch:  15 | Step:  0 | Acc:  94.53333333333333
Epoch:  16 | Step:  0 | Acc:  94.86666666666666
Epoch:  17 | Step:  0 | Acc:  95.0
Epoch:  18 | Step:  0 | Acc:  95.0
Epoch:  19 | Step:  0 | Acc:  95.06666666666666
Epoch:  20 | Step:  0 | Acc:  95.2
Epoch:  21 | Step:  0 | Acc:  95.2
Epoch:  22 | Step:  0 | Acc:  95.2
Epoch:  23 | Step:  0 | Acc:  95.06666666666666
E

KeyboardInterrupt: 

In [73]:
print(cnn.state_dict().keys())

odict_keys(['conv1.0.weight', 'conv1.0.bias', 'conv2.0.weight', 'conv2.0.bias', 'out.weight', 'out.bias'])


In [65]:
for i in cnn.named_parameters():
    print(i[1].shape)
    break


torch.Size([16, 1, 5, 5])


In [58]:
optimizer.state_dict

<bound method Optimizer.state_dict of Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    eps: 1e-08
    lr: 0.0001
    weight_decay: 0
)>

## CNN Architectures
    1. LeNet-5(1998, Yann LeCun)(60,000 parameters)
    2. AlexNet(2012, SuperVision group consisting of Alex Krizhevsky, Geoffrey Hinton, and Ilya Sutskever)
        It consisted 11x11, 5x5,3x3, convolutions, max pooling, dropout, data augmentation, ReLU activations, SGD with momentum. It attached ReLU activations after every convolutional and fully-connected layer. AlexNet was trained for 6 days simultaneously on two Nvidia Geforce GTX 580 GPUs which is the reason for why their network is split into two pipelines. (60 M parameters)
    3. VGG(2012, Visual Geometry Group, Oxford)(528 M for VGG16)
    4. Inception(2014, Google)
    5. ResNet
    6. ResNext
    7. EfficientNet
    
Links to read further about architectures:-

    https://towardsdatascience.com/cnn-architectures-a-deep-dive-a99441d18049
    https://medium.com/analytics-vidhya/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5
    https://towardsdatascience.com/illustrated-10-cnn-architectures-95d78ace614d

In [69]:
import torchvision
vgg = torchvision.models.vgg16_bn()
print(vgg)

VGG(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU(inplace=True)
    (3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (5): ReLU(inplace=True)
    (6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (7): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (9): ReLU(inplace=True)
    (10): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (12): ReLU(inplace=True)
    (13): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (14): Conv2d(128, 256

## Dropout
    The term “dropout” refers to dropping out units (both hidden and visible) in a neural network. I mean these units are not considered during a particular forward or backward pass. More technically, At each training stage, individual nodes are either dropped out of the net with probability 1-p or kept with probability p, so that a reduced network is left; incoming and outgoing edges to a dropped-out node are also removed. 
![Dropout](../img/dropout.png)

### Why do we need Dropout?
    The answer to this question is “to prevent over-fitting”. A fully connected layer occupies most of the parameters, and hence, neurons develop co-dependency amongst each other during training which curbs the individual power of each neuron leading to over-fitting of training data.
    
    Effects and Observations of Dropout:-
        1. Dropout forces a neural network to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.
        2. Dropout roughly doubles the number of iterations required to converge. However, training time for each epoch is less.
        
        
## Batch Normalization
![Batch Normalisation](../img/batch_norm.png)   
    
    Batch Norm is a technique that aims to improve the training of deep neural networks by stabilizing the distribution of layer inputs. Batch normalization normalizes the output of a previous activation layer by subtracting the batch mean and dividing by the batch standard deviation. 
    Even if you don’t need to worry about overfitting there are many benefits to implementing batch normalization. Because of this, and its regularizing effect, batch normalization has largely replaced dropout in modern convolutional architectures. Because:-
    First, dropout is generally less effective at regularizing convolutional layers. The reason? Since convolutional layers have few parameters, they need less regularization to begin with. Furthermore, because of the spatial relationships encoded in feature maps, activations can become highly correlated. This renders dropout ineffective. (Source.)
    Second, what dropout is good at regularizing is becoming outdated. Large models like VGG16 included fully connected layers at the end of the network. For models like this, overfitting was combatted by including dropout between fully connected layers.
    
## So, why Batch-Norm is so effective?
    According to those who introduced Batch-Norm, It's success is due to the reduction of ICS(the change in the distribution of layer inputs caused by updating the layer inputs of preceding layers). To explain covariance shift, let’s have a deep network on cat detection. We train our data on only black cats’ images. So, if we now try to apply this network to data with colored cats, it is obvious; we’re not going to do well. The training set and the prediction set are both cats’ images but they differ a little bit. In other words, if an algorithm learned some X to Y mapping, and if the distribution of X changes, then we might need to retrain the learning algorithm by trying to align the distribution of X with the distribution of Y. 
    
    But Actually the performance is improved in CNNs due to Batch-Norm beacuse of other reason:-
According to Ian Goodfellow:-

        Deep Neural networks have higher-order interactions, which means changing weights of one layer might also effect the statistics of other layers in addition to the loss function. These cross layer interactions, when unaccounted lead to internal covariate shift. Every time we update the weights of a layer, there is a chance that it effects the statistics of a layer further in the neural network in an unfavorable way.

    Convergence may require careful initializing, hyperparameter tuning and longer training durations in such cases. However, when we add the batch normalized layer between the layers, the statistics of a layer are only effected by the two hyperparameters γ and β. Now our optimization algorithm has to adjust only two hyperparameters to control the statistics of any layer, rather than the entire weights in the previous layer. This greatly speeds up convergence, and avoids the need for careful initialization and hyperparameter tuning. Therefore, Batch Norm acts more like a check pointing mechanism.
    
According to paper published by MIT:-
 
![Effect of Batch Norm](../img/effect_batch.png)
    
    It reparametrize the underlying optimization problem to make its loss landscape significantly more smooth. That is, the loss changes at a smaller rate and the magnitudes of the gradients are smaller too.

In [None]:
.