$\color{blue}{\text{What is Neural Network }}$
A neural network is simply a group of connected neurons, there are some input neurons, some output neurons and a group of what we call hidden neurons in between. When we feed information to the input neurons we get some information from the output neurons. Information starts at the input neurons and travels to the next layers of neurons having whats called a weight and a bias applied to it. These weight and biases start out randomly determined and are tweaked as the network learns and sees more data. After reaching a new layer there is a function applied to each neurons value that is called an activation function.


$\color{blue}{\text{How does Neural Network work? }}$

y=w1*x1+w2*x2+w3*x3+w4(bias)
z=act(y)

First weights will be passed to hidden neurons, and two steps would happen. 
1. The summation of weights and features would happen and then bias is added
2.  Activation function is applied to the summation of weights

$\color{blue}{\text{How does activation function work? }}$
Lets suppose, if we keep our one hand on a hot object, then the neurons of that hand would get activated, but not of the other hands 
1. Sigmoid Activation Function: used in Logistic Regression, and Activation function is 1/(1+e^-y). Any value of y given to sigmoid would result either in 0 or in 1. There is not in between value. 
2. ReLu Activation Function: the product and summation of weights, features, and bias is sent to Relu, and max(y, 0) is found in Relu.

$\color{blue}{\text{Neural Network Training }}$
All the input features will pass to the hidden layer, then the two steps of how neural network works are followed and output layer produces the predicted output. y^=predicted value, y=actual value
Then loss function is applied. The difference of y-y^ is found. The loss function value should be reduced to minimal value so that y=y^. We can do this by updating weights and using optimizer. To achieve the minimal loss value, we got to update the weights using backpropagation. Lets suppose we want to update the bias w4, $\color{blue}{\text{w4'=w4-learning_rate* derivative of w4}}$ subsequently, w1, w2, w3 would be updated.

In [1]:
from torchvision import datasets as ds
from torch.utils.data import DataLoader
from torchvision import transforms
import torchvision as tv
import torch
import torch.nn as nn
import math
import numpy as np
from torch.autograd import Variable
from torch import optim
from matplotlib import pyplot as plt
import torch.backends.cudnn as cudnn

In [3]:
transform = transforms.Compose(
    [
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ]
)

###  $\color{blue}{\text{Normalize does the following for each channel:}}$

image = (image - mean) / std

The parameters mean, std are passed as 0.5, 0.5 in your case. This will normalize the image in the range [-1,1]. For example, the minimum value 0 will be converted to (0-0.5)/0.5=-1, the maximum value of 1 will be converted to (1-0.5)/0.5=1. Both parameters are “Sequences for each channel”. Color images have three channels (red, green, blue), therefore you need three parameters to normalize each channel. The first tuple (0.5, 0.5, 0.5) is the mean for all three channels and the second (0.5, 0.5, 0.5) is the standard deviation for all three channels.

if you would like to get your image back in [0,1] range, you could use,

image = ((image * std) + mean)

In [4]:
train_set = ds.CIFAR10(root='../input/', train=True, transform=transform, download=True)

train_loader = torch.utils.data.DataLoader(train_set, batch_size=4, shuffle=True, num_workers=0)

testset = tv.datasets.CIFAR10(root='../input/', train=False, download=True, transform=transform)

testloader = torch.utils.data.DataLoader(testset,  batch_size=4, shuffle=True, num_workers=0)

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ../input/cifar-10-python.tar.gz


HBox(children=(FloatProgress(value=0.0, max=170498071.0), HTML(value='')))


Extracting ../input/cifar-10-python.tar.gz to ../input/
Files already downloaded and verified


CIFAR10 in torch package has 60,000 images of 10 labels, with the size of 32x32 pixels. By default, torchvision.datasets.CIFAR10 will separate the dataset into 50,000 images for training and 10,000 images for testing.

$\color{blue}{\text{Dataset loader }}$
The dataset is divided in three categories: training, validation and test. The first one will be, obviously, used for trainig; the validation set will be used to measure the model performance during training and the test set will be used to evaluate our model performance once the training has finished.

$\color{blue}{\text{Utils}}$
Some utility function to visualize the dataset and the model's prediction

In [5]:
classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']

## $\color{blue}{\text{VGG16 Architecture}}$

![picture](https://drive.google.com/uc?export=view&id=1dimQsscYoFA63KN1FDsWe7bwIrvxpRhR)

1. VGG: The VGG16 is a CNN model that instead of having a large number of hyper-parameter they focused on having convolution layers of 3x3 filter with a stride 1 and always used same padding and maxpool layer of 2x2 filter of stride 2. It follows this arrangement of convolution and max pool layers consistently throughout the whole architecture. In the end it has 2 FC(fully connected layers) followed by a softmax for output. The 16 in VGG16 refers to it has 16 layers that have weights. This network is a pretty large network and it has about 138 million (approx) parameters.
2. I will be using Sequential method as I am creating a sequential model. Sequential model means that all the layers of the model will be arranged in sequence. 


  a. torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros')

stride controls the stride for the cross-correlation, a single number or a tuple.

padding controls the amount of implicit padding on both sides for padding number of points for each dimension.

in_channels (int) – Number of channels in the input image

out_channels (int) – Number of channels produced by the convolution

kernel_size (int or tuple) – Size of the convolving kernel

stride (int or tuple, optional) – Stride of the convolution. Default: 1

padding (int or tuple, optional) – Zero-padding added to both sides of the input. Default: 0

$\color{blue}{\text{relu(Rectified Linear Unit) activation to each layers so that all the negative values are not passed to the next layer. }}$

the spatial padding of conv. layer input is such that the spatial resolution is preserved after convolution, i.e. the padding is 1-pixel for 3×3 conv. layers. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv.  layers (not all the conv. layers are followed by max-pooling). Max-pooling is performed over a 2×2 pixel window, with stride 2.

In [6]:
class VGGNet(nn.Module):
    def __init__(self, num_classes):
        super(VGGNet, self).__init__()
        self.features = nn.Sequential(
            
            nn.Conv2d(3, 32, 3, padding=1),  # Conv1
            nn.ReLU(True),
            nn.Conv2d(32, 32, 3, padding=1),  # Conv2
            nn.ReLU(True),
            nn.MaxPool2d(2, 2),  # Pool1
            nn.Conv2d(32, 64, 3, padding=1),  # Conv3
            nn.ReLU(True),
            nn.Conv2d(64, 64, 3, padding=1),  # Conv4
            nn.ReLU(True),
            nn.MaxPool2d(2, 2),  # Pool2
            nn.Conv2d(64, 128, 3, padding=1),  # Conv5
            nn.ReLU(True),
            nn.Conv2d(128, 128, 3, padding=1),  # Conv6
            nn.ReLU(True),
            nn.Conv2d(128, 128, 3, padding=1),  # Conv7
            nn.ReLU(True),
            nn.MaxPool2d(2, 2),  # Pool3
            nn.Conv2d(128, 256, 3, padding=1),  # Conv8
            nn.ReLU(True),
            nn.Conv2d(256, 256, 3, padding=1),  # Conv9
            nn.ReLU(True),
            nn.Conv2d(256, 256, 3, padding=1),  # Conv10
            nn.ReLU(True),
            nn.MaxPool2d(2, 2),  # Pool4
            nn.Conv2d(256, 256, 3, padding=1),  # Conv11
            nn.ReLU(True),
            nn.Conv2d(256, 256, 3, padding=1),  # Conv12
            nn.ReLU(True),
            nn.Conv2d(256, 256, 3, padding=1),  # Conv13
            nn.ReLU(True),
            # nn.MaxPool2d(2, 2)  # Pool5 
        )
#Sequential object from keras. A Sequential model simply defines a sequence of layers starting with the input layer and ending with the 
#output layer. Our model will have 3 layers, and input layer of 784 neurons (representing all of the 28x28 pixels in a picture) a hidden 
#layer of an arbitrary 128 neurons and an output layer of 10 neurons representing the probability of the picture being each of the 10 classes.
        self.classifier = nn.Sequential(
            nn.Linear(2 * 2 * 256, 512),
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(512, 512),
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(512, num_classes),
        )
        self._initialize_weights()

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
                m.weight.data.normal_(0, math.sqrt(2. / n))
                if m.bias is not None:
                    m.bias.data.zero_()
            elif isinstance(m, nn.BatchNorm2d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()
            elif isinstance(m, nn.Linear):
                m.weight.data.normal_(0, 0.01)
                m.bias.data.zero_()

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        x = self.classifier(x)
        return x

In [10]:
net = VGGNet(16)
net.cuda()
lr = 1e-3
momentum = 0.9
num_epoch = 50

critierion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=lr, momentum=momentum)
print('Training with learning rate = %f, momentum = %f ' % (lr, momentum))

Training with learning rate = 0.001000, momentum = 0.900000 


n_total_step in my case is 1,250 steps, it is calculated by <total records>/<batch size>, so my case is$\color{blue}{\text{ 50,000/40 = 1,250. it means that in training stage, each epoch my code will execute a loop of 1,250 steps.}}$


$\color{red}{\text{CrossEntropyLoss}}$ $\color{blue}{\text{function in torch to calculate the loss value. This function received the predicted y value of n-features and the labels and does the}}$  $\color{blue}{\text{ softmax calculation, in my case, I have 10-feature predicted outputs for each image.}}$

In [11]:
loss_p = np.array([])
for t in range(num_epoch):
    running_loss = 0
    running_loss_sum_per_epoch = 0
    total_images = 0
    correct_images = 0
    if t == 25:
        optimizer = optim.SGD(net.parameters(), lr=lr/10, momentum=momentum)
    for i, data in enumerate(train_loader, 0):
        images, labels = data
        images = Variable(images.cuda())
        labels = Variable(labels.cuda())

        optimizer.zero_grad()
        outputs = net(images)
        _, predicts = torch.max(outputs.data, 1)
        loss = critierion(outputs, labels)

        loss.backward()

        optimizer.step()

        total_images += labels.size(0)
        correct_images += (predicts == labels).sum().item()
        loss_data = loss.data.item()
        running_loss += loss_data
        
        running_loss_sum_per_epoch = running_loss + running_loss_sum_per_epoch
        if i % 2000 == 1999:
            print('Epoch, batch [%d, %5d] loss: %.6f, Training accuracy: %.5f' %
                  (t + 1, i + 1, running_loss / 2000, 100 * correct_images / total_images))
            running_loss = 0
            total_images = 0
            correct_images = 0

    loss_p = np.append(loss_p, running_loss_sum_per_epoch)

print('Finished training.')

Epoch, batch [1,  2000] loss: 2.536399, Training accuracy: 10.48750
Epoch, batch [1,  4000] loss: 2.333488, Training accuracy: 10.18750
Epoch, batch [1,  6000] loss: 2.244716, Training accuracy: 15.25000
Epoch, batch [1,  8000] loss: 2.070428, Training accuracy: 18.42500
Epoch, batch [1, 10000] loss: 1.977798, Training accuracy: 19.43750
Epoch, batch [1, 12000] loss: 1.936953, Training accuracy: 20.28750
Epoch, batch [2,  2000] loss: 1.875491, Training accuracy: 24.26250
Epoch, batch [2,  4000] loss: 1.787872, Training accuracy: 28.47500
Epoch, batch [2,  6000] loss: 1.746399, Training accuracy: 30.95000
Epoch, batch [2,  8000] loss: 1.702240, Training accuracy: 32.63750
Epoch, batch [2, 10000] loss: 1.668457, Training accuracy: 32.86250
Epoch, batch [2, 12000] loss: 1.610163, Training accuracy: 36.57500
Epoch, batch [3,  2000] loss: 1.554412, Training accuracy: 39.67500
Epoch, batch [3,  4000] loss: 1.510543, Training accuracy: 42.42500
Epoch, batch [3,  6000] loss: 1.461720, Training

In [12]:
with torch.no_grad():
    number_corrects = 0
    number_samples = 0
    for i, (test_images_set , test_labels_set) in enumerate(testloader):
        test_images_set = test_images_set.cuda()
        test_labels_set = test_labels_set.cuda()
    
        y_predicted = net(test_images_set)
        labels_predicted = y_predicted.argmax(axis = 1)
        number_corrects += (labels_predicted==test_labels_set).sum().item()
        number_samples += test_labels_set.size(0)
    print(f'Overall accuracy {(number_corrects / number_samples)*100}%')

Overall accuracy 81.13%


 $\color{blue}{\text{Return evenly spaced numbers over a specified interval.}}$

In [1]:
e = np.linspace(0, num_epoch, num_epoch)

NameError: ignored

In [None]:
plt.plot(e, loss_p, color='red', linestyle='--', labels='Training Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()

 ## $\color{blue}{\text{Theory Behind the CNN}}$
 In a CNN, the input is a tensor with a shape: (number of inputs) x (input height) x (input width) x (input channels). After passing through a convolutional layer, the image becomes abstracted to a feature map, also called an activation map, with shape: (number of inputs) x (feature map height) x (feature map width) x (feature map channels). A convolutional layer within a CNN generally has the following attributes:

Convolutional filters/kernels defined by a width and height (hyper-parameters).
The number of input channels and output channels (hyper-parameters). One layer's input channels must equal the number of output channels (also called depth) of its input.
Additional hyperparameters of the convolution operation, such as: padding, stride, and dilation.
Convolutional layers convolve the input and pass its result to the next layer. This is similar to the response of a neuron in the visual cortex to a specific stimulus.[14] Each convolutional neuron processes data only for its receptive field. Although fully connected feedforward neural networks can be used to learn features and classify data, this architecture is generally impractical for larger inputs such as high resolution images. It would require a very high number of neurons, even in a shallow architecture, due to the large input size of images, where each pixel is a relevant input feature. For instance, a fully connected layer for a (small) image of size 100 x 100 has 10,000 weights for each neuron in the second layer. Instead, convolution reduces the number of free parameters, allowing the network to be deeper.[15] For example, regardless of image size, using a 5 x 5 tiling region, each with the same shared weights, requires only 25 learnable parameters. Using regularized weights over fewer parameters avoids the vanishing gradients and exploding gradients problems seen during backpropagation in traditional neural networks.[16][17] Furthermore, convolutional neural networks are ideal for data with a grid-like topology (such as images) as spatial relations between separate features are taken into account during convolution and/or pooling.

Pooling layers
Convolutional networks may include local and/or global pooling layers along with traditional convolutional layers. Pooling layers reduce the dimensions of data by combining the outputs of neuron clusters at one layer into a single neuron in the next layer. Local pooling combines small clusters, tiling sizes such as 2 x 2 are commonly used. Global pooling acts on all the neurons of the feature map.[18][19] There are two common types of pooling in popular use: max and average. Max pooling uses the maximum value of each local cluster of neurons in the feature map,[20][21] while average pooling takes the average value.

Fully connected layers
Fully connected layers connect every neuron in one layer to every neuron in another layer. It is the same as a traditional multi-layer perceptron neural network (MLP). The flattened matrix goes through a fully connected layer to classify the images.

Receptive field
In neural networks, each neuron receives input from some number of locations in the previous layer. In a convolutional layer, each neuron receives input from only a restricted area of the previous layer called the neuron's receptive field. Typically the area is a square (e.g. 5 by 5 neurons). Whereas, in a fully connected layer, the receptive field is the entire previous layer. Thus, in each convolutional layer, each neuron takes input from a larger area in the input than previous layers. This is due to applying the convolution over and over, which takes into account the value of a pixel, as well as its surrounding pixels. When using dilated layers, the number of pixels in the receptive field remains constant, but the field is more sparsely populated as its dimensions grow when combining the effect of several layers.

Weights
Each neuron in a neural network computes an output value by applying a specific function to the input values received from the receptive field in the previous layer. The function that is applied to the input values is determined by a vector of weights and a bias (typically real numbers). Learning consists of iteratively adjusting these biases and weights.

The vector of weights and the bias are called filters and represent particular features of the input (e.g., a particular shape). A distinguishing feature of CNNs is that many neurons can share the same filter. This reduces the memory footprint because a single bias and a single vector of weights are used across all receptive fields that share that filter, as opposed to each receptive field having its own bias and vector weighting.

ReLU layer
ReLU is the abbreviation of rectified linear unit, which applies the non-saturating activation function {\textstyle f(x)=\max(0,x)}{\textstyle f(x)=\max(0,x)}.[56] It effectively removes negative values from an activation map by setting them to zero.[69] It introduces nonlinearities to the decision function and in the overall network without affecting the receptive fields of the convolution layers.

Other functions can also be used to increase nonlinearity, for example the saturating hyperbolic tangent {\displaystyle f(x)=\tanh(x)}{\displaystyle f(x)=\tanh(x)}, {\displaystyle f(x)=|\tanh(x)|}{\displaystyle f(x)=|\tanh(x)|}, and the sigmoid function {\textstyle \sigma (x)=(1+e^{-x})^{-1}}{\textstyle \sigma (x)=(1+e^{-x})^{-1}}. ReLU is often preferred to other functions because it trains the neural network several times faster without a significant penalty to generalization accuracy.[70]

Fully connected layer
After several convolutional and max pooling layers, the final classification is done via fully connected layers. Neurons in a fully connected layer have connections to all activations in the previous layer, as seen in regular (non-convolutional) artificial neural networks. Their activations can thus be computed as an affine transformation, with matrix multiplication followed by a bias offset (vector addition of a learned or fixed bias term).

CNNs use more hyperparameters than a standard multilayer perceptron (MLP). While the usual rules for learning rates and regularization constants still apply, the following should be kept in mind when optimizing.

Number of filters
Since feature map size decreases with depth, layers near the input layer tend to have fewer filters while higher layers can have more. To equalize computation at each layer, the product of feature values va with pixel position is kept roughly constant across layers. Preserving more information about the input would require keeping the total number of activations (number of feature maps times number of pixel positions) non-decreasing from one layer to the next.

The number of feature maps directly controls the capacity and depends on the number of available examples and task complexity.

Filter size
Common filter sizes found in the literature vary greatly, and are usually chosen based on the data set.

The challenge is to find the right level of granularity so as to create abstractions at the proper scale, given a particular data set, and without overfitting.

Pooling type and size
In modern CNNs, max pooling is typically used, and often of size 2×2, with a stride of 2. This implies that the input is drastically downsampled, further improving the computational efficiency.

Very large input volumes may warrant 4×4 pooling in the lower layers.[71] However, choosing larger shapes will dramatically reduce the dimension of the signal, and may result in excess information loss. Often, non-overlapping pooling windows perform best.

##  $\color{blue}{\text{Using Pretrained Model From Torchvision}}$ 

In [None]:

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
from torchvision import models
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
num_epochs = 5
batch_size = 40
learning_rate = 0.001
classes = ('plane', 'car' , 'bird',
    'cat', 'deer', 'dog',
    'frog', 'horse', 'ship', 'truck')

cuda


In [None]:
transform = transforms.Compose([
    transforms.Resize(size=(224, 224)),
    transforms.ToTensor(),
    transforms.Normalize( 
       (0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010) 
    )
])
train_dataset = torchvision.datasets.CIFAR10(
    root= './data', train = True,
    download =True, transform = transform)
test_dataset = torchvision.datasets.CIFAR10(
    root= './data', train = False,
    download =True, transform = transform)

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz


HBox(children=(FloatProgress(value=0.0, max=170498071.0), HTML(value='')))


Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified


In [None]:
train_loader = torch.utils.data.DataLoader(train_dataset
    , batch_size = batch_size
    , shuffle = True)
test_loader = torch.utils.data.DataLoader(test_dataset
    , batch_size = batch_size
    , shuffle = True)
n_total_step = len(train_loader)
print(n_total_step)

1250


In [None]:
model = models.vgg16(pretrained = True)
input_lastLayer = model.classifier[6].in_features
model.classifier[6] = nn.Linear(input_lastLayer,10)
model = model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr = learning_rate, momentum=0.9,weight_decay=5e-4)

Downloading: "https://download.pytorch.org/models/vgg16-397923af.pth" to /root/.cache/torch/hub/checkpoints/vgg16-397923af.pth


HBox(children=(FloatProgress(value=0.0, max=553433881.0), HTML(value='')))




In [None]:
for epoch in range(num_epochs):
  for i, (imgs , labels) in enumerate(train_loader):
    imgs = imgs.to(device)
    labels = labels.to(device)

    labels_hat = model(imgs)
    n_corrects = (labels_hat.argmax(axis=1)==labels).sum().item()
    loss_value = criterion(labels_hat, labels)
    loss_value.backward()
    optimizer.step()
    optimizer.zero_grad()
    if(i+1)%250==0:
      print(f'epoch {epoch+1}/{num_epochs}, step: {i+1}/{n_total_step}: loss = {loss_value:.5f}, acc = {100*(n_corrects/labels.size(0)):.2f}%')
  print()

   

epoch 1/5, step: 250/1250: loss = 0.01601, acc = 100.00%
epoch 1/5, step: 500/1250: loss = 0.03563, acc = 97.50%
epoch 1/5, step: 750/1250: loss = 0.14697, acc = 95.00%
epoch 1/5, step: 1000/1250: loss = 0.06332, acc = 97.50%
epoch 1/5, step: 1250/1250: loss = 0.01718, acc = 100.00%

epoch 2/5, step: 250/1250: loss = 0.01432, acc = 100.00%
epoch 2/5, step: 500/1250: loss = 0.00585, acc = 100.00%
epoch 2/5, step: 750/1250: loss = 0.00607, acc = 100.00%
epoch 2/5, step: 1000/1250: loss = 0.00969, acc = 100.00%
epoch 2/5, step: 1250/1250: loss = 0.08723, acc = 97.50%

epoch 3/5, step: 250/1250: loss = 0.03872, acc = 100.00%
epoch 3/5, step: 500/1250: loss = 0.00851, acc = 100.00%
epoch 3/5, step: 750/1250: loss = 0.01612, acc = 100.00%
epoch 3/5, step: 1000/1250: loss = 0.01277, acc = 100.00%
epoch 3/5, step: 1250/1250: loss = 0.01875, acc = 100.00%

epoch 4/5, step: 250/1250: loss = 0.05796, acc = 95.00%
epoch 4/5, step: 500/1250: loss = 0.00648, acc = 100.00%
epoch 4/5, step: 750/1250: 

In [None]:
with torch.no_grad():
    number_corrects = 0
    number_samples = 0
    for i, (test_images_set , test_labels_set) in enumerate(test_loader):
        test_images_set = test_images_set.to(device)
        test_labels_set = test_labels_set.to(device)
    
        y_predicted = model(test_images_set)
        labels_predicted = y_predicted.argmax(axis = 1)
        number_corrects += (labels_predicted==test_labels_set).sum().item()
        number_samples += test_labels_set.size(0)
    print(f'Overall accuracy {(number_corrects / number_samples)*100}%')

Overall accuracy 93.47999999999999%
