# Here we manually create ResNet 34 for application to ImageNet - 1k classes

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import models
import pdb

# Explore Input Dimensions

In [9]:
inp = torch.randn([2,3,224,224])

In [5]:
inp.shape # bs,rgb channels , width,height

torch.Size([2, 3, 224, 224])

Pytorch nn.Conv2d layer expects inputs of (batch_size, channels, height, width).

We are considering batch size of 2, 3 RGB color channels, height and width of 224 each. It is a rank 4 tensor

## Question - what is batch size?

# Convolution block
First we are going to construct a small sequential network for initial convolution block of ResNet. It's the very first layer in our resnet architecture.

It consists of 4 operations,

- Convolution
- Batch Normalization
- ReLU activation function
- Maxpooling

Conv2d here: 3 input channels, 64 Kernels creating 64 feature maps, 7 * 7 kernel size. 
how to get padding 3? Maybe from the desired output size ? we get output of half the size due to stride and padding 

In [10]:
conv_block = nn.Sequential(nn.Conv2d(3,64,kernel_size=7, stride=2, padding=3, bias=False), #112,112
                       nn.BatchNorm2d(64),
                       nn.ReLU(inplace=True),
                       nn.MaxPool2d(kernel_size=3, stride=2, padding=1)) # 56,56
conv_block

Sequential(
  (0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (2): ReLU(inplace=True)
  (3): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
)

Conv2d layer takes 3 input channels and generate 64 filters/channels (kernels) i.e feature maps. We can choose how many features we want to generate by convolution operation.

Here, kernel size is 7 X 7 , stride is 2 and padding 3. Padding adds border of zeros around your input matrix.

Conv2d layer downsamples input when stride is equal to 2, i.e convolution window skips over 1 pixel.

So after conv2d layer output shape will be (2,64,112,112) tensor of activations. Basically height and width of input grid reduces to half.


In [7]:
#inp=nn.Conv2d(3,64,kernel_size=7, stride=2, padding=3, bias=False)(inp)
#inp.shape

torch.Size([2, 64, 112, 112])

After maxpooling operation with stride of 2 again output from conv2d->batchNorm->Relu gets downsampled to half.

Let's check what is shape of output after conv_block operations.

In [11]:
out=conv_block(inp)
out.shape

torch.Size([2, 64, 56, 56])

# Residual block
After 2 convolution operations, the input of those 2 convolution is added to their output. This is the core of the Residual method: the extra input x does not add parameters or complexity to model. 

# Basic Block
Each basic block constitutes of 2 convolution operations.

Each convolutional layer is followed by a batch normalization layer and a ReLU activation function. After the 2nd Conv, input is added to that output before applying ReLU. Downsampling may have to be applied 

FOr the 3 x 3 Kernel, the padding of 1 and stride 1 means output is the same size as the input for the first conv layer 
## Question - how to get the expansion value from the paper? Why is Bias False? what does the downsampling do - I know that later we may update downsampling, but why does that replace the original identity? 
when the stride is not 1, ie the first layer when stride =2, in order to add the original input x to the output of the second conv, we must downsample the x so that they are same dimensions - Ws in the orign paper. This downsample is 1 x 1 convolution + Batch Norm  


### Question: how does the number of filters, the number or channels and the input shapes interact? To calc output dimensions we use 
$\frac{h - k + 2p}{s} + 1$ where h = dimension, k = kernel size, p = padding, s= stride

Why does input and output number of channels change on the first conv layer but not the second? 

In [13]:
class BasicBlock(nn.Module):
    expansion = 1

    def __init__(self, inplanes, planes, stride=1, downsample=None):
        super().__init__()
        self.conv1 = nn.Conv2d(inplanes, planes, kernel_size=3, stride=stride,
                     padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(planes)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=1,
                     padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(planes)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out = self.relu(out)

        return out

In [14]:
BasicBlock(64,128)

BasicBlock(
  (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)

In [15]:
t = torch.randn((2,64,56,56))
t.shape

torch.Size([2, 64, 56, 56])

In [36]:
block = BasicBlock(64,64)
block(t).shape


torch.Size([2, 64, 56, 56])

# make_layer()
Next 4 layers after initial conv_block are called as layer 1, 2, 3, 4 respectively. Each layer consists of multiple convolution blocks. Conv_block refers to set of operations as in Convolution->BatchNorm->ReLU activation to an input. Instead of adding new layers to create a deeper neural network, resnet authors added many conv_block within each layer, thus keeping depth of neural network same - 4 layers.
### Question: why does coding them up in 'blocks' mean that they dont count as a whole new layer? Is it not just notation? 

In the PyTorch implementation they distinguish between the blocks that includes 2 operations – Basic Block – and the blocks that include 3 operations – Bottleneck Block.

The make_layer() function takes which type of block to use as an argument, the number of input and output filters, and number of blocks to be stacked together. Every layer downsamples the input at the start using stride equals to 2 i.e for 1st convolutional layer in 1st block of a layer. For all the rest convolution layers stride is 1. Also, if downsample has to be applied to input, stride 2 convolution is used followed by BatchNorm.


In [21]:
def _make_layer(block, inplanes,planes, blocks, stride=1):
    downsample = None  
    if stride != 1 or inplanes != planes:
        downsample = nn.Sequential(            
            nn.Conv2d(inplanes, planes, 1, stride, bias=False),
            nn.BatchNorm2d(planes),
        )
    layers = []
    layers.append(block(inplanes, planes, stride, downsample))
    inplanes = planes
    for _ in range(1, blocks):
        layers.append(block(inplanes, planes))
    return nn.Sequential(*layers)

In [22]:
layers=[3, 4, 6, 3]

In [23]:
layer1 =_make_layer(BasicBlock, inplanes=64,planes=64, blocks=layers[0])
layer1

Sequential(
  (0): BasicBlock(
    (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (relu): ReLU(inplace=True)
    (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
  (1): BasicBlock(
    (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (relu): ReLU(inplace=True)
    (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
  (2): BasicBlock(
    (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (bn1): BatchNorm2d(64, eps=1e-05, mome

In [24]:
layer2 = _make_layer(BasicBlock, 64, 128, layers[1], stride=2)
layer2

Sequential(
  (0): BasicBlock(
    (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
    (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (relu): ReLU(inplace=True)
    (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (downsample): Sequential(
      (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
      (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (1): BasicBlock(
    (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (relu): ReLU(inplace=True)
    (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (bn2): BatchNorm2d(128, eps=1e-

# Why we need to downsample input

In [25]:
t = torch.rand((2,64,56,56))
t.shape #batch, RGB channels/filters,width,height

torch.Size([2, 64, 56, 56])

In [26]:
o = nn.Conv2d(64,128,kernel_size = 3, stride = 2, padding = 1)(t)
o.shape

torch.Size([2, 128, 28, 28])

In [27]:
o+t  #height and width not matching  56,56 !=28,28

RuntimeError: The size of tensor a (28) must match the size of tensor b (56) at non-singleton dimension 3

In [37]:
#Here we apply the 1 x 1 convolution to t to create something that we can add to o, the output of the conv layers 
t_d =nn.Conv2d(64,128,1,2,0)(t)
o.shape,t_d.shape

(torch.Size([2, 128, 28, 28]), torch.Size([2, 128, 28, 28]))

In [29]:
(o+t_d).shape

torch.Size([2, 128, 28, 28])

# Classifier block
This is a fully connected layer. It's the final layer in our resnet architecture. It consists of 3 operations -

- Average pooling layer - aggregates all features. It's output grid size is 1X 1 which forms rank 3 tensor of (512, 1,1)

- Flatten- Our loss function expects a vector of tensor instead of rank 3 tensor. So in forward() we call 'torch.flatten' to remove any unit axis from matrix and make it just a vector of length 512.
        x = torch.flatten(x, 1)
- Linear Layer - It's a fully connected layer just before applying softmax, which takes in 512 features and outputs 1000 class probabilities. (In case of Imagenet, we have 1000 categories.)

In [30]:
num_classes=1000
nn.Sequential(nn.AdaptiveAvgPool2d((1, 1)),
              nn.Linear(512 , num_classes))

Sequential(
  (0): AdaptiveAvgPool2d(output_size=(1, 1))
  (1): Linear(in_features=512, out_features=1000, bias=True)
)

# Resnet class

TO DO 
turn first layer into callable module

In [38]:
class ResNet(nn.Module):

    def __init__(self, block, layers, num_classes=1000):
        super().__init__()
        
        self.inplanes = 64

        self.conv1 = nn.Conv2d(3, self.inplanes, kernel_size=7, stride=2, padding=3,
                               bias=False)
        self.bn1 = nn.BatchNorm2d(self.inplanes)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
        
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 , num_classes)


    def _make_layer(self, block, planes, blocks, stride=1):
        downsample = None  
   
        if stride != 1 or self.inplanes != planes:
            downsample = nn.Sequential(
                nn.Conv2d(self.inplanes, planes, 1, stride, bias=False),
                nn.BatchNorm2d(planes),
            )

        layers = []
        layers.append(block(self.inplanes, planes, stride, downsample))
        
        self.inplanes = planes
        
        for _ in range(1, blocks):
            layers.append(block(self.inplanes, planes))

        return nn.Sequential(*layers)
    
    
    def forward(self, x):
        x = self.conv1(x)           # 224x224
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)         # 112x112

        x = self.layer1(x)          # 56x56
        x = self.layer2(x)          # 28x28
        x = self.layer3(x)          # 14x14
        x = self.layer4(x)          # 7x7

        x = self.avgpool(x)         # 1x1
        x = torch.flatten(x, 1)     # remove 1 X 1 grid and make vector of tensor shape 
        x = self.fc(x)

        return x

In [32]:
def resnet34():
    layers=[3, 4, 6, 3]
    model = ResNet(BasicBlock, layers)
    return model

In [33]:
model=resnet34()

In [34]:
model

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
  