###  ResNet  

### Deep Residual Learning for Image Recognition (He K. et al., 2016)

*We explicitly reformulate the layers as learning __residual functions__ with reference to the layer inputs, instead of learning unreferenced functions...
We provide comprehensive empirical evidence showing that these __residual
networks are easier to optimize, and can gain accuracy from
considerably increased depth__.*


[Paper](https://arxiv.org/abs/1512.03385)

In [1]:
import os
import re
import numpy as np
import netron
import torch
import torch.nn as nn
import torchvision

assert torch.cuda.is_available() is True
%load_ext watermark

In [2]:
%watermark -p torch,ignite,numpy,netron

torch : 1.10.2
ignite: 0.4.8
numpy : 1.22.1
netron: 5.7.8




*When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated...*



<img src="../assets/2_resnet.png" width="500">

<img src="../assets/5_resnet.png" width="600">

Proposed basic block (left and right):

$$y = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x} $$

Projection block for dimensions matching (+ 1x1 convolution in skip connection):


$$y = \mathcal{F}(\mathbf{x}, \{W_i\}) + W_s\mathbf{x} $$

The usage of bottleneck designs is mainly due to practical considerations.

In [3]:
class BasicConv2d(nn.Module):
    def __init__(self, in_channels: int, out_channels: int, kernel: int, pad: int = 0, **kwargs) -> None:
        super().__init__()
        self.conv = nn.Conv2d(in_channels, out_channels,
                              kernel_size=kernel, padding=pad, bias=False, **kwargs)
        self.bn = nn.BatchNorm2d(out_channels, eps=0.001)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        z = self.bn(self.conv(x))
        return z

In [4]:
class BuildingBlock(nn.Module):
    def __init__(self, in_channels: int, out_channels: int, **kwargs) -> None:
        super().__init__()
        self.conv1 = BasicConv2d(in_channels, in_channels, 3, 1)
        self.conv2 = BasicConv2d(in_channels, out_channels, 3, 1)
        self.relu = nn.ReLU()
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        z = self.relu(self.conv1(x))
        z = self.conv2(z)
        z += x
        print(f'Total feature maps: {z.shape[1]} of size: {z.shape[2:]}')
        return self.relu(z)

In [5]:
resnet18_conv2x = BuildingBlock(64, 64)
resnet18_conv2x

BuildingBlock(
  (conv1): BasicConv2d(
    (conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (bn): BatchNorm2d(64, eps=0.001, momentum=0.1, affine=True, track_running_stats=True)
  )
  (conv2): BasicConv2d(
    (conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (bn): BatchNorm2d(64, eps=0.001, momentum=0.1, affine=True, track_running_stats=True)
  )
  (relu): ReLU()
)

In [6]:
x = torch.Tensor(np.random.normal(size=(1, 64, 60, 60)))
model_path = os.path.join('onnx_graphs', 'resnet18_conv2x.onnx')
torch.onnx.export(resnet18_conv2x, x, model_path,
                  input_names=['input'], output_names=['output'], opset_version=10)

Total feature maps: 64 of size: torch.Size([60, 60])


In [None]:
netron.start(model_path, 30000)

In [7]:
def count_conv_ops(kernel, output_channel, input_shape):
    return np.prod([*kernel, output_channel, *input_shape])

In [8]:
ops_cnt = np.array((count_conv_ops((3, 3), 64, (64, 60, 60)),
                    count_conv_ops((3, 3), 64, (64, 60, 60)),))
print("%s\n%s" % (ops_cnt, ops_cnt/np.sum(ops_cnt)))
print('Total operations: %.3f M' % (np.sum(ops_cnt)/1e6))

[132710400 132710400]
[0.5 0.5]
Total operations: 265.421 M


In [9]:
class BottleneckBlock(nn.Module):
    def __init__(self, in_channels: int, out_channels: list or tuple, **kwargs) -> None:
        super().__init__()
        self.branch1 = nn.Sequential(BasicConv2d(in_channels, out_channels[0], 1),
                                     nn.ReLU(inplace=True),
                                     BasicConv2d(out_channels[0], out_channels[1], 3, 1),
                                     nn.ReLU(inplace=True),
                                     BasicConv2d(out_channels[1], out_channels[2], 1))
        
        self.branch2 = nn.Sequential(BasicConv2d(in_channels, out_channels[2], 1))
        self.relu = nn.ReLU()
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        z = self.branch1(x) 
        sc = self.branch2(x) 
        z += sc
        print(f'Total feature maps: {z.shape[1]} of size: {z.shape[2:]}')
        return self.relu(z)

In [10]:
resnet50_conv2x = BottleneckBlock(64, (64, 64, 256))
resnet50_conv2x

BottleneckBlock(
  (branch1): Sequential(
    (0): BasicConv2d(
      (conv): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn): BatchNorm2d(64, eps=0.001, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): ReLU(inplace=True)
    (2): BasicConv2d(
      (conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn): BatchNorm2d(64, eps=0.001, momentum=0.1, affine=True, track_running_stats=True)
    )
    (3): ReLU(inplace=True)
    (4): BasicConv2d(
      (conv): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn): BatchNorm2d(256, eps=0.001, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (branch2): Sequential(
    (0): BasicConv2d(
      (conv): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn): BatchNorm2d(256, eps=0.001, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (relu): ReLU()
)

In [11]:
x = torch.Tensor(np.random.normal(size=(1, 64, 60, 60)))
model_path = os.path.join('onnx_graphs', 'resnet50_conv2x.onnx')
torch.onnx.export(resnet50_conv2x, x, model_path, 
                  input_names=['input'], output_names=['output'], opset_version=10)

Total feature maps: 256 of size: torch.Size([60, 60])


In [12]:
netron.start(model_path, 30000)

Serving 'onnx_graphs/resnet50_conv2x.onnx' at http://localhost:30000


('localhost', 30000)

In [13]:
ops_cnt = np.array((count_conv_ops((1, 1), 64, (64, 60, 60)),
                    count_conv_ops((3, 3), 64, (64, 60, 60)),
                    count_conv_ops((1, 1), 256, (64, 60, 60)),
                   ))
print("%s\n%s" % (ops_cnt, ops_cnt/np.sum(ops_cnt)))
print('Total operations: %.3f M' % (np.sum(ops_cnt)/1e6))

[ 14745600 132710400  58982400]
[0.07142857 0.64285714 0.28571429]
Total operations: 206.438 M


So, bottleneck layers first decrease the number of feature maps and restore it in the output.

How can we drop out many features so carelessly?

The structure of the image data: lots of correlated features.

<img src="../assets/4_resnet.png" width="800">

<img src="../assets/6_resnet.png" width="500">

ResNet interpretations:

* A system of simultaneously parallel and serial modules: in many models, the in-out signal comes in parallel, and the output signals of each module are connected in series. An ansamble of parallel and series modules: [Link](https://arxiv.org/abs/1605.06431)

* It is related to the visual cortex models: [Link](https://arxiv.org/abs/1604.03640)

#### Visualizing the Loss Landscape of Neural Nets (Li, Hao, et al., 2018)
[Paper](https://arxiv.org/pdf/1712.09913.pdf)


<img src="../assets/1_resnet.png" width="600">

#### Torch [implementation](https://github.com/pytorch/vision/blob/e13206d9749e81fd8b3aec5e664f697a73febf9f/torchvision/models/resnet.py#L164)


```python
class BasicBlock(nn.Module):
    expansion: int = 1

    def __init__(
        self,
        inplanes: int,
        planes: int,
        stride: int = 1,
        downsample: Optional[nn.Module] = None,
        groups: int = 1,
        base_width: int = 64,
        dilation: int = 1,
        norm_layer: Optional[Callable[..., nn.Module]] = None,
    ) -> None:
        super().__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        if groups != 1 or base_width != 64:
            raise ValueError("BasicBlock only supports groups=1 and base_width=64")
        if dilation > 1:
            raise NotImplementedError("Dilation > 1 not supported in BasicBlock")
        # Both self.conv1 and self.downsample layers downsample the input when stride != 1
        self.conv1 = conv3x3(inplanes, planes, stride)
        self.bn1 = norm_layer(planes)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = conv3x3(planes, planes)
        self.bn2 = norm_layer(planes)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x: Tensor) -> Tensor:
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out = self.relu(out)

        return out
```

* Modified Bottleneck layer (resnet{50,101,152}):


```python
class Bottleneck(nn.Module):
    # Bottleneck in torchvision places the stride for downsampling at 3x3 convolution(self.conv2)
    # while original implementation places the stride at the first 1x1 convolution(self.conv1)
    # according to "Deep residual learning for image recognition"https://arxiv.org/abs/1512.03385.
    # This variant is also known as ResNet V1.5 and improves accuracy according to
    # https://ngc.nvidia.com/catalog/model-scripts/nvidia:resnet_50_v1_5_for_pytorch.
...

```


* Zero init of the last BN improves accuracy:

```python
class ResNet(nn.Module):
    def __init__(
        self,
        block: Type[Union[BasicBlock, Bottleneck]],
        layers: List[int],
        num_classes: int = 1000,
        zero_init_residual: bool = False,
        groups: int = 1,
        width_per_group: int = 64,
        replace_stride_with_dilation: Optional[List[bool]] = None,
        norm_layer: Optional[Callable[..., nn.Module]] = None,
    ) -> None:
        
...

        # Zero-initialize the last BN in each residual branch,
        # so that the residual branch starts with zeros, and each residual block behaves like an identity.
        # This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677
        if zero_init_residual:
            for m in self.modules():
                if isinstance(m, Bottleneck):
                    nn.init.constant_(m.bn3.weight, 0)  # type: ignore[arg-type]
                elif isinstance(m, BasicBlock):
                    nn.init.constant_(m.bn2.weight, 0)  # type: ignore[arg-type]
```

In [14]:
tuple(arch for arch in dir(torchvision.models) if re.match('resnet', arch))

('resnet', 'resnet101', 'resnet152', 'resnet18', 'resnet34', 'resnet50')

#### Your training code here

In [None]:
# Define data transformation pipeline.


# Initialize dataset and dataloaders.


# Initialize pretrained network, replace Linear layer with a new one for your dataset.


# Initialize optimizer, loss function and training procedure with handlers/callbacks.

#### References

* http://cs231n.stanford.edu/slides/2021/lecture_9.pdf
* https://onnx.ai/
* https://pytorch.org/docs/stable/index.html