Before ResNet, deep networks suffered from:

Vanishing gradients: gradients shrink as they flow backward through layers.

Degradation: accuracy gets worse as layers increase — deeper isn’t always better.

ResNet introduced a simple idea:

Instead of learning a full transformation,
just learn the difference from the input (i.e., the residual).

in plain nn : out = F (x)

but in resnet its out = F(x) + x

This lets the network focus on just learning the residual (how the output differs from the input).

in case of vanishing gradients now gradients can flow directly to earlier layers

Network can easily learn identity mapping if needed (just set residual weights ≈ 0
You can keep stacking layers.

If they’re not useful, the model just skips them by making
𝐹
(
𝑥
)
=
0


In case of shape mismatch we use a shortcut that converts x into the same shape as F(x) using a 1×1 Conv

In [5]:
import torch
import torch.nn as nn

class ResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()

        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)

        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)

        self.downsample = None
        if stride != 1 or in_channels != out_channels:
            self.downsample = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(out_channels)
            )

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out = self.relu(out)
        return out
