In [3]:
import torch
from torch import nn
from d2l import torch as d2l

import sys
import os

# Get the path of the parent directory (your_project)
parent_dir = os.path.abspath(os.path.join(os.getcwd(), os.pardir))

# Add the parent directory to the Python path
if parent_dir not in sys.path:
    sys.path.append(parent_dir)

# Now you can import correctly
from models.models import Conv2D, MaxPool2d, ReLU, GlobalAvgPool2d, LinearRegression, SoftmaxRegression, SGDFromScratch, CrossEntropyError

# 3.0 ResNet Architecture
The ResNet architecture introduces the **Residual Learning** block, a fundamental improvement on deeper models. We will talk about the motivation for a deeper neural network, the core problem that hinders it, and how ResNet mitigates this problem. 

For this notebook, I'd like to quote my Multivariable Calculus professor, I-Shen Lai. Say hello to Professor Lai!

> "Haha... you can quote me if you want."
> I-Shen Lai, September 3rd 2025.

## 3.1 Deep Learning
Deep Learning models are made up of multiple sequential layers of other model, usually composed of the models we made in earlier notebooks. The addition of newer layers allow for more representation of complex features, allowing more information to be learned resulting in an even lower loss [[1]](#ref1).

However, there is a limit to the number of layers we can effectively add. In practice, when we reach tens of layers, the model faces a **Degradation Problem** that result in stagnant accuracy, accompanied by higher training errors [[1]](#ref1). 

He et al. proposed that the layers themselves could not fit or approximate to the **identity function**, which is the function that simply return input itself [[1]](#ref1).

So if we could somehow skip layers we deem useless or that it would just result in overfitting, then theoretically, we could improve our model, where adding additional layers would not worsen its performance.
> *Drum Rolls*

## 3.2 Residual Learning Block
Residual Learning Block introduced by He has a property of skip connections, in which our layer could either try to learn the underlying function that we desire or the identity function.

What is this underlying function? It's the function that model reality and is the ultimate truth. Throughout our lesson, we have been using `LinearRegression` to learn if there's a linear relationship, or we add `ReLU` to model non-linear and complex ones. We define as follows: the input $x$, the underlying function $U(x)$, and the layer's function $F(x)$.

$$
F(\mathbf{x}) = U(\mathbf{x}) - \mathbf{x}
$$

Cleverly, He et al. arranged it to the following

$$
U(\mathbf{x}) = F(\mathbf{x}) + \mathbf{x}
$$

It's explained that the skip connection is achieved when $F(\mathbf{x})$ outputs 0, thus achieving $U(\mathbf{x}) = \mathbf{x}$, which is the identity function.

## 3.3 Batch Normalization
Before, we work out the implementation in code. He et al. adopts batch normalization on its architecture, which helps model to converge faster in its training [[2]](#ref2). It must be applied in between affine function and non-linearity function. 

According to D2L, Batch normalization normalizes a layer's output with the batch's mean $\hat{u}_\mathcal{B}$ and standard deviation $\hat{\sigma}_\mathcal{B}$.

$$
\begin{align*}
BN(\mathbf{x}) &= \gamma \odot \frac{\mathbf{x} - \hat{u}_\mathcal{B}}{\hat{\sigma}_\mathcal{B}} + \beta \\
\hat{u}_\mathcal{B} &= \frac{1}{|\mathcal{B}|} \sum_{\mathbf{x} \in \mathcal{B}} \mathbf{x} \\
\hat{\sigma}_\mathcal{B}^2 &= \frac{1}{|\mathcal{B}|} \sum_{\mathbf{x} \in \mathcal{B}} (\mathbf{x} - \hat{u}_\mathcal{B})^2 + 
\end{align*}
$$

In [4]:
# The following definitions are a direct copy from the d2l, I'm too lazy to work this one out
def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum):
    # Use is_grad_enabled to determine whether we are in training mode
    if not torch.is_grad_enabled():
        # In prediction mode, use mean and variance obtained by moving average
        X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
    else:
        assert len(X.shape) in (2, 4)
        if len(X.shape) == 2:
            # When using a fully connected layer, calculate the mean and
            # variance on the feature dimension
            mean = X.mean(dim=0)
            var = ((X - mean) ** 2).mean(dim=0)
        else:
            # When using a two-dimensional convolutional layer, calculate the
            # mean and variance on the channel dimension (axis=1). Here we
            # need to maintain the shape of X, so that the broadcasting
            # operation can be carried out later
            mean = X.mean(dim=(0, 2, 3), keepdim=True)
            var = ((X - mean) ** 2).mean(dim=(0, 2, 3), keepdim=True)
        # In training mode, the current mean and variance are used
        X_hat = (X - mean) / torch.sqrt(var + eps)
        # Update the mean and variance using moving average
        moving_mean = (1.0 - momentum) * moving_mean + momentum * mean
        moving_var = (1.0 - momentum) * moving_var + momentum * var
    Y = gamma * X_hat + beta  # Scale and shift
    return Y, moving_mean.data, moving_var.data

class BatchNorm2d(d2l.Module):    
    # num_features: the number of outputs for a fully connected layer or the
    # number of output channels for a convolutional layer. num_dims: 2 for a
    # fully connected layer and 4 for a convolutional layer
    def __init__(self, num_features, num_dims):
        super().__init__()
        if num_dims == 2:
            shape = (1, num_features)
        else:
            shape = (1, num_features, 1, 1)
        # The scale parameter and the shift parameter (model parameters) are
        # initialized to 1 and 0, respectively
        self.gamma = nn.Parameter(torch.ones(shape))
        self.beta = nn.Parameter(torch.zeros(shape))
        # The variables that are not model parameters are initialized to 0 and
        # 1
        self.moving_mean = torch.zeros(shape)
        self.moving_var = torch.ones(shape)

    def forward(self, X):
        # If X is not on the main memory, copy moving_mean and moving_var to
        # the device where X is located
        if self.moving_mean.device != X.device:
            self.moving_mean = self.moving_mean.to(X.device)
            self.moving_var = self.moving_var.to(X.device)
        # Save the updated moving_mean and moving_var
        Y, self.moving_mean, self.moving_var = batch_norm(
            X, self.gamma, self.beta, self.moving_mean,
            self.moving_var, eps=1e-5, momentum=0.1)
        return Y

## 3.4 ResNet-50 Implementation
We will need to design a layer first before constructing our model. A bottleneck design will be utilized since we are going to build a ResNet-50. Each Residual Learning block will consist of a 1x1, 3x3 and 1x1 convolution layer (in that order), which is for down-sampling our features and then upscaling it to the same dimension [[1]](#ref1). For every Convolution layer we added, we add `BatchNorm2d` and `ReLU` for non-linearity and stability.

In [5]:
class ResNetLayer(d2l.Module):
    def __init__(self, in_channels, out_channels, use_1x1conv=False, strides=1):
        super().__init__()
        self.in_channels = in_channels
        self.out_channels = out_channels
        
        self.ReLU = ReLU()
        
        self.conv1 = Conv2D(in_channels, out_channels, kernel_size=1, stride=strides)
        self.bn1 = BatchNorm2d(out_channels, 4)
        self.conv2 = Conv2D(out_channels, out_channels, kernel_size=3, padding=1)
        self.bn2 = BatchNorm2d(out_channels, 4)
        self.conv3 = Conv2D(out_channels, out_channels * 4, kernel_size=1)
        self.bn3 = BatchNorm2d(out_channels * 4, 4)
        
        # Skip connection to match input/output dimensions if needed
        if use_1x1conv or in_channels != out_channels * 4:
            self.conv4 = Conv2D(in_channels, out_channels * 4, kernel_size=1, stride=strides)
            self.bn4 = BatchNorm2d(out_channels * 4, 4)
        else:
            self.conv4 = None
        
        self.ReLU = ReLU()
    
    def forward(self, X):
        Y = self.ReLU(self.bn1(self.conv1(X)))
        Y = self.ReLU(self.bn2(self.conv2(Y)))
        Y = self.bn3(self.conv3(Y))
        
        # add skip connection
        if self.conv4:
            X = self.bn4(self.conv4(X))
            
        Y += X
        
        return self.ReLU(Y)

With the individual layers implemented. We can start building our ResNet-50 model.

In [6]:
class ResNet50(d2l.Classifier):
    def __init__(self, num_classes, lr):
        super().__init__()
        self.lr = lr
        self.bias = True
        self.conv1 = Conv2D(kernel_size=7, in_channels=3, out_channels=64, stride=2)
        self.pool1 = MaxPool2d(kernel_size=3, stride=2)
        self.conv2 = nn.Sequential(
            ResNetLayer(in_channels=64, out_channels=64, use_1x1conv=True),
            ResNetLayer(in_channels=256, out_channels=64, use_1x1conv=True),
            ResNetLayer(in_channels=256, out_channels=64, use_1x1conv=True)
        )
        self.conv3 = nn.Sequential(
            ResNetLayer(in_channels=256, out_channels=128, use_1x1conv=True),
            ResNetLayer(in_channels=512, out_channels=128, use_1x1conv=True),
            ResNetLayer(in_channels=512, out_channels=128, use_1x1conv=True),
            ResNetLayer(in_channels=512, out_channels=128, use_1x1conv=True)
        )
        self.conv4 = nn.Sequential(
            ResNetLayer(in_channels=512, out_channels=256, use_1x1conv=True),
            ResNetLayer(in_channels=1024, out_channels=256, use_1x1conv=True),
            ResNetLayer(in_channels=1024, out_channels=256, use_1x1conv=True),
            ResNetLayer(in_channels=1024, out_channels=256, use_1x1conv=True),
            ResNetLayer(in_channels=1024, out_channels=256, use_1x1conv=True),
            ResNetLayer(in_channels=1024, out_channels=256, use_1x1conv=True),
        )
        self.conv5 = nn.Sequential(
            ResNetLayer(in_channels=1024, out_channels=512, use_1x1conv=True),
            ResNetLayer(in_channels=2048, out_channels=512, use_1x1conv=True),
            ResNetLayer(in_channels=2048, out_channels=512, use_1x1conv=True),
        )
        self.pool2 = GlobalAvgPool2d()
        self.fc = LinearRegression(in_features=2048, out_features=1000, lr=self.lr, bias=self.bias)
        self.softmax = SoftmaxRegression(1000, num_classes, lr=self.lr, bias=self.bias)
    
    def forward(self, X):
        Y = self.pool1(self.conv1(X))
        Y = self.conv2(Y)
        Y = self.conv3(Y)
        Y = self.conv4(Y)
        Y = self.conv5(Y)
        Y = self.pool2(Y)
        Y = Y.reshape(Y.shape[0], -1)
        Y = self.softmax(self.fc(Y))
        return Y
        
        
    def loss(self, y_hat, y):
        return CrossEntropyError(y_hat, y)
    
    def configure_optimizers(self):
        return SGDFromScratch(self.parameters(), self.lr)

Let us test the output of this model to verify and see if everything is working correct. We have done a lot of scaffolding and building without testing it out. In the future, we wish to be better by testing at every step.

In [None]:
model = ResNet50(num_classes=10, lr=0.01)
model(torch.randn(1, 3, 228, 228))

tensor([[0.0949, 0.0995, 0.0984, 0.0912, 0.1113, 0.1022, 0.0931, 0.1063, 0.0963,
         0.1067]], grad_fn=<DivBackward0>)

Tada! it works somehow. We have finally finished constructing a ResNet-50 architecture. We can call `eval` to see the shape of our model.

In [18]:
model = ResNet50(num_classes=10, lr=0.01)
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128, resize=(96, 96))
trainer.fit(model, data)


RuntimeError: mat1 and mat2 shapes cannot be multiplied (259200x49 and 147x64)

## References
<a name="ref1">[1]</a> K. He, C. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778.

<a name="ref2">[2]</a> A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola, *Dive into Deep Learning*. 2021. [Online]. Available: https://d2l.ai