# Resnet

Original Paper: Deep Residual Learning for Image Recognition

## Problem Statement
Before actually introducing resnet, we want to know the problem it intends to solve. Lets consider two networks, a shallow one and its deeper counterpart that adds more layers onto it. We should expect the deeper one to have better results because in the worst case the extra layers of the deeper counterpart are all identity mappings and the results would be equivalent to that of the shallow network.

<div style = "text-align:center;">
<img src="resnet1.png" width="500" height="200">
</div>

This shows that the intuitive conclusion that adding layers = better performance doesn't typically hold true. The problem then becomes how to make the network learn identity functions, which is hard in traditional networks.

## Approach
There are essentially two pieces of innovation that Resnet comes up with.

---

Instead of learning the direct transformation, Resnet learns a residual between the output and the input, that is, $\mathcal{F}: x_{l-1} \to x_l-x_{l-1}$ where $f$ is the function the model tries to fit, $y$ is the output, and $x$ is the input.

For a residual block, the output $x_l$ is defined as: 
\begin{equation} x_l = x_{l-1} + \mathcal{F}(x_{l-1}) \end{equation}
<div style = "text-align:center;">
<img src="resnet2.png" width="250" height="200">
</div>

The added $x_{l-1}$ comes through a skip connection, as shown in the figure, we add it to the end of the block such that the output is still $x_l$ but the training process of the block changes. If $x_l$ and the residual have different dimensions or channel counts, the skip connection goes through a $1\times 1$ convolution to project $x_{l-1}$ to the correct dimension.

## Why

### Easier To Fit Identity Function
The essential problem is that traditional networks struggle to fit the identity function. Fitting a $n$ sized block of neural networks as the identity funciton would require:
\begin{equation}
 \sigma(W_n\sigma(W_{n-1}\cdots(W_1x))) = x
\end{equation}
Which is very hard due to non-linear activation functions, take ReLU $\sigma(z) = \text{max}(0,z)$ as an example, it directly 'zeros' all negative values. This means that fitting the block to the identity function would require it to be able to accurately 'recreate' any $x < 0$ out of no information, as $x$ would have been set zero by ReLU. 

<div style = "text-align:center;">
<img src="resnet3.png" width="500" height="200">
</div>

For a residual block to fit the identity function, it would require:
\begin{equation}
\mathcal{F} + x = x
\end{equation}
This can easily be simplified to be $\mathcal{F} = 0$, which is far easier than the previously shown requirements as it can simply be done by setting all weights to $0$. 

### Prevent Gradient Vanishing
During backpropagation, for a loss $\mathcal{L}$, the gradient of $\mathcal{L}$ with respect to $x_{l-1}$ is:
\begin{equation} \frac{\partial \mathcal{L}}{\partial x_{l-1}} = \frac{\partial \mathcal{L}}{\partial x_l} \cdot (1 + \frac{\partial \mathcal{F}}{\partial x_{l-1}}) \end{equation}

The chain rule here applies as: $\frac{\partial \mathcal{L}}{\partial x_{l-1}} = \frac{\partial \mathcal{L}}{\partial x_l} \cdot (\frac{\partial}{\partial x_{l-1}}(x_{l-1} + \mathcal{F}))$ as $x_l = x_{l-1} + \mathcal{F}$

The "$+1$" in the equation ensures that the gradient is able to flow directly from $x_l$ to $x_{l-1}$. In traditional networks ($f$), it looks like:
\begin{equation}
\frac{\partial \mathcal{L}}{\partial x_{l-1}} = \frac{\partial \mathcal{L}}{\partial x_l} \cdot \frac{\partial f}{\partial x_{l-1}}
\end{equation}

The chain rule in this one is more explicit, $f = x_l$. The problem with this is that since $\frac{\partial f}{\partial x_{l-1}}$ can be similarly decomposed to a product of multiple gradients, it approaches to $0$ exponentially if all gradients $<1$. Resnet's "$+1$" ensures that even if the gradient vanishes, there is still something that is passed back.




In [None]:
import torch
# Load ResNet models of different depths (with pre-trained weights)
def get_resnet(depth, pretrained=True):
    """
    Load ResNet model of specified depth using torch.hub.
    
    Args:
        depth (int): ResNet depth (18, 34, 50, 101, 152)
        pretrained (bool): Whether to load pre-trained weights
    
    Returns:
        nn.Module: ResNet model
    """
    model_name = f"resnet{depth}"
    # PyTorch's vision repo supports these ResNet variants
    supported_depths = {18, 34, 50, 101, 152}
    if depth not in supported_depths:
        raise ValueError(f"Unsupported ResNet depth: {depth}. Supported depths: {supported_depths}")
    
    model = torch.hub.load(
        'pytorch/vision:v0.10.0',  # Repo and version
        model_name,                # Model name (e.g., "resnet18")
        pretrained=pretrained
    )
    model.eval()  # Set to evaluation mode
    return model


# Example usage
if __name__ == "__main__":
    # Load ResNet-18
    resnet18 = get_resnet(18, pretrained=True)
    print("ResNet-18 loaded.")

    # Load ResNet-50
    resnet50 = get_resnet(50, pretrained=True)
    print("ResNet-50 loaded.")

    # Load ResNet-152
    resnet152 = get_resnet(152, pretrained=True)
    print("ResNet-152 loaded.")

    # Test with a sample input
    x = torch.randn(1, 3, 224, 224)  # Batch of 1, 3-channel, 224x224 image
    with torch.no_grad():  # Disable gradient computation for inference
        output = resnet18(x)
        print(f"ResNet-18 output shape: {output.shape}")  # Should be (1, 1000)

Downloading: "https://github.com/pytorch/vision/zipball/v0.10.0" to C:\Users\1111/.cache\torch\hub\v0.10.0.zip
