## Residual Neural


When we stack more and more layers in a deep neural network, training becomes harder:

* **Vanishing/exploding gradients**: gradients shrink or grow as they backpropagate, making early layers learn very slowly or unstably.
* **Degradation problem**: simply adding more layers sometimes *reduces* training accuracy (not just test accuracy).

> The issue wasn’t overfitting — it was optimization difficulty.

---

#### Key idea: “Residual” learning

Instead of making a stack of layers learn a direct mapping $H(x)$ from input $x$ to output, ResNet makes the stack learn a **residual mapping**:

$
F(x) = H(x) - x \quad \Rightarrow \quad H(x) = F(x) + x
$

So the block learns only the *change* to apply to the input. The original input is added back at the end via a **skip connection**.

---

#### A residual block

**Standard block:**

```text
x ──> [Conv → BN → ReLU → Conv → BN] ──> + ──> ReLU ──> output
       ^                                   │
       └───────────── skip connection ─────┘
```

Mathematically:

$
\text{output} = \text{ReLU}(F(x; W) + x)
$

* $F(x; W)$: the output of the two convolutions (the “residual”)
* $x:$ the identity/skip connection input

This allows gradients to flow directly through the skip connection during backpropagation, which stabilizes training even with very deep networks.

#### What does it mean "Skip Connection"

A **skip connection** (also called a **shortcut connection**) is literally what it sounds like:
a pathway that *skips over* one or more layers and feeds the input directly to a later point in the network.

---

**In a normal feed-forward block**

```
x ──> [Layer(s)] ──> output
```

All information flows through the layers.

---

**With a skip (shortcut) connection**

```
        ┌───────────────┐
x ──> [Layer(s)] ──> + ──> output
   └───────────────┘ ↑
        (skip x)─────┘
```

You still process (x) through some layers to get $F(x)$, **but at the same time you also send (x) forward unchanged** and add it back in at the end.

Mathematically:
$
\text{output} = F(x) + x
$

* $F(x)$: what the block learned (residual)
* $x$: original input passed along the shortcut path

---

#### Two main skip-connection types

* **Identity skip** (when input/output have same shape): just add $x$ to $F(x)$.
* **Projection skip** (when shapes differ): apply a $1 \times 1$ convolution (and maybe stride) to $x$ before adding.

---


#### The network want to learn F(x) or H(x)?


A plain stack of layers takes an input (x) and tries to directly learn
$
H(x) \quad \text{(desired mapping)}
$

What ResNet does instead

ResNet rewrites the problem so that the stacked layers learn the **residual function**

$
F(x) = H(x) - x
$

and then adds the input back:

$
\text{Output} = F(x) + x = H(x)
$

---

**Residual block forward pass**

$
y = x + F(x;,W)
$

where $F(x;W)$ are the layers with parameters $W$ (two convs, etc.), and $x$ is the input to the block.

---

#### Gradient through a normal (plain) block

If you had

$
y = F(x;W)
$

then

$
\frac{\partial L}{\partial x}=\frac{\partial L}{\partial y}\frac{\partial y}{\partial x}=
\frac{\partial L}{\partial y}
\frac{\partial F(x;W)}{\partial x}
$


So all the gradient information must flow through the derivative of (F). If $\partial F/\partial x$ is very small (vanishing gradient), the gradient almost disappears before it reaches earlier layers.

---

Gradient with skip connection:

With
$
y = x + F(x;W)
$

we have

$
\frac{\partial L}{\partial x}=\frac{\partial L}{\partial y}
\frac{\partial (x + F(x;W))}{\partial x}
=\frac{\partial L}{\partial y},(I + \frac{\partial F}{\partial x})
$

Notice the **identity term (I)** coming from $\partial x/\partial x = 1$.

This means that even if $\partial F/\partial x$ is tiny, there is still a direct gradient path:

$
\frac{\partial L}{\partial x} \approx \frac{\partial L}{\partial y}
$

So the gradient flows directly back to the input (x) (and thus to layers before the block) without being multiplied by small numbers. That’s the “highway” for gradients people talk about.

---



* Yes, the gradient can go **directly to the layer that produced (x)** via the identity/skip branch.
* It doesn’t have to be squeezed entirely through the complicated (F(x)) path.
* This keeps earlier layers trainable even in very deep networks.

---





---

### Simple PyTorch example of a residual block

```python
import torch
import torch.nn as nn

class BasicBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3,
                               stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3,
                               stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)

        # projection if dimensions change
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1,
                          stride=stride, bias=False),
                nn.BatchNorm2d(out_channels)
            )

    def forward(self, x):
        identity = self.shortcut(x)
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += identity    # skip connection
        out = self.relu(out)
        return out
```

---



```python
class ResNet(nn.Module):
    def __init__(self, block, layers, num_classes=1000, in_channels=3):
        """
        block   : block class (BasicBlock)
        layers  : list with number of blocks in each stage, e.g. [2,2,2,2] for ResNet-18
        """
        super().__init__()
        self.inplanes = 64

        # Stem
        self.conv1 = nn.Conv2d(in_channels, 64, kernel_size=7, stride=2, padding=3, bias=False)
        self.bn1   = nn.BatchNorm2d(64)
        self.relu  = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

        # Stages
        self.layer1 = self._make_layer(block,  64, layers[0], stride=1)  # /4 spatial
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)  # /8
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)  # /16
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)  # /32

        # Head
        self.avgpool = nn.AdaptiveAvgPool2d(1)  # (N, C, 1, 1)
        self.fc = nn.Linear(512 * block.expansion, num_classes)

        # Kaiming init (common for ResNets)
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode="fan_out", nonlinearity="relu")
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)

    def _make_layer(self, block, planes, blocks, stride):
        """
        planes : out_channels of this stage
        blocks : how many blocks in this stage
        stride : stride of the first block (downsample if 2)
        """
        layers = []
        layers.append(block(self.inplanes, planes, stride=stride))
        self.inplanes = planes * block.expansion
        for _ in range(1, blocks):
            layers.append(block(self.inplanes, planes, stride=1))
        return nn.Sequential(*layers)

    def forward(self, x):
        # Stem
        x = self.relu(self.bn1(self.conv1(x)))
        x = self.maxpool(x)
        # Stages
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        # Head
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.fc(x)
        return x


# -------- Factory helpers (ImageNet-style) --------
def resnet18(num_classes=1000, in_channels=3):
    return ResNet(BasicBlock, [2, 2, 2, 2], num_classes=num_classes, in_channels=in_channels)

def resnet34(num_classes=1000, in_channels=3):
    return ResNet(BasicBlock, [3, 4, 6, 3], num_classes=num_classes, in_channels=in_channels)


# -------- Quick smoke test --------
if __name__ == "__main__":
    model = resnet18(num_classes=10)  # e.g., 10 classes
    x = torch.randn(2, 3, 224, 224)   # batch of 2 images
    logits = model(x)
    print(logits.shape)  # -> torch.Size([2, 10])
```    


#### Architecture overview

* **ResNet-18/34**: Basic blocks (two 3×3 conv layers per block).
* **ResNet-50/101/152**: Bottleneck blocks (1×1 → 3×3 → 1×1 conv) to reduce computation.

A typical ResNet-50 looks like:

```
Conv7x7 → MaxPool → [3 residual blocks] →
[4 residual blocks] → [6 residual blocks] →
[3 residual blocks] → GlobalAvgPool → FC
```

In [10]:
import torch
import torchvision
import torchviz


#ResNet18_Weights.DEFAULT is equivalent to ResNet18_Weights.IMAGENET1K_V1. 
resnet18=torchvision.models.resnet18(weights='ResNet18_Weights.DEFAULT', progress= True)

## How to find model input size


In [11]:

print("resnet18 input size: ", resnet18.fc.in_features)
print("resnet18 output size: ",resnet18.fc.out_features)


resnet18 input size:  512
resnet18 output size:  1000


In [12]:
#  resnet18 has an averagepool layer at the end.
#  So the input size does not matter much provided the feature map size is greater than kernel size.

input=torch.randn(size=[1,3,128,128])

resnet18_graph=torchviz.make_dot(resnet18(input) ,dict(resnet18.named_parameters()))
resnet18_graph.format='svg'
resnet18_graph.save('images/resnet18_graph')
resnet18_graph.render()



'images/resnet18_graph.svg'

![](images/resnet18_graph.svg)

## Finetune the model on a new dataset with 10 labels


Let’s say we want to finetune the model on a new dataset with `10` labels. In resnet, the classifier is the last linear layer `model.fc.` We can simply replace it with a new linear layer (unfrozen by default) that acts as our classifier.


```python
for params in resnet18.parameters():
    params.requiers_gard=False

resnet18.fc=torch.nn.Linear(512,10)

```

Now all parameters in the model, except the parameters of `model.fc`, are frozen. The only parameters that compute gradients are the `weights` and `bias` of `model.fc.`

```python
optimizer=torch.optim.SGD(resnet18.fc.parameters(),lr=1e-2,momentum=0.9)
```

## Black and white Image Input

The pretrained ResNet-18 expects 3-channel RGB at `conv1`. With a single-channel (monochrome) input you have a few good options:

### 1) Duplicate the channel (fastest, no weight surgery)

Preprocess your 1-channel image to 3 channels by repeating it. `transforms.Grayscale`:
- If num_output_channels == 1 : returned image is single channel
- If num_output_channels == 3 : returned image is 3 channel with r == g == b

```python
# during transforms
from torchvision import transforms

tfm = transforms.Compose([
    transforms.Grayscale(num_output_channels=3),  # 1→3 by duplication
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=(0.485,0.456,0.406), std=(0.229,0.224,0.225)),  # ImageNet stats
])
```

Pros: zero code change to the model; you keep pretrained weights intact.
Cons: a tiny bit redundant, but works very well in practice.

### 2) Replace `conv1` with 1 input channel and **port** pretrained weights

Average the RGB kernels into a single channel:

```python
import torch
import torchvision.models as models
import torch.nn as nn

m = models.resnet18(weights='models.ResNet18_Weights.IMAGENET1K_V1')
w = m.conv1.weight  # [64, 3, 7, 7]

m.conv1 = nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3, bias=False)

with torch.no_grad():
    m.conv1.weight[:] = w.mean(dim=1, keepdim=True)  # [64,1,7,7]
```

(You can also use a weighted sum like `0.2989 R + 0.5870 G + 0.1140 B` instead of `.mean`.)

Pros: no wasted computation; uses pretrained filters sensibly.
Cons: a tiny bit of code; but this is the cleanest if you’re truly single-channel end-to-end.

### 3) Add a learnable 1→3 adapter in front

Keep the pretrained model intact and learn a shallow mapping:

```python
class GrayToRGB(nn.Module):
    def __init__(self):
        super().__init__()
        self.map = nn.Conv2d(1, 3, kernel_size=1, bias=False)
        nn.init.constant_(self.map.weight, 1/3)  # start as “repeat”

    def forward(self, x):
        return self.map(x)

adapter = GrayToRGB()
model = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)
full = nn.Sequential(adapter, model)
```

Pros: lets the network learn the best 1→3 projection.
Cons: a few extra params; slightly more moving parts.

---

### Normalization notes

* If you **duplicate to 3-ch**, you can keep ImageNet mean/std as above, or compute dataset-specific stats and use those for all 3 channels (same numbers repeated).
* If you **switch to 1-ch conv1**, use single mean/std (e.g., `(mean,)` and `(std,)`) matching your grayscale dataset.

### Which should you pick?

* **Quick wins / transfer learning**: Option **1** (duplicate) is perfectly fine and very common.
* **Purist, minimal compute**: Option **2** (port weights) is elegant and usually performs best.
* **Data shift concerns** (e.g., MRI/CT with unusual intensity): Option **3** gives flexibility; also consider dataset-specific normalization and fine-tuning early layers.

### How to train the option 3 Network (learnable 1→3 adapter in front)

With option 3 (a learnable 1→3 “adapter” in front of a pretrained ResNet-18), you’ve got three common training strategies. Pick one based on data size and how different your grayscale data is from ImageNet.

### A. Freeze backbone first, train adapter + head (safe start)

1–3 epochs:

* Freeze **all** ResNet18 params.
* Train only the `GrayToRGB` adapter and the final `fc`.

Then unfreeze the backbone (optionally with a lower LR) and fine-tune.

```python
import torch.nn as nn
import torchvision.models as models

class GrayToRGB(nn.Module):
    def __init__(self):
        super().__init__()
        self.map = nn.Conv2d(1, 3, kernel_size=1, bias=False)
        nn.init.constant_(self.map.weight, 1/3)

    def forward(self, x): return self.map(x)

backbone = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)
backbone.fc = nn.Linear(backbone.fc.in_features, n_classes)  # replace head

model = nn.Sequential(GrayToRGB(), backbone)

# --- Phase 1: freeze backbone ---
for p in backbone.parameters():
    p.requires_grad = False

# optimize only adapter + fc
params = list(model[0].parameters()) + list(backbone.fc.parameters())
optimizer = torch.optim.Adam(params, lr=1e-3, weight_decay=1e-4)
```

After a few epochs:

```python
# --- Phase 2: unfreeze backbone with smaller LR ---
for p in backbone.parameters():
    p.requires_grad = True

optimizer = torch.optim.Adam([
    {"params": model[0].parameters(), "lr": 1e-3},          # adapter
    {"params": backbone.layer1.parameters(), "lr": 5e-4},
    {"params": backbone.layer2.parameters(), "lr": 2.5e-4},
    {"params": backbone.layer3.parameters(), "lr": 1.25e-4},
    {"params": backbone.layer4.parameters(), "lr": 1.25e-4},
    {"params": backbone.fc.parameters(),     "lr": 1e-3},
], weight_decay=1e-4)
```

### B. Train everything, but with **discriminative learning rates** (faster)

Good when you have a moderate dataset and want quick convergence without a freezing phase.

```python
optimizer = torch.optim.AdamW([
    {"params": model[0].parameters(),           "lr": 1e-3},  # adapter highest
    {"params": backbone.layer1.parameters(),    "lr": 5e-4},
    {"params": backbone.layer2.parameters(),    "lr": 3e-4},
    {"params": backbone.layer3.parameters(),    "lr": 2e-4},
    {"params": backbone.layer4.parameters(),    "lr": 2e-4},
    {"params": backbone.fc.parameters(),        "lr": 1e-3},  # head high
], weight_decay=1e-4)
```

### C. Unfreeze progressively (“gradual unfreezing”)

Start with only adapter+fc, then unfreeze layers one block at a time every few epochs (layer4 → layer3 → …). This is handy with small datasets.

---

### Do we freeze the adapter conv?

* **Usually not.** Let it learn a smart projection beyond simple channel copy.
* If the dataset is tiny and unstable, you *can* freeze it for the first few hundred steps.

### BatchNorm tips

* If batch size is small (≤16), consider putting the backbone’s BN layers in **eval** mode during early training:

  ```python
  def set_bn_eval(m):
      if isinstance(m, nn.BatchNorm2d):
          m.eval()
  backbone.apply(set_bn_eval)
  ```

  (Params can still be trainable; this just freezes running stats.)

### Weight decay hygiene (optional but nice)

Avoid weight decay on BN and bias:

```python
decay, no_decay = [], []
for n, p in model.named_parameters():
    if not p.requires_grad: continue
    if n.endswith('bias') or 'bn' in n.lower():
        no_decay.append(p)
    else:
        decay.append(p)
optimizer = torch.optim.AdamW([
    {"params": decay, "weight_decay": 1e-4, "lr": 3e-4},
    {"params": no_decay, "weight_decay": 0.0, "lr": 3e-4},
])
```

### Schedulers (keep it simple)

* **Cosine with warmup** or **OneCycleLR** both work well:

```python
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)
# or:
# scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=1e-3, steps_per_epoch=len(loader), epochs=E)
```

### Quick guidance

* **Small dataset / big domain shift (e.g., MRI):** A or C.
* **Moderate dataset / some domain shift:** B with discriminative LRs.
* **Plenty of data:** Train all, normal LRs, standard fine-tune.

