# Residual Networks (ResNets)

<hr>

- Very deep neural networks are hard to train due to vanishing and exploding gradients. ResNet addresses this problem.
- Skip (or shortcut) connections allow activations from one layer to be fed directly to a later layer, bypassing one or more intermediate layers. This is the core idea behind ResNets.

<div style="text-align:center">
    <img src="media/skip_connection.jpg" width=500>
</div>

<br>

<div style="text-align:center">
    <img src="media/resnet_paths.png" width=800>
</div>

**Residual Block:**
- The fundamental unit of ResNet.
- In a traditional sequential network, an activation $a^{[l]}$ goes through transformations to become $a^{[l+2]}$.
- In a residual block, the original activation $a^{[l]}$ is added directly to the output of the layer(s) before applying the final activation function, like ReLU. This is $a^{[l+2]} = g(z^{[l+2]} + a^{[l]})$, where $g$ is the ReLU function.

**Network Architecture:**
- ResNets are composed of stacked residual blocks.
- Each block has two or more layers, with skip connections adding the input of the block to its output.

**Effectiveness in Training Deep Networks:**
- Allows training of very deep networks, over 100 layers, by mitigating the vanishing/exploding gradient problems.
- Empirically, ResNets continue to benefit from increased depth, showing decreased training error with more layers.

**Comparison to Plain Networks:**
- In plain deep networks (without skip connections), increasing depth can lead to increased training error after a certain point.
- ResNets, however, demonstrate continued improvement in training performance with added depth.

<div style="text-align:center">
    <img src="media/plain_vs_resnet.png">
</div>

<br>

<div style="text-align:center">
    <img src="media/plain_vs_resnet1.png" width=600>
</div>

## Why do ResNets Work?

<hr>

<div style="text-align:center">
    <img src="media/resnet_identity.png" width=500>
</div>

If we apply ReLU, then we know that $a^{[l]} \geq 0$.

\begin{align}
a^{[l+2]} &= g\left( z^{[l+2]} + a^{[l]} \right) & (\text{by skip connection}) \\
&= g\left( W^{[l+2]} a^{[l+1]} + b^{[l+2]} + a^{[l]} \right) \\
&= g\left(a^{[l]} \right) & (\text{if} \ W^{[l+1]} = b^{[l+1]} = 0) \\
&= a^{[l]} \\
\end{align}

- One of the main reasons ResNets work well is their ability to easily learn the identity function. In a ResNet, when the optimal function for a layer is close to the identity function, the network can easily push the weights towards zero, effectively making the layer approximate an identity mapping.
- This means that adding extra layers doesn't necessarily hurt performance because these layers can effectively become "no-op" operations if needed, ensuring that the network's performance doesn't degrade.

## Spectrum of Depth
<hr>

<div style="text-align:center">
    <img src="media/depth_spectrum.png" width=700>
</div>

<br>

**Identity Block**
- The conv is followed by a batch norm `BN` before `ReLU`. Dimensions here are same.
- This skip is over 2 layers. The skip connection can jump $n$ connections where $n>2$.

<div style="text-align:center">
    <img src="media/identity_block.png" width=800>
</div>

**Convolution Block**
- The conv can be bottleneck $1 \times 1$ conv.

<div style="text-align:center">
    <img src="media/conv_block.png" width=800>
</div>