### ✅ Shortcut Connections in Transformers

Deep neural networks often suffer from the **vanishing gradient problem**, where the gradient of the loss with respect to model parameters becomes extremely small.  
This causes the weight updates to become negligible, making learning very slow or stagnant.

---

### ✅ How Skip Connections Help

Transformers use **residual (skip) connections** of the form:

$$
x_2 = x_1 + \text{layer}_1(x_1)
$$

This simple addition dramatically improves gradient flow.

---

### ✅ Gradient Flow Through a Skip Connection

We want the gradient of the loss with respect to the input \( x_1 \):

$$
\frac{\partial L}{\partial x_1}
  = \frac{\partial L}{\partial x_2} \cdot \frac{\partial x_2}{\partial x_1}
$$

Given:

$$
x_2 = x_1 + \text{layer}_1(x_1)
$$

Differentiate:

$$
\frac{\partial x_2}{\partial x_1}
= 1 + \frac{\partial\, \text{layer}_1(x_1)}{\partial x_1}
$$

Thus:

$$
\frac{\partial L}{\partial x_1}
= \frac{\partial L}{\partial x_2} \cdot 
\left( 1 + \frac{\partial\, \text{layer}_1(x_1)}{\partial x_1} \right)
$$

---

### ✅ Why This Helps

- Even if  
  $$
  \frac{\partial\, \text{layer}_1(x_1)}{\partial x_1}
  \approx 0
  $$
  (vanishing gradient inside the layer)

- The **“1” term remains**, preserving gradient flow.

This means the gradient **cannot vanish completely** — it always has at least the identity path.

This is why residual connections are essential in **Transformers, ResNets, LLMs**, and other deep architectures.



In [1]:
import torch 
import torch.nn as nn

class GeLU(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return 0.5 * x * (1 + torch.tanh(
            torch.sqrt(torch.tensor(2.0 / torch.pi)) * 
            (x + 0.044715 * torch.pow(x, 3))
        ))

In [8]:
## lets make a deep neural network

class DeepNeuralNet(nn.Module):
    def __init__(self, use_skip_connnection, layer_size):
        super().__init__()
        self.use_skip_connection = use_skip_connnection
        self.layers = nn.ModuleList([
            nn.Sequential(nn.Linear(layer_size[0], layer_size[1], GeLU())),
            nn.Sequential(nn.Linear(layer_size[1], layer_size[2], GeLU())),
            nn.Sequential(nn.Linear(layer_size[2], layer_size[3], GeLU())),
            nn.Sequential(nn.Linear(layer_size[3], layer_size[4], GeLU())),
            nn.Sequential(nn.Linear(layer_size[4], layer_size[5], GeLU()))
        ])

    def forward(self, x):
        for layer in self.layers:
            layer_output = layer(x)
            if self.use_skip_connection and layer_output.shape == x.shape:
                x = x + layer_output
            else:
                x = layer_output 
        return x

In [11]:
layer_sizes = [25, 25, 43, 43, 54, 54]

input1 = torch.rand(2, 3, 25)
neural_net = DeepNeuralNet(True, layer_sizes)


In [12]:
output = neural_net(input1)
output.shape

torch.Size([2, 3, 54])