In [1]:
import torch
import torch.nn as nn

  cpu = _conversion_method_template(device=torch.device("cpu"))


We will implement a small neural network sub-module that is a part of the llm transformer block

In [2]:
class GELU(nn.Module):
    def __init__(self):
        super().__init__()
    def forward(self,x):
        return 0.5*x*(1 + torch.tanh(
            torch.sqrt(torch.tensor(2.0/torch.pi))
        * (x+0.044715*torch.pow(x,3))
        ))

* The smoothness of GeLU can lead to better optimization properties during training, as it allows for more nuanced adjustmets in model's parameters.
* In contrast, ReLU has a sharp corner at zero, which can sometimes make optimization harder, especially in networks that are very deep or have complicated architectures.
* Moreover, GeLU allows for a small, non-zero output for negative values. This means that during training process, neurons that receive negative input can still contribute to the learning process, albeit to a lesser extent than positive inputs.


In [3]:
class FeedForward(nn.Module):
    def __init__(self,cfg):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(cfg["emb_dim"],4*cfg["emb_dim"]),
            GELU(),
            nn.Linear(4*cfg["emb_dim"],cfg["emb_dim"]),
        )
    def forward(self,x):
        return self.layers(x)

* The FeedForward module plays a crucial role in enhancing the model's ability to learn from and generalise the data
* Although the input and output dimensions of this module are the same, it internally expands the embedding dimension into a higher-dim space through the first linear layer
* This expansion is followed by a non-linear GeLU activation, and the a contraction back to the original dimension.
* This allows a richer representation space.