Name: Aditya Singh | [LinkedIn](https://www.linkedin.com/in/adityasingh2022/) | [GitHub](https://github.com/adityasinghcoding)

# __Transformer from Scratch__

`Transformer`(transform/translate) helps in understanding real world meaning of the data by transforming the relations(important/priority-wise) w.r.t actual fundamental meaning. 
<br>___Note :___ Transformer acts as the model inside main model.

---
### __Components of Transformer__
1. ___Encoder Layer___
   - Input Embeddings
   - Positional Encoding
   - Self-Attention Layer
   - Feed Forward Network
2. ___Decoder Layer___
   - Masked Self-Attention
   - Encoder-Decoder Attention
   - Feed-Forward Network

---
**Attention**: It finds, measure, evaluate the relation of one word or any data with other data present in the batch/sequence/matrix. 
There are 2 types of Attentions:
- **Self Attention**: Evaluate connection/relation of each word/data. 
- **Multi Head Attention**: Multiple Self Attentions in parallel, to evaluate complex relations between entities(data).
--- 
### __Architecture of Transformers__
#### __Encoder Layer__
- **Input** (Data in Vector; Embedding of words)
- **Positional Encoding** (Metadata of data order with sine/cosine functions)
- **Self Attention Layer** (Relation capturing & evaluation or in technical terms: establishing weights)
   - $Attention(Q,K,V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V$ <br>_Softmax(Sum to 1) is used to normalize the data(scores)._
      - _Q, K, V:_ Query, Key, Value
      - ${QK^T}$: Compute Scores/Interaction with others.
      - V: Vectors. 
   - $\sqrt{d_k}$: Fixes large dot products causing gradient issues.
- __Feed-Forward Network__ (Simple Neural Network)

#### __Decoder Layer__
- __Masked-Self Attention__ (Hiding future tokens during training)
- __Encoder-Decoder Attention__ (___Q___ from Decoder, ___K & V___ from Encoder. Encoder output used to focus on relevant input parts)

In [3]:
import torch 
import torch.nn as nn
import torch.nn.functional as F

### Self-Attention Layer

In [20]:
class SelfAttention(nn.Module):
   def __init__(self, embed_size, heads):
      super(SelfAttention, self).__init__()
      self.embed_size = embed_size # Dimension of input embeddings (e.g., 512)
      self.heads= heads # Number of attention heads (e.g., 8)
      self.head_dim = embed_size // heads # Dimension per head (e.g., 512/8=64)

      # Ensuring embed_size is divisible by the number of heads
      assert self.head_dim*heads == embed_size, "Embed size must be divisible by heads"

      # Linear layers for Q, K, V
      # Linear layers to project embeddings into Query (Q), Key (K), Value (V) vectors
      self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)
      self.keys = nn.Linear(self.head_dim, self.head_dim, bias=False)
      self.queries = nn.Linear(self.head_dim, self.head_dim, bias=False)

      # Final linear layer to combine outputs from all heads
      self.fc_out = nn.Linear(embed_size, embed_size)
   
   def forward(self, values, keys, queries, mask=None):
      # Get batch size (N) and sequence lengths for values, keys, queries
      N = queries.shape[0] #Batch Size
      value_len, key_len, query_len = values.shape[1], keys.shape[1], queries.shape[1]

      # Spliting embeddings into multiple heads (reshape for parallel computation)
      # New shape: (N, seq_len, heads, head_dim)
      values = values.reshape(N, value_len, self.heads, self.head_dim)
      keys = values.reshape(N, key_len, self.heads, self.head_dim)
      queries = values.reshape(N, query_len, self.heads, self.head_dim)
      
      # Computing Q, K, V
      # Projecting embeddings to Q, K, V using linear layers
      values = self.values(values)
      keys = self.keys(keys)
      queries = self.queries(queries)

      # Attention scores: (Q*K^T) / sqrt(d_k)
      # Compute attention scores (Q * K^T)
      # Einstein summation: (batch, query_len, heads, head_dim) x (batch, key_len, heads, head_dim)
      
      # Result shape: (batch, heads, query_len, key_len)
      energy = torch.eisum("nhql, nlhd->nqhd", [attention, values]).reshape(N, query_len, self.embed_size)

      # Applying mask (if provided) to ignore certain positions (e.g., padding or future tokens)
      if mask is not None:
         # Replace masked positions with -inf
         energy = energy.masked_fill(mask == 0, float("-1e20")) 
      
      # Normalizing scores using softmax and scale by sqrt(embed_size) for stability
      attention = torch.softmax(energy / (self.embed_size**(0.5)), dim = 3)

      # Computing weighted sum of values using attention scores
      # Result shape: (batch, query_len, heads, head_dim)
      out = torch.einsum("nhql, nlhd->nqhd", [attention, values])
      
      # Reshaping back to (batch, query_len, embed_size) and pass through final linear layer
      out = out.reshape(N, query_len, self.embed_size)
      out = self.fc_out(out)
      return out
       

### Positional Encoding

In [21]:
class PositionalEncoding(nn.Module):
   def __init__(self, embed_size, max_seq_len):
      super(PositionalEncoding, self).__init__()

      # Creating a matrix of shape (max_seq_len, embed_size) initialized to zeros
      pe = torch.zeros(max_seq_len, embed_size)

      # Generating positions from 0 to max_seq_len-1
      position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)

      # Computing divisor term for scaling positional encoding
      # Using exp and log to avoid numerical instability
      div_term = torch.exp(torch.arange(0, embed_size, 2).float()*(-torch.log(torch.tensor(10000.0)) / embed_size))
      
      # Applying sine to even indices and cosine to odd indices
      pe[:, 0::2] = torch.sin(position * div_term) # Even positions
      pe[:, 1::2] = torch.cos(position * div_term) # Odd positions

      # Register as a buffer (non-trainable parameter) for saving/loading
      self.register_buffer("pe", pe.unsqueeze(0)) # (1, max_seq_len, embed_size)


      def forward(self, x):
         # Add positional encoding to input embeddings
         # x shape: (batch, seq_len, embed_size)
         # pe shape: (1, max_seq_len, embed_size) → automatically broadcasted
         return x + self.pe[:, :x.shape[1], :]
         
          

### Transformer Block (Encoder Layer)

In [22]:
class TransformerBlock(nn.Module):
   def __init__(self, embed_size, heads, dropout = 0.1):
      super(TransformerBlock, self).__init__()

      # Multi-head self-attention layer
      self.attention = SelfAttention(embed_size, heads)

      # Layer normalization for stabilizing training
      self.norm1 = nn.LayerNorm(embed_size)
      self.norm2 = nn.LayerNorm(embed_size)

       # Feed-forward network (expands and contracts embeddings)
      self.ff = nn.Sequential(
         nn.Linear(embed_size, 4 * embed_size), # Expand to 4 * embed_size
         nn.ReLU(), # Non-linearity
         nn.Linear(4 * embed_size, embed_size), # Contract back to embed_size
      )

      # Dropout for regularization
      self.dropout = nn.Dropout(dropout)
      
      def forward(self, x, mask = None):
         # Step 1: Compute self-attention
         attention = self.attention(x, x, x, mask)

         # Step 2: Residual connection + layer norm
         x = self.norm1(attention + x) # Residual skip connection
         x = self.dropout(x)

         # Step 3: Feed-forward network
         ff = self.ff(x)

         # Step 4: Residual connection + layer norm
         x = self.norm2(ff + x)
         x = self.dropout(x)
         return x
          

### Full Transformer (Encoder-Decoder)

In [25]:
class Transformer(nn.Module):
   def __init__(self, src_vocab_size, embed_size, num_layers, heads, max_seq_len, dropout = 0.1):
      super(Transformer, self).__init__()

      # Embedding layer to convert token IDs to vectors
      self.embed = nn.Embedding(src_vocab_size, embed_size)

      # Positional encoding to add sequence information
      self.pe = PositionalEncoding(embed_size, max_seq_len)

      # Stack multiple transformer blocks (encoder layers)
      self.layers = nn.ModuleList([TransformerBlock(embed_size, heads, dropout)
      for _ in range (num_layers)
      ])

      # Final linear layer to project embeddings back to vocabulary size
      self.fc_out = nn.Linear(embed_size, src_vocab_size)


      def forward(self, x, mask = None):
         # Step 1: Convert token IDs to embeddings
         x = self.embed(x) # (batch, seq_len) → (batch, seq_len, embed_size)

         # Step 2: Add positional encoding
         x = self.pe(x)

         # Step 3: Pass through each transformer block
         for layer in self.layers:
            x = layer(x,mask)

          # Step 4: Project embeddings to vocabulary logits
         x = self.fc_out(x) # (batch, seq_len, vocab_size)
         return x
          
       

### Training a Toy Example

In [26]:
# Hyperparameters
embed_size = 128 # Dimension of embeddings
heads = 8 # Number of attention heads
num_layers = 3 # Number of transformer blocks
max_seq_len = 10 # Maximum sequence length
vocab_size = 10 # Vocabulary size (e.g., 10 tokens: 0-9)

# Initializing model, loss, and optimizer
model = Transformer(vocab_size, embed_size, num_layers, heads, max_seq_len)
criterion = nn.CrossEntropyLoss() # For classification tasks
optimizer = torch.optim.Adam(model.parameters(), lr= 0.001)

# Generating toy data (input and target are the same for a copy task)
src = torch.randint(0, vocab_size, (32, max_seq_len)) # Fake input (batch_size=32)
trg = src.clone() # Target is same as input (simple copy task)


# Traning loop 
for epoch in range(100):
   # Forward pass: compute model predictions
   output = model(src) # Shape: (batch, seq_len, vocab_size)

   # Compute loss (flatten batch and sequence dimensions for cross-entropy)
   loss = criterion(output.view(-1, vocab_size), trg.view(-1))

   # Backpropagation
   optimizer.zero_grad() # Clear gradients
   loss.backward() # Compute gradients
   optimizer.step() # Update weights
   print(f"Epoch {epoch}, Loss: {loss.item()}")

NotImplementedError: Module [Transformer] is missing the required "forward" function