# Transformers

In [1]:
import torch 
import torch.nn as nn
import math
import torch.nn.functional as F

 ![Transformer](./assets/transformer.png)

The transformer architecture consist of two main Blocs:

- `Encoder`: consist of multiple identical layers that are responsible for reading and processing the input sequence and generate context rich nemurical representations. It doing this using **self-attention** and **feed-forward** networks.

- `Decoder`: Essentialy do the inverse of the `encoder` bloc by generating an output sequence based on the encoded input sequence which is coming from the `encoder`.

- `Positional encoding`: included in both blocks, it allows tokens be processed in parallel by encoding each token postion in the sequence. This allows the model to recognize the relationships between tokens and their order (Essential for making sense in sentences and capture the context).

- `Attention Mechanisms`: used to highlight the most important tokens and their relationships which improves the quality of the generated text.

- `Self-attention`: is type of `Attention Mechanisms` that assigns a weight to each token in the sequence similtanuously capturing long-range dependencies.

- `Multi-Head attention`: extends `Self-attention` by using multiple heads to focus on different aspect of the input sequence in parallel. This allows each head to capture distinct relational patterns within the data leading to richer representations.

- `Position-wise feed-forward networks`: this simple FFN that apply complex transformations on each tokens embeddings independently. Because each token get it's own transformation, the networks are position independent (`Position wise`)

In [2]:
model = nn.Transformer(
    d_model=128,
    nhead=4,
    num_encoder_layers=6,
    num_decoder_layers=6,
)
model



Transformer(
  (encoder): TransformerEncoder(
    (layers): ModuleList(
      (0-5): 6 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=128, out_features=128, bias=True)
        )
        (linear1): Linear(in_features=128, out_features=2048, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=2048, out_features=128, bias=True)
        (norm1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
    )
    (norm): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
  )
  (decoder): TransformerDecoder(
    (layers): ModuleList(
      (0-5): 6 x TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=128, o

# Embedding and positional encoding

## Embedding

In [3]:
class InputEmbeddings(nn.Module):
    def __init__(self, vocab_size:int, d_model:int) ->None:
        super().__init__()
        self.d_model = d_model
        self.vocab_size = vocab_size
        self.embedding = nn.Embedding(vocab_size, d_model)
    def forward(self, x):
        return self.embedding(x) * math.sqrt(self.d_model) 

`math.sqrt(self.d_model)` this scaling is a trick from the original Transformer paper. Why?:

- The embeddings are initially small random values.

- Scaling by $ \sqrt(d_model) $ increases their magnitude so that their variance matches that of the positional encodings (added later), helping with stable training.

In [4]:
embedding_layer = InputEmbeddings(vocab_size=10, d_model=128)
sentence = [0,1,2,3,4,5,6,7,1,1,1,1] # each Token ID in the sentence ≤ vocab_size - 1 
batch_sentences= [
    [1,2,3,4,4,5], 
    [5,6,7,1,3,4]
] # (batch_size, sequence_length, d_model) , sequence length must be same in the one batch 

x = torch.tensor(sentence)
z = torch.tensor(batch_sentences)
embedded_output_sentence = embedding_layer(x)
embedded_output_batch = embedding_layer(z)
embedded_output_sentence.shape , embedded_output_batch.shape

(torch.Size([12, 128]), torch.Size([2, 6, 128]))

- Batch sentences need to have the same length cz when we create a PyTorch tensor for a batch, all sequences must have the same length to form a proper rectangular tensor (matrix).

- If sequences have different lengths we `pad` the shorter sequences with a special token (e.g., token ID 0) to the max length in the batch.

- When do you need to pad?

    - When you want to process multiple sequences together in a batch (e.g., batch size > 1) and these sequences have different lengths.

    - PyTorch tensors require all sequences in the batch to have the same length to form a single tensor.

    - So, you pad shorter sequences with a special token (e.g., token ID 0) to match the longest sequence length in the batch.

-  When can you skip padding?

    - If you’re processing sequences one by one (batch size = 1), no padding is necessary.

    - If all sequences in your batch already have the same length, no padding is needed.

## Positional encoding

- Encode each token's position in the sequence into `Positional embedding` and adds them into the `Token embeddings` to capture the positional information.

- The token and postional embedding usally have the same dimensionality for easy readation.

- The positional embedding is generated using an equation.

- $ \text{Input}_i = \text{TokenEmbedding}_i + \text{PositionalEncoding}_i $

- $ PE_{pos, 2i} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) $

- $ PE_{pos, 2i+1} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right) $

**Why use sine and cosine functions for positional encoding?**

- Periodic functions encode positions smoothly.

- By assigning each position a vector of sine and cosine values at different frequencies, we get a unique “signature” for each position.

- These signatures vary smoothly as position changes, so positions close to each other have similar encodings, and positions far apart have more different encodings.

**Why a vector, not a scalar?**

- A single number (scalar) wouldn’t capture complex position info.

- The vector’s different components capture patterns at multiple frequencies, encoding both fine and coarse positional details.

- This multi-dimensional encoding lets the Transformer distinguish many positions and compute relationships between them.


**Positional encoding vector**

- The dimension of each positional encoding vector is the same as the model’s embedding size, usually called `d_model`.

- So for $$ PE(pos) \in \mathbb{R}^{d_{model}} $$ If $ d_{model}=128 $, then $ PE(pos) = [PE_0, PE_1, …, PE_{127}]  $ where each $PE_i$ is computed by the sine or cosine formula.

- Why the same dimension?

    - Because you add the positional encoding vector to the token embedding vector element-wise.

    - For addition to work, both vectors must have the same shape, i.e., (d_model,).

**Positional Encoding matrix shape**

- Suppose your maximum sequence length is `max_len` (e.g., 512 tokens max in a sentence).

- Your model dimension (embedding size) is `d_model` (e.g., 512).

- The positional encoding matrix has shape: $$ ( max_{len},  d_{model}) $$

- Each row corresponds to one position (from 0 to max_len - 1).

- Each column corresponds to one dimension of the positional encoding vector.


**Does the positional encoding vector change during training?**

- Original Transformer’s sinusoidal positional encoding fixed.

- Some Transformer variants replace sinusoidal encoding with learnable positional embeddings (just like token embeddings, but for positions).

In [5]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int, max_sequence_len:int):
        super().__init__()
        positional_encoding = torch.zeros(max_sequence_len, d_model)
        position = torch.arange(0, max_sequence_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

        positional_encoding[:, 0::2] = torch.sin(position * div_term)
        positional_encoding[:, 1::2] = torch.cos(position * div_term)
        
        self.register_buffer("positional_encoding", positional_encoding.unsqueeze(0))

        # register_buffer stores the positional encoding tensor in the module without making it a learnable parameter.
        # .unsqueeze(0) adds a batch dimension → shape becomes (1, max_sequence_len, d_model) so it can be broadcast over batches during addition.

    def forward(self, x):
        return x + self.positional_encoding[:, :x.size(1)]  
      
    # x.size(1) is the current sequence length (number of tokens) in the batch — could be smaller or equal to max_sequence_len.


In [6]:
max_seq_len = 4
embedding_layer = InputEmbeddings(vocab_size=10, d_model=128)
input_sentence = [
    [1,2,3,4], # his seq length must be <= max_seq_len and since we are in batch all sequences must be the same length
    [5,6,7,1],
] 
input_sentence = torch.tensor(input_sentence)
embedded_output = embedding_layer(input_sentence)
pe_layer = PositionalEncoding(d_model=128, max_sequence_len=4)
pe_output = pe_layer(embedded_output)
pe_output.shape


torch.Size([2, 4, 128])

# Multi-head attention

## Self-attention

- `Self-attention`: allow transformers to identify the relashionships between tokens and focus on the most relevant ones with the task.

- Given a sequence of token embeddings, each embedding is projected into 3 matrices: `Q`, `K`, `V` of equal dimensions using separate linear transformations with learned weights.

    - `Q: Query matrix`: indicate what each token is looking for in other tokens. “What am I looking for?”

    - `K: Key matrix`: represent the content of each token that other tokens might find relevant. “What do I have that others might want?”

    - `V: Value matrix`: Actual content to be aggregated or weighted based on the attention sccores. “The actual information I can give.”

- Transform each token embedding into this rows helps the model learns more nuanced token relationships. 

- Attention scores are computed by the dot product of `Q` and `K` matrices. $$\text{scores} = Q K^\top$$

- This measures how much each token’s query matches every token’s key.

- Attention weights are computed by applying a softmax function to the attention scores.

    - Apply softmax to the scores along each row → weights sum to 1.

    - This gives “how much attention” each token should pay to every other token.


- The attention weights reflect the relevance (attention) the model assigns to each token in the sequence.

- Finally we multiply the attention weights with the `V` matrix (which are the token embeddings) to update the token embeddings with the self-attention information.

- This attention mechanism use one attention head with one set of `Q`, `K`, `V` matrices. But in practice, embedding are split into multiple heads to focus on different aspects of the input sequence in parallel.

- Multi-Head Attention concatenate the output of each head and linearly transform them to match the input dimensions(embedding dimension).

- The resulting embedding captur token meaning, postional enconding and contextual relationships.

In [7]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, nhead: int):
        super().__init__()
        assert d_model % nhead == 0
        self.nhead = nhead
        self.d_model = d_model
        self.head_dim = d_model // nhead # embedding dimension per head

        self.query_linear = nn.Linear(d_model, d_model,bias=False)
        self.key_linear = nn.Linear(d_model, d_model,bias=False)
        self.value_linear = nn.Linear(d_model, d_model,bias=False)

        self.out_linear = nn.Linear(d_model, d_model)



    def split_heads(self, x,batch_size):  
        seq_length = x.size(1)
        x = x.reshape(batch_size, seq_length, self.nhead, self.head_dim)  
        x = x.permute(0, 2, 1, 3)  # (batch_size, nhead, seq_length, head_dim)
        return x 
    
    
    def compute_attention(self, query, key, value,mask=None):
        scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(self.head_dim)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float("-inf"))

        attention_weights = F.softmax(scores, dim=-1)

        return torch.matmul(attention_weights, value)
    
    def combine_heads(self,x,batch_size):
       
        # x: (batch_size, nhead, seq_length, head_dim)

        seq_length = x.size(2)
       
        # Move heads back to last dimension: (batch_size, seq_length, nhead, head_dim)
        x = x.permute(0, 2, 1, 3)

        # Merge nhead and head_dim back into d_model
        x = x.reshape(batch_size, seq_length, self.d_model)

        return x
    
    def forward(self, query, key, value, mask=None):

        batch_size = query.size(0)

        query = self.split_heads(self.query_linear(query),batch_size)
        key = self.split_heads(self.key_linear(key),batch_size)
        value = self.split_heads(self.value_linear(value),batch_size)

        attention_output = self.compute_attention(query,key,value,mask)

        output = self.combine_heads(attention_output,batch_size)

        return self.out_linear(output)

In [9]:
multihead_attn = MultiHeadAttention(d_model=128, nhead=4)

# Encoder-Decoder (Full transformer)

 ![Transformer2](./assets/original_transformer.png)

# Encoder only transformers

![Encoder](./assets/encoder.png)


- Focus on understanding and representing the input data such as text classification.

- Consists of two main components: 
 
   - **Transformer body**

   - **Transformer head**

**Transformer body**:
- Or encoder is stack of $N$ encoder layers designed to learn complex patterns from input data.

- Each encoder layer incorporates:

   - A multi-head self-attention mechanism to captures the relationships between tokens in the input sequence.

   - Followed by feed-forward network (sublayer) to map this knowledge into abstract non-linear representations.

   - Both of these components are usually combined with other techniques like layer normalizations and dropout to improve training.

**Transformer head**: 

- Is the final layer designed to produce task-specific output.

- Process encoded inputs to prediction outputs (classification, regression).

## Feed-Forward sublayer in encoder layer

- Contains two fully connected linear layer with ReLU activation in between.

- `d_ff` is the dimension of the hidden layer in the feed-forward sublayer.

In [10]:
class FeedForward(nn.Module):
    def __init__(self, d_model: int, d_ff: int):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        return self.linear2(F.relu(self.linear1(x)))

In [None]:
class EncoderLayer(nn.Module):
    def __init__(self, d_model: int, nhead: int, d_ff: int,dropout:float):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, nhead)
        self.feed_forward = FeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)


    def forward(self, x, mask=None):

        atten_output = self.self_attn(x,x,x,mask)
        atten_output = self.dropout1(atten_output)
        x = self.norm1(x + atten_output)

        feed_forward_output = self.feed_forward(x)
        x = self.norm2(x + feed_forward_output)

        return x

- When we call attention mechanism in forward pass input embedding is passed as q,v,k.

- Mask is used to prevent processing of padding tokens in the input sequence.

- The padded tokens are irrelevant to the attention task so we exclude them from the attention machnism.

- We do that by applying a padding mask to the attention scores so scores linked to padded tokens are set to 0.

In [11]:
class TransformerEncoder(nn.Module):
    def __init__(self, vocab_size,d_model,num_layers,num_heads,d_ff,dropout,max_seq_length):
        super().__init__()
        self.embedding = InputEmbeddings(vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, max_seq_length)
        self.encoder_layers = nn.ModuleList(
            [
                EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)
                
                ] )

    def forward(self, x, mask=None):
        x = self.embedding(x)
        x = self.positional_encoding(x)
        for encoder_layer in self.encoder_layers:
            x = encoder_layer(x,mask)
        return x

In [12]:
class ClassificationHead(nn.Module):
    def __init__(self, d_model: int, num_classes: int):
        super().__init__()
        self.fc = nn.Linear(d_model, num_classes)

    def forward(self, x):
        logits = self.fc(x)
        output = F.log_softmax(logits, dim=-1)
        return output



class RegressionHead(nn.Module):
    def __init__(self, d_model: int,output_dim: int):
        super().__init__()
        self.fc = nn.Linear(d_model, output_dim)

    def forward(self, x):
        
        return self.fc(x)  