#### Transformer Code Architecture

The Transformer architecture, introduced in the seminal 2017 paper "Attention is All You Need," has transformed NLP by focusing on self-attention mechanisms rather than recurrent neural networks (RNNs). Here's a breakdown of the Transformer’s core components:

- Encoder-Decoder Structure: The Transformer is composed of stacked encoder and decoder layers.

    - Encoder: The encoder processes the input sequence, generating a set of representations.
    - Decoder: The decoder uses these representations to generate the output sequence.
- Self-Attention Mechanism: Self-attention computes the relationships between elements in a sequence, capturing dependencies regardless of their distance.

- Positional Encoding: Since Transformers lack recurrence, positional encodings add sequence information to the inputs.

- Multi-Head Attention: Each attention head captures different aspects of relationships, enriching the learned representations.

- Feed-Forward Network (FFN): FFNs are applied after each attention layer, with activation functions and dropout for non-linearity.

- Layer Normalization and Residual Connections: Residual connections and normalization stabilize and improve the training of the deep network.

#### Explanation of Terms in the Original Paper
- Attention: Mechanism to focus on different parts of the sequence, assigning weights to relevant parts.
- Scaled Dot-Product Attention: Calculates attention by computing a scaled dot-product between query and key matrices, then applies a softmax to get weights.
- Multi-Head Attention: Multiple attention heads to capture diverse relationships within the sequence.
- Query (Q), Key (K), Value (V): Matrices derived from the input that help compute attention weights and apply them to obtain relevant information.
- Positional Encoding: A way to encode the order of tokens since Transformer lacks sequence awareness.
- Feed-Forward Neural Network (FFN): A fully connected network applied after the attention layer.
- Residual Connection: Adds the output of a previous layer to the next layer’s output to improve information flow.
- Layer Normalization: Stabilizes training by normalizing inputs to each layer.

#### Coding the Transformer Architecture
Below is a simplified PyTorch implementation of the Transformer:

In [1]:
import torch
import torch.nn as nn
import math

# ScaledDotProductAttention computes the attention weights for a given query, key, and value set.
# This is a core part of the self-attention mechanism in the Transformer.
class ScaledDotProductAttention(nn.Module):
    def __init__(self, d_k):
        super().__init__()
        # Scaling factor for the dot product, helps with gradient stability
        self.scale = math.sqrt(d_k)
    
    def forward(self, Q, K, V, mask=None):
        # Compute the attention scores by taking the dot product of Q and the transpose of K
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
        # Apply the mask if provided (useful for tasks like language generation)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        # Compute the attention weights by applying softmax to the scores
        attn_weights = torch.softmax(attn_scores, dim=-1)
        # Multiply the attention weights with the values to get the output
        return torch.matmul(attn_weights, V), attn_weights

# MultiHeadAttention splits the attention mechanism across multiple heads, allowing the model
# to capture different aspects of relationships in the input data.
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        # Number of dimensions for each head
        self.d_k = d_model // num_heads
        # Number of attention heads
        self.num_heads = num_heads
        # Linear layers to transform the input into query, key, and value matrices
        self.qkv_layers = nn.ModuleList([nn.Linear(d_model, d_model) for _ in range(3)])
        # Final linear layer to consolidate multi-head attention output
        self.out_layer = nn.Linear(d_model, d_model)

    def forward(self, Q, K, V, mask=None):
        batch_size = Q.size(0)
        # Generate Q, K, V for each head, then reshape for attention computation
        Q, K, V = [layer(x).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2) 
                   for layer, x in zip(self.qkv_layers, (Q, K, V))]
        # Calculate scaled dot-product attention for each head
        attn_output, attn_weights = ScaledDotProductAttention(self.d_k)(Q, K, V, mask)
        # Concatenate attention heads and reshape back to original dimensions
        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads * self.d_k)
        # Apply the final linear transformation
        return self.out_layer(attn_output)

# PositionalEncoding adds positional information to each token in the sequence, enabling the model to differentiate
# the relative position of tokens, which is crucial since Transformers lack a sequential structure like RNNs.
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        # Create a matrix to hold positional encodings
        pe = torch.zeros(max_len, d_model)
        # Generate position indices for each position
        position = torch.arange(0, max_len).unsqueeze(1)
        # Compute sinusoidal values for each position across even and odd dimensions
        div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        # Register as a buffer so it's not updated during training
        self.pe = pe.unsqueeze(0)
    
    def forward(self, x):
        # Add positional encoding to input embeddings
        x = x + self.pe[:, :x.size(1)].to(x.device)
        return x

# TransformerLayer represents a single layer in the encoder or decoder, which consists of a multi-head attention layer
# followed by a feed-forward layer, with layer normalization applied at each step.
class TransformerLayer(nn.Module):
    def __init__(self, d_model, num_heads, ff_hidden):
        super().__init__()
        # Multi-head attention module
        self.attention = MultiHeadAttention(d_model, num_heads)
        # Feed-forward network
        self.ff = nn.Sequential(nn.Linear(d_model, ff_hidden), nn.ReLU(), nn.Linear(ff_hidden, d_model))
        # Layer normalization modules to stabilize training
        self.layernorm1 = nn.LayerNorm(d_model)
        self.layernorm2 = nn.LayerNorm(d_model)

    def forward(self, x, mask=None):
        # Apply multi-head attention and add residual connection
        attn_output = self.layernorm1(x + self.attention(x, x, x, mask))
        # Apply feed-forward network and add another residual connection
        return self.layernorm2(attn_output + self.ff(attn_output))

# Transformer combines embedding, positional encoding, and multiple transformer layers
# to form a complete encoder or decoder model.
class Transformer(nn.Module):
    def __init__(self, d_model, num_heads, ff_hidden, num_layers, max_len, vocab_size):
        super().__init__()
        # Embedding layer for converting input indices to vectors
        self.embedding = nn.Embedding(vocab_size, d_model)
        # Positional encoding to inject positional information
        self.pos_encoding = PositionalEncoding(d_model, max_len)
        # Stack of transformer layers
        self.layers = nn.ModuleList([TransformerLayer(d_model, num_heads, ff_hidden) for _ in range(num_layers)])
        # Final output layer that projects the model's outputs to the vocabulary size
        self.fc_out = nn.Linear(d_model, vocab_size)

    def forward(self, x, mask=None):
        # Embed the input and add positional encoding
        x = self.pos_encoding(self.embedding(x))
        # Pass through each transformer layer
        for layer in self.layers:
            x = layer(x, mask)
        # Project to vocabulary size
        return self.fc_out(x)


#### Explanation of the Code Flow and Key Components
- Scaled Dot-Product Attention: Computes attention scores between query, key, and value matrices. The scaling factor helps with gradient stability when handling large dot-products.

- Multi-Head Attention: This mechanism enables the model to focus on different parts of the input. The multi-head setup ensures that attention is computed on different projections, allowing the model to capture diverse relationships within the input sequence.

- Positional Encoding: Adds positional context to each token in the sequence. Transformers do not inherently recognize sequence order, so sinusoidal positional encodings help the model learn the position of each token, which is critical in sequence processing tasks.

- Transformer Layer: Combines multi-head attention and a feed-forward network, with layer normalization and residual connections to stabilize training and enhance information flow. Residual connections add the input of each sub-layer to the output, aiding gradient flow and helping avoid vanishing gradient problems.

- Stacked Transformer Model: Combines embeddings, positional encoding, multiple transformer layers, and a final linear layer to create the complete Transformer model. It forms the basis for complex tasks in NLP and is adaptable to other domains like computer vision.

#### Importance of Each Part
- Attention Mechanisms: Allow the model to focus on important parts of the input sequence, making the Transformer highly effective for long sequences.
- Positional Encoding: Addresses the lack of recurrence, essential for sequence data.
- Multi-Layer Structure: Enables the model to learn complex patterns through stacked attention and feed-forward layers.
- Residual Connections and Normalization: Help with stability during training and prevent gradient issues in deep architectures.

Great tutorial on Transformers https://www.youtube.com/watch?v=QCJQG4DuHT0&list=PLTl9hO2Oobd97qfWC40gOSU8C0iu0m2l4&index=1