#### Transformer Architecture

##### Overview of the Encoder-Decoder architecture
+ The encoder-decoder architecture in the Transformer model revolutionizes how we approach sequence-to-sequence tasks.
+ The `encoder` creates **rich, contextual representations of the input data**, while the `decoder` uses these representations—along with its own past outputs—to **generate coherent and contextually accurate** sequences. 

+ At its core, the encoder-decoder framework consists of two main components:
    + 1. Encoder: Processes the input sequence and **transforms** it into a set of continuous representations (often referred to as "hidden states"); these representations capture the essential features and context of the input data.
    + 2. Decoder: Uses the encoder's representations along with its own previous outputs to generate the target sequence; it **“decodes”** the information into the desired output, step by step.

#### How the Transformer Implements the Encoder-Decoder Architecture

#### 1. Encoder: Understanding the Input
Layer Structure:
+ The Transformer encoder is composed of a stack of identical layers. Each layer has two primary sub-layers:

1. Multi-Head Self-Attention: This mechanism allows the encoder to weigh the importance of different parts of the input sequence relative to each token. It helps capture dependencies regardless of their distance in the sequence.

2. Feed-Forward Neural Network (FFN): A position-wise network that further processes the representations generated by the attention mechanism.

##### Positional Encoding:
+ Since the Transformer does not inherently encode sequence order (unlike recurrent networks), positional encodings are added to the input embeddings to provide information about the order of tokens.

Layer Normalization and Residual Connections:
+ Each sub-layer is followed by layer normalization and uses residual connections (i.e., adding the input of the sub-layer to its output) to facilitate training and stabilize gradients.

#### 2. Decoder: Generating the Output
Layer Structure:
+ Like the encoder, the decoder consists of a stack of layers. However, each decoder layer has three main sub-layers:

##### 1. Masked Multi-Head Self-Attention: 
+ The “masking” ensures that the model cannot "cheat" by looking ahead at future tokens in the output sequence during training. This allows the decoder to generate the output one token at a time.

##### 2. Encoder-Decoder (Cross) Attention: 
+ This layer enables the decoder to focus on relevant parts of the encoder's output. It computes attention scores between the decoder’s current state and all encoder outputs, effectively “translating” the input representation into the context needed for output generation.

##### 3. Feed-Forward Neural Network (FFN): Like in the encoder, this network processes the outputs of the attention mechanisms.

##### Positional Encoding and Residual Connections:
+ Positional encodings are also used in the decoder to maintain information about the sequence order. Residual connections and layer normalization are similarly applied for stability and performance.

#### Key Benefits of the Encoder-Decoder Framework in Transformers

##### Parallel Processing:
+ Unlike traditional RNN-based encoder-decoder models, the Transformer architecture allows the encoder and decoder to process sequences in parallel during training, significantly speeding up computation.

##### Long-Range Dependencies:
+ The self-attention mechanism within the encoder (and the cross-attention in the decoder) can capture relationships between tokens regardless of their position, making it highly effective for modeling long-range dependencies.

##### Modularity and Flexibility:
+ The clear separation between the encoder and decoder allows for flexible design choices. For example, one might pretrain the encoder on a large corpus and then fine-tune the decoder for a specific task, or vice versa.

##### Scalability:
+ With the self-attention mechanism, the Transformer can scale to handle very long sequences efficiently, as it is not limited by the sequential nature of recurrent architectures.



### Practical Implications and Use Cases

#### 1. Machine Translation:
+ In tasks like translating text from one language to another, the encoder processes the source language sentence to capture its meaning, and the decoder then generates the corresponding sentence in the target language, guided by the context provided by the encoder.

#### 2. Text Summarization:
+ The encoder can distill the main points of a long document, and the decoder can produce a concise summary.

#### 3. Question Answering:
+The encoder might be used to comprehend a passage of text, and the decoder then generates an answer based on the information encoded.

-----------

#### Importing Libraries and Defining the Positional Encoding

In [5]:
import torch
import torch.nn as nn
import math

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        """
        Args:
            d_model: the dimension of embeddings.
            max_len: the maximum length of sequences.
        """
        super(PositionalEncoding, self).__init__()
        # Create a long enough matrix of positional encodings (max_len, d_model)
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float32).unsqueeze(1)
        # Compute the positional encodings once in log space.
        div_term = torch.exp(torch.arange(0, d_model, 2, dtype=torch.float32) *
                             -(math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)  # apply sine to even indices in the array
        pe[:, 1::2] = torch.cos(position * div_term)  # apply cosine to odd indices in the array
        pe = pe.unsqueeze(0)  # shape: (1, max_len, d_model)
        self.register_buffer('pe', pe)

    def forward(self, x):
        """
        Args:
            x: Tensor of shape (batch_size, seq_length, d_model)
        Returns:
            Tensor with positional encodings added.
        """
        seq_length = x.size(1)
        x = x + self.pe[:, :seq_length]
        return x

# Quick test of positional encoding:
d_model = 512
pe = PositionalEncoding(d_model)
dummy_input = torch.zeros(2, 10, d_model)  # (batch_size, seq_length, d_model)
print("Positional Encoding applied, shape:", pe(dummy_input).shape)


Positional Encoding applied, shape: torch.Size([2, 10, 512])


#### Define the Transformer Encoder Block

In [6]:
class TransformerEncoderBlock(nn.Module):
    def __init__(self, d_model, nhead, dim_feedforward, dropout=0.1):
        super(TransformerEncoderBlock, self).__init__()
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        self.linear1 = nn.Linear(d_model, dim_feedforward)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(dim_feedforward, d_model)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.activation = nn.ReLU()

    def forward(self, src, src_mask=None, src_key_padding_mask=None):
        """
        Args:
            src: input tensor of shape (seq_length, batch_size, d_model)
            src_mask: optional mask for attention weights.
            src_key_padding_mask: optional padding mask.
        Returns:
            Processed tensor of same shape.
        """
        # Self-attention: each position attends to every other position.
        attn_output, _ = self.self_attn(src, src, src,
                                        attn_mask=src_mask,
                                        key_padding_mask=src_key_padding_mask)
        src = src + self.dropout1(attn_output)
        src = self.norm1(src)
        # Feed-forward network.
        ff_output = self.linear2(self.dropout(self.activation(self.linear1(src))))
        src = src + self.dropout2(ff_output)
        src = self.norm2(src)
        return src

# Quick test of the encoder block with dummy data:
d_model = 512
nhead = 8
dim_feedforward = 2048
dropout = 0.1
seq_length = 10
batch_size = 32

encoder_block = TransformerEncoderBlock(d_model, nhead, dim_feedforward, dropout)
dummy_src = torch.rand(seq_length, batch_size, d_model)  # (sequence length, batch size, d_model)
encoder_output = encoder_block(dummy_src)
print("Encoder output shape:", encoder_output.shape)


Encoder output shape: torch.Size([10, 32, 512])


#### Define the Transformer Decoder Block

In [7]:
class TransformerDecoderBlock(nn.Module):
    def __init__(self, d_model, nhead, dim_feedforward, dropout=0.1):
        super(TransformerDecoderBlock, self).__init__()
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        self.multihead_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        self.linear1 = nn.Linear(d_model, dim_feedforward)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(dim_feedforward, d_model)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.dropout3 = nn.Dropout(dropout)
        self.activation = nn.ReLU()

    def forward(self, tgt, memory, tgt_mask=None, memory_mask=None,
                tgt_key_padding_mask=None, memory_key_padding_mask=None):
        """
        Args:
            tgt: target sequence tensor (seq_length, batch_size, d_model)
            memory: encoder output tensor (seq_length, batch_size, d_model)
            tgt_mask: mask for target sequence self-attention (e.g., to prevent looking ahead).
            memory_mask: mask for encoder-decoder attention.
            tgt_key_padding_mask: padding mask for target.
            memory_key_padding_mask: padding mask for memory.
        Returns:
            Processed tensor of shape (seq_length, batch_size, d_model)
        """
        # Masked self-attention in decoder (ensures causal/auto-regressive generation).
        attn_output, _ = self.self_attn(tgt, tgt, tgt,
                                        attn_mask=tgt_mask,
                                        key_padding_mask=tgt_key_padding_mask)
        tgt = tgt + self.dropout1(attn_output)
        tgt = self.norm1(tgt)
        # Encoder-decoder (cross) attention.
        attn_output, _ = self.multihead_attn(tgt, memory, memory,
                                             attn_mask=memory_mask,
                                             key_padding_mask=memory_key_padding_mask)
        tgt = tgt + self.dropout2(attn_output)
        tgt = self.norm2(tgt)
        # Feed-forward network.
        ff_output = self.linear2(self.dropout(self.activation(self.linear1(tgt))))
        tgt = tgt + self.dropout3(ff_output)
        tgt = self.norm3(tgt)
        return tgt

# Quick test of the decoder block with dummy data:
decoder_block = TransformerDecoderBlock(d_model, nhead, dim_feedforward, dropout)
dummy_tgt = torch.rand(seq_length, batch_size, d_model)
# For testing, we can use the encoder output from the previous cell as 'memory'
decoder_output = decoder_block(dummy_tgt, encoder_output)
print("Decoder output shape:", decoder_output.shape)


Decoder output shape: torch.Size([10, 32, 512])


#### Define a Helper Function for Target Masking and Integrate Encoder-Decoder

In [8]:
def generate_square_subsequent_mask(sz):
    """
    Generates a square mask for the sequence.
    The masked positions are filled with -inf.
    Unmasked positions are filled with 0.
    """
    mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask

# Define sequence lengths and create dummy input data.
src_seq_length = 10  # source sequence length
tgt_seq_length = 10  # target sequence length

dummy_src = torch.rand(src_seq_length, batch_size, d_model)  # (L, N, E)
dummy_tgt = torch.rand(tgt_seq_length, batch_size, d_model)  # (L, N, E)

# Create target mask to prevent the decoder from looking ahead.
tgt_mask = generate_square_subsequent_mask(tgt_seq_length)

# Pass through the encoder block.
memory = encoder_block(dummy_src)
print("Memory (encoder output) shape:", memory.shape)

# Pass through the decoder block using the encoder's output as memory.
decoder_output = decoder_block(dummy_tgt, memory, tgt_mask=tgt_mask)
print("Final Decoder output shape:", decoder_output.shape)


Memory (encoder output) shape: torch.Size([10, 32, 512])
Final Decoder output shape: torch.Size([10, 32, 512])


-------

### Layer Normalization

+ Layer normalization is a technique used to stabilize and accelerate the training of deep neural networks by **normalizing the inputs to each layer**.
+ It is particularly well-suited for models like Transformers, where inputs across time steps or sequence positions are processed simultaneously.
+ Layer normalization is a normalization method that standardizes the inputs to a neural network layer across the features for each individual sample rather than across the batch

#### Key Characteristics
##### 1. Normalization Across Features:
+ Unlike batch normalization, which normalizes across the batch dimension, layer normalization normalizes the activations within a layer across the feature dimension.
+ This means that for each sample (or time step in a sequence), the normalization is performed over all hidden units in that layer.

##### 2. Independence from Batch Size:
+ Because the normalization is done per sample, layer normalization works well even with very small batch sizes or in settings where the batch size is 1 (such as in some reinforcement learning or online learning scenarios).

##### 3. Simplicity in Recurrent and Transformer Models:
+ In models that handle sequences—like RNNs and Transformers—layer normalization helps stabilize hidden state dynamics over time and improves convergence.

### Benefits of Layer Normalization
#### 1. Stabilizes Training:
+ By normalizing the inputs to each layer, the gradients during backpropagation become more stable, helping the network converge faster and more reliably.

#### 2. Handles Variable Batch Sizes:
+ Since the normalization is done per sample, it is less sensitive to the batch size.
+ This makes it a robust choice when training on small batches or even on a single sample at a time.

#### 3. Ideal for Sequence Models:
+ In tasks where the batch dimension may not be consistent—such as in natural language processing (NLP) tasks with varying sequence lengths—layer normalization is more appropriate than batch normalization.

#### 4. Improves Generalization:
+ Normalizing activations can reduce internal covariate shift (i.e., the change in the distribution of network activations during training), which in turn can help the network generalize better to unseen data.

### Use in Transformer Models
+ In Transformer architectures, layer normalization is applied at several points:

#### Within Encoder and Decoder Blocks:
+ After the multi-head self-attention and feed-forward sub-layers, layer normalization is used (often in combination with residual connections) to ensure that the inputs to subsequent layers are well-behaved.

#### Before Attention Mechanisms:
+ Some variants of the Transformer apply layer normalization before the attention layers (known as pre-norm Transformers) to further improve training stability.

-----

In [11]:
# Cell 1: Import required libraries
import torch
import torch.nn as nn
import numpy as np

# For reproducibility
torch.manual_seed(42)


<torch._C.Generator at 0x108e59690>

In [12]:
# Cell 2: Prepare a simple NLP example

# Define a simple sentence and tokenize it.
sentence = "hello world this is a test".split()
print("Tokenized Sentence:", sentence)

# Build a small vocabulary from the sentence.
# (In practice, you would have a larger vocabulary and better tokenization.)
vocab = {word: idx for idx, word in enumerate(sentence)}
print("Vocabulary:", vocab)

# Convert tokens to indices.
indices = [vocab[word] for word in sentence]
print("Token Indices:", indices)

# Convert indices to a tensor of shape (batch_size, sequence_length).
# Here, we assume a batch size of 1.
input_indices = torch.tensor([indices])  # shape: (1, sequence_length)
print("Input Indices Tensor:\n", input_indices)


Tokenized Sentence: ['hello', 'world', 'this', 'is', 'a', 'test']
Vocabulary: {'hello': 0, 'world': 1, 'this': 2, 'is': 3, 'a': 4, 'test': 5}
Token Indices: [0, 1, 2, 3, 4, 5]
Input Indices Tensor:
 tensor([[0, 1, 2, 3, 4, 5]])


In [13]:
# Cell 3: Create an Embedding Layer and Apply it

# Set the embedding dimension.
embedding_dim = 8  # You can choose any dimension

# Create an embedding layer (vocab_size x embedding_dim).
vocab_size = len(vocab)
embedding = nn.Embedding(vocab_size, embedding_dim)

# Get embeddings for the input indices.
embeddings = embedding(input_indices)  # shape: (batch_size, sequence_length, embedding_dim)
print("\nRaw Embeddings:\n", embeddings)

# Optionally, view the shape
print("Embeddings Shape:", embeddings.shape)



Raw Embeddings:
 tensor([[[ 1.9269,  1.4873,  0.9007, -2.1055,  0.6784, -1.2345, -0.0431,
          -1.6047],
         [-0.7521,  1.6487, -0.3925, -1.4036, -0.7279, -0.5594, -0.7688,
           0.7624],
         [ 1.6423, -0.1596, -0.4974,  0.4396, -0.7581,  1.0783,  0.8008,
           1.6806],
         [ 1.2791,  1.2964,  0.6105,  1.3347, -0.2316,  0.0418, -0.2516,
           0.8599],
         [-1.3847, -0.8712, -0.2234,  1.7174,  0.3189, -0.4245,  0.3057,
          -0.7746],
         [-1.5576,  0.9956, -0.8798, -0.6011, -1.2742,  2.1228, -1.2347,
          -0.4879]]], grad_fn=<EmbeddingBackward0>)
Embeddings Shape: torch.Size([1, 6, 8])


In [14]:
# Cell 4: Apply Layer Normalization on the Embeddings

# Layer normalization is typically applied over the embedding dimension.
# For each token in the sequence, we want to normalize its embedding vector.
layer_norm = nn.LayerNorm(embedding_dim)

# Apply layer normalization to the embeddings.
# Note: LayerNorm expects the normalized shape to match the last dimension.
normalized_embeddings = layer_norm(embeddings)
print("\nNormalized Embeddings:\n", normalized_embeddings)

# View the shape of the normalized embeddings
print("Normalized Embeddings Shape:", normalized_embeddings.shape)



Normalized Embeddings:
 tensor([[[ 1.3737,  1.0601,  0.6418, -1.5020,  0.4833, -0.8809, -0.0312,
          -1.1448],
         [-0.5176,  2.0823, -0.1281, -1.2231, -0.4913, -0.3089, -0.5357,
           1.1225],
         [ 1.2722, -0.7856, -1.1714, -0.1013, -1.4692,  0.6281,  0.3112,
           1.3160],
         [ 1.0335,  1.0605, -0.0108,  1.1203, -1.3260, -0.8990, -1.3572,
           0.3787],
         [-1.3584, -0.7856, -0.0628,  2.1023,  0.5421, -0.2872,  0.5274,
          -0.6778],
         [-1.0002,  1.1404, -0.4319, -0.1983, -0.7626,  2.0854, -0.7294,
          -0.1034]]], grad_fn=<NativeLayerNormBackward0>)
Normalized Embeddings Shape: torch.Size([1, 6, 8])


 -------------

### Feed-Forward Networks
+ Feed-forward networks (FFNs) are a critical component in Transformer architectures.
+ They are applied independently and identically to each position (or token) in the sequence, providing a way to further process and transform the output of the attention layers.

### Feed-Forward Networks in Transformers
+ In the context of Transformer models, a feed-forward network is typically a two-layer multilayer perceptron (MLP) applied to each position in the sequence independently.
+ The key idea is to further process the token representations after the self-attention mechanism has captured relationships among tokens.

### Key Characteristics
#### 1. Position-wise Application:
+ The same feed-forward network is applied to each token separately.
+ This means that while the network is shared across all positions, it operates independently for each position, allowing parallel computation.

#### 2. Two-Layer Structure:
+ A standard FFN in a Transformer consists of:

+ 1. A linear layer that projects the input from the model's hidden size to a higher-dimensional space
+ 2. A non-linear activation function (commonly ReLU or GELU) to introduce non-linearity.
+ 3. A second linear layer that projects the output back to the hidden size 

### Why Use Feed-Forward Networks?
##### 1. Non-Linearity:
+ The introduction of a non-linear activation function allows the network to capture complex patterns in the data that linear transformations alone cannot capture.

##### 2. Dimensional Expansion:
+ By projecting to a higher-dimensional space, the network can model more complex interactions among features. Then, projecting back to the original space allows these enhanced representations to be integrated back into the overall model.

##### 3. Position-wise Independence:
+ Since the FFN is applied to each token independently, it does not mix information across different positions. This complements the attention mechanism, which is responsible for mixing information across positions.

##### 4. Efficiency:
+ The feed-forward network is applied in parallel across tokens, making it computationally efficient.

-----

In [15]:
import torch
import torch.nn as nn

class FeedForwardNetwork(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
        """
        Args:
            d_model: The input and output dimensionality (model hidden size).
            d_ff: The inner-layer dimensionality (feed-forward dimension).
            dropout: Dropout probability.
        """
        super(FeedForwardNetwork, self).__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.activation = nn.ReLU()  # Alternatively, use nn.GELU() for smoother activations.
        self.linear2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        # Apply the first linear transformation
        x = self.linear1(x)
        # Apply non-linearity
        x = self.activation(x)
        # Apply dropout for regularization
        x = self.dropout(x)
        # Project back to d_model
        x = self.linear2(x)
        return x

# Quick test of the FFN module with dummy data:
d_model = 512       # Hidden size
d_ff = 2048         # Feed-forward size
batch_size = 2      # Number of sequences in the batch
seq_length = 10     # Length of each sequence

ffn = FeedForwardNetwork(d_model, d_ff)
dummy_input = torch.rand(batch_size, seq_length, d_model)
ffn_output = ffn(dummy_input)
print("Input shape:", dummy_input.shape)
print("FFN output shape:", ffn_output.shape)


Input shape: torch.Size([2, 10, 512])
FFN output shape: torch.Size([2, 10, 512])


In [16]:
# Cell 2: An example using word embeddings

# Sample sentence tokenization (for illustration)
sentence = "hello world this is a test".split()
vocab = {word: idx for idx, word in enumerate(sentence)}
indices = [vocab[word] for word in sentence]

# Create a tensor of token indices (batch size = 1, sequence length = len(sentence))
input_indices = torch.tensor([indices])
print("Token indices:", input_indices)

# Create an embedding layer and obtain embeddings.
embedding_dim = d_model  # Must match d_model of FFN
embedding = nn.Embedding(len(vocab), embedding_dim)
embeddings = embedding(input_indices)  # Shape: (1, sequence_length, embedding_dim)
print("Raw embeddings:\n", embeddings)

# Apply the Feed-Forward Network on the embeddings.
ffn_output = ffn(embeddings)
print("\nFFN output:\n", ffn_output)
print("FFN output shape:", ffn_output.shape)


Token indices: tensor([[0, 1, 2, 3, 4, 5]])
Raw embeddings:
 tensor([[[ 0.4575,  0.9477,  1.7342,  ...,  0.0348,  1.3952, -0.3006],
         [-0.4981, -2.1107,  1.0544,  ...,  0.2418, -0.7503, -1.9291],
         [-1.0991,  0.0943, -0.2119,  ..., -0.3632,  0.4274,  0.7224],
         [ 1.2295,  0.2509,  0.1474,  ..., -0.9729,  1.7464,  0.6245],
         [ 0.9247, -0.4319,  0.9695,  ...,  1.2066, -0.3008, -0.8452],
         [-1.0557,  1.2071, -0.0109,  ..., -1.6180,  0.1539,  1.0028]]],
       grad_fn=<EmbeddingBackward0>)

FFN output:
 tensor([[[-0.0936,  0.1393, -0.1026,  ...,  0.0180,  0.0283, -0.0821],
         [ 0.0056,  0.2733,  0.4768,  ..., -0.0036,  0.2604,  0.1568],
         [-0.1127,  0.3853,  0.4542,  ..., -0.2403,  0.0419,  0.5705],
         [-0.0903, -0.0647,  0.1574,  ...,  0.0601, -0.6056, -0.0466],
         [-0.3474,  0.0300, -0.0115,  ..., -0.0618,  0.3574, -0.1310],
         [-0.2085, -0.0121, -0.0578,  ...,  0.1855, -0.1635,  0.0790]]],
       grad_fn=<ViewBackward0>)
