1. **MultiHeadAttention**: This class implements the multi-head attention mechanism, a crucial component of the Transformer architecture. It computes attention for multiple attention heads independently, allowing the model to focus on different parts of the input simultaneously. The class contains linear layers to transform the input into Query (Q), Key (K), and Value (V) matrices, which are then split into multiple heads. Scaled dot-product attention is calculated for each head, and the outputs are concatenated and passed through a final linear layer.

2. **PositionalEncoding**: This class is responsible for adding positional information to the input sequences. Since the self-attention mechanism in the Transformer architecture does not have any inherent notion of the position of words in the sequence, positional encodings are added to the input embeddings to provide information about the word positions. These encodings are crucial for the model to learn and understand the order of words in the input sequences. The class generates sinusoidal positional encodings and adds them to the input during the forward pass.

3. **TransformerBlock**: This class represents a single layer in the Transformer architecture. Each layer contains a multi-head attention mechanism, layer normalization, a position-wise feed-forward network, and another layer normalization. Residual connections and dropout are also applied at appropriate points in the layer. The `TransformerBlock` class uses the `MultiHeadAttention` class to perform the multi-head attention operation and combines it with the other components to create a single Transformer layer.

4. **Transformer**: This class combines the `TransformerBlock` layers and applies the `PositionalEncoding` to the input data. It encapsulates the full Transformer architecture by stacking multiple `TransformerBlock` layers and adding positional information to the input sequences. During the forward pass, the input data is first passed through the `PositionalEncoding` layer, and then sequentially through the stack of `TransformerBlock` layers. The output of the final `TransformerBlock` layer is returned as the output of the model.


In [7]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchsummary import summary


In [2]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        self.head_dim = d_model // num_heads
        
        assert self.head_dim * num_heads == d_model, "Incompatible dimensions"

        self.query = nn.Linear(d_model, d_model)
        self.key = nn.Linear(d_model, d_model)
        self.value = nn.Linear(d_model, d_model)
        self.fc_out = nn.Linear(d_model, d_model)
        
    def forward(self, query, key, value):
        batch_size = query.shape[0]
        
        # Apply linear layers
        Q = self.query(query)
        K = self.key(key)
        V = self.value(value)

        # Split d_model into num_heads * head_dim
        Q = Q.view(batch_size, -1, self.num_heads, self.head_dim).permute(0, 2, 1, 3)
        K = K.view(batch_size, -1, self.num_heads, self.head_dim).permute(0, 2, 1, 3)
        V = V.view(batch_size, -1, self.num_heads, self.head_dim).permute(0, 2, 1, 3)

        # Calculate scaled dot-product attention
        attn_weights = torch.matmul(Q, K.permute(0, 1, 3, 2)) / (self.head_dim ** 0.5)
        attn_weights = F.softmax(attn_weights, dim=-1)
        attn_output = torch.matmul(attn_weights, V)

        # Concatenate and pass through final linear layer
        attn_output = attn_output.permute(0, 2, 1, 3).contiguous().view(batch_size, -1, self.d_model)
        return self.fc_out(attn_output)

In [3]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_length):
        super(PositionalEncoding, self).__init__()
        self.encoding = torch.zeros(max_seq_length, d_model)
        pos = torch.arange(0, max_seq_length).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * -(torch.log(torch.tensor(10000.0)) / d_model))

        self.encoding[:, 0::2] = torch.sin(pos * div_term)
        self.encoding[:, 1::2] = torch.cos(pos * div_term)
        self.encoding = self.encoding.unsqueeze(0)

    def forward(self, x):
        return x + self.encoding[:x.size(0), :].detach()

In [4]:
class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(TransformerBlock, self).__init__()
        
        # Multi-head attention layer
        self.attention = MultiHeadAttention(d_model, num_heads)
        
        # Layer normalization layers
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        
        # Position-wise feed-forward network
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )
        
        # Dropout layer
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, query, key, value):
        # Pass input through the multi-head attention layer
        attn_output = self.attention(query, key, value)
        
        # Apply residual connection and layer normalization
        out1 = self.norm1(query + self.dropout(attn_output))
        
        # Pass output through the feed-forward network
        ff_output = self.feed_forward(out1)
        
        # Apply residual connection and layer normalization
        out2 = self.norm2(out1 + self.dropout(ff_output))
        
        return out2

In [5]:
class Transformer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers, max_seq_length, dropout):
        super(Transformer, self).__init__()
        self.layers = nn.ModuleList([
            TransformerBlock(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])
        self.positional_encoding = PositionalEncoding(d_model, max_seq_length)

    def forward(self, x):
        x = self.positional_encoding(x)
        for layer in self.layers:
            x = layer(x, x, x)
        return x

In [14]:
# Example usage

# Set the dimension of the input and output embeddings (model's hidden size)
d_model = 64

# Set the number of attention heads in the multi-head attention mechanism
num_heads = 4

# Set the dimension of the feed-forward network's hidden layer
d_ff = 256

# Set the number of layers in the Transformer architecture
num_layers = 3

# Set the maximum sequence length for input sequences
max_seq_length = 10

# Set the dropout rate for the dropout layers in the Transformer architecture
dropout = 0.1

# Set the batch size for input data
batch_size = 1

# Generate a random input tensor of shape (batch_size, max_seq_length, d_model)
input_data = torch.randn(batch_size, max_seq_length, d_model)


transformer = Transformer(d_model, num_heads, d_ff, num_layers, max_seq_length, dropout)
output_data = transformer(input_data)
print(output_data.shape)  # Expected output shape: (batch_size, max_seq_length, d_model)


torch.Size([1, 10, 64])


- **MultiHeadAttention** is used within the `TransformerBlock` class to implement the multi-head attention mechanism.
- **PositionalEncoding** and **TransformerBlock** are used within the `Transformer` class. The `PositionalEncoding` adds positional information to the input sequences, and the `TransformerBlock` represents a single layer in the Transformer architecture. Multiple `TransformerBlock` instances are stacked together to create the full Transformer model.
- The `Transformer` class is instantiated and called in the end to build the model and process input data.

In summary, the `Transformer` class encapsulates the full Transformer architecture, which includes the `PositionalEncoding` and multiple `TransformerBlock` layers. Each `TransformerBlock` layer uses the `MultiHeadAttention` class to perform the multi-head attention operation.


In [15]:
# Move the model to the appropriate device (e.g., GPU or CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
transformer = transformer.to(device)

# Display the summary of the model
summary(transformer, input_size=(1, 512), batch_size=-1, device=device.type)

Layer (type:depth-idx)                   Param #
├─ModuleList: 1-1                        --
|    └─TransformerBlock: 2-1             --
|    |    └─MultiHeadAttention: 3-1      16,640
|    |    └─LayerNorm: 3-2               128
|    |    └─LayerNorm: 3-3               128
|    |    └─Sequential: 3-4              33,088
|    |    └─Dropout: 3-5                 --
|    └─TransformerBlock: 2-2             --
|    |    └─MultiHeadAttention: 3-6      16,640
|    |    └─LayerNorm: 3-7               128
|    |    └─LayerNorm: 3-8               128
|    |    └─Sequential: 3-9              33,088
|    |    └─Dropout: 3-10                --
|    └─TransformerBlock: 2-3             --
|    |    └─MultiHeadAttention: 3-11     16,640
|    |    └─LayerNorm: 3-12              128
|    |    └─LayerNorm: 3-13              128
|    |    └─Sequential: 3-14             33,088
|    |    └─Dropout: 3-15                --
├─PositionalEncoding: 1-2                --
Total params: 149,952
Trainable params: 1

Layer (type:depth-idx)                   Param #
├─ModuleList: 1-1                        --
|    └─TransformerBlock: 2-1             --
|    |    └─MultiHeadAttention: 3-1      16,640
|    |    └─LayerNorm: 3-2               128
|    |    └─LayerNorm: 3-3               128
|    |    └─Sequential: 3-4              33,088
|    |    └─Dropout: 3-5                 --
|    └─TransformerBlock: 2-2             --
|    |    └─MultiHeadAttention: 3-6      16,640
|    |    └─LayerNorm: 3-7               128
|    |    └─LayerNorm: 3-8               128
|    |    └─Sequential: 3-9              33,088
|    |    └─Dropout: 3-10                --
|    └─TransformerBlock: 2-3             --
|    |    └─MultiHeadAttention: 3-11     16,640
|    |    └─LayerNorm: 3-12              128
|    |    └─LayerNorm: 3-13              128
|    |    └─Sequential: 3-14             33,088
|    |    └─Dropout: 3-15                --
├─PositionalEncoding: 1-2                --
Total params: 149,952
Trainable params: 1

In [13]:
# Instantiate the Transformer model
d_model = 512
num_heads = 8
d_ff = 2048
num_layers = 6
max_seq_length = 512
dropout = 0.1

model = Transformer(d_model, num_heads, d_ff, num_layers, max_seq_length, dropout)

# Move the model to the appropriate device (e.g., GPU or CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Display the summary of the model
summary(model, input_size=(1, 512), batch_size=-1, device=device.type)

Layer (type:depth-idx)                   Param #
├─ModuleList: 1-1                        --
|    └─TransformerBlock: 2-1             --
|    |    └─MultiHeadAttention: 3-1      1,050,624
|    |    └─LayerNorm: 3-2               1,024
|    |    └─LayerNorm: 3-3               1,024
|    |    └─Sequential: 3-4              2,099,712
|    |    └─Dropout: 3-5                 --
|    └─TransformerBlock: 2-2             --
|    |    └─MultiHeadAttention: 3-6      1,050,624
|    |    └─LayerNorm: 3-7               1,024
|    |    └─LayerNorm: 3-8               1,024
|    |    └─Sequential: 3-9              2,099,712
|    |    └─Dropout: 3-10                --
|    └─TransformerBlock: 2-3             --
|    |    └─MultiHeadAttention: 3-11     1,050,624
|    |    └─LayerNorm: 3-12              1,024
|    |    └─LayerNorm: 3-13              1,024
|    |    └─Sequential: 3-14             2,099,712
|    |    └─Dropout: 3-15                --
|    └─TransformerBlock: 2-4             --
|    |    └

Layer (type:depth-idx)                   Param #
├─ModuleList: 1-1                        --
|    └─TransformerBlock: 2-1             --
|    |    └─MultiHeadAttention: 3-1      1,050,624
|    |    └─LayerNorm: 3-2               1,024
|    |    └─LayerNorm: 3-3               1,024
|    |    └─Sequential: 3-4              2,099,712
|    |    └─Dropout: 3-5                 --
|    └─TransformerBlock: 2-2             --
|    |    └─MultiHeadAttention: 3-6      1,050,624
|    |    └─LayerNorm: 3-7               1,024
|    |    └─LayerNorm: 3-8               1,024
|    |    └─Sequential: 3-9              2,099,712
|    |    └─Dropout: 3-10                --
|    └─TransformerBlock: 2-3             --
|    |    └─MultiHeadAttention: 3-11     1,050,624
|    |    └─LayerNorm: 3-12              1,024
|    |    └─LayerNorm: 3-13              1,024
|    |    └─Sequential: 3-14             2,099,712
|    |    └─Dropout: 3-15                --
|    └─TransformerBlock: 2-4             --
|    |    └