# Introduction

Transformers have revolutionized the field of natural language processing.

Transformers are basically deep learning models introduced in the paper "Attention Is All You Need" by Vaswani et al. They have since become the state-of-the-art model for a range of tasks in natural language processing (NLP). Unlike previous models which relied heavily on RNNs or CNNs, Transformers use self-attention mechanisms to weigh the significance of different words in a sequence.

# Objective

The main goal of this notebook is to implement the Transformer architecture from scratch, understand its core components, and test its capabilities.


The objective of this notebook is to implement a Transformer model from scratch. Building it step by step allows us to understand the intricacies involved and the reasoning behind each design choice. By the end of this notebook, we aim to have a working model that can be trained and evaluated on NLP tasks.

# Setting Up

In this section, we'll set up our environment by importing the necessary libraries and defining any configurations required for our model.

In [1]:
import torch
import torch.nn as nn

# Implementing the Self-Attention Mechanism

The self-attention mechanism is at the heart of the Transformer architecture. It allows the model to focus on different words in the input when producing an output, assigning different attention scores to different words.

In [2]:
class SelfAttention(nn.Module):
    def __init__(self, embed_size, heads):
        super(SelfAttention, self).__init__()
        self.embed_size = embed_size
        self.heads = heads
        self.head_dim = embed_size // heads

        # Ensure the embedding size is divisible by number of heads
        assert (
            self.head_dim * heads == embed_size
        ), "Embedding size needs to be divisible by heads"

        # Linear layers for the queries, keys, and values
        self.values = nn.Linear(embed_size, embed_size)
        self.keys = nn.Linear(embed_size, embed_size)
        self.queries = nn.Linear(embed_size, embed_size)

        # Output fully connected layer
        self.fc_out = nn.Linear(embed_size, embed_size)

    def forward(self, values, keys, query, mask):
        # Get number of training examples
        N = query.shape[0]
        value_len, key_len, query_len = values.shape[1], keys.shape[1], query.shape[1]

        # Transform the input using the query, key, and value linear layers
        values = self.values(values)
        keys = self.keys(keys)
        queries = self.queries(query)

        # Split the embedding size into the number of heads
        # This allows for multiple attention scores
        values = values.reshape(N, value_len, self.heads, self.head_dim)
        keys = keys.reshape(N, key_len, self.heads, self.head_dim)
        queries = queries.reshape(N, query_len, self.heads, self.head_dim)

        # Scaled dot-product attention mechanism
        energy = torch.einsum("nqhd,nkhd->nhqk", [queries, keys])
        if mask is not None:
            energy = energy.masked_fill(mask == 0, float("-1e20"))

        # Turn the energy values into probabilities ranging from 0 to 1
        attention = torch.softmax(energy / (self.embed_size ** (1 / 2)), dim=3)

        # Multiply the attention scores with the values to get the final output
        out = torch.einsum("nhql,nlhd->nqhd", [attention, values]).reshape(
            N, query_len, self.heads * self.head_dim
        )

        # Apply the output linear layer
        out = self.fc_out(out)
        return out

In [3]:
class TransformerBlock(nn.Module):
    def __init__(self, embed_size, heads, dropout, forward_expansion):
        super(TransformerBlock, self).__init__()
        self.attention = SelfAttention(embed_size, heads)
        self.norm1 = nn.LayerNorm(embed_size)
        self.norm2 = nn.LayerNorm(embed_size)

        self.feed_forward = nn.Sequential(
            nn.Linear(embed_size, forward_expansion * embed_size),
            nn.ReLU(),
            nn.Linear(forward_expansion * embed_size, embed_size),
        )

        self.dropout = nn.Dropout(dropout)

    def forward(self, value, key, query, mask):
        attention = self.attention(value, key, query, mask)

        # Add skip connection, run through normalization and finally dropout
        x = self.dropout(self.norm1(attention + query))
        forward = self.feed_forward(x)
        out = self.dropout(self.norm2(forward + x))
        return out

# Building the Encoder

The encoder processes the input sequence and compresses this information into a 'context' or 'memory' that the decoder can then use. Each encoder consists of multiple layers of self-attention and feed-forward neural networks.

In [4]:

class Encoder(nn.Module):
    def __init__(
        self,
        src_vocab_size,
        embed_size,
        num_layers,
        heads,
        device,
        forward_expansion,
        dropout,
        max_length
    ):
        super(Encoder, self).__init__()

        # Set the embedding size and device
        self.embed_size = embed_size
        self.device = device

        # Create word and position embeddings
        self.word_embedding = nn.Embedding(src_vocab_size, embed_size)
        self.position_embedding = nn.Embedding(max_length, embed_size)

        # Initialize encoder layers
        self.layers = nn.ModuleList(
            [
                TransformerBlock(
                    embed_size,
                    heads,
                    dropout=dropout,
                    forward_expansion=forward_expansion
                )
                for _ in range(num_layers)
            ]
        )

        # Dropout layer
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask):
        # Get number of training examples and sequence length
        N, seq_length = x.shape

        # Create position embeddings
        positions = torch.arange(0, seq_length).expand(N, seq_length).to(self.device)

        # Combine word embeddings and position embeddings
        out = self.dropout(
            (self.word_embedding(x) + self.position_embedding(positions))
        )

        # Pass the input through the encoder layers
        for layer in self.layers:
            out = layer(out, out, out, mask)

        return out

# Constructing the Decoder

The decoder generates the output sequence. It uses the context provided by the encoder to produce the correct output. Similar to the encoder, it has multiple layers, but in addition to the self-attention mechanism, it also has an encoder-decoder attention layer.

In [5]:

class DecoderBlock(nn.Module):
    def __init__(self, embed_size, heads, forward_expansion, dropout, device):
        super(DecoderBlock, self).__init__()

        # Layer normalization
        self.norm = nn.LayerNorm(embed_size)

        # Self attention mechanism for the decoder block
        self.attention = SelfAttention(embed_size, heads=heads)

        # Transformer block which includes another self attention mechanism followed by a feed-forward network
        self.transformer_block = TransformerBlock(
            embed_size, heads, dropout, forward_expansion
        )

        # Dropout layer
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, value, key, src_mask, trg_mask):
        # Apply self attention
        attention = self.attention(x, x, x, trg_mask)

        # Add and normalize
        query = self.dropout(self.norm(attention + x))

        # Pass through the transformer block
        out = self.transformer_block(value, key, query, src_mask)
        return out

class Decoder(nn.Module):
    def __init__(
        self,
        trg_vocab_size,
        embed_size,
        num_layers,
        heads,
        forward_expansion,
        dropout,
        device,
        max_length,
    ):
        super(Decoder, self).__init__()

        # Set device and initialize word and position embeddings
        self.device = device
        self.word_embedding = nn.Embedding(trg_vocab_size, embed_size)
        self.position_embedding = nn.Embedding(max_length, embed_size)

        # Create multiple decoder blocks
        self.layers = nn.ModuleList(
            [
                DecoderBlock(embed_size, heads, forward_expansion, dropout, device)
                for _ in range(num_layers)
            ]
        )

        # Fully connected output layer
        self.fc_out = nn.Linear(embed_size, trg_vocab_size)

        # Dropout layer
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, enc_out, src_mask, trg_mask):
        # Get sequence length and create position embeddings
        N, seq_length = x.shape
        positions = torch.arange(0, seq_length).expand(N, seq_length).to(self.device)

        # Combine word embeddings and position embeddings
        x = self.dropout((self.word_embedding(x) + self.position_embedding(positions)))

        # Pass the input through the decoder layers
        for layer in self.layers:
            x = layer(x, enc_out, enc_out, src_mask, trg_mask)

        # Apply the output linear layer
        out = self.fc_out(x)
        return out

# Piecing Together the Transformer

In this section, we combine our encoder and decoder to form the complete Transformer architecture. The encoder processes the input and the decoder uses this processed input to generate the output.

In [6]:
class Transformer(nn.Module):
    def __init__(
        self,
        src_vocab_size,
        trg_vocab_size,
        src_pad_idx,
        trg_pad_idx,
        embed_size=512,
        num_layers=6,
        forward_expansion=4,
        heads=8,
        dropout=0,
        device="cpu",
        max_length=100,
    ):
        super(Transformer, self).__init__()

        # Encoder to process the source sequence
        self.encoder = Encoder(
            src_vocab_size,
            embed_size,
            num_layers,
            heads,
            device,
            forward_expansion,
            dropout,
            max_length,
        )

        # Decoder to generate the target sequence
        self.decoder = Decoder(
            trg_vocab_size,
            embed_size,
            num_layers,
            heads,
            forward_expansion,
            dropout,
            device,
            max_length,
        )

        # Define padding indices for source and target sequences
        self.src_pad_idx = src_pad_idx
        self.trg_pad_idx = trg_pad_idx
        self.device = device

    def make_src_mask(self, src):
        # Create mask for source sequence to ignore padding tokens
        src_mask = (src != self.src_pad_idx).unsqueeze(1).unsqueeze(2)
        return src_mask.to(self.device)

    def make_trg_mask(self, trg):
        # Create mask for target sequence to avoid attending to future tokens
        N, trg_len = trg.shape
        trg_mask = torch.tril(torch.ones((trg_len, trg_len))).expand(
            N, 1, trg_len, trg_len
        )
        return trg_mask.to(self.device)

    def forward(self, src, trg):
        # Create masks
        src_mask = self.make_src_mask(src)
        trg_mask = self.make_trg_mask(trg)

        # Process source sequence with encoder
        enc_src = self.encoder(src, src_mask)

        # Generate target sequence with decoder
        out = self.decoder(trg, enc_src, src_mask, trg_mask)
        return out

# Testing the Transformer

In [7]:
if __name__ == "__main__":
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(device)

    x = torch.tensor([[1, 5, 6, 4, 3, 9, 5, 2, 0], [1, 8, 7, 3, 4, 5, 6, 7, 2]]).to(
        device
    )
    trg = torch.tensor([[1, 7, 4, 3, 5, 9, 2, 0], [1, 5, 6, 2, 4, 7, 6, 2]]).to(device)

    src_pad_idx = 0
    trg_pad_idx = 0
    src_vocab_size = 10
    trg_vocab_size = 10
    model = Transformer(src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx, device=device).to(
        device
    )
    out = model(x, trg[:, :-1])
    print(out.shape)

cpu
torch.Size([2, 7, 10])


# Conclusion

In this notebook, we've built a Transformer model from scratch, delved deep into its core components, and tested its capabilities. Through this exercise, we have gained a deeper understanding of how this powerful architecture works.

Building the Transformer from scratch has given us insights into its design and functioning. The self-attention mechanism's ability to weigh the importance of different words in a sequence is a significant advantage, allowing Transformers to achieve state-of-the-art performance on various NLP tasks.

# References

- [Attention Is All You Need](https://arxiv.org/abs/1706.03762) - The original Transformer paper.