<a href="https://colab.research.google.com/github/arkeodev/nlp/blob/main/Simple_Transformers/simple_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformer Architecture

To implement the transformer architecture based on the description you've provided, let's break it down into its components and outline how each can be coded using PyTorch. The transformer model, as described in the paper "Attention is All You Need", consists of several key components: an Encoder and Decoder architecture, with each having multiple layers of self-attention and feed-forward neural networks.

This guide and explanation of the Transformer architecture draws extensively from "The Annotated Transformer" by Austin Huang, Suraj Subramanian, Jonathan Sum, Khalid Almubarak, and Stella Biderman, building upon the original work by Sasha Rush. This comprehensive article offers a detailed line-by-line implementation of the Transformer model, providing insights into its inner workings and illustrating the theoretical concepts with practical code examples. It serves as an invaluable resource for understanding and implementing the Transformer model.

## Time Complexity Comparison

<figure>
    <img src="https://raw.githubusercontent.com/arkeodev/nlp/main/Simple_Transformers/images/time-complexity-of-transformers.png" width="1000" height="300" alt="Time complexity">
    <figcaption>Time complexity</figcaption>
</figure>

The table from Vaswani et al. (2017), which is from the seminal paper "Attention is All You Need," compares different neural network layer types across three dimensions: complexity per layer, number of sequential operations, and maximum path length:

- **Self-Attention Layer**:
  - Computational complexity per layer is \( O(n^2 \cdot d) \), meaning it scales quadratically with the sequence length and linearly with the dimensionality of the representation.
  - Sequential operations are constant (\( O(1) \)), allowing for high parallelization during training and inference.
  - Maximum path length is also constant (\( O(1) \)), ensuring that long-range dependencies can be learned effectively as any position in the sequence can directly attend to any other position.

- **Recurrent Layer**:
  - Complexity is \( O(n \cdot d^2) \), which scales linearly with the sequence length, making it less efficient for longer sequences compared to self-attention.
  - Requires \( O(n) \) sequential operations, limiting parallelization and potentially increasing training time.
  - Has a linear maximum path length (\( O(n) \)), which can make learning long-range dependencies more challenging.

- **Convolutional Layer**:
  - Shows a complexity of \( O(k \cdot n \cdot d^2) \), where \( k \) is the kernel size, suggesting an increase in computation if larger receptive fields are needed.
  - Has constant sequential operations (\( O(1) \)), similar to self-attention, facilitating parallelization.
  - Exhibits a logarithmic maximum path length (\( O(\log_k(n)) \)), improving the learning of dependencies over recurrent layers but potentially less effective than self-attention for very long sequences.

- **Restricted Self-Attention Layer (or Local Attention)**:
  - Complexity per layer is reduced to \( O(r \cdot n \cdot d) \) by restricting attention to a window of \( r \) surrounding tokens, balancing computational efficiency and the ability to capture dependencies.
  - Maintains constant sequential operations (\( O(1) \)), benefiting parallelization.
  - Features a maximum path length of \( O(n/r) \), indicating a compromise in the ability to learn dependencies as the sequence length increases compared to full self-attention.

Let's dive into the self-attention mechanism, which forms the core component of transformers.

## Self-Attention

The self-attention mechanism is what allows Transformers to process data in parallel. It assigns a weight to each element in the input sequence, based on how relevant each element is to every other element.

### Description of Self Attention

Self-attention can be described with three main components: Queries, Keys, and Values.

- **Queries**: A set of vectors that is matched against the keys to decide the most important elements in the sequence.
- **Keys**: Vectors that are paired with values; they are used to extract the information that queries look for.
- **Values**: Vectors that contain the actual information of each element in the sequence that is extracted based on the weightage from the keys.

Imagine you are in a library with a huge collection of books (the sequence), and you are looking for information on a specific topic.

- The **query** is like your question about the topic you’re interested in.
- The **keys** represent the index or summary of each book.
- The **values** are the actual contents of the books.

The librarian (the self-attention mechanism) checks your question against all summaries (keys) to determine which books (values) have the information you need. This process is done simultaneously for all the questions in parallel, which is what makes the transformer model so powerful and efficient.

### Alternative Attention Approaches

- **Additive Attention**: While the scaled dot-product attention is widely used, alternative attention mechanisms exist, such as additive (or Bahdanau) attention, which uses a feed-forward network to compute attention scores instead of dot products. Each variant has its own advantages and use cases.

- **Multi-Head Attention**: The Transformer model extends the basic scaled dot-product attention mechanism through the use of multi-head attention, where the model runs several attention processes in parallel. This approach allows the model to simultaneously attend to information from different representation subspaces at different positions, improving the model's ability to capture various types of relationships in the data.

- **Self-Attention vs. Cross-Attention**: The module as described implements self-attention, where queries, keys, and values all come from the same source. In cross-attention settings, keys and values can come from a different source than queries, enabling interactions between different sequences (e.g., between an encoder and a decoder in a sequence-to-sequence model).

### Mathemetical Foundation of Attention

The following formula represents the scaled dot-product attention, which is the foundation of self-attention:

$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$

**Components and Dimensions:**

- $( Q )$: Queries matrix of dimension $( m \times d_k )$, where $( m )$ is the number of queries and $( d_k )$ is the dimension of each query vector.
- $( K )$: Keys matrix of dimension $( n \times d_k )$, where $( n )$ is the number of keys (and values) and $( d_k )$ is the dimension of each key vector.
- $( V )$: Values matrix of dimension $( n \times d_v )$, where $( n )$ is the number of values and $( d_v )$ is the dimension of each value vector.
- $( d_k )$: Dimension of the keys (and queries), used for scaling the dot product scores.

**1. Dot Product of $( Q )$ and $( K^T )$**

- $( QK^T )$: Since $( Q )$ is $( m \times d_k )$ and $( K )$ is $( n \times d_k )$, when we take the transpose of $( K )$ to get $( K^T )$ which is $( d_k \times n )$, the resulting dot product $( QK^T )$ will be a matrix of dimension $( m \times n )$. Each element of this matrix represents a score reflecting the relevance of a query to a key.

- This step calculates the similarity between all queries and all keys, serving as the basis for determining how the values should be weighted according to each query.

**2. Scaling by $( \sqrt{d_k} )$**

- The scores in the $( m \times n )$ matrix are scaled down by $( \sqrt{d_k} )$, which doesn't change the dimensions of the matrix but normalizes the scores to prevent them from becoming too large.

- Scaling helps maintain numerical stability, particularly in the softmax step, by ensuring the scores don't lead to extremely small gradients when the dimensionality $( d_k )$ is large.

**3. Softmax Function**

- The softmax function is applied across the rows of the scaled $( m \times n )$ matrix. The dimensionality of the output remains $( m \times n )$, but now each row sums to 1, representing probabilities.

- Converting the scores to probabilities allows the model to probabilistically decide which keys (and thus, which values) are most relevant to each query, enabling a soft selection mechanism.

**4. Multiplication by $( V )$**

- The softmax output, still $( m \times n )$, is then multiplied by $( V )$, which is $( n \times d_v )$. The resulting matrix will have dimensions $( m \times d_v )$, representing the final output of the attention mechanism.

- This step effectively combines the values based on their relevance to each query, as determined by the weighted scores. The final dimension $( m \times d_v )$ corresponds to the weighted sum of values for each query, now ready for further processing in the model.


### Sample Implementation

In [28]:
import numpy as np

# Define the softmax function
def softmax(x, axis=-1):
    """Compute softmax values for each sets of scores in x over the specified axis."""
    e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return e_x / np.sum(e_x, axis=axis, keepdims=True)

# Define the scaled dot product attention function
def scaled_dot_product_attention(Q, K, V):
    matmul_qk = np.matmul(Q, K.transpose(0, 2, 1))
    dk = K.shape[-1]
    scaled_attention_logits = matmul_qk / np.sqrt(dk)
    attention_weights = softmax(scaled_attention_logits)
    output = np.matmul(attention_weights, V)
    return output, attention_weights

# Let's consider two sentences for our demonstration
sentences = ["The cat sits on the mat", "A dog lies on the rug"]

# Define embeddings for each unique word in the sentences
# Embeddings are crafted so that similar words have closer embeddings
word_embeddings = {
    "the": np.array([1, 0, 0, 0, 0, 0, 0, 0]),
    "cat": np.array([0, 1, 0, 0, 0.1, 0.2, 0.3, 0.4]),
    "sits": np.array([0, 0.9, 1, 0, 0.2, 0.1, 0.4, 0.3]),
    "on": np.array([0, 0, 0, 1, 0, 0, 0, 0]),
    "mat": np.array([0, 0.8, 0.6, 0.3, 0, 0.6, 0.3, 0.2]),
    "a": np.array([1, 0, 0, 0, 0, 0, 0, 0.1]),
    "dog": np.array([0, 0.9, 0.1, 0, 0, 0.3, 0.4, 0.3]),
    "lies": np.array([0, 1, 0.8, 0.1, 0.3, 0.1, 0.4, 0.2]),
    "rug": np.array([0, 0.9, 0.6, 0.3, 0, 0.5, 0.3, 0.1])
}

# Tokenize the sentences
tokenized_sentences = [sentence.lower().split() for sentence in sentences]

# Convert sentences to embeddings using the predefined embeddings
sentence_embeddings = [[word_embeddings[word] for word in sentence] for sentence in tokenized_sentences]

# Pad sentences to the same length
max_length = max(len(sentence) for sentence in tokenized_sentences)
padded_embeddings = [np.array(sentence + [np.zeros(3)] * (max_length - len(sentence))) for sentence in sentence_embeddings]

# Stack the embeddings to create the Q, K, and V matrices
Q = np.array([np.vstack(sentence) for sentence in padded_embeddings])
K = np.array([np.vstack(sentence) for sentence in padded_embeddings])
V = np.array([np.vstack(sentence) for sentence in padded_embeddings])

# Apply the scaled dot product attention function
attention_output, attention_weights = scaled_dot_product_attention(Q, K, V)

# Let's print the attention weights for the first sentence
print("Attention weights for the first sentence:")
print(attention_weights[0])

# Let's print the attention weights for the second sentence
print("Attention weights for the second sentence:")
print(attention_weights[1])

Attention weights for the first sentence:
[[0.20795408 0.14602296 0.14602296 0.14602296 0.20795408 0.14602296]
 [0.1320772  0.20914044 0.20045296 0.1320772  0.1320772  0.194175  ]
 [0.11958619 0.18149541 0.25215275 0.11958619 0.11958619 0.20759326]
 [0.15299844 0.15299844 0.15299844 0.21788799 0.15299844 0.17011824]
 [0.20795408 0.14602296 0.14602296 0.14602296 0.20795408 0.14602296]
 [0.12397355 0.18226132 0.21520941 0.13784561 0.12397355 0.21673656]]
Attention weights for the second sentence:
[[0.20789086 0.14701445 0.1464956  0.14546337 0.20715715 0.14597857]
 [0.13342497 0.19895022 0.20393542 0.13201726 0.13201726 0.19965486]
 [0.12073931 0.18519945 0.23888728 0.12420309 0.11988857 0.2110823 ]
 [0.15216063 0.15216063 0.15763656 0.21669485 0.15216063 0.16918669]
 [0.20795408 0.14602296 0.14602296 0.14602296 0.20795408 0.14602296]
 [0.12305363 0.18544201 0.21589025 0.13633987 0.12261934 0.21665489]]


**For the first sentence "The cat sits on the mat":**

1. The attention weights for the word "the" (first row) seem to be distributed across "the", "sits", and "mat" with higher weights (0.2079) compared to "cat" and "on" (0.1460). This might indicate that the model perceives a stronger association between "the" and the action ("sits") and the object ("mat") of the sentence.

2. The word "cat" (second row) has the highest attention weight when paired with itself (0.2091) and with "sits" (0.2004), which could suggest that the model identifies "cat" and the action it's performing as key components of the sentence.

3. The word, "sits" (third row) has the highest attention weight in combination with itself (0.2521) and substantial weight with "cat" (0.1814), potentially indicating the model's understanding of the importance of the verb in the context of the subject.

4. Finally, "mat" (sixth row) pays more attention to "sits" (0.2152) as expected.

**For the second sentence "A dog lies on the rug":**

1. "Dog" (second row) shows higher attention weights for itself (0.1989) and "lies" (0.2039), which is consistent with recognizing the subject and verb relationship.

3. "Lies" (third row) has the highest weight with itself (0.2388), and a notable weight with "dog" (0.1851) and "rug" (0.2110), reinforcing the connection between the subject, object and the action.

4. "Rug" (sixth row), much like "mat" in the first sentence, shows a relatively high weight for "dog" (0.1854) and "lies" (0.2158), as expected.

In both sentences, it is evident that the attention mechanism is picking up on the grammatical structure and the relationships between subjects, verbs, and objects. It emphasizes self-attention and relationships that make sense in the context of the sentences.

# Implementation the Model

## Preliminary Setup

First, ensure all required packages are installed and import necessary libraries.

In [32]:
import torch
import torch.nn as nn
import torch.nn.functional as F

import math
import copy

### Helper Functions

In [36]:
# Helper function to produce N identical layers
def clones(module, N):
  "Produce N identical layers."
  return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

## Core Components of the Transformer

### Scaled Dot-Product Attention

The core of the transformer model is the attention mechanism.

<figure>
    <img src="https://raw.githubusercontent.com/arkeodev/nlp/main/Simple_Transformers/images/attention.png" width="180" height="300" alt="Attention Mechanism">
    <figcaption>Attention Mechanism</figcaption>
</figure>

In [45]:
class ScaledDotProductAttention(nn.Module):
    def __init__(self, dropout=None):
        super(ScaledDotProductAttention, self).__init__()
        # Initialize dropout for attention scores, if provided.
        self.dropout = nn.Dropout(dropout) if dropout is not None else None

    def forward(self, query, key, value, mask=None):
        # Dimension of the key vectors, used for scaling down the dot products.
        d_k = query.size(-1)
        # Compute the dot products between queries and keys for each batch and head,
        scores = torch.matmul(query, key.transpose(-2, -1))
        # Scale the results
        scaled_score = scores / math.sqrt(d_k)
        # Apply masking if provided (for attention to consider only certain positions).
        if mask is not None:
            scaled_score = scaled_score.masked_fill(mask == 0, -1e9)
        # Apply softmax to obtain attention probabilities.
        p_attn = F.softmax(scaled_score, dim=-1)
        # Optionally apply dropout to attention scores.
        if self.dropout is not None:
            p_attn = self.dropout(p_attn)
        # Weight the values by the computed attention probabilities.
        return torch.matmul(p_attn, value), p_attn

The `ScaledDotProductAttention` module implements the attention mechanism that is at the heart of the Transformer model's effectiveness. Its roles and effects on the structure are as follows:

- **Attention Calculation**: This module computes the attention scores by taking the dot product of the query with the key. It scales the dot product by the square root of the dimensionality of the key to prevent extremely small gradients when the dimensionality is large. This scaling helps stabilize gradients during training.

- **Masking**: The optional mask allows the model to selectively ignore certain positions within the input sequence. This is particularly useful for masking out padding tokens in the input sequence or for preventing the model from peeking at future tokens when processing sequences in an autoregressive manner (e.g., during training a language model).

- **Dropout**: Applying dropout to the attention scores is a regularization technique that helps prevent overfitting by randomly zeroing out some of the scores before computing the final attention weights.

- **Softmax**: The softmax function converts the attention scores into probabilities, ensuring that they are non-negative and sum up to 1. This step allows the model to essentially "focus" on the most relevant parts of the input.

- **Output**: The module outputs the weighted sum of the value vectors, scaled by the attention probabilities, which is then used in subsequent layers of the model. It also returns the attention probabilities themselves, which can be useful for analysis or for visualizing the model's attention.



### Multi-Head Attention

Allows the model to jointly attend to information from different representation subspaces.

<figure>
    <img src="https://raw.githubusercontent.com/arkeodev/nlp/main/Simple_Transformers/images/mutl-head-attention.png" width="180" height="300" alt="Multi Head Attention">
    <figcaption>Multi Head Attention</figcaption>
</figure>

In [47]:
class MultiHeadedAttention(nn.Module):
    def __init__(self, h, d_model, dropout=0.1):
        super(MultiHeadedAttention, self).__init__()
        assert d_model % h == 0
        self.d_k = d_model // h
        self.h = h
        self.linears = clones(nn.Linear(d_model, d_model), 4)
        self.attn = ScaledDotProductAttention(dropout)
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, query, key, value, mask=None):
        if mask is not None:
            mask = mask.unsqueeze(1)
        nbatches = query.size(0)
        query, key, value = [
            l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
            for l, x in zip(self.linears, (query, key, value))
        ]
        x, attn_weights = self.attn(query, key, value, mask=mask)
        x = x.transpose(1, 2).contiguous().view(nbatches, -1, self.h * self.d_k)
        return self.linears[-1](x)

### Position-wise Feed-Forward Networks

Each layer in the encoder and decoder contains a fully connected feed-forward network.

In [9]:
class PositionwiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
        super(PositionwiseFeedForward, self).__init__()
        self.w_1 = nn.Linear(d_model, d_ff)
        self.w_2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.w_2(self.dropout(F.relu(self.w_1(x))))

### Embeddings and Softmax


Convert input tokens and output tokens to vectors.

In [24]:
class Embeddings(nn.Module):
    def __init__(self, d_model, vocab):
        super(Embeddings, self).__init__()
        self.lut = nn.Embedding(vocab, d_model)
        self.d_model = d_model

    def forward(self, x):
        return self.lut(x) * math.sqrt(self.d_model)

### Positional Embedding

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence.

In [19]:
class PositionalEncoding(nn.Module):
    "Implement the PE function."

    def __init__(self, d_model, dropout, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        # Compute the positional encodings once in log space.
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer("pe", pe)

    def forward(self, x):
        x = x + self.pe[:, : x.size(1)].requires_grad_(False)
        return self.dropout(x)

### Layer Normalisation

We employ a residual connection around each of the two sub-layers, followed by layer normalization

In [11]:
class LayerNorm(nn.Module):
    "Construct a layernorm module (See citation for details)."

    def __init__(self, features, eps=1e-6):
        super(LayerNorm, self).__init__()
        self.a_2 = nn.Parameter(torch.ones(features))
        self.b_2 = nn.Parameter(torch.zeros(features))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.a_2 * (x - mean) / (std + self.eps) + self.b_2

### Sublayer Connection

That is, the output of each sub-layer is

$$
LayerNorm(x+Sublayer(x))
$$

where Sublayer(x) is the function implemented by the sub-layer itself. We apply dropout to the output of each sub-layer, before it is added to the sub-layer input and normalized.

To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension $d_model = 512$


In [13]:
class SublayerConnection(nn.Module):
    """
    A residual connection followed by a layer norm.
    Note for code simplicity the norm is first as opposed to last.
    """

    def __init__(self, size, dropout):
        super(SublayerConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, sublayer):
        "Apply residual connection to any sublayer with the same size."
        return x + self.dropout(sublayer(self.norm(x)))

### Generator

In [33]:
class Generator(nn.Module):
    "Define standard linear + softmax generation step."

    def __init__(self, d_model, vocab):
        super(Generator, self).__init__()
        self.proj = nn.Linear(d_model, vocab)

    def forward(self, x):
        return F.log_softmax(self.proj(x), dim=-1)

### Encoder Layer

The encoder is composed of a stack of $N = 6$ identical layers.

In [15]:
class Encoder(nn.Module):
    "Core encoder is a stack of N layers"

    def __init__(self, layer, N):
        super(Encoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)

    def forward(self, x, mask):
        "Pass the input (and mask) through each layer in turn."
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)

In [16]:
class EncoderLayer(nn.Module):
    "Encoder is made up of self-attn and feed forward (defined below)"

    def __init__(self, size, self_attn, feed_forward, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 2)
        self.size = size

    def forward(self, x, mask):
        "Follow Figure 1 (left) for connections."
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
        return self.sublayer[1](x, self.feed_forward)

### Decoder Layer

The decoder is also composed of a stack of $N=6$ identical layers.

In [17]:
class Decoder(nn.Module):
    "Generic N layer decoder with masking."

    def __init__(self, layer, N):
        super(Decoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)

    def forward(self, x, memory, src_mask, tgt_mask):
        for layer in self.layers:
            x = layer(x, memory, src_mask, tgt_mask)
        return self.norm(x)

In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization.

In [18]:
class DecoderLayer(nn.Module):
    "Decoder is made of self-attn, src-attn, and feed forward (defined below)"

    def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
        super(DecoderLayer, self).__init__()
        self.size = size
        self.self_attn = self_attn
        self.src_attn = src_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 3)

    def forward(self, x, memory, src_mask, tgt_mask):
        "Follow Figure 1 (right) for connections."
        m = memory
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
        x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask))
        return self.sublayer[2](x, self.feed_forward)

### Subsequent Mask

We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position $i$ can depend only on the known outputs at positions less than $i$.

In [22]:
def subsequent_mask(size):
    "Mask out subsequent positions."
    attn_shape = (1, size, size)
    subsequent_mask = torch.triu(torch.ones(attn_shape), diagonal=1).type(
        torch.uint8
    )
    return subsequent_mask == 0

## Full Transformer Model

<figure>
    <img src="https://raw.githubusercontent.com/arkeodev/nlp/main/Simple_Transformers/images/overall-architecture-of-transformers.png" width="400" height="400" alt="Transformers Architecture">
    <figcaption>Transformers Architecture</figcaption>
</figure>

Now, let's put together the main Transformer architecture combining the encoder, decoder, and other components previously described.

In [48]:
class Transformer(nn.Module):
    def __init__(self, src_vocab, tgt_vocab, N=6, d_model=512, d_ff=2048, h=8, dropout=0.1):
        super(Transformer, self).__init__()
        c = copy.deepcopy
        attn = MultiHeadedAttention(h, d_model)
        ff = PositionwiseFeedForward(d_model, d_ff, dropout)
        position = PositionalEncoding(d_model, dropout)
        self.encoder = Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout), N)
        self.decoder = Decoder(DecoderLayer(d_model, c(attn), c(attn), c(ff), dropout), N)
        self.src_embed = nn.Sequential(Embeddings(d_model, src_vocab), c(position))
        self.tgt_embed = nn.Sequential(Embeddings(d_model, tgt_vocab), c(position))
        self.generator = Generator(d_model, tgt_vocab)

        # Parameter Initialization
        for p in self.parameters():
            if p.dim() > 1:
                nn.init.xavier_uniform_(p)

    def forward(self, src, tgt, src_mask, tgt_mask):
        return self.decode(self.encode(src, src_mask), src_mask, tgt, tgt_mask)

    def encode(self, src, src_mask):
        return self.encoder(self.src_embed(src), src_mask)

    def decode(self, memory, src_mask, tgt, tgt_mask):
        return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)


### Inference Test

In [49]:
def inference_test():
    # Adjust parameters as needed
    test_model = Transformer(src_vocab=11, tgt_vocab=11, N=2, d_model=512, d_ff=2048, h=8, dropout=0.1)
    test_model.eval()
    src = torch.LongTensor([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
    src_mask = torch.ones(1, 1, 10)
    memory = test_model.encode(src, src_mask)
    ys = torch.zeros(1, 1).type_as(src)

    for i in range(9):
        out = test_model.decode(
            memory, src_mask, ys, subsequent_mask(ys.size(1)).type_as(src.data)
        )
        prob = test_model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.item()
        ys = torch.cat(
            [ys, torch.tensor([[next_word]]).type_as(src.data)], dim=1
        )

    print("Example Untrained Model Prediction:", ys)

def run_tests():
    for _ in range(10):
        inference_test()

# Uncomment to run the test
run_tests()

Example Untrained Model Prediction: tensor([[ 0,  1,  3,  5,  0,  1,  3,  5,  0, 10]])
Example Untrained Model Prediction: tensor([[0, 4, 2, 2, 4, 2, 2, 2, 2, 4]])
Example Untrained Model Prediction: tensor([[0, 7, 2, 8, 8, 8, 8, 8, 8, 8]])
Example Untrained Model Prediction: tensor([[0, 9, 2, 0, 7, 2, 0, 7, 2, 6]])
Example Untrained Model Prediction: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
Example Untrained Model Prediction: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
Example Untrained Model Prediction: tensor([[0, 7, 6, 6, 7, 6, 6, 6, 6, 6]])
Example Untrained Model Prediction: tensor([[0, 8, 5, 5, 5, 5, 5, 5, 5, 5]])
Example Untrained Model Prediction: tensor([[ 0, 10,  1,  1,  1,  9,  8,  5,  8,  5]])
Example Untrained Model Prediction: tensor([[0, 8, 3, 9, 9, 9, 9, 9, 9, 9]])


# Training

## Training Infrastructure

Below is a simple training loop and data generation for a copy task, which will serve as our "simple task". For real data, you would replace the data generation and possibly the training loop to suit your dataset and task.

In [None]:
def train(model, data_generator, optimizer, num_epochs=10):
    model.train()
    for epoch in range(num_epochs):
        total_loss = 0
        for i, batch in enumerate(data_generator):
            src = batch[0]
            tgt = batch[1][:, :-1]
            tgt_y = batch[1][:, 1:]
            src_mask = (src != 0).unsqueeze(-2)
            tgt_mask = make_std_mask(tgt, 0)
            out = model(src, tgt, src_mask, tgt_mask)
            loss = (out - tgt_y).mean()  # Simplified loss calculation
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            total_loss += loss.item()
        print(f"Epoch {epoch} Loss {total_loss / len(data_generator)}")

def data_gen(V, batch, nbatches):
    "Generate random data for a src-tgt copy task."
    for i in range(nbatches):
        data = torch.from_numpy(np.random.randint(1, V, size=(batch, 10)))
        data[:, 0] = 1  # Start token
        src = Variable(data, requires_grad=False)
        tgt = Variable(data, requires_grad=False)
        yield Batch(src, tgt, 0)

## Run

Let's put it all together and run the training for our Transformer model on the simple copy task.

In [None]:
# Model hyperparameters
V = 11  # Vocabulary size
N = 6  # Number of layers
d_model = 512  # Embedding dimension
d_ff = 2048  # Feed-forward dimension
h = 8  # Number of heads

# Instantiate model
model = Transformer(V, V, N=N, d_model=d_model, d_ff=d_ff, h=h)

# Optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)

# Data generation
gen = data_gen(V, 30, 20)

# Training
train(model, gen, optimizer, num_epochs=10)

# Applying the Model to Real Data


This implementation is quite high-level and abstracts away many details for brevity. When working with real-world datasets, consider using existing frameworks like Hugging Face's Transformers, which provide pre-implemented versions of Transformer models and utilities for processing data, or adjust the data loading and processing parts of this guide to fit your specific needs.

# Reference



- Huang, A., Subramanian, S., Sum, J., Almubarak, K., & Biderman, S. (2022). *The Annotated Transformer*. Original by Sasha Rush. Retrieved from [https://nlp.seas.harvard.edu/annotated-transformer/](https://nlp.seas.harvard.edu/annotated-transformer/)

- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). *Attention is All You Need*. In Advances in Neural Information Processing Systems (NIPS). Retrieved from [https://arxiv.org/abs/1706.03762](https://arxiv.org/abs/1706.03762)