<a href="https://colab.research.google.com/github/Zain-Haider-ML/Transformer-Seq2Seq-Model-for-Translation-from-Scratch-with-PyTorch-and-Tensorflow/blob/main/Seq2Seq_Model_for_Translation_with_PyTorch_and_Tensorflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Installing necessary libraries

In [None]:
!pip install datasets

In [2]:
!nvidia-smi

/bin/bash: line 1: nvidia-smi: command not found


#Importing necessary libraries

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
from transformers import AutoConfig, AutoModel, AutoTokenizer, TFAutoModel
import math
import torch
from datasets import load_dataset

In [4]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cpu


In [None]:
model_ckpt = "bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
config = AutoConfig.from_pretrained(model_ckpt)

This code snippet does the following:

- **Model Checkpoint Selection**: Specifies the name of a pre-trained BERT model to be used, in this case, "bert-base-uncased".
- **Tokenizer Initialization**: Initializes a tokenizer from the pre-trained BERT model using `AutoTokenizer`.
- **Model Configuration Retrieval**: Retrieves the configuration for the specified BERT model using `AutoConfig`.

#Pytorch Implementation of Transformer

**Embeddings**

In [6]:
class Embeddings(torch.nn.Module):
  def __init__(self, config):
    super().__init__()
    self.token_embeddings = torch.nn.Embedding(config.vocab_size, config.hidden_size)
    self.positional_embeddings = torch.nn.Embedding(config.max_position_embeddings, config.hidden_size)
    self.norm = torch.nn.LayerNorm(config.hidden_size, eps=1e-12)
    self.dropout = torch.nn.Dropout()

  def forward(self, input_ids):
    seq_len = input_ids.size(1)
    position_ids = torch.arange(seq_len, dtype=torch.long, device=input_ids.device).unsqueeze(0)

    tok_emb = self.token_embeddings(input_ids)
    pos_emb = self.positional_embeddings(position_ids)
    embeddings = tok_emb + pos_emb
    embeddings = self.norm(embeddings)
    embeddings = self.dropout(embeddings)
    return embeddings

The `Embeddings` class defines an embedding layer commonly used in Transformer-based architectures. It combines token embeddings and positional embeddings to represent input sequences. Here's a detailed explanation of its components and how it operates:

- **Inheritance from `torch.nn.Module`**: The `Embeddings` class inherits from `torch.nn.Module`, indicating that it is a PyTorch neural network module.

- **Initialization (`__init__`)**:
  - The constructor initializes several components for embedding and processing input sequences:
    - `self.token_embeddings`: An embedding layer that maps vocabulary tokens to vectors of size `config.hidden_size`. It uses `config.vocab_size` to determine the vocabulary size.
    - `self.positional_embeddings`: An embedding layer that assigns a unique vector to each position in a sequence, based on `config.max_position_embeddings`.
    - `self.norm`: A layer normalization operation with a specified `epsilon` (1e-12), providing stability and consistent scaling.
    - `self.dropout`: A dropout layer that randomly zeroes out elements during training, helping reduce overfitting.

- **Forward Method**:
  - The `forward` method processes the input tensor `input_ids`, representing a sequence of token indices (e.g., word tokens).
  - It determines the length of the input sequence (`seq_len`) and creates `position_ids`, which is a tensor with a range of sequence indices from 0 to `seq_len - 1`.
  - The token embeddings (`tok_emb`) are obtained by passing `input_ids` through `self.token_embeddings`.
  - The positional embeddings (`pos_emb`) are obtained by passing `position_ids` through `self.positional_embeddings`.
  - These embeddings are then summed, adding token embeddings and positional embeddings, to create `embeddings`.
  - The embeddings are then normalized using `self.norm` to ensure consistent scaling and stability.
  - A dropout layer (`self.dropout`) is applied to the embeddings for regularization.
  - The resulting `embeddings` represent the final output, which is a tensor with positional and token-based information for the input sequence.

Overall, the `Embeddings` class combines token and positional embeddings, normalizes them, and applies dropout to ensure robust and regularized embeddings for Transformer models. This layer is typically used at the beginning of such models to convert raw token inputs into suitable representations for further processing.

**Scaled Dotted Product with Mask**

In [8]:
def scaled_dot_product_attention(query, key, value, mask=None):
  dim_k = query.size(-1)
  scores = torch.bmm(query, key.transpose(1, 2)) / math.sqrt(dim_k)
  if mask is not None:
    scores = scores.masked_fill(mask == 0, float("-inf")).to(device)
  weights = torch.nn.functional.softmax(scores, dim=-1)
  return weights.bmm(value)

The `scaled_dot_product_attention` method calculates attention scores, applies a softmax operation to generate attention weights, and then computes an attention-based output. Here's a breakdown of its functionality:

- **Input Parameters**:
  - `query`, `key`, `value`: These tensors represent the query, key, and value components used in the attention mechanism. Their dimensions typically align with an attention head in a Transformer.
  - `mask`: An optional tensor used to mask specific positions in the attention mechanism, typically for handling padding or constraints during attention.

- **Calculate Attention Scores**:
  - The attention scores are computed by performing a batch matrix multiplication (`torch.bmm`) between the `query` tensor and the transpose of the `key` tensor. This operation computes the dot product for each query-key pair.
  - The scores are then scaled by dividing by the square root of `dim_k`, which is the last dimension of `key`. This scaling helps maintain stable gradients and prevents extreme values that could affect softmax output.

- **Apply Masking**:
  - If a `mask` is provided, it is applied to the `scores` tensor using `masked_fill`. Masking typically involves setting certain positions to `float("-inf")` where the mask value is zero. This effectively excludes these positions from attention after softmax, ensuring they don't contribute to attention computations.
  - The `.to(device)` method ensures the mask is on the same device as the `scores`.

- **Apply Softmax**:
  - A softmax operation is applied to the `scores` tensor along the last dimension (`dim=-1`), turning the attention scores into normalized probabilities. This step is critical because it ensures that the attention scores are converted into probabilities that sum to 1.

- **Calculate Attention-Weighted Output**:
  - The `weights` tensor, representing normalized attention probabilities, is used to compute the attention-weighted output by performing a batch matrix multiplication with the `value` tensor (`weights.bmm(value)`). This operation applies the attention weights to the `value` tensor, generating the final attention-based output.

- **Return Value**:
  - The function returns the computed attention-based output, representing the contextually transformed information based on the scaled dot product attention.

In summary, the `scaled_dot_product_attention` function calculates attention scores, applies optional masking, performs softmax to generate normalized attention probabilities, and computes the final attention-based output by weighting the `value` tensor based on the attention scores.

**Self Attention**

In [9]:
class AttentionHead(torch.nn.Module):
  def __init__(self, embed_dim, head_dim):
    super(AttentionHead, self).__init__()
    self.q = torch.nn.Linear(embed_dim, head_dim)
    self.k = torch.nn.Linear(embed_dim, head_dim)
    self.v = torch.nn.Linear(embed_dim, head_dim)

  def forward(self, query, key, value, mask = None):
    attn_outputs = scaled_dot_product_attention(self.q(query), self.k(key), self.v(value), mask)
    return attn_outputs

The `AttentionHead` class represents a single attention head in a multi-head attention mechanism, a core component of Transformer-based architectures. This class is responsible for creating the query, key, and value projections and applying scaled dot product attention. Let's break down the components and the functionality of this class:

- **Inheritance from `torch.nn.Module`**:
  - The `AttentionHead` class inherits from `torch.nn.Module`, indicating it's a PyTorch neural network module.

- **Initialization (`__init__`)**:
  - This constructor initializes linear layers for transforming inputs into query (`self.q`), key (`self.k`), and value (`self.v`) projections.
  - The parameters `embed_dim` and `head_dim` represent the dimensionality of the input embeddings and the size of the attention head, respectively. The linear layers map from `embed_dim` to `head_dim`.

- **Forward Method**:
  - This method calculates the attention outputs using the `scaled_dot_product_attention` function.
  - It accepts the following parameters:
    - `query`, `key`, `value`: The tensors used to generate attention. These represent the input to the attention mechanism.
    - `mask`: An optional tensor for masking certain positions during attention. This is used for handling padding or other constraints.
  - The `forward` method performs the following steps:
    - Projects the `query`, `key`, and `value` inputs through their respective linear layers to get the respective projections using `self.q(query)`, `self.k(key)`, and `self.v(value)`.
    - Calls the `scaled_dot_product_attention` function with the transformed `query`, `key`, `value`, and optional `mask`. This function computes the attention scores, applies softmax to get attention weights, and calculates the final attention outputs.
  - Returns `attn_outputs`, which represents the attention-based output after applying the scaled dot product attention.
  
The `AttentionHead` class implements the basic functionality of an attention head in a multi-head attention mechanism, projecting inputs into query, key, and value representations and calculating attention outputs. It's typically used within a larger multi-head attention context, allowing for modularity and reusability. The optional `mask` parameter supports flexible attention constraints, such as sequence padding or other attention-specific scenarios.

In [10]:
class MultiHeadAttention(torch.nn.Module):
  def __init__(self, config):
    super().__init__()

    num_heads = config.num_attention_heads
    embed_dim = config.hidden_size
    head_dim = embed_dim // num_heads

    self.heads = torch.nn.ModuleList([AttentionHead(embed_dim, head_dim) for _ in range(num_heads)])
    self.linear = torch.nn.Linear(embed_dim, embed_dim)

  def forward(self, q, k, v, mask = None):
    x = torch.cat([h(q, k, v, mask) for h in self.heads], dim = -1)
    x = self.linear(x)
    return x

The `MultiHeadAttention` class utilizes multiple `AttentionHead` instances to compute attention in parallel, allowing the model to attend to different parts of the input sequence simultaneously. Let's break down its components and functionality:

- **Inheritance from `torch.nn.Module`**:
  - The `MultiHeadAttention` class inherits from `torch.nn.Module`, indicating that it is a PyTorch neural network module.

- **Initialization (`__init__`)**:
  - This constructor initializes multiple attention heads based on the provided configuration (`config`).
  - Key parameters:
    - `num_heads`: Number of attention heads, derived from `config.num_attention_heads`.
    - `embed_dim`: Dimensionality of the embeddings, typically representing the total hidden size, derived from `config.hidden_size`.
    - `head_dim`: Dimensionality of each individual attention head, calculated as `embed_dim // num_heads`. This split allows multiple heads to operate in parallel within the overall embedding space.
  - The `self.heads` attribute is a `torch.nn.ModuleList` containing `num_heads` instances of `AttentionHead`. This allows for creating multiple attention heads to perform multi-head attention.
  - The `self.linear` attribute is a linear transformation layer, used to map the concatenated outputs from the attention heads back to the original `embed_dim`.

- **Forward Method**:
  - This method calculates the multi-head attention output.
  - It accepts the following parameters:
    - `q`, `k`, `v`: These represent the query, key, and value inputs for the attention mechanism.
    - `mask`: An optional tensor used for masking certain positions during attention, typically for handling padding or constraints.
  - The `forward` method performs the following steps:
    - Loops over the `self.heads` to compute attention for each head, applying the `AttentionHead` class's `forward` method with the provided `q`, `k`, `v`, and `mask`. This produces multiple attention outputs, one from each head.
    - Concatenates the outputs from each attention head along the last dimension using `torch.cat([h(q, k, v, mask) for h in self.heads], dim=-1)`.
    - Applies the linear transformation (`self.linear`) to map the concatenated output back to the original embedding size.
  - Returns the final output (`x`), representing the multi-head attention result after concatenation and linear transformation.
  
The `MultiHeadAttention` class implements multi-head attention by creating multiple `AttentionHead` instances, allowing parallel attention computations across different "heads." The final output is a concatenation of these individual attention outputs, transformed back to the original embedding size through a linear layer. This approach allows the Transformer to focus on different parts of the input sequence simultaneously, enhancing its ability to capture diverse patterns and relationships.

**Feed Forward Network**

In [11]:
class FeedForward(torch.nn.Module):
  def __init__(self, config):
    super().__init__()
    self.linear_1 = torch.nn.Linear(config.hidden_size, config.intermediate_size)
    self.linear_2 = torch.nn.Linear(config.intermediate_size, config.hidden_size)
    self.gelu = torch.nn.GELU()
    self.dropout = torch.nn.Dropout(config.hidden_dropout_prob)

  def forward(self, x):
    x = self.linear_1(x)
    x = self.gelu(x)
    x = self.linear_2(x)
    x = self.dropout(x)
    return x

The `FeedForward` class applies a two-layer feedforward neural network with an activation function and dropout for regularization. Let's break down the components and functionality of this class:

- **Inheritance from `torch.nn.Module`**:
  - The `FeedForward` class inherits from `torch.nn.Module`, indicating it is a PyTorch neural network module.

- **Initialization (`__init__`)**:
  - This constructor initializes the components of the feedforward network based on the provided configuration (`config`).
  - The following attributes are initialized:
    - `self.linear_1`: A linear transformation layer that maps from the input dimension (`config.hidden_size`) to an intermediate dimension (`config.intermediate_size`). This is the first part of the feedforward network.
    - `self.linear_2`: A linear transformation layer that maps from the intermediate dimension back to the original hidden size. This is the second part of the feedforward network.
    - `self.gelu`: A GELU (Gaussian Error Linear Unit) activation function, applied after the first linear layer to introduce non-linearity.
    - `self.dropout`: A dropout layer that applies random dropout during training for regularization, with a dropout probability set to `config.hidden_dropout_prob`.

- **Forward Method**:
  - This method defines the feedforward transformation applied to the input tensor (`x`).
  - It performs the following steps:
    - Applies the first linear transformation (`self.linear_1`) to map `x` to the intermediate size.
    - Applies the GELU activation function (`self.gelu`) to introduce non-linearity.
    - Applies the second linear transformation (`self.linear_2`) to map back to the original hidden size.
    - Applies dropout (`self.dropout`) to the result for regularization, reducing overfitting risk during training.
  - Returns the final output (`x`), representing the feedforward transformation result.

In summary, the `FeedForward` class implements a typical feedforward neural network layer used in Transformer-based architectures. It applies a two-layer feedforward transformation with a non-linear activation (GELU) and dropout for regularization. This component is commonly used in Transformer encoder and decoder layers to add depth and capacity for modeling complex relationships in the data.

**Encoder**

In [12]:
class TransformerEncoderLayer(torch.nn.Module):
  def __init__(self, config):
    super().__init__()

    self.norm1 = torch.nn.LayerNorm(config.hidden_size)
    self.norm2 = torch.nn.LayerNorm(config.hidden_size)

    self.attention = MultiHeadAttention(config)
    self.feed_forward = FeedForward(config)

  def forward(self, x):
    multi_head_att = self.attention(x, x, x, None)
    x = self.norm1(x + multi_head_att)

    ff = self.feed_forward(x)
    x = self.norm2(x + ff)
    return x

The `TransformerEncoderLayer` class represents a single layer within the encoder component of a Transformer architecture. This layer typically consists of two sublayers: multi-head self-attention and a position-wise feedforward network, each followed by layer normalization. Let's break down the components and functionality of this class:

- **Inheritance from `torch.nn.Module`**:
  - The `TransformerEncoderLayer` class inherits from `torch.nn.Module`, indicating it's a PyTorch neural network module.

- **Initialization (`__init__`)**:
  - This constructor initializes the components of the encoder layer based on the provided configuration (`config`).
  - The following attributes are initialized:
    - `self.norm1` and `self.norm2`: Layer normalization layers. Each layer normalizes the input tensor before and after the sublayer, respectively. This helps stabilize training and improves performance.
    - `self.attention`: Multi-head self-attention mechanism. This sublayer attends to different parts of the input sequence and captures dependencies within the sequence.
    - `self.feed_forward`: Feedforward neural network layer. This sublayer applies a position-wise feedforward transformation to each position in the sequence independently.

- **Forward Method**:
  - This method defines the forward pass of the encoder layer.
  - It accepts the following parameter:
    - `x`: The input tensor to the encoder layer.
  - The forward method performs the following steps:
    - Computes multi-head self-attention (`multi_head_att`) using the input tensor `x` as queries, keys, and values. Here, `x` is passed to all three inputs of the multi-head attention mechanism.
    - Adds the output of self-attention (`multi_head_att`) to the input tensor `x` and normalizes the result using layer normalization (`self.norm1`).
    - Passes the normalized output through the feedforward neural network (`self.feed_forward`) and adds the output to the normalized input tensor `x`. The result is again normalized using layer normalization (`self.norm2`).
  - Returns the final output tensor (`x`) from the encoder layer.

In summary, the `TransformerEncoderLayer` class encapsulates a single layer within the Transformer encoder. It consists of multi-head self-attention and a position-wise feedforward network, each followed by layer normalization. This design allows the encoder to capture complex patterns and dependencies within the input sequence, making it suitable for various natural language processing tasks.

In [13]:
class TransformerEncoder(torch.nn.Module):
  def __init__(self, config):
    super().__init__()

    self.embeddings = Embeddings(config)
    self.layers = torch.nn.ModuleList([TransformerEncoderLayer(config) for _ in range(config.num_hidden_layers)])

  def forward(self, x):
    x = self.embeddings(x)

    for l in self.layers:
      x = l(x)

    return x

The `TransformerEncoder` class represents the entire encoder component of a Transformer-based architecture. It comprises an embedding layer and multiple encoder layers, providing the core mechanism for encoding input sequences. Let's break down the components and functionality of this class:

- **Inheritance from `torch.nn.Module`**:
  - The `TransformerEncoder` class inherits from `torch.nn.Module`, indicating it is a PyTorch neural network module.

- **Initialization (`__init__`)**:
  - This constructor initializes the key components of the Transformer encoder based on the provided configuration (`config`).
  - The following attributes are initialized:
    - `self.embeddings`: An instance of the `Embeddings` class, responsible for mapping input tokens to a higher-dimensional representation and adding positional information to preserve sequence order.
    - `self.layers`: A `torch.nn.ModuleList` containing a series of `TransformerEncoderLayer` instances. The number of layers is determined by `config.num_hidden_layers`, allowing the encoder to have a stacked structure for deeper encoding capabilities.

- **Forward Method**:
  - This method defines the forward pass of the encoder.
  - It accepts the following parameter:
    - `x`: The input tensor, typically representing token IDs (e.g., from a tokenizer).
  - The forward method performs the following steps:
    - Applies the embedding layer (`self.embeddings`) to the input tensor `x` to convert token IDs into embeddings and add positional information.
    - Loops through the list of encoder layers (`self.layers`), applying each `TransformerEncoderLayer` to the output of the previous step. This iterative process allows the encoder to capture complex relationships within the input sequence.
  - Returns the final encoded output (`x`), representing the fully processed sequence after the embedding layer and all encoder layers.

In summary, the `TransformerEncoder` class represents the entire encoder component of a Transformer-based architecture. It initializes with an embedding layer and multiple encoder layers, with each layer consisting of multi-head self-attention and feedforward neural networks. The forward pass applies the embeddings to the input sequence, then passes the result through a series of encoder layers, allowing the Transformer encoder to extract complex patterns and relationships from the input sequence. This structure is a key component of many Transformer-based architectures, used in various natural language processing tasks such as machine translation, text classification, and more.

**Decoder**

In [16]:
class TransformerDecoderLayer(torch.nn.Module):
  def __init__(self, config):
    super().__init__()

    self.norm1 = torch.nn.LayerNorm(config.hidden_size)
    self.norm2 = torch.nn.LayerNorm(config.hidden_size)
    self.norm3 = torch.nn.LayerNorm(config.hidden_size)

    self.selfAttention = MultiHeadAttention(config)
    self.crossAttention = MultiHeadAttention(config)
    self.feed_forward = FeedForward(config)

  def forward(self, x, y, mask):

    selfmulti_head_att = self.selfAttention(y, y, y, mask)
    y = self.norm1(y + selfmulti_head_att)

    cross_multi_head_att = self.crossAttention(y, x, x, None)
    y = self.norm2(y + cross_multi_head_att)

    ff = self.feed_forward(y)
    y = self.norm3(y + ff)
    return y

The `TransformerDecoderLayer` class represents a single layer in the decoder component of a Transformer-based architecture. This layer typically consists of multi-head self-attention, cross-attention with an encoder, and a feedforward neural network, with layer normalization following each sublayer. Here's an explanation of its components and functionality:

- **Inheritance from `torch.nn.Module`**:
  - The `TransformerDecoderLayer` class inherits from `torch.nn.Module`, indicating it's a PyTorch neural network module.

- **Initialization (`__init__`)**:
  - This constructor initializes the essential components for a decoder layer based on the provided configuration (`config`).
  - The following attributes are set up:
    - `self.norm1`, `self.norm2`, and `self.norm3`: Layer normalization layers. These ensure consistent scaling and stability during training.
    - `self.selfAttention`: Multi-head self-attention mechanism. This sublayer enables the decoder to attend to different parts of its own sequence while respecting causality constraints.
    - `self.crossAttention`: Multi-head cross-attention mechanism. This sublayer allows the decoder to attend to outputs from the encoder, capturing relevant context from the encoder.
    - `self.feed_forward`: A position-wise feedforward network, providing additional processing and non-linearity.

- **Forward Method**:
  - This method defines the forward pass of the decoder layer.
  - It accepts the following parameters:
    - `x`: The encoder's output tensor. This is used by the cross-attention mechanism.
    - `y`: The decoder's input tensor. This is typically a sequence of token embeddings.
    - `mask`: A tensor used for masking in self-attention, often to ensure causality (preventing future information from influencing the present).
  - The forward method performs the following steps:
    - **Self-Attention**: Computes multi-head self-attention (`selfmulti_head_att`) using `y` as queries, keys, and values, with the provided `mask` to ensure causality. The result is added to the input `y`, and then normalized with `self.norm1`.
    - **Cross-Attention**: Computes multi-head cross-attention (`cross_multi_head_att`) with `y` as queries and `x` (the encoder's output) as keys and values. This enables the decoder to focus on relevant parts of the encoder's output. The result is added to the output from the self-attention step and then normalized with `self.norm2`.
    - **Feedforward Network**: Applies the feedforward transformation (`ff`) to the output from the cross-attention step. The result is then normalized with `self.norm3`.
  - Returns the final output (`y`) from the decoder layer, representing the fully processed sequence after self-attention, cross-attention, and feedforward processing.

In summary, the `TransformerDecoderLayer` class represents a single layer in a Transformer decoder. It consists of multi-head self-attention, cross-attention with the encoder's output, and a feedforward neural network, with layer normalization after each sublayer. This design allows the decoder to focus on its own sequence and context from the encoder, making it suitable for sequence-to-sequence tasks like machine translation and text generation.

In [17]:
class TransformerDecoder(torch.nn.Module):
  def __init__(self, config):
    super().__init__()

    self.embeddings = Embeddings(config)
    self.layers = torch.nn.ModuleList([TransformerDecoderLayer(config) for _ in range(config.num_hidden_layers)])

  def forward(self, x, y, mask):
    y = self.embeddings(y)

    for l in self.layers:
       y = l(x, y, mask)
    return y

The `TransformerDecoder` class represents the complete decoder component of a Transformer-based architecture. It consists of an embedding layer and a series of decoder layers, with each layer comprising self-attention, cross-attention, and a feedforward neural network. Here's an explanation of its components and functionality:

- **Inheritance from `torch.nn.Module`**:
  - The `TransformerDecoder` class inherits from `torch.nn.Module`, indicating it's a PyTorch neural network module.

- **Initialization (`__init__`)**:
  - This constructor initializes the key components of the Transformer decoder based on the provided configuration (`config`).
  - The following attributes are set up:
    - `self.embeddings`: An instance of the `Embeddings` class, responsible for mapping input token IDs into embeddings and adding positional information to preserve sequence order.
    - `self.layers`: A `torch.nn.ModuleList` containing a series of `TransformerDecoderLayer` instances. The number of layers is determined by `config.num_hidden_layers`, indicating the depth of the decoder.

- **Forward Method**:
  - This method defines the forward pass of the decoder.
  - It accepts the following parameters:
    - `x`: The encoder's output tensor, typically representing the encoded context from the source sequence. This is used in the cross-attention mechanism in each decoder layer.
    - `y`: The decoder's input tensor, usually representing the target sequence to be generated or token embeddings. It could be the output of a previous decoder or a start token in a generation task.
    - `mask`: A tensor used to mask certain positions in self-attention, typically to ensure causality and prevent attention to future tokens.
  - The forward method performs the following steps:
    - **Embeddings**: Converts `y` into embeddings using `self.embeddings`, allowing the decoder to work with higher-dimensional representations.
    - **Decoder Layers**: Iterates through the list of decoder layers (`self.layers`), applying each `TransformerDecoderLayer` to the output from the previous step. This allows each decoder layer to process the input with self-attention, cross-attention, and feedforward transformations, following the Transformer architecture pattern.
  - Returns the final output (`y`), representing the processed output after the embedding layer and all decoder layers.

In summary, the `TransformerDecoder` class represents the complete decoder component in a Transformer-based architecture. It consists of an embedding layer and multiple decoder layers, each with self-attention, cross-attention, and a feedforward neural network. This structure allows the decoder to process sequences, attend to both its own context and the encoder's context, and generate output sequences, making it suitable for tasks like machine translation and text generation.

**Sequence To Sequence Model for Machine Translation**

In [18]:
class TransformerForSeqToSeq(torch.nn.Module):
  def __init__(self, config):
    super().__init__()

    self.encoder = TransformerEncoder(config)
    self.decoder = TransformerDecoder(config)
    self.dropout = torch.nn.Dropout(config.hidden_dropout_prob)
    self.linear = torch.nn.Linear(config.hidden_size, config.vocab_size)

  def forward(self, en_ids, de_ids):

    x = self.encoder(en_ids.input_ids)

    mask = de_ids.attention_mask.unsqueeze(1)
    y = self.decoder(x, de_ids.input_ids, mask)

    y = self.dropout(y)
    logits = self.linear(y)

    probs = torch.nn.functional.softmax(logits, dim = -1)
    return probs

The `TransformerForSeqToSeq` class represents a complete sequence-to-sequence (seq2seq) Transformer model. This architecture is designed for tasks that involve transforming one sequence into another, such as machine translation, text summarization, or text generation. It typically consists of an encoder, a decoder, and additional processing components. Here's a breakdown of its components and functionality:

- **Inheritance from `torch.nn.Module`**:
  - The `TransformerForSeqToSeq` class inherits from `torch.nn.Module`, indicating it is a PyTorch neural network module.

- **Initialization (`__init__`)**:
  - This constructor initializes the main components of the seq2seq model based on the provided configuration (`config`).
  - The following attributes are set up:
    - `self.encoder`: An instance of the `TransformerEncoder` class, responsible for encoding the source sequence into a context-rich representation.
    - `self.decoder`: An instance of the `TransformerDecoder` class, responsible for decoding the encoded representation along with a target sequence to generate a new sequence.
    - `self.dropout`: A dropout layer to add regularization and reduce overfitting risk.
    - `self.linear`: A linear transformation layer that maps from the hidden size to the vocabulary size, enabling the model to produce logits corresponding to vocabulary tokens.

- **Forward Method**:
  - This method defines the forward pass of the seq2seq model.
  - It accepts two parameters:
    - `en_ids`: An object containing the encoder's input data, typically with a property like `input_ids` representing the token IDs of the source sequence.
    - `de_ids`: An object containing the decoder's input data, also typically with an `input_ids` property representing the token IDs of the target sequence, along with an `attention_mask` to indicate valid positions (to manage padding).
  - The forward method performs the following steps:
    - **Encoding**: The encoder (`self.encoder`) processes `en_ids.input_ids` to generate an encoded representation (`x`), capturing contextual information from the source sequence.
    - **Mask Generation**: Generates an attention mask (`mask`) from `de_ids.attention_mask`, which is used in the decoder to enforce causal attention (e.g., no attention to future tokens) or handle padding.
    - **Decoding**: The decoder (`self.decoder`) processes the encoded representation (`x`) along with the decoder's input (`de_ids.input_ids`) and the generated `mask`. This step uses self-attention and cross-attention mechanisms to generate contextually relevant outputs.
    - **Dropout**: Applies dropout to the decoder's output (`y`) for regularization.
    - **Linear Transformation**: Maps the output from the decoder to logits with the linear layer (`self.linear`), enabling the model to predict token probabilities.
    - **Softmax**: Converts logits into probabilities using softmax, providing probabilities for each token in the vocabulary.
  - Returns the final probabilities (`probs`), representing the output sequence in terms of token probabilities.

In summary, the `TransformerForSeqToSeq` class represents a complete sequence-to-sequence Transformer model. It integrates an encoder and a decoder to process source and target sequences, using self-attention, cross-attention, and feedforward neural networks. With additional components like dropout for regularization and a linear layer for converting outputs to logits, this architecture is suitable for various seq2seq tasks, such as machine translation, text summarization, or text generation.

**Training Model**

In [19]:
dataset = load_dataset("wmt16", "de-en")
ds = pd.DataFrame(dataset['train']['translation'][163:167])
ds.head()

Downloading readme:   0%|          | 0.00/11.1k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/282M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/267M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/277M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/343k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/475k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/4548885 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2169 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2999 [00:00<?, ? examples/s]

Unnamed: 0,de,en
0,Nach der Tagesordnung folgt der Bericht (A5-01...,The next item is the report (A5-0105/1999) by ...
1,"Verehrte Frau Kommissarin, verehrte Präsidenti...","Commissioner, Madam President, ladies and gent..."
2,"Erstens: Wir mußten formal tätig werden, um de...","Firstly, we needed to take action on a formal ..."
3,Zweitens: Wir erzielen mit dieser Richtlinie a...,"Secondly, by adopting this directive we achiev..."


In [20]:
englishTokenizedText = tokenizer(ds['en'].tolist(), add_special_tokens=False, return_tensors='pt', padding=True, truncation=True).to(device)
germanTokenizedText = tokenizer(ds['de'].tolist(), add_special_tokens=False, return_tensors='pt', padding=True, truncation=True).to(device)

Creating Model Instance for Training

In [22]:
model = TransformerForSeqToSeq(config).to(device)

Training Loop

In [None]:
# Define the loss function and optimizer
criterion = torch.nn.CrossEntropyLoss().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)

num_epochs = 10
for epoch in range(num_epochs):
    # Clear previous gradients
    optimizer.zero_grad()

    # Forward pass
    output = model(englishTokenizedText, germanTokenizedText)

    # Calculate loss
    loss = criterion(output.view(-1, config.vocab_size), germanTokenizedText.input_ids.view(-1).to(device))

    # Backward pass
    loss.backward()

    # Optimization step
    optimizer.step()

    # Logging loss
    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item()}')

print("Training finished.")

#Tensorflow Implementation of Transformer

In [3]:
class Embeddings(tf.keras.layers.Layer):
  def __init__(self, config):
    super().__init__()
    self.token_embeddings = tf.keras.layers.Embedding(config.vocab_size, config.hidden_size)
    self.positional_embeddings = tf.keras.layers.Embedding(config.max_position_embeddings, config.hidden_size)
    self.norm = tf.keras.layers.LayerNormalization(epsilon = 1e-12)
    self.dropout = tf.keras.layers.Dropout(0.1)

  def call(self, input_ids):
    seq_len = input_ids.shape[1]
    position_ids = tf.range(0, seq_len, dtype = tf.int64)
    position_ids = tf.expand_dims(position_ids, 0)

    tok_emb = self.token_embeddings(input_ids)
    pos_emb = self.positional_embeddings(position_ids)

    embeddings = tok_emb + pos_emb
    embeddings = self.norm(embeddings)
    embeddings = self.dropout(embeddings)
    return embeddings

In [4]:
def scaled_dot_product_attention(query, key, value, mask=None):
  dim_k = query.shape[-1]
  scores = tf.matmul(query, tf.transpose(key, [0, 2, 1])) / math.sqrt(dim_k)
  if mask is not None:
    scores = tf.where(mask == 0, float('-inf'), scores)
  weights = tf.keras.activations.softmax(scores, -1)
  return tf.matmul(weights, value)

In [5]:
class AttentionHead(tf.keras.layers.Layer):
  def __init__(self, head_dim):
    super(AttentionHead, self).__init__()
    self.q = tf.keras.layers.Dense(head_dim)
    self.k = tf.keras.layers.Dense(head_dim)
    self.v = tf.keras.layers.Dense(head_dim)

  def call(self, query, key, value, mask = None):
    attn_outputs = scaled_dot_product_attention(self.q(query), self.k(key), self.v(value), mask)
    return attn_outputs

In [6]:
class MultiHeadAttention(tf.keras.layers.Layer):
  def __init__(self, config):

    super(MultiHeadAttention, self).__init__()
    num_heads = config.num_attention_heads
    embed_dim = config.hidden_size
    head_dim = embed_dim // num_heads

    self.heads = [AttentionHead(head_dim) for _ in range(num_heads)]
    self.linear = tf.keras.layers.Dense(embed_dim)

  def call(self, q, k, v, mask = None):
    x = tf.concat([h(q, k, v, mask) for h in self.heads], axis = -1)
    x = self.linear(x)
    return x

In [7]:
class FeedForward(tf.keras.layers.Layer):
  def __init__(self, config):
    super().__init__()
    self.linear_1 = tf.keras.layers.Dense(config.intermediate_size)
    self.linear_2 = tf.keras.layers.Dense(config.hidden_size)
    self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)

  def call(self, x):
    x = self.linear_1(x)
    x = tf.keras.activations.gelu(x)
    x = self.linear_2(x)
    x = self.dropout(x)
    return x

In [8]:
class TransformerEncoderLayer(tf.keras.layers.Layer):
  def __init__(self, config):
    super().__init__()
    self.norm1 = tf.keras.layers.LayerNormalization()
    self.norm2 = tf.keras.layers.LayerNormalization()

    self.attention = MultiHeadAttention(config)
    self.feed_forward = FeedForward(config)

  def call(self, x):
    multi_head_att = self.attention(x, x, x, None)
    x = self.norm1(x + multi_head_att)

    ff = self.feed_forward(x)
    x = self.norm2(x + ff)
    return x

In [9]:
class TransformerEncoder(tf.keras.Model):
  def __init__(self, config):
    super().__init__()
    self.embeddings = Embeddings(config)
    self.enlayers = [TransformerEncoderLayer(config) for _ in range(config.num_hidden_layers)]

  def call(self, x):
    x = self.embeddings(x)

    for l in self.enlayers:
      x = l(x)
    return x

In [10]:
class TransformerDecoderLayer(tf.keras.layers.Layer):
  def __init__(self, config):
    super().__init__()
    self.norm1 = tf.keras.layers.LayerNormalization()
    self.norm2 = tf.keras.layers.LayerNormalization()
    self.norm3 = tf.keras.layers.LayerNormalization()

    self.attention1 = MultiHeadAttention(config)
    self.attention2 = MultiHeadAttention(config)
    self.feed_forward = FeedForward(config)

  def call(self, x, y, mask):

    selfmulti_head_att = self.attention1(y, y, y, mask)
    y = self.norm1(y + selfmulti_head_att)

    cross_multi_head_att = self.attention2(y, x, x, None)
    y = self.norm2(y + cross_multi_head_att)

    ff = self.feed_forward(y)
    y = self.norm3(y + ff)
    return y

In [11]:
class TransformerDecoder(tf.keras.Model):
  def __init__(self, config):
    super().__init__()

    self.embeddings = Embeddings(config)
    self.delayers = [TransformerDecoderLayer(config) for _ in range(config.num_hidden_layers)]

  def call(self, x, y, mask):
    y = self.embeddings(y)

    for l in self.delayers:
       y = l(x, y, mask)
    return y

In [24]:
class TransformerForSeqToSeq(tf.keras.Model):
  def __init__(self, config):
    super().__init__()

    self.encoder = TransformerEncoder(config)
    self.decoder = TransformerDecoder(config)

    self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)
    self.linear = tf.keras.layers.Dense(config.vocab_size)

  def call(self, inputs):
    x = self.encoder(inputs[0].input_ids)
    y = self.decoder(x, inputs[1].input_ids, inputs[2])

    y = self.dropout(y)
    logits = self.linear(y)

    probs = tf.keras.activations.softmax(logits, axis = -1)
    return probs

  def predict(self, inputs):

    x = self.encoder(inputs[0].input_ids)
    y = self.decoder(x, inputs[1].input_ids, None)

    y = self.dropout(y)
    logits = self.linear(y)

    probs = tf.keras.activations.softmax(logits, axis = -1)
    return probs

In [14]:
dataset = load_dataset("wmt16", "de-en")
ds = pd.DataFrame(dataset['train']['translation'][200:208])
ds

Unnamed: 0,de,en
0,Oder denken Sie an Schiffer von osteuropäische...,Or ships from Eastern Europe which moor adjace...
1,"Außerdem belegen Kontrollen in belgischen, fin...","Furthermore, it has transpired that research i..."
2,"Kurzum, ein ernstes Problem.","In short, the issue is an important one."
3,"Was die Sicherheitsberater betrifft, so ist in...",If we look at the situation where safety advis...
4,Die Umsetzung ist gegenwärtig insbesondere in ...,There will be major problems with enforcing th...
5,Entweder sie verkaufen ihre Ladung oder mische...,These smaller companies either dispose of thei...
6,Gefordert werden deshalb auch die Erfassung di...,It is therefore also being requested that ISO ...
7,"Die eigentliche Arbeit ist getan, nun geht es ...",The work is done. All that remains is the busi...


In [15]:
englishTokenizedText = tokenizer(ds['en'].tolist(), add_special_tokens=False, return_tensors='tf', padding=True, truncation=True)
germanTokenizedText = tokenizer(ds['de'].tolist(), add_special_tokens=False, return_tensors='tf', padding=True, truncation=True)

In [30]:
model = TransformerForSeqToSeq(config)

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
optimizer = tf.keras.optimizers.Adam()

mask = tf.expand_dims(germanTokenizedText.attention_mask, axis = 1)

# Number of epochs for training
num_epochs = 7

# Custom training loop
for epoch in range(num_epochs):
    # Track the gradients
    with tf.GradientTape() as tape:
        # Forward pass
        predictions = model([englishTokenizedText, germanTokenizedText, mask])  # Feed the input data into the model

        # Calculate the loss
        loss = loss_fn(germanTokenizedText.input_ids, predictions)

    # Compute the gradients and apply them
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    # Optional: Print training progress
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {loss.numpy()}")


Epoch [1/7], Loss: 10.392840385437012
Epoch [2/7], Loss: 9.211803436279297
Epoch [3/7], Loss: 7.721303939819336
Epoch [4/7], Loss: 6.644387722015381
Epoch [5/7], Loss: 5.663522720336914
Epoch [6/7], Loss: 4.955041885375977
Epoch [7/7], Loss: 4.541843891143799
