# Equipo
---
Gamaliel Marines Olvera
A01708746

Uri Jared Gopar Morales
A01709413

José Antonio Miranda Baños
A01611795

María Fernanda Moreno Gómez
A01708653

Oskar Adolfo Villa López
A01275287

Luis Ángel Cruz García
A01736345



####Activity 3: Implementing a Translator

- Objective

To understand the Transformer Architecture by Implementing a translator.

- Instructions

    This activity requires submission in teams. While teamwork is encouraged, each member is expected to contribute individually to the assignment. The final submission should feature the best arguments and solutions from each team member. Only one person per team needs to submit the completed work, but it is imperative that the names of all team members are listed in a Markdown cell at the very beginning of the notebook (either the first or second cell). Failure to include all team member names will result in the grade being awarded solely to the individual who submitted the assignment, with zero points given to other team members (no exceptions will be made to this rule).

    Follow the provided code. The code already implements a transformer from scratch as explained in one of [week's 9 videos](https://youtu.be/XefFj4rLHgU)

    Since the provided code already implements a simple translator, your job for this assignment is to understand it fully, and document it using pictures, figures, and markdown cells.  You should test your translator with at least 10 sentences. The dataset used for this task was obtained from [Tatoeba, a large dataset of sentences and translations](https://tatoeba.org/en/downloads).
  
- Evaluation Criteria

    - Code Readability and Comments
    - Traning a translator
    - Translating at least 10 sentences.

- Submission

Submit this Jupyter Notebook in canvas with your complete solution, ensuring your code is well-commented and includes Markdown cells that explain your design choices, results, and any challenges you encountered.



##**Environment**

###**Libraries**

### Importing Libraries

To build a translator using transformers, we first import essential libraries that will help with data manipulation, mathematical operations, model construction, and utilities.

#### Data Manipulation
- `pandas` (`import pandas as pd`): Used for handling and processing datasets efficiently.

#### Mathematical Operations
- `math`: Provides access to various mathematical functions and constants.
- `numpy` (`import numpy as np`): A core library for numerical computations and handling arrays, which is useful for manipulating data before feeding it into our model.

#### PyTorch (Deep Learning Framework)
- `torch` (`import torch`): The core PyTorch library for creating and manipulating tensors, which are fundamental data structures in deep learning.
- `torch.nn` (`import torch.nn as nn`): A module that provides classes and functions to build neural networks.
- `torch.nn.functional` (`import torch.nn.functional as F`): Contains functions for various neural network layers and operations, like activation functions, which are commonly used in model definitions.
- `torch.optim` (`import torch.optim as optim`): Optimizers for training neural networks, allowing for different strategies to adjust model parameters.
- `torch.utils.data` (`from torch.utils.data import Dataset, DataLoader`):
  - `Dataset`: Used to define and organize the dataset for training.
  - `DataLoader`: Allows us to load data in batches, making training more efficient and manageable.

#### Utilities
- `collections.Counter` (`from collections import Counter`): A subclass for counting hashable objects, helping in tasks like word frequency counting.
- `re` (`import re`): Provides regular expression support for efficient text processing and cleaning.

Together, these libraries provide the tools necessary for data preparation, model building, and training in a neural translation model.


In [None]:
# Libraries

# Data Manipulation
import pandas as pd

# Mathematical Operations
import math
import numpy as np

# PyTorch (Deep Learning Framework)
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

# Utilities
from collections import Counter
import re

###**Drive**

In [None]:
# Google Drive in Google Colab.
# Access to files and directories stored in Google Drive from a Colab notebook.
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


###**Device**

### Setting the Device for Computation

To ensure that the code leverages available hardware resources efficiently, we set up a device for running computations:

- `torch.device('cuda' if torch.cuda.is_available() else 'cpu')`:
  - This line checks if a CUDA-enabled GPU is available on the system.
  - If a GPU is available, it sets the device to `'cuda'`, allowing for faster computations as GPUs handle parallel operations efficiently.
  - If no GPU is detected, it defaults to `'cpu'`, where computations will still run but might be slower than on a GPU.

- `print(device)`:
  - This line prints the selected device, showing either `'cuda'` for GPU or `'cpu'` for CPU. This confirmation helps you verify that the code is using the expected hardware.

By dynamically setting the device, we ensure that the model will run on the most powerful hardware available, optimizing training and inference performance.


In [None]:
# Check if a CUDA-enabled GPU is available; if so, set the device to GPU, otherwise use CPU.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

print(device)

cuda


##**Data Loading**

###**Load Dataset**

Loading the Dataset

To build a translation model, we start by loading a Spanish-English dataset, which will be used to train the model.

- `PATH = '/content/drive/MyDrive/Deep_Learning_Team/eng-spa2024.csv'`:
  - Specifies the file path for the Spanish-English dataset stored in Google Drive.
  - This path points to the location where the dataset is saved, making it accessible for loading.

- `df = pd.read_csv(PATH, encoding='latin1', header=None)`:
  - Loads the dataset into a pandas DataFrame for easier handling and manipulation.
  - `encoding='latin1'`: Specifies the encoding to ensure correct handling of special characters common in Spanish.
  - `header=None`: Indicates that the file does not contain a header row, so columns will be assigned default numerical labels.
  
This dataset will provide the Spanish-English text pairs necessary for training the translation model.


In [None]:
# Define file path for the Spanish-English dataset.
PATH = '/content/drive/MyDrive/Colab Notebooks/eng-spa2024.csv'

# Load dataset into a DataFrame using tab ('\t') as the separator.
df = pd.read_csv(PATH, encoding='latin1', header=None)

###**CVS File to TXT File**

Preprocessing the Dataset

After loading the dataset, we perform several preprocessing steps to prepare it for model training.

- **Select Relevant Columns**:
  - `eng_spa_cols = df.iloc[:, [1, 3]]`: Selects only the columns containing English and Spanish text.
  - The columns are chosen by their positions: `1` (English text) and `3` (Spanish text), simplifying the dataset to the text pairs we need.

- **Calculate and Sort by Text Length**:
  - `eng_spa_cols['length'] = eng_spa_cols.iloc[:, 0].str.len()`: Calculates the character length of each entry in the English column and stores it in a new column called `length`. This length information will help us sort the dataset.
  - `eng_spa_cols = eng_spa_cols.sort_values(by='length')`: Sorts the DataFrame based on the `length` column, arranging text pairs in ascending order of English text length. Sorting by length can improve the training efficiency, as shorter sequences are processed first.

- **Clean Up the DataFrame**:
  - `eng_spa_cols = eng_spa_cols.drop(columns=['length'])`: Removes the `length` column after sorting, as it's no longer needed for training.

- **Save the Processed Data**:
  - `output_file_path = '/content/drive/MyDrive/Deep_Learning_Team/eng-spa2024.txt'`: Specifies the output file path where the cleaned dataset will be saved.
  - `eng_spa_cols.to_csv(output_file_path, sep='\t', index=False, header=False)`: Saves the processed DataFrame to a new text file using tab (`'\t'`) as the separator, without including the index or header.

By selecting only the necessary columns, sorting by text length, and saving the cleaned data, we create a more manageable and optimized dataset for training the translation model.


In [None]:
# Select only the relevant columns for English and Spanish from the DataFrame.
eng_spa_cols = df.iloc[:, [1, 3]]

# Calculate the length of each entry in the first column (English text) and store it as a new column.
eng_spa_cols['length'] = eng_spa_cols.iloc[:, 0].str.len()

# Sort the DataFrame based on the 'length' column to order entries by the length of the English text.
eng_spa_cols = eng_spa_cols.sort_values(by='length')

# Remove the 'length' column after sorting, as it is no longer needed.
eng_spa_cols = eng_spa_cols.drop(columns=['length'])

# Define output file path and save the processed DataFrame to a new file without index or header.
output_file_path = '/content/drive/MyDrive/Colab Notebooks/eng-spa2024.csv'
eng_spa_cols.to_csv(output_file_path, sep='\t', index=False, header=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  eng_spa_cols['length'] = eng_spa_cols.iloc[:, 0].str.len()


##**Transformer - Attention Is All You Need**

Setting Up Transformer Parameters

To ensure consistent training results and manage input data effectively, we define a random seed and a maximum sequence length.

- **Random Seed for Reproducibility**:
  - `torch.manual_seed(23)`: Sets a fixed random seed for PyTorch operations. Using a seed makes model training deterministic, meaning the results will be consistent every time the code is run. This is essential for debugging and comparing different model configurations.

- **Maximum Sequence Length**:
  - `MAX_SEQ_LEN = 128`: Sets the maximum sequence length for the input data. This limits the number of tokens that the model will process per input sequence.
  - Limiting sequence length helps control memory usage and computation time, as longer sequences require more resources. Setting an appropriate length ensures that the model can handle most sentences without exceeding resource limits.

These initial configurations help create a stable and manageable setup for building a transformer-based translation model.


In [None]:
# Setting a random seed for reproducibility in PyTorch operations.
torch.manual_seed(23)

# Define the maximum sequence length for input data, setting a limit for processing.
MAX_SEQ_LEN = 128

####**Positional Embedding**

The positional embedding layer encodes positional information into the embeddings of tokens, allowing the transformer model to understand the order of tokens in a sequence, as it lacks inherent sequence information.

Key Components of the PositionalEmbedding Class
- **Positional Encoding Matrix**: A matrix is created to store position encodings for each position in the sequence. These encodings use a combination of sine and cosine functions, ensuring that each position has a unique pattern of values. This helps the model distinguish between different token positions in a consistent way.
- **Encoding Formula**: The encoding values are based on the sine and cosine functions, with different frequencies for even and odd dimensions. This formulation ensures that the embeddings reflect relative position information.
- **Integration with Input Embeddings**: In the forward pass, these positional encodings are added directly to the token embeddings, enhancing them with position-related information.

This layer allows the transformer to incorporate sequence order information, which is essential for tasks like translation where word order influences meaning.



In [None]:
# Define a positional embedding layer for adding position information to token embeddings.
class PositionalEmbedding(nn.Module):

    def __init__(self, d_model, max_seq_len=MAX_SEQ_LEN):

        super().__init__()

        # Create a matrix to store positional encodings for each token position.
        self.pos_embed_matrix = torch.zeros(max_seq_len, d_model, device=device)

        # Calculate sine and cosine position encodings.
        token_pos = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        self.pos_embed_matrix[:, 0::2] = torch.sin(token_pos * div_term)
        self.pos_embed_matrix[:, 1::2] = torch.cos(token_pos * div_term)

        # Add a batch dimension and adjust shape for compatibility.
        self.pos_embed_matrix = self.pos_embed_matrix.unsqueeze(0).transpose(0, 1)

    def forward(self, x):

        return x + self.pos_embed_matrix[:x.size(0), :]

####**Multi Head Attention**

The multi-head attention mechanism enables the model to attend to different parts of a sequence simultaneously, which is critical for capturing various aspects of linguistic relationships in translation tasks. In this class, we implement a multi-head attention layer to split the attention into multiple "heads" and capture different representation subspaces.

#### Key Components of the MultiHeadAttention Class
- **Initialization**: The model dimension (`d_model`) and number of heads (`num_heads`) are specified. The embedding dimension is divided by the number of heads, allowing each head to focus on a different part of the input sequence in parallel.
- **Linear Projections for Q, K, V**: The input is projected into Query (Q), Key (K), and Value (V) matrices through linear layers. This separation allows the model to determine "what to attend to" in the input sequence for each head.
- **Scale Dot-Product Attention**: The dot-product attention is computed, scaled for numerical stability, and then passed through a softmax function to create attention scores, determining the importance of each token in the sequence.
- **Output Projection**: The outputs from all attention heads are concatenated and linearly projected to match the original model dimension, integrating the multi-head attention information.

This multi-head attention layer captures multiple aspects of context in the sequence, enhancing the model’s understanding of complex relationships in language and helping it handle tasks like translation effectively.


In [None]:
# Define a multi-head attention layer for capturing various representation subspaces.
class MultiHeadAttention(nn.Module):

    def __init__(self, d_model=512, num_heads=8):

        super().__init__()
        assert d_model % num_heads == 0, 'Embedding size must be divisible by number of heads'

        # Define dimensions for each attention head.
        self.d_v = d_model // num_heads
        self.d_k = self.d_v
        self.num_heads = num_heads

        # Linear layers for projecting inputs into Q, K, V spaces.
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, Q, K, V, mask=None):

        batch_size = Q.size(0)

        # Q, K, V -> [batch_size, seq_len, num_heads*d_k] after transpose Q, K, V -> [batch_size, num_heads, seq_len, d_k]

        # Project and reshape Q, K, V to enable multi-head attention.
        Q = self.W_q(Q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(K).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(V).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        # Calculate attention output.
        weighted_values, attention = self.scale_dot_product(Q, K, V, mask)

        # Reshape the output back to original dimensions.
        weighted_values = weighted_values.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads * self.d_k)
        weighted_values = self.W_o(weighted_values)

        return weighted_values, attention

    def scale_dot_product(self, Q, K, V, mask=None):

        # Compute attention scores and apply mask if provided.
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)

        attention = F.softmax(scores, dim=-1)

        # Calculate weighted sum of values.
        weighted_values = torch.matmul(attention, V)

        return weighted_values, attention

####**Position Feed Forward**

The position feedforward layer is a component within the Transformer architecture that applies a fully connected neural network to each position independently. This layer helps the model capture complex relationships by transforming each position's embedding through a multi-layer perceptron (MLP).

Key Components of the PositionFeedForward Class
- **Two Linear Transformations**: The feedforward layer consists of two linear (fully connected) layers. The first transforms the embedding dimension (`d_model`) to a larger intermediate dimension (`d_ff`), while the second reduces it back to the original model dimension.
- **ReLU Activation**: A ReLU activation function is applied between the two linear layers, introducing non-linearity to help the model capture more complex patterns.

This feedforward layer is applied to each position separately, allowing the model to learn and refine the representation of each token independently of the others. It’s a crucial component for increasing the model’s capacity to learn diverse features in the sequence.

In [None]:
# Define a feedforward neural network layer used within the Transformer.
class PositionFeedForward(nn.Module):

    def __init__(self, d_model, d_ff):

        super().__init__()

        # Two linear layers for the feedforward network.
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)

    def forward(self, x):

        # Apply ReLU activation between two linear transformations.
        return self.linear2(F.relu(self.linear1(x)))

####**Encoder Sub Layer**

The EncoderSubLayer class defines a core component of the Transformer encoder, combining self-attention and feedforward layers with normalization and residual connections. Each sublayer enhances the model's ability to capture relationships in the input sequence.

Key Components of the EncoderSubLayer Class
- **Self-Attention and Feedforward Layers**:
  - The self-attention layer enables the model to focus on different parts of the input sequence, capturing contextual relationships.
  - The feedforward layer refines the representation of each token independently, adding complexity to the model's understanding of each position.

- **Normalization and Dropout**:
  - Layer normalization stabilizes and accelerates training by maintaining consistent input distributions.
  - Dropout provides regularization, helping prevent overfitting by randomly zeroing out portions of the input.

- **Residual Connections**:
  - Residual (or skip) connections allow the original input to bypass each layer, ensuring that information flows through the network and making training more stable.

This sublayer forms the building block of the Transformer encoder, stacking multiple instances to create deeper and more expressive models for tasks like translation.

In [None]:
# Define a sublayer within the encoder, which includes attention and feedforward layers.
class EncoderSubLayer(nn.Module):

    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):

        super().__init__()

        # Self-attention and feedforward layers.
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.ffn = PositionFeedForward(d_model, d_ff)

        # Normalization and dropout layers.
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x, mask=None):

        # Apply self-attention, normalization, and residual connections.
        attention_score, _ = self.self_attn(x, x, x, mask)
        x = x + self.dropout1(attention_score)
        x = self.norm1(x)

        # Apply feedforward, normalization, and residual connections.
        x = x + self.dropout2(self.ffn(x))

        return self.norm2(x)

####**Encoder**

The Encoder class implements the Transformer encoder, which consists of multiple stacked layers to learn complex representations of the input sequence. This stack of layers allows the model to progressively refine its understanding of the sequence.

Key Components of the Encoder Class
- **Stacked Encoder Sublayers**:
  - The encoder is composed of multiple `EncoderSubLayer` instances, where each sublayer includes self-attention and feedforward components with normalization and residual connections.
  - Stacking several sublayers helps the model capture increasingly abstract patterns and relationships across the input.

- **Layer Normalization**:
  - After the input passes through all encoder layers, a final layer normalization step is applied to stabilize the output. This helps maintain consistent representation quality across the entire model.

Forward Pass
- The input sequence passes sequentially through each encoder layer, where each layer processes and refines the input. This approach allows the encoder to build a rich and hierarchical representation of the sequence, essential for downstream tasks like translation.

The encoder structure forms a powerful sequence processor, enabling the model to learn deep, contextual representations that enhance its understanding of the input text.

In [None]:
# Define the encoder consisting of multiple sublayers for representation learning.
class Encoder(nn.Module):

    def __init__(self, d_model, num_heads, d_ff, num_layers, dropout=0.1):

        super().__init__()

        # Stack multiple encoder sublayers.
        self.layers = nn.ModuleList([EncoderSubLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.norm = nn.LayerNorm(d_model)

    def forward(self, x, mask=None):

        # Pass the input through each encoder layer.
        for layer in self.layers:
            x = layer(x, mask)

        return self.norm(x)

####**Decoder Sub Layer**

The DecoderSubLayer class implements a sublayer within the Transformer decoder, combining self-attention, cross-attention, and feedforward layers. This layer enables the decoder to focus on both the target sequence and the encoder's output, allowing it to generate meaningful output sequences.

Key Components of the DecoderSubLayer Class
- **Self-Attention Layer**:
  - This layer focuses on the target sequence itself, enabling the model to learn dependencies within the generated sequence. A target mask is applied to prevent the model from attending to future tokens.

- **Cross-Attention Layer**:
  - The cross-attention layer attends to the encoder's output, allowing the decoder to align with relevant parts of the input sequence. This enables the model to use information from the input sequence when generating each token in the target sequence.

- **Feedforward Network**:
  - A feedforward layer further refines the representation at each position, adding complexity to the model's understanding of each token in context.

- **Normalization and Dropout**:
  - Layer normalization and dropout are applied after each main component to stabilize training and prevent overfitting. Residual connections are also included to maintain a steady flow of information through the model.

Forward Pass
- The target sequence is passed through self-attention, cross-attention, and feedforward layers sequentially, with residual connections, dropout, and normalization applied throughout.

This sublayer is crucial for the decoder’s ability to generate contextually relevant output by dynamically focusing on both the generated sequence and the encoded input.

In [None]:
# Define a sublayer within the decoder, incorporating self-attention, cross-attention, and feedforward layers.
class DecoderSubLayer(nn.Module):

    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):

        super().__init__()

        # Self-attention, cross-attention, and feedforward layers.
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionFeedForward(d_model, d_ff)

        # Normalization and dropout layers.
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.dropout3 = nn.Dropout(dropout)

    def forward(self, x, encoder_output, target_mask=None, encoder_mask=None):

        # Self-attention with target mask.
        attention_score, _ = self.self_attn(x, x, x, target_mask)
        x = x + self.dropout1(attention_score)
        x = self.norm1(x)

        # Cross-attention with encoder output and mask.
        encoder_attn, _ = self.cross_attn(x, encoder_output, encoder_output, encoder_mask)
        x = x + self.dropout2(encoder_attn)
        x = self.norm2(x)

        # Feedforward network with residual and normalization.
        ff_output = self.feed_forward(x)
        x = x + self.dropout3(ff_output)

        return self.norm3(x)

####**Decoder**

The Decoder class defines the Transformer decoder, which consists of multiple stacked sublayers designed to generate a target sequence based on the encoder’s output. The decoder progressively refines its understanding of the input and target sequence, enabling accurate and contextually relevant sequence generation.

Key Components of the Decoder Class
- **Stacked Decoder Sublayers**:
  - The decoder is composed of several `DecoderSubLayer` instances, each of which includes self-attention, cross-attention, and feedforward layers.
  - Stacking these sublayers allows the model to progressively build a rich representation of the target sequence, incorporating both self-attended target context and cross-attended input context.

- **Layer Normalization**:
  - After passing through all sublayers, a final layer normalization is applied to stabilize the output representation. This ensures consistency across the model and improves overall training stability.

Forward Pass
- The target sequence is passed through each decoder layer, with each layer adding information by attending to the target sequence itself and to the encoder’s output. This approach enables the model to capture nuanced relationships within the target and between the target and input sequences.

The decoder structure forms the backbone of the Transformer’s generation capabilities, allowing it to produce accurate translations or other sequence-to-sequence outputs.

In [None]:
# Define the decoder consisting of multiple sublayers for sequence generation.
class Decoder(nn.Module):

    def __init__(self, d_model, num_heads, d_ff, num_layers, dropout=0.1):

        super().__init__()

        # Stack multiple decoder sublayers.
        self.layers = nn.ModuleList([DecoderSubLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.norm = nn.LayerNorm(d_model)

    def forward(self, x, encoder_output, target_mask, encoder_mask):

        # Pass the input through each decoder layer.
        for layer in self.layers:
            x = layer(x, encoder_output, target_mask, encoder_mask)

        return self.norm(x)

###**Transformer**

The Transformer class implements an encoder-decoder structure for sequence-to-sequence tasks, like language translation. This model uses attention mechanisms and positional encodings to capture complex relationships in the input and output sequences.

Key Components of the Transformer Class
- **Embedding Layers**:
  - Separate embedding layers for the source and target vocabulary map tokens to continuous vector representations of a specified dimension (`d_model`).
  - Positional encoding is added to these embeddings, helping the model understand token order within sequences.

- **Encoder and Decoder**:
  - The encoder processes the input sequence, learning a rich representation of it.
  - The decoder, leveraging both the target sequence (for self-attention) and encoder’s output (for cross-attention), generates the target sequence progressively.

- **Output Layer**:
  - A final linear layer maps the decoder’s output to the target vocabulary, producing logits for each token in the target sequence.

Forward Pass
1. **Generate Masks**: Masks are created for both the source and target sequences to prevent attending to padding tokens and future tokens in the target sequence.
2. **Encoding**: The source sequence is embedded, positionally encoded, and passed through the encoder to produce the encoder’s output.
3. **Decoding**: The target sequence is embedded, positionally encoded, and passed through the decoder, which also attends to the encoder’s output.
4. **Output Mapping**: The decoder’s output is mapped to the target vocabulary, producing a probability distribution over possible next tokens.

This architecture forms the basis of a powerful Transformer model capable of handling a wide range of sequence generation tasks by effectively encoding and decoding input and output sequences.

In [None]:
# Define a Transformer model with encoder-decoder structure.
class Transformer(nn.Module):

    def __init__(self, d_model, num_heads, d_ff, num_layers, input_vocab_size, target_vocab_size, max_len=MAX_SEQ_LEN, dropout=0.1):

        super().__init__()

        # Define embedding layers for input and target vocabulary.
        self.encoder_embedding = nn.Embedding(input_vocab_size, d_model)
        self.decoder_embedding = nn.Embedding(target_vocab_size, d_model)

        # Positional embedding to encode token positions.
        self.pos_embedding = PositionalEmbedding(d_model, max_len)

        # Define encoder and decoder modules.
        self.encoder = Encoder(d_model, num_heads, d_ff, num_layers, dropout)
        self.decoder = Decoder(d_model, num_heads, d_ff, num_layers, dropout)

        # Output layer to map decoder output to vocabulary space.
        self.output_layer = nn.Linear(d_model, target_vocab_size)

    def forward(self, source, target):

        # Generate masks for encoder and decoder inputs.
        source_mask, target_mask = self.mask(source, target)

        # Apply embedding and positional encoding to source input.
        source = self.encoder_embedding(source) * math.sqrt(self.encoder_embedding.embedding_dim)
        source = self.pos_embedding(source)

        # Pass through encoder.
        encoder_output = self.encoder(source, source_mask)

        # Apply embedding and positional encoding to target input.
        target = self.decoder_embedding(target) * math.sqrt(self.decoder_embedding.embedding_dim)
        target = self.pos_embedding(target)

        # Pass through decoder.
        output = self.decoder(target, encoder_output, target_mask, source_mask)

        # Map decoder output to target vocabulary size.
        return self.output_layer(output)

    def mask(self, source, target):

        # Create source mask (1 for non-padding tokens, 0 for padding).
        source_mask = (source != 0).unsqueeze(1).unsqueeze(2)

        # Create target mask (1 for non-padding tokens, 0 for padding).
        target_mask = (target != 0).unsqueeze(1).unsqueeze(2)

        # Generate triangular mask to prevent attending to future tokens.
        size = target.size(1)
        no_mask = torch.tril(torch.ones((1, size, size), device=device)).bool()
        target_mask = target_mask & no_mask

        return source_mask, target_mask

##**Test**

Testing the Transformer Model

To test the Transformer model, we define key parameters for the sequence length, batch size, and vocabulary size, and then generate random input data for the source and target sequences.

Parameters
- **seq_len_source**: Specifies the length of each source sequence (e.g., 10 tokens per sequence).
- **seq_len_target**: Specifies the length of each target sequence (e.g., 10 tokens per sequence).
- **batch_size**: Defines the number of samples in each batch (e.g., 2 sequences per batch).
- **input_vocab_size**: Sets the vocabulary size for the source language (e.g., 50 unique tokens).
- **target_vocab_size**: Sets the vocabulary size for the target language (e.g., 50 unique tokens).

Generating Input Data
- **source**: A random tensor simulating a batch of source sequences, where each token is randomly selected from the source vocabulary.
- **target**: A random tensor simulating a batch of target sequences, where each token is randomly selected from the target vocabulary.

These randomly generated sequences allow us to verify the Transformer model's forward pass and assess whether the model is functioning as expected without training.

In [None]:
# Parameters
seq_len_source = 10           # Length of each source sequence.
seq_len_target = 10           # Length of each target sequence.
batch_size = 2                # Number of samples in each batch.
input_vocab_size = 50         # Vocabulary size for source language.
target_vocab_size = 50        # Vocabulary size for target language.

# Generate random source and target sequences as input data.
source = torch.randint(1, input_vocab_size, (batch_size, seq_len_source))
target = torch.randint(1, target_vocab_size, (batch_size, seq_len_target))

To initialize the Transformer model, we define key hyperparameters that control the model's structure and complexity:

- **d_model**: Dimensionality of the model and embedding size (e.g., 512), which determines the size of each token’s representation.
- **num_heads**: Number of attention heads in multi-head attention (e.g., 8), allowing the model to focus on different parts of the input sequence in parallel.
- **d_ff**: Dimensionality of the feedforward layer (e.g., 2048), enabling the model to learn complex transformations on token representations.
- **num_layers**: Number of layers in both the encoder and decoder (e.g., 6), which defines the depth of the model and increases its representational capacity.

Model Instantiation
- **Transformer Model**: Using the specified hyperparameters, the `Transformer` model is instantiated, defining the encoder-decoder structure for processing source and target sequences.
- **Device Assignment**: The model and input tensors (`source` and `target`) are moved to the specified device (GPU or CPU) for efficient computation.

With these hyperparameters and setup, the model is ready for a forward pass with the defined test sequences, enabling us to test the end-to-end functionality of the Transformer.

In [None]:
# Hyperparameters for the Transformer Model
d_model = 512
num_heads = 8
d_ff = 2048
num_layers = 6

# Instantiate the Transformer model with the specified parameters.
model = Transformer(d_model, num_heads, d_ff, num_layers, input_vocab_size, target_vocab_size, max_len=MAX_SEQ_LEN, dropout=0.1)

# Move the model and input tensors to the specified device (GPU or CPU).
model = model.to(device)
source = source.to(device)
target = target.to(device)

After setting up the model and test data, we perform a forward pass through the Transformer to obtain predictions for the target sequence.

- **Forward Pass**:
  - `output = model(source, target)`: Feeds the `source` and `target` sequences into the model, generating a predicted output tensor.
  - The expected shape of the output tensor is `[batch_size, seq_len_target, target_vocab_size]`. In this test case, the expected shape would be `[2, 10, 50]`, where:
    - `2` represents the batch size,
    - `10` is the target sequence length, and
    - `50` is the size of the target vocabulary.

- **Output Shape Verification**:
  - `print(f'output.shape {output.shape}')`: Prints the shape of the output tensor, allowing us to confirm that the model produces predictions with the expected dimensions.

This test verifies that the model is processing the input data correctly and that its output aligns with the expected structure for translation tasks.

In [None]:
# Perform a forward pass through the model with source and target sequences.
output = model(source, target)  # Get the model's output for the given input sequences.

In [None]:
# Expected output shape -> [batch, seq_len_target, target_vocab_size] i.e. [2, 10, 50]

# Print the shape of the output tensor to verify dimensions.
print(f'ouput.shape {output.shape}')

ouput.shape torch.Size([2, 10, 50])


## **Translator Eng-Spa**

To train a translation model, we load a dataset of English-Spanish sentence pairs from a text file.

Key Steps
- **File Path**:
  - `PATH` specifies the location of the text file containing English-Spanish pairs, with each line containing one pair separated by a tab (`\t`).

- **Reading the File**:
  - `with open(PATH, 'r', encoding='utf-8') as f`: Opens the file with UTF-8 encoding to handle special characters.
  - `lines = f.readlines()`: Reads all lines from the file.

- **Splitting into Pairs**:
  - `eng_spa_pairs = [line.strip().split('\t') for line in lines if '\t' in line]`: Processes each line to create a list of English-Spanish pairs by splitting on the tab separator. Lines without a tab are ignored.

- **Extracting Sentences**:
  - `eng_sentences` and `spa_sentences` are lists containing only the English and Spanish sentences, respectively, extracted from `eng_spa_pairs`.

- **Preview of Data**:
  - Display the first 10 pairs to verify that the data has been loaded correctly.

This setup provides a list of English and Spanish sentences, ready for tokenization and further preprocessing in the translation model.

In [None]:
# Define the path to the text file containing English-Spanish sentence pairs.
PATH = '/content/drive/MyDrive/Colab Notebooks/eng-spa2024.csv'

# Open the file and read all lines with UTF-8 encoding.
with open(PATH, 'r', encoding='utf-8') as f:
    lines = f.readlines()

# Split each line into English-Spanish pairs, ignoring lines without a tab separator.
eng_spa_pairs = [line.strip().split('\t') for line in lines if '\t' in line]

# Display the first 10 English-Spanish pairs.
eng_spa_pairs[:10]

# Extract the English sentences from the pairs.
eng_sentences = [pair[0] for pair in eng_spa_pairs]

# Extract the Spanish sentences from the pairs.
spa_sentences = [pair[1] for pair in eng_spa_pairs]

# Print the first 10 English and Spanish sentences.
print(eng_sentences[:10])
print(spa_sentences[:10])

['Ow!', 'So?', 'Go.', 'Hi.', 'OK.', 'Go!', 'Ah!', 'Go.', 'Go!', 'OK.']
['Â¡Ay!', 'Â¿Y?', 'Ve.', 'Hola.', 'Â¡Ã\x93rale!', 'Vete', 'Â¡Anda!', 'VÃ¡yase.', 'VÃ¡yase', 'Bueno.']


###**Preprocess Sentences**

Sentence Preprocessing

The `preprocess_sentence` function prepares text data by cleaning and normalizing sentences. This process ensures that the input text is consistent and formatted for the model.

Key Steps in Sentence Preprocessing
- **Standardize Text**: Converts text to lowercase, removes extra whitespace, and normalizes accented characters.
- **Remove Unnecessary Characters**: Filters out non-alphabetic characters, keeping only relevant text for translation.
- **Add Special Tokens**: Adds `<sos>` and `<eos>` tokens to mark the start and end of each sentence, providing clear boundaries for the model.

This preprocessing function ensures that sentences are in a clean, consistent format, ready for model input.

In [None]:
def preprocess_sentence(sentence):

    # Convert sentence to lowercase and remove leading/trailing whitespace.
    sentence = sentence.lower().strip()

    # Replace multiple spaces with a single space.
    sentence = re.sub(r'[" "]+', " ", sentence)

    # Normalize accented characters to their non-accented equivalents.
    sentence = re.sub(r"[á]+", "a", sentence)
    sentence = re.sub(r"[é]+", "e", sentence)
    sentence = re.sub(r"[í]+", "i", sentence)
    sentence = re.sub(r"[ó]+", "o", sentence)
    sentence = re.sub(r"[ú]+", "u", sentence)

    # Remove non-alphabetic characters.
    sentence = re.sub(r"[^a-z]+", " ", sentence)

    # Remove leading/trailing spaces after cleaning.
    sentence = sentence.strip()

    # Add start and end tokens to sentence.
    sentence = '<sos> ' + sentence + ' <eos>'

    return sentence

In [None]:
s1 = '¿Hola @ cómo estás? 123'

In [None]:
print(s1)
print(preprocess_sentence(s1))

¿Hola @ cómo estás? 123
<sos> hola como estas <eos>


In [None]:
eng_sentences = [preprocess_sentence(sentence) for sentence in eng_sentences]
spa_sentences = [preprocess_sentence(sentence) for sentence in spa_sentences]

In [None]:
spa_sentences[:10]

['<sos> ay <eos>',
 '<sos> y <eos>',
 '<sos> ve <eos>',
 '<sos> hola <eos>',
 '<sos> rale <eos>',
 '<sos> vete <eos>',
 '<sos> anda <eos>',
 '<sos> v yase <eos>',
 '<sos> v yase <eos>',
 '<sos> bueno <eos>']

###**Build Vocabulary**

The `build_vocab` function creates a vocabulary from a list of sentences, mapping each unique word to a corresponding index.

Key Steps in Building Vocabulary
- **Tokenize and Count Words**: Splits sentences into words and counts the frequency of each word.
- **Sort by Frequency**: Orders words by their frequency in descending order, allowing the most common words to have the lowest indices.
- **Create Word-to-Index Mapping**: Assigns a unique index to each word, starting from index 2. Special tokens are added for padding (`<pad>`) and unknown words (`<unk>`) at indices 0 and 1, respectively.
- **Create Index-to-Word Mapping**: Reverses the word-to-index mapping, enabling conversion back from indices to words.

This function generates vocabulary dictionaries that facilitate the conversion between text and numeric indices, essential for processing input data for the model.

In [None]:
def build_vocab(sentences):

    # Flatten the list of sentences into individual words.
    words = [word for sentence in sentences for word in sentence.split()]

    # Count the occurrences of each word.
    word_count = Counter(words)

    # Sort words by frequency in descending order.
    sorted_word_counts = sorted(word_count.items(), key=lambda x:x[1], reverse=True)

    # Create a mapping of words to indices starting from index 2.
    word2idx = {word: idx for idx, (word, _) in enumerate(sorted_word_counts, 2)}

    # Add special tokens for padding and unknown words.
    word2idx['<pad>'] = 0
    word2idx['<unk>'] = 1

    # Reverse the mapping: indices to words.
    idx2word = {idx: word for word, idx in word2idx.items()}

    return word2idx, idx2word

In [None]:
# Build vocabulary for English and Spanish sentences.
eng_word2idx, eng_idx2word = build_vocab(eng_sentences)
spa_word2idx, spa_idx2word = build_vocab(spa_sentences)

# Get the vocabulary sizes for both languages.
eng_vocab_size = len(eng_word2idx)
spa_vocab_size = len(spa_word2idx)

# Print the vocabulary sizes.
print(eng_vocab_size, spa_vocab_size)

27672 43296


###**English-Spanish Dataset**

The `EngSpaDataset` class defines a custom dataset for loading English-Spanish sentence pairs, preparing them for use in the translation model.

Key Components of the EngSpaDataset Class
- **Initialization**:
  - Accepts lists of English and Spanish sentences along with vocabulary mappings for each language. These mappings (`eng_word2idx` and `spa_word2idx`) are used to convert words into index representations.

- **Dataset Length**:
  - The `__len__` method returns the number of sentence pairs in the dataset, which is the total number of training examples.

- **Get Item**:
  - The `__getitem__` method retrieves a sentence pair (English and Spanish) by index.
  - Each sentence is tokenized by converting words to their respective indices using the vocabulary dictionaries, with unknown words replaced by the `<unk>` token.
  - The method returns the tokenized English and Spanish sentences as tensors.

This dataset class enables efficient loading and tokenization of English-Spanish sentence pairs, readying them for model training.

In [None]:
# Define a custom Dataset for English-Spanish sentence pairs.
class EngSpaDataset(Dataset):

    # Initialize dataset with English and Spanish sentences and vocab mappings.
    def __init__(self, eng_sentences, spa_sentences, eng_word2idx, spa_word2idx):

        self.eng_sentences = eng_sentences  # List of English sentences.
        self.spa_sentences = spa_sentences  # List of Spanish sentences.

        self.eng_word2idx = eng_word2idx  # English word-to-index dictionary.
        self.spa_word2idx = spa_word2idx  # Spanish word-to-index dictionary.

    # Return the number of sentences in the dataset.
    def __len__(self):

        return len(self.eng_sentences)

    # Return the tokenized index version of an English-Spanish sentence pair.
    def __getitem__(self, idx):

        eng_sentence = self.eng_sentences[idx]  # Get the English sentence at the given index.
        spa_sentence = self.spa_sentences[idx]  # Get the Spanish sentence at the given index.

        # Convert English and Spanish sentences to indices using respective vocabularies.
        eng_idxs = [self.eng_word2idx.get(word, self.eng_word2idx['<unk>']) for word in eng_sentence.split()]
        spa_idxs = [self.spa_word2idx.get(word, self.spa_word2idx['<unk>']) for word in spa_sentence.split()]

        # Return the tokenized English and Spanish sentences as tensors.
        return torch.tensor(eng_idxs), torch.tensor(spa_idxs)


In [None]:
# Custom collate function to process a batch of sentences for the DataLoader.
def collate_fn(batch):

    # Unzip the batch into English and Spanish sentence pairs.
    eng_batch, spa_batch = zip(*batch)

    # Truncate or pad English sentences to a maximum sequence length (MAX_SEQ_LEN).
    eng_batch = [seq[:MAX_SEQ_LEN].clone().detach() for seq in eng_batch]

    # Truncate or pad Spanish sentences to a maximum sequence length (MAX_SEQ_LEN).
    spa_batch = [seq[:MAX_SEQ_LEN].clone().detach() for seq in spa_batch]

    # Pad the English sentences to ensure all sequences in the batch are of equal length.
    eng_batch = torch.nn.utils.rnn.pad_sequence(eng_batch, batch_first=True, padding_value=0)

    # Pad the Spanish sentences to ensure all sequences in the batch are of equal length.
    spa_batch = torch.nn.utils.rnn.pad_sequence(spa_batch, batch_first=True, padding_value=0)

    # Return the padded English and Spanish sentence batches.
    return eng_batch, spa_batch

###**Training**

The `train` function defines the training loop for the Transformer model, iteratively adjusting the model’s parameters to minimize the loss between predicted and target translations.

Key Steps in the Training Loop
- **Model Training Mode**:
  - `model.train()` sets the model to training mode, enabling dropout layers and other training-specific behavior.

- **Epoch Loop**:
  - For each epoch, the model processes all batches in the dataloader. The total loss for the epoch is tracked to monitor training progress.

- **Batch Processing**:
  - Each batch of English and Spanish sentences is loaded and moved to the specified device (GPU or CPU).
  - The target (Spanish) batch is split into `target_input` (input for the decoder) and `target_output` (correct output for loss calculation). The last token is removed from `target_input`, and the first token is removed from `target_output` to align them for prediction.

- **Forward Pass and Loss Calculation**:
  - The model generates predictions from the `eng_batch` and `target_input`.
  - The output is reshaped to match the shape of `target_output`, and the loss is calculated between the model's predictions and the actual target output.

- **Backpropagation and Optimization**:
  - Gradients are reset, and the model performs backpropagation to calculate gradients of the loss with respect to model parameters.
  - The optimizer updates the model's parameters to minimize the loss.

- **Epoch Loss Tracking**:
  - The average loss for the epoch is calculated and printed, allowing us to track the model's learning progress over time.

This training loop optimizes the model’s parameters, gradually reducing the loss and improving the model's translation accuracy.

In [None]:
# Training loop for the Transformer model
def train(model, dataloader, loss_function, optimiser, epochs):

    # Set model to training mode
    model.train()

    # Loop over epochs
    for epoch in range(epochs):
        total_loss = 0  # Initialize total loss for the epoch

        # Loop over batches in the dataloader
        for i, (eng_batch, spa_batch) in enumerate(dataloader):
            # Move batches to the device (GPU or CPU)
            eng_batch = eng_batch.to(device)
            spa_batch = spa_batch.to(device)

            # Preprocess target (Spanish) sentences for the decoder
            target_input = spa_batch[:, :-1]
            target_output = spa_batch[:, 1:].contiguous().view(-1)

            # Zero the gradients before backpropagation
            optimiser.zero_grad()

            # Run the model and get output
            output = model(eng_batch, target_input)
            output = output.view(-1, output.size(-1))

            # Compute loss between model output and target output
            loss = loss_function(output, target_output)

            # Backpropagation and parameter update
            loss.backward()
            optimiser.step()

            # Accumulate loss for the current batch
            total_loss += loss.item()

        # Calculate average loss for the epoch
        avg_loss = total_loss / len(dataloader)

        # Print progress at the end of the epoch
        print(f'Epoch: {epoch}/{epochs}, Loss: {avg_loss:.4f}')

In [None]:
# Define batch size for training.
BATCH_SIZE = 64

# Initialize the dataset for English-Spanish sentence pairs.
dataset = EngSpaDataset(eng_sentences, spa_sentences, eng_word2idx, spa_word2idx)

# Create a DataLoader for batching, shuffling, and padding sequences.
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn)

In [None]:
# Initialize the Transformer model with specified hyperparameters.
model = Transformer(
    d_model=512,
    num_heads=8,
    d_ff=2048,
    num_layers=6,
    input_vocab_size=eng_vocab_size,  # Vocabulary size for input (English).
    target_vocab_size=spa_vocab_size, # Vocabulary size for output (Spanish).
    max_len=MAX_SEQ_LEN,              # Maximum sequence length.
    dropout=0.1
)

In [None]:
# Move the model to the specified device (GPU/CPU).
model = model.to(device)

# Define the loss function as CrossEntropyLoss, ignoring padding index (0).
loss_function = nn.CrossEntropyLoss(ignore_index=0)

# Set up the Adam optimizer with a learning rate of 0.0001 for model parameters.
optimiser = optim.Adam(model.parameters(), lr=0.0001)

In [None]:
# Train the model using the provided data loader, loss function, optimizer, and number of epochs (10).
train(model, dataloader, loss_function, optimiser, epochs=10)

Epoch: 0/10, Loss: 3.4942
Epoch: 1/10, Loss: 2.1202
Epoch: 2/10, Loss: 1.6389
Epoch: 3/10, Loss: 1.3261
Epoch: 4/10, Loss: 1.0884
Epoch: 5/10, Loss: 0.8961
Epoch: 6/10, Loss: 0.7403
Epoch: 7/10, Loss: 0.6183
Epoch: 8/10, Loss: 0.5259
Epoch: 9/10, Loss: 0.4601


###**Translate Sentences**

The `sentence_to_indices` and `indices_to_sentence` functions handle conversions between sentences and their indexed representations, which is essential for processing text in the model.

`sentence_to_indices`
- **Purpose**: Converts a sentence (string) into a list of word indices based on a word-to-index (`word2idx`) mapping.
- **Functionality**: Each word in the sentence is replaced by its corresponding index. If a word is not in the vocabulary, it is replaced by the index for the `<unk>` (unknown) token.

`indices_to_sentence`
- **Purpose**: Converts a list of indices back into a readable sentence using an index-to-word (`idx2word`) mapping.
- **Functionality**: Each index is replaced by its corresponding word. Padding tokens (`<pad>`) are excluded to avoid unnecessary spaces in the reconstructed sentence.

These functions facilitate the conversion between text and numeric representations, making it possible to input text data to the model and convert model predictions back to human-readable text.

In [None]:
# Convert a sentence into a list of word indices using the provided word-to-index mapping.
def sentence_to_indices(sentence, word2idx):
    return [word2idx.get(word, word2idx['<unk>']) for word in sentence.split()]

# Convert a list of indices back into a sentence using the provided index-to-word mapping.
def indices_to_sentence(indices, idx2word):
    return ' '.join([idx2word[idx] for idx in indices if idx in idx2word and idx2word[idx] != '<pad>'])

The `translate_sentence` function uses a trained Transformer model to translate an input sentence from English to Spanish by encoding the input and generating a target sequence token-by-token.

Key Steps in Translation
- **Model Evaluation Mode**:
  - `model.eval()` sets the model to evaluation mode, which disables dropout layers and ensures consistent inference behavior.

- **Sentence Preprocessing**:
  - The input sentence is preprocessed to match the format expected by the model (e.g., lowercase, removing extra spaces, adding special tokens).

- **Convert Sentence to Indices**:
  - `sentence_to_indices` is used to map the words in the preprocessed sentence to their corresponding indices based on the English vocabulary.
  - This indexed sentence is then converted to a tensor and moved to the appropriate device (CPU or GPU).

- **Initialize Target Sequence**:
  - The target sequence begins with the `<sos>` (start-of-sequence) token. This token acts as a starting point for the model to begin generating the translation.

- **Token Generation Loop**:
  - The function generates each token in the target sentence iteratively:
    - The model's output is computed based on the input tensor and the current state of the target sequence.
    - The most probable next token is determined by selecting the index with the highest probability.
    - This token is added to the target sequence, and the process continues until either the maximum length is reached or the `<eos>` (end-of-sequence) token is generated.

- **Convert Indices to Sentence**:
  - Once the target indices are generated, `indices_to_sentence` converts them back to a readable Spanish sentence.

This function allows for sentence-by-sentence translation using a trained Transformer model, outputting the generated translation based on the model’s learned representations.

In [None]:
# Translate a sentence using the trained model by encoding the input and generating a target sequence.
def translate_sentence(model, sentence, eng_word2idx, spa_idx2word, max_len=MAX_SEQ_LEN, device='cpu'):
    model.eval()  # Set the model to evaluation mode.
    sentence = preprocess_sentence(sentence)  # Preprocess the input sentence.
    input_indices = sentence_to_indices(sentence, eng_word2idx)  # Convert the sentence to indices.
    input_tensor = torch.tensor(input_indices).unsqueeze(0).to(device)  # Convert indices to tensor.

    # Initialize the target sequence with the <sos> token.
    tgt_indices = [spa_word2idx['<sos>']]
    tgt_tensor = torch.tensor(tgt_indices).unsqueeze(0).to(device)

    with torch.no_grad():  # Disable gradient computation during inference.
        for _ in range(max_len):  # Generate tokens until max length or <eos> is reached.
            output = model(input_tensor, tgt_tensor)  # Get model's output.
            output = output.squeeze(0)
            next_token = output.argmax(dim=-1)[-1].item()  # Get the most probable token.
            tgt_indices.append(next_token)  # Append the token to the target sequence.
            tgt_tensor = torch.tensor(tgt_indices).unsqueeze(0).to(device)  # Update target tensor.
            if next_token == spa_word2idx['<eos>']:  # Stop if <eos> token is generated.
                break

    return indices_to_sentence(tgt_indices, spa_idx2word)  # Convert generated indices back to a sentence.

###**Evaluate Translations**

The `evaluate_translations` function takes a list of English sentences, translates each using the trained Transformer model, and displays the original sentences alongside their translations.

Key Steps in Translation Evaluation
- **Iterate Through Sentences**:
  - For each sentence in the provided list, the function generates a Spanish translation by calling `translate_sentence`, which leverages the model to predict the target sequence.

- **Display Results**:
  - The original input sentence and its corresponding translation are printed side-by-side for easy comparison.

This function provides an efficient way to evaluate the model’s translation performance on multiple sentences, allowing for quick inspection of translation quality.

In [None]:
# Evaluate the translations of a list of sentences using the trained model.
def evaluate_translations(model, sentences, eng_word2idx, spa_idx2word, max_len=MAX_SEQ_LEN, device='cpu'):

    # Iterate through each sentence in the provided list.
    for sentence in sentences:
        # Translate the sentence using the trained model.
        translation = translate_sentence(model, sentence, eng_word2idx, spa_idx2word, max_len, device)

        # Print the original sentence and its translation.
        print(f'Input Sentence: {sentence}')
        print(f'Translation: {translation}')
        print()

In [None]:
# Sentences to test the translator.
test_sentences = [

    "What time is it right now?",
    "I need to buy groceries today.",
    "The stars look beautiful tonight.",
    "Can you show me how to do this?",
    "This movie is really entertaining.",
    "I enjoy listening to music in my free time.",
    "The water in the lake is so clear.",
    "Let’s take a walk in the park.",
    "He hide himself in the citchen",
    "She asked for a candy"
]

In [None]:
# Check if a GPU is available and set the device accordingly.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move the model to the selected device (GPU/CPU).
model = model.to(device)

# Evaluate the translations for the test sentences.
evaluate_translations(model, test_sentences, eng_word2idx, spa_idx2word, max_len=MAX_SEQ_LEN, device=device)

Input Sentence: What time is it right now?
Translation: <sos> ahora mismo es lo que es <eos>

Input Sentence: I need to buy groceries today.
Translation: <sos> tengo que comprar hoy <eos>

Input Sentence: The stars look beautiful tonight.
Translation: <sos> las estrellas se ven hermosas esta noche <eos>

Input Sentence: Can you show me how to do this?
Translation: <sos> puedes mostrarme c mo hacer esto <eos>

Input Sentence: This movie is really entertaining.
Translation: <sos> esta pel cula es muy entretenida <eos>

Input Sentence: I enjoy listening to music in my free time.
Translation: <sos> me gusta escuchar m sica en mi tiempo libre <eos>

Input Sentence: The water in the lake is so clear.
Translation: <sos> el lago est tan clara en el agua <eos>

Input Sentence: Let’s take a walk in the park.
Translation: <sos> demos un paseo en el parque <eos>

Input Sentence: He hide himself in the citchen
Translation: <sos> l se oculta el dinero en el lugar <eos>

Input Sentence: She asked for