<a href="https://colab.research.google.com/github/arkeodev/nlp/blob/main/Neural_Machine_Translation_with_RNNs/Neural_Machine_Translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

$$
\begin{array}{c}
\text{$\Large We\ must\ cultivate\ our\ garden.$} \\
{\text{$\small Voltaire\ -\ Candide$}} \\
\end{array}
$$

# Neural Machine Translation with Sequence to Sequence Networks

Neural Machine Translation (NMT) is a sophisticated approach to language translation that leverages deep learning techniques to facilitate the automatic translation of text from one language to another. As a subset of natural language processing (NLP), NMT has transformed the landscape of translation by introducing models that can understand and translate whole sentences, often preserving the context and semantic meaning better than previous rule-based and statistical methods.

## Sequence Networks

<figure>
    <img src="https://raw.githubusercontent.com/arkeodev/nlp/main/Neural_Machine_Translation_with_RNNs/images/sequence_models_types.png" width="1000" height="300" alt="Sequence Network Types">
    <figcaption>Sequence Network Types</figcaption>
</figure>

Sequence networks are specialized architectures within the field of neural networks designed to handle sequential data. Unlike traditional neural networks that assume all inputs (and outputs) are independent of each other, sequence networks are adept at managing data where the order is significant, such as time series data, sentences in natural language processing, or steps in a video. These networks can capture temporal dynamics and relationships within the data, allowing them to perform tasks like predicting the next word in a sentence, generating text, understanding spoken language, or forecasting time-dependent variables. The adaptability of sequence networks to various input-output mappings—such as one-to-one, one-to-many, many-to-one, and many-to-many—makes them extraordinarily versatile and powerful for a wide range of sequential processing tasks in machine learning.

Here is the Sequence Network Types:

- **One to One**: This type of network has a single input and produces a single output. It is the simplest form of sequence modeling and is not typically used for sequences. An example task would be a traditional neural network that maps one input to one output, like a simple regression.

- **One to Many**: This network takes a single input and produces a sequence of outputs. It is useful for tasks where one piece of information can lead to a sequence of results. An example task would be image captioning, where an image input results in a sequence of words forming the caption.

- **Many to One**: Here, a sequence of inputs leads to a single output. This is often used for tasks where the entire sequence is necessary to produce a meaningful result. An example task would be sentiment analysis, where a sequence of words (a sentence or document) is classified as expressing a positive or negative sentiment.

- **Many to Many (synced)**: This type of network processes a sequence of inputs and produces a sequence of outputs where the output is generated synchronously with the input. This is often used in tasks where each time step in the input is directly related to a time step in the output. An example task would be video frame prediction, where each input frame corresponds to an output prediction.

- **Many to Many (asynced)**: This network type processes a sequence of inputs and produces a sequence of outputs, but the outputs are not synchronously produced with the inputs. This can be used in tasks where a sequence is processed before the output sequence is generated. An example task would be machine translation, where an entire sentence must be read before the translation can begin.

## Impact on Language Processing

The advent of Sequence networks and the NMT has notably impacted the field of language processing by introducing several key capabilities:

- **Handling of Idiomatic Expressions**: Traditional models often struggled with idioms and culturally specific phrases. NMT's contextual understanding significantly improves handling such expressions, translating them more naturally and accurately.

- **Reduction in Translation Latency**: NMT models can translate texts substantially faster than traditional models, especially when optimized and deployed on appropriate hardware. This speed is crucial for applications requiring real-time translation.

- **Improved Scalability**: Given the right computational resources, NMT models can be scaled to accommodate large-scale translation tasks that were previously impractical, making it feasible to offer high-quality translation services on a global scale.

- **Accessibility**: By lowering language barriers, NMT increases accessibility, allowing more people to access content in their native or preferred languages. This inclusivity is crucial in educational contexts and information dissemination.

## Evolution of Neural Machine Translation

The development of Neural Machine Translation (NMT) marks a significant milestone in the progression from basic models to sophisticated neural networks designed to handle complex language processing tasks. This evolution is not only a story of technological advancement but also of conceptual shifts in how machines understand and process human languages.


### Early Translation Models

1. **Rule-Based Translation Systems (RBMT)**: The earliest attempts at machine translation were rule-based, relying on a comprehensive set of language rules and bilingual dictionaries. These systems required extensive manual work to define grammatical structures and vocabulary mappings between the source and target languages. While they were somewhat effective for languages with similar structures, their performance dropped significantly with complex or unrelated language pairs.

2. **Statistical Machine Translation (SMT)**: Emerging in the late 1980s and coming into prominence in the 1990s, SMT represented a shift towards data-driven approaches. SMT models used statistical methods to translate text based on the probability distributions of words and phrases, learned from large corpora of translated texts. This method was more flexible and scalable than RBMT, but still struggled with syntactic and semantic nuances, often producing literal translations that lacked contextual coherence.

### The Role of Embeddings in NMT Evolution

#### Introduction to Word Embeddings

Word embeddings are a type of word representation that allows words with similar meanings to have a similar representation. They are low-dimensional, continuous, dense vectors that are learned from text data. These vectors aim to capture syntactic and semantic word relationships based on the contexts in which words appear. The development and use of word embeddings have been fundamental in advancing NMT by providing a more nuanced and effective means of representing language data in neural networks.

#### Early Embeddings: From One-Hot to Distributed Representations

1. **One-Hot Encodings**: Initially, language models, including early translation systems, relied on one-hot encoding to represent words. Each word in the vocabulary was represented by a vector where only one element is one, and all others are zero. This method was simple but had major limitations, such as high dimensionality and an inability to capture semantic relationships between words.

2. **Distributed Representations (Word2Vec, GloVe)**: The shift to distributed representations marked a significant improvement. Techniques like Word2Vec and GloVe allowed words to be represented as dense vectors where semantically similar words were mapped to proximate points in vector space. These embeddings were pre-trained on large corpora and could then be used to initialize the first layer of neural networks in NMT systems, providing a richer and more expressive input representation.

#### Embeddings in NMT

1. **Improved Semantic Capture**: Embeddings provided a way for models to grasp semantic meanings and relationships, which are crucial for accurate translation. For instance, synonyms or contextually related words could be recognized as closer in the embedding space, aiding in more coherent and contextually appropriate translations.

2. **Efficiency and Scalability**: By reducing the dimensionality of the input space compared to one-hot encodings, embeddings made neural models more computationally efficient and easier to train. This scalability was essential as NMT models began to be applied to larger and more complex language pairs and datasets.

3. **Custom Embeddings for NMT**: As NMT systems evolved, researchers started training custom embeddings as part of the end-to-end training process of the translation model. This allowed the embeddings to be optimized specifically for the translation task, rather than relying solely on pre-trained embeddings. This integration led to further improvements in translation quality by tailoring the embeddings to capture nuances specific to the languages and textual contexts involved in the translation tasks.

#### Continuous Improvement with Contextual Embeddings

With the arrival of models like ELMo and later BERT, the concept of embeddings expanded from static representations to contextual embeddings. These are dynamic embeddings that change based on the words' context in a sentence, providing even richer information about word usage and meaning. This advancement significantly improved the subtlety with which NMT systems could handle language, leading to even better translations, particularly in handling idiomatic and nuanced phrases.

### Introduction of Neural Networks

1. **Initial Neural Approaches**: The introduction of neural networks to machine translation began with feedforward neural networks, which were initially used to improve specific components of SMT systems, like language modeling and re-ranking of translation hypotheses. These early neural components hinted at the potential of fully neural systems.

2. **Recurrent Neural Networks (RNNs)**: The true breakthrough came with the application of RNNs, particularly Long Short-Term Memory (LSTM) networks, which could remember long sequences of words—crucial for maintaining context in sentences. The encoder-decoder architecture, where one RNN encoded the input sentence into a context vector and another RNN decoded this vector into a translation, became a foundational model for NMT.

<figure>
    <img src="https://raw.githubusercontent.com/arkeodev/nlp/main/Neural_Machine_Translation_with_RNNs/images/neural_machine_translation.png" width="1000" height="300" alt="Neural Machine Translation">
    <figcaption>Neural Machine Translation</figcaption>
</figure>

### Advancements and Modern Architectures


1. **Attention Mechanisms**: The introduction of attention mechanisms was a pivotal improvement in NMT. It allowed the model to focus on different parts of the input sentence while translating, mimicking how human translators revisit different words and phrases. This led to translations that were not only more fluent but also more accurate in terms of context and semantics.

2. **Transformers and Self-Attention**: The development of the Transformer model in 2017 marked the next significant evolution. Transformers replaced recurrence with self-attention layers, which process all words in the sentence simultaneously. This parallel processing significantly increased the speed and efficiency of training and improved the handling of long-range dependencies in text, setting new standards for translation quality.

3. **Integration of BERT and Pre-trained Models**: Following the success of Transformers, the use of bidirectional and pre-trained models like BERT (Bidirectional Encoder Representations from Transformers) further pushed the boundaries. These models, pre-trained on vast amounts of text before being fine-tuned for translation, brought improvements in understanding contextual nuances and generative capabilities.


## Implementation of NMT

### Setting Up the Environment

In [1]:
# Import necessary libraries
import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader, Dataset

import torchtext
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

from pathlib import Path
from tqdm import tqdm

import numpy as np
import matplotlib.pyplot as plt

import spacy

### Data Preparation

#### Downloading the Dataset

In [2]:
!wget https://www.manythings.org/anki/fra-eng.zip
!unzip -o fra-eng.zip -d dataset

# Path to the data txt file on disk.
data_path = 'dataset/fra.txt'
Path("fra-eng.zip").unlink()

--2024-04-23 23:12:30--  https://www.manythings.org/anki/fra-eng.zip
Resolving www.manythings.org (www.manythings.org)... 173.254.30.110
Connecting to www.manythings.org (www.manythings.org)|173.254.30.110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7943074 (7,6M) [application/zip]
Saving to: ‘fra-eng.zip’


2024-04-23 23:12:34 (3,02 MB/s) - ‘fra-eng.zip’ saved [7943074/7943074]

Archive:  fra-eng.zip
  inflating: dataset/_about.txt      
  inflating: dataset/fra.txt         


#### Preprocessing Data

In [3]:
import os
os.environ['KMP_DUPLICATE_LIB_OK']='TRUE'
! python -m spacy download en_core_web_sm -q
! python -m spacy download fr_core_news_sm -q

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')


**Load Data:**

This function opens the dataset and extracts the English and French sentences from each line, ignoring metadata and attribution text.

In [4]:
# Load spaCy models
spacy_en = spacy.load('en_core_web_sm')
spacy_fr = spacy.load('fr_core_news_sm')

def load_data(file_path):
    """Load and preprocess data from file."""
    data = []
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            src, trg, _ = line.strip().split('\t', 2)
            data.append((src, trg))
    return data

**Tokenization:**

Depending on the language, the appropriate spaCy tokenizer is applied. This is handled in a separate function to keep tokenization modular and reusable.

In [5]:
def tokenize(data, lang):
    """Tokenize sentences using the specified spaCy tokenizer."""
    tokenizer = spacy_en.tokenizer if lang == 'en' else spacy_fr.tokenizer
    tokenized_data = []
    for src, trg in data:
        sentence = src if lang == 'en' else trg
        tokenized_sentence = [token.text for token in tokenizer(sentence)]
        tokenized_data.append(tokenized_sentence)
    return tokenized_data

**Yield Tokens:**

This generator function is used to iterate over the data for building the vocabulary.

It includes <start> and <end> tokens in each sequence, which are essential for the LSTM model's training and inference phases.

In [6]:
def yield_tokens(data, lang):
    """Yield tokens for vocabulary building."""
    for sentence in tokenize(data, lang):
        yield ['<start>'] + sentence + ['<end>']

In [7]:
data_path = 'dataset/fra.txt'
data = load_data(data_path)

# Build vocabularies
vocab_en = build_vocab_from_iterator(yield_tokens(data, 'en'), specials=['<unk>', '<start>', '<end>'])
vocab_fr = build_vocab_from_iterator(yield_tokens(data, 'fr'), specials=['<unk>', '<start>', '<end>'])

# Set default unknown token index
vocab_en.set_default_index(vocab_en["<unk>"])
vocab_fr.set_default_index(vocab_fr["<unk>"])

print("Sample English vocabulary:", list(vocab_en.get_itos())[:10])
print("Sample French vocabulary:", list(vocab_fr.get_itos())[:10])


Sample English vocabulary: ['<unk>', '<start>', '<end>', '.', 'I', 'you', 'to', '?', 'the', "n't"]
Sample French vocabulary: ['<unk>', '<start>', '<end>', '.', 'de', 'Je', '?', 'pas', 'est', 'que']


In [8]:
class TranslationDataset(Dataset):
    """
    A dataset class for handling machine translation datasets.

    Attributes:
    data (list of tuples): A list where each tuple contains a source sentence and a target sentence.
    src_vocab (dict): A dictionary mapping source language tokens to indices.
    trg_vocab (dict): A dictionary mapping target language tokens to indices.
    """
    def __init__(self, data, src_vocab, trg_vocab):
        """
        Initialize the dataset with source and target data and vocabularies.

        Parameters:
        data (list): List of tuples, where each tuple is (source_sentence, target_sentence).
        src_vocab (dict): Dictionary mapping source vocabulary words to integers.
        trg_vocab (dict): Dictionary mapping target vocabulary words to integers.
        """
        self.data = data
        self.src_vocab = src_vocab
        self.trg_vocab = trg_vocab

    def __len__(self):
        """Return the number of items in the dataset."""
        return len(self.data)

    def __getitem__(self, idx):
        """
        Retrieve an item from the dataset at the specified index.

        Parameters:
        idx (int): Index of the item to retrieve.

        Returns:
        tuple: A tuple containing the source indices and target indices as tensors.
        """
        src_sentence, trg_sentence = self.data[idx]
        # Convert source sentence to indices, adding start and end tokens
        src_indices = [self.src_vocab['<start>']] + [self.src_vocab[token] for token in src_sentence] + [self.src_vocab['<end>']]
        # Convert target sentence to indices, adding start and end tokens
        trg_indices = [self.trg_vocab['<start>']] + [self.trg_vocab[token] for token in trg_sentence] + [self.trg_vocab['<end>']]
        return torch.tensor(src_indices, dtype=torch.long), torch.tensor(trg_indices, dtype=torch.long)

def collate_fn(batch):
    """
    A function to collate data samples into batch tensors.

    Parameters:
    batch (list): A list of tuples, where each tuple is (source_indices, target_indices).

    Returns:
    tuple: A tuple containing batched source and target sequences.
    """
    src_batch, trg_batch = zip(*batch)  # Unpack batch data
    # Pad sequences for uniform length in the batch, using <unk> token for padding
    src_batch = pad_sequence(src_batch, padding_value=vocab_en["<unk>"])
    trg_batch = pad_sequence(trg_batch, padding_value=vocab_fr["<unk>"])
    return src_batch, trg_batch

# Load data
split_ratio = 0.8
train_size = int(len(data) * split_ratio)
train_data = TranslationDataset(data[:train_size], vocab_en, vocab_fr)
valid_data = TranslationDataset(data[train_size:], vocab_en, vocab_fr)

# Create DataLoaders
batch_size = 32
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
valid_loader = DataLoader(valid_data, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)


### Model Architecture

#### Encoder

Processes the input English sentences and creates context vectors.

In [9]:
class Encoder(nn.Module):
    """
        Initializes the Encoder part of a Seq2Seq model.

        Parameters:
        input_dim (int): The size of the input vocabulary.
        emb_dim (int): The dimension of the embedding layer.
        hid_dim (int): The dimension of the hidden states in the LSTM.
        n_layers (int): The number of LSTM layers.
        dropout (float): The dropout rate used for regularizing the model.
    """
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        self.n_layers = n_layers
        self.hid_dim = hid_dim
        
        self.embedding = nn.Embedding(input_dim, emb_dim) # Embedding layer to convert input indices into dense vectors
        # The hidden dimension is divided by 2 for bidirectional output compatibility
        self.rnn = nn.GRU(emb_dim, hid_dim // 2, n_layers, dropout=dropout, bidirectional=True)
        self.dropout = nn.Dropout(dropout)

    def forward(self, src):
        """
        Forward pass through the encoder.
        The encoder uses a bidirectional GRU. The hidden states 
        from both directions are concatenated to form a single output 
        vector.

        Parameters:
        src (Tensor): The input sequence to the encoder (batch of input indices).

        Returns:
        Tuple[Tensor, Tensor]: A tuple containing the final hidden state and the final cell state of the LSTM.
        """
        embedded = self.dropout(self.embedding(src)) # Apply embedding and dropout to the input sequence
        _, hidden = self.rnn(embedded)  # Pass the embedded input through the GRU
        # Reshape hidden state to handle bidirectional outputs correctly:
        hidden = hidden.view(self.n_layers, 2, src.size(1), self.hid_dim // 2)
        hidden = torch.cat((hidden[:, 0, :, :], hidden[:, 1, :, :]), dim=2)  # Concatenate the hidden states

        return hidden # Return only the hidden and cell states to be used by the decoder


#### Decoder

Uses the context vectors to start generating the translated French sentences.

In [10]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
        """
        Decoder initialization.

        Parameters:
        output_dim (int): Size of the output vocabulary.
        emb_dim (int): Size of the embeddings.
        hid_dim (int): Dimensionality of the hidden state and cell state.
        n_layers (int): Number of layers in the LSTM.
        dropout (float): Dropout rate for regularization.
        """
        super().__init__()
        self.output_dim = output_dim  # Target vocabulary size
        self.embedding = nn.Embedding(output_dim, emb_dim)  # Embedding layer
        # Adjust GRU to accept double the layer count
        self.rnn = nn.GRU(emb_dim, hid_dim, n_layers, dropout=dropout)
        self.fc_out = nn.Linear(hid_dim, output_dim)  # Fully connected layer to get the output
        self.dropout = nn.Dropout(dropout)  # Dropout layer for regularization

    def forward(self, input, hidden):
        """
        Forward pass of the decoder.
        The decoder's initial hidden state needs to handle the doubled output 
        from the bidirectional encoder.

        Parameters:
        input (Tensor): Input tensor for the decoder.
        hidden (Tensor): Hidden state from the last time step or the encoder.
        cell (Tensor): Cell state from the last time step or the encoder.

        Returns:
        Tuple[Tensor, Tensor, Tensor]: A tuple containing the output predictions, the new hidden state, and the new cell state.
        """
        input = input.unsqueeze(0)  # Add a batch dimension (batch size is assumed to be 1)
        embedded = self.dropout(self.embedding(input))  # Embed the input word and apply dropout
        output, hidden = self.rnn(embedded, hidden)
        prediction = self.fc_out(output.squeeze(0))  # Pass the GRU output through the linear layer to get predictions
        
        return prediction, hidden # Return the prediction, new hidden state

#### Seq2Seq Wrapper

Manages the data flow from the encoder to the decoder and structures the training process.

In [11]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        """
        Initializes the Seq2Seq model which encompasses an encoder and decoder.

        Parameters:
        encoder (nn.Module): The encoder neural network module.
        decoder (nn.Module): The decoder neural network module.
        device (torch.device): The device to run the model on ('cpu' or 'cuda').
        """
        super().__init__()
        self.encoder = encoder  # The RNN/LSTM/GRU encoder module
        self.decoder = decoder  # The RNN/LSTM/GRU decoder module
        self.device = device    # Device on which to perform computations

    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        """
        The forward pass for the Seq2Seq model.

        Parameters:
        src (Tensor): The input tensor containing the source sequence.
        trg (Tensor): The target tensor containing the target sequence.
        teacher_forcing_ratio (float, optional): The probability of using teacher forcing.

        Returns:
        Tensor: The output tensor containing the predicted sequence.
        """
        # Determine the batch size and target sequence length
        trg_len = trg.shape[0]
        batch_size = trg.shape[1]
        trg_vocab_size = self.decoder.output_dim # The size of the target vocabulary
        
        # Initialize a tensor to hold the decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        
        # Run the source sequence through the encoder
        hidden = self.encoder(src)
        
        # Take the first token of the target sequence as the initial decoder input
        input = trg[1,:]  # Start token
        
        # Iterate over each token in the target sequence
        for t in range(1, trg_len):
            # Run the input and the previous hidden and cell states through the decoder
            output, hidden = self.decoder(input, hidden)
            outputs[t] = output
            
            # Get the highest probability word from the decoder's output
            top1 = output.argmax(1)
            
            # Use the actual next token as the next input (teacher forcing),
            # or use the predicted token (no teacher forcing)
            input = trg[t] if (torch.rand(1) < teacher_forcing_ratio) else top1
            
        return outputs  # Return the tensor containing all the decoder outputs
    
    def loss(self, Y_hat, Y):
        """
        Compute the masked cross-entropy loss, ignoring the padding tokens in the target sequences.

        Args:
        Y_hat (torch.Tensor): Predicted output probabilities from the model (logits).
        Y (torch.Tensor): Ground truth target sequences (indices).

        Returns:
        torch.Tensor: Scalar value of the average loss computed over non-padded tokens.
        """
        # Apply softmax and calculate log probabilities
        log_probs = torch.nn.functional.log_softmax(Y_hat, dim=-1)
        
        # Calculate negative log-likelihood loss with no reduction to keep losses for each element
        l = torch.nn.functional.nll_loss(log_probs.transpose(1, 2), Y, reduction='none')
        
        # Create a mask for padding tokens (0 where Y is padding, 1 elsewhere)
        mask = (Y != self.tgt_pad).type(torch.float32)
        
        # Apply mask to the losses
        masked_loss = l * mask
        
        # Calculate the mean loss only over non-masked elements
        return masked_loss.sum() / mask.sum()

### Training

#### Parameters

In [12]:
# Model architecture parameters
model_params = {
    'input_dim': len(vocab_en),  # vocabulary size for encoder
    'output_dim': len(vocab_fr), # vocabulary size for decoder
    'enc_emb_dim': 300,           # encoder embedding dimension
    'dec_emb_dim': 300,           # decoder embedding dimension
    'hid_dim': 512,              # dimension of the hidden layers
    'n_layers': 1,               # number of layers in each the encoder and decoder
    'enc_dropout': 0.5,          # dropout rate for the encoder
    'dec_dropout': 0.5           # dropout rate for the decoder
}

# Training parameters
training_params = { 
    'n_epochs': 5,               # number of epochs to train
    'train_losses': [],          # list to record training losses
    'valid_losses': [],          # list to record validation losses
    'print_step_size': 1000      # print step size
}

#### Model Initialisation

In [13]:
# Path management for model saving/loading
model_path = './model/best_model.pt'
model_dir = Path('./model')

def prepare_model_directory():
    if model_dir.exists():
        print("Model directory already exists.")
    else:
        model_dir.mkdir(parents=True, exist_ok=True)
        print("Model directory was created.")

def load_or_initialize_model(device):
    if load_saved_model and model_dir.joinpath('best_model.pt').exists():
        print("Loading model from:", model_path)
        # Initialize the model first
        enc = Encoder(model_params['input_dim'], model_params['enc_emb_dim'], model_params['hid_dim'], model_params['n_layers'], model_params['enc_dropout'])
        dec = Decoder(model_params['output_dim'], model_params['dec_emb_dim'], model_params['hid_dim'], model_params['n_layers'], model_params['dec_dropout'])
        model = Seq2Seq(enc, dec, device).to(device)
        # Load the state dictionary
        model.load_state_dict(torch.load(model_path, map_location=device))
        return model
    else:
        prepare_model_directory()
        print("No saved model found. Initializing a new model.")
        enc = Encoder(model_params['input_dim'], model_params['enc_emb_dim'], model_params['hid_dim'], model_params['n_layers'], model_params['enc_dropout'])
        dec = Decoder(model_params['output_dim'], model_params['dec_emb_dim'], model_params['hid_dim'], model_params['n_layers'], model_params['dec_dropout'])
        model = Seq2Seq(enc, dec, device).to(device)
        return model

# Adjust this variable according to load an existing model or not
load_saved_model = True

# Initialise device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load or initialize model
model = load_or_initialize_model(device)

Loading model from: ./model/best_model.pt




#### Train Loop



In [14]:
def train_model(model, train_loader, valid_loader, optimizer, criterion, n_epochs, device, print_step_size, save_path='best_model.pt'):
    model.train()
    best_valid_loss = float('inf')

    for epoch in range(n_epochs):
        model.train()
        epoch_loss = 0
        for i, (src, trg) in enumerate(tqdm(train_loader, desc=f"Training Epoch {epoch+1}")):
            src, trg = src.to(device), trg.to(device)

            optimizer.zero_grad()
            output = model(src, trg)

            # trg is of shape [trg_len, batch_size]
            # output is of shape [trg_len, batch_size, output_dim]
            output_dim = output.shape[-1]
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            loss = criterion(output, trg)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()

            if (i + 1) % print_step_size == 0:
                print(f'Step {i+1}, Training Loss: {loss.item():.4f}')

        # Validation loss
        model.eval()
        valid_loss = 0
        with torch.no_grad():
            for src, trg in valid_loader:
                src, trg = src.to(device), trg.to(device)
                output = model(src, trg, 0)  # Turn off teacher forcing
                output_dim = output.shape[-1]
                output = output[1:].view(-1, output_dim)
                trg = trg[1:].view(-1)
                loss = criterion(output, trg)
                valid_loss += loss.item()

        average_train_loss = epoch_loss / len(train_loader)
        training_params["train_losses"].append(average_train_loss)

        average_valid_loss = valid_loss / len(valid_loader)
        training_params["valid_losses"].append(average_valid_loss)

        print(f'Epoch: {epoch+1}, Train Loss: {average_train_loss:.4f}, Valid Loss: {average_valid_loss:.4f}')

        # Save the best model
        if average_valid_loss < best_valid_loss:
            best_valid_loss = average_valid_loss
            torch.save(model.state_dict(), save_path)
            print(f'Best model saved at Epoch {epoch+1} with Validation Loss: {average_valid_loss:.4f}')

    return model


#### Print Losses Curve

In [15]:
def plot_loss_curve():
    plt.figure(figsize=(10, 5))
    plt.plot(training_params["train_losses"], label='Training Loss')
    plt.plot(training_params["valid_losses"], label='Validation Loss')
    plt.title('Training and Validation Losses')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.legend()
    plt.show()

#### Inference: Translating Sentences

To use the trained model for translating an English sentence into French, we need to encode the sentence using the trained Encoder and then iteratively decode the output using the trained Decoder.

In [16]:
def translate_sentence(model, sentence, src_vocab, trg_vocab, device, max_len=50):
    model.eval()

    # Tokenize and numericalize the input sentence
    tokens = ['<start>'] + [token.text.lower() for token in spacy_en.tokenizer(sentence)] + ['<end>']
    src_indexes = [src_vocab[token] for token in tokens]
    src_tensor = torch.LongTensor(src_indexes).unsqueeze(1).to(device)

    with torch.no_grad():
        hidden = model.encoder(src_tensor)

    trg_indexes = [trg_vocab['<start>']]

    for i in range(max_len):
        trg_tensor = torch.LongTensor([trg_indexes[-1]]).to(device)

        with torch.no_grad():
            output, hidden = model.decoder(trg_tensor, hidden)
            pred_token = output.argmax(1).item()

        trg_indexes.append(pred_token)

        if pred_token == trg_vocab['<end>']:
            break

    trg_tokens = [trg_vocab.get_itos()[i] for i in trg_indexes]
    return trg_tokens[1:-1]  # Remove the start and end tokens

#### Run Training or Inference

In [17]:
run_inference = True

if run_inference:
    # Run Inference
    sentence = "Hello, how are you?"
    print(f"English sentence: {sentence}")
    translation = translate_sentence(model, sentence, vocab_en, vocab_fr, device)
    print(f"French sentence: : {' '.join(translation)}")
else:
    # Run training

    # Optimiser and Loss Function
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    criterion = nn.CrossEntropyLoss(ignore_index=vocab_fr["<unk>"])  # Ignore index for padding or <unk>

    # Start training loop
    trained_model = train_model(model, train_loader, valid_loader, optimizer, criterion, training_params["n_epochs"], device, training_params["print_step_size"])

    # Plot the loss curve
    plot_loss_curve()

English sentence: Hello, how are you?
French sentence: : , n c ' a s t - c a s s a s s a n s a n s ?


## BLEU Score

In neural machine translation, to assess the translation performance of sequence-to-sequence (Seq2Seq) models, such as those consisting of Recurrent Neural Networks (RNNs), the Bilingual Evaluation Understudy (BLEU) score is often employed.

The BLEU score is a metric for evaluating a generated sentence to a set of reference sentences. A higher BLEU score corresponds to a translation that is more similar to the reference translations, with a score of 1 being a perfect match and a score of 0 indicating no overlap. BLEU score considers precision of n-grams in the generated text compared to the reference texts, and also includes a brevity penalty to prevent favoring overly short translations.

The BLEU score is calculated as follows:

1. **Compute the n-gram precision, Pn, for n=1 to N**:

  $$
   P_n = \frac{\sum_{\text{cand}\in\text{Candidates}} \sum_{\text{n-gram}\in\text{cand}} \min(\text{Count}_{\text{cand}}(\text{n-gram}), \text{Max-Count}_{\text{ref}}(\text{n-gram}))}{\sum_{\text{cand}\in\text{Candidates}} \sum_{\text{n-gram}\in\text{cand}} \text{Count}_{\text{cand}}(\text{n-gram})}
  $$

   Where:
   - **Candidates**: The set of candidate translated sentences.
   - **n-gram**: A contiguous sequence of n words from the candidate translation.
   - **$\text{Count}_{\text{cand}}(\text{n-gram})$**: The count of n-grams in the candidate translation.
   - **$\text{Max-Count}_{\text{ref}}(\text{n-gram})$**: The maximum count of the n-gram in any single reference translation.

2. **Compute the brevity penalty, BP**:
  $$
     BP = \begin{cases}
       1 & \text{if } c > r \\
       e^{(1-r/c)} & \text{if } c \leq r
     \end{cases}
  $$
   
  Where:
   - **c**: The length of the candidate translation.
   - **r**: The effective reference corpus length.

3. **Compute the BLEU score**:
  $$
  \text{BLEU} = BP \cdot \exp\left(\sum_{n=1}^{N} w_n \log P_n\right)
  $$

   Where:
   - **wn**: The weight for each n-gram precision score (typically uniform, such as 1/N for N=4).


**Example Calculation:**

Consider the following example with a candidate translation and two reference translations:

- Candidate: "The cat is on the mat."
- Reference 1: "The cat is on the mat."
- Reference 2: "There is a cat on the mat."

For simplicity, let's calculate the BLEU score for up to bigrams (N=2) with uniform weights for unigram and bigram precision (w1=w2=0.5).

1. **Unigram Precision (P1)**:
   The candidate has six unigrams, all of which appear in both reference sentences. So, P1 is 6/6 = 1.

  - Candidate unigrams: "The", "cat", "is", "on", "the", "mat"
  - Reference 1 unigrams: "The", "cat", "is", "on", "the", "mat"
  - Reference 2 unigrams: "There", "is", "a", "cat", "on", "the", "mat"

  All unigrams in the candidate sentence appear in both reference sentences. The counts for each unigram in the candidate are:

  - "The": 2
  - "cat": 1
  - "is": 1
  - "on": 1
  - "the": 1 (again for the second occurrence)
  - "mat": 1

  Since all unigrams are found in the reference sentences, we get:
  $$
  P_1 = \frac{6}{6} = 1
  $$

2. **Bigram Precision (P2)**:
   There are five bigrams in the candidate sentence. All five appear in Reference 1, and four appear in Reference 2. The maximum count from the references is used for each bigram, and hence P2 is also 5/5 = 1.

    - Candidate bigrams: "The cat", "cat is", "is on", "on the", "the mat"
    - Reference 1 bigrams: "The cat", "cat is", "is on", "on the", "the mat"
    - Reference 2 bigrams: "There is", "is a", "a cat", "cat on", "on the", "the mat"

  In the candidate, each bigram appears once, and all but one ("The cat") also appear in Reference 2. For "The cat", we take the count from Reference 1:

  - "The cat": 1 (from Reference 1)
  - "cat is": 1
  - "is on": 1
  - "on the": 1
  - "the mat": 1

  The counts match for all bigrams, so the precision is:
  $$
  P_2 = \frac{5}{5} = 1
  $$

3. **Brevity Penalty (BP)**:
    - Candidate sentence length (c): 6 words
    - Reference sentence lengths: 6 words (Reference 1), 7 words (Reference 2)
    - The effective reference corpus length (r) is the one that is closest to the candidate length, which in this case is 6 words from Reference 1.

  Since the candidate length c is not less than the reference length r, the brevity penalty BP is 1:
  $$
  BP = 1 \text{ (because c is not less than r)}
  $$

4. **BLEU Score**:
  With uniform weights (0.5 each for unigram and bigram precision), the BLEU score is:

   $$
   \text{BLEU} = BP \cdot \exp\left(0.5 \log P_1 + 0.5 \log P_2\right)
   \text{BLEU} = 1 \cdot \exp\left(0.5 \log 1 + 0.5 \log 1\right) = 1
   $$

In this example, the BLEU score is 1, indicating a perfect match with the reference translations. However, it's important to note that in real-world scenarios, BLEU scores are typically less than 1, reflecting various degrees of translation quality.


In [18]:
import nltk
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction

def calculate_bleu(reference_sentences, candidate_sentence):
    # The reference sentences should be tokenized and passed as a list of lists
    # The candidate sentence should be tokenized and passed as a list
    # Each reference and candidate sentence should start and end with <start> and <end> tokens respectively

    # We'll use nltk's method for smoothing.
    smoothing = SmoothingFunction().method1

    # Calculate BLEU score
    bleu_score = sentence_bleu(reference_sentences, candidate_sentence, smoothing_function=smoothing)

    return bleu_score

sentence = "Hello, how are you?"
print(f"English sentence: {sentence}")
translation = ' '.join(translate_sentence(model, sentence, vocab_en, vocab_fr, device))
print(f"French sentence: : {translation}")

# Tokenize the reference sentences
references = [translation]

# Calculate BLEU score
bleu_score = calculate_bleu(references, [sentence])
print(f"BLEU score: {bleu_score}")

English sentence: Hello, how are you?
French sentence: : , n c ' a s t - c a s s a s s a n s a n s ?
BLEU score: 0


## Conclusion

Today, NMT systems are integral parts of global communication, supporting instant translation across numerous language pairs with increasing reliability. As NMT continues to evolve, it integrates more deeply with other AI technologies, pushing towards truly interactive, real-time multilingual communication and making the dream of removing language barriers more attainable than ever. The evolution from simple rule-based systems to complex neural architectures reflects broader trends in AI towards more holistic and context-aware systems, promising exciting developments for the future of language processing.

## References

- A ten-minute introduction to sequence-to-sequence learning in Keras: https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html