$$
\begin{array}{c}
\text{$\Large We\ must\ cultivate\ our\ garden.$} \\
{\text{$\small Voltaire\ -\ Candide$}} \\
\end{array}
$$

# Neural Machine Translation with Recurrent Neural Networks

Neural Machine Translation (NMT) is a sophisticated approach to language translation that leverages deep learning techniques to facilitate the automatic translation of text from one language to another. As a subset of natural language processing (NLP), NMT has transformed the landscape of translation by introducing models that can understand and translate whole sentences, often preserving the context and semantic meaning better than previous rule-based and statistical methods.

## Impact on Language Processing

The advent of NMT has notably impacted the field of language processing by introducing several key capabilities:

- **Handling of Idiomatic Expressions**: Traditional models often struggled with idioms and culturally specific phrases. NMT's contextual understanding significantly improves handling such expressions, translating them more naturally and accurately.

- **Reduction in Translation Latency**: NMT models can translate texts substantially faster than traditional models, especially when optimized and deployed on appropriate hardware. This speed is crucial for applications requiring real-time translation.

- **Improved Scalability**: Given the right computational resources, NMT models can be scaled to accommodate large-scale translation tasks that were previously impractical, making it feasible to offer high-quality translation services on a global scale.

- **Accessibility**: By lowering language barriers, NMT increases accessibility, allowing more people to access content in their native or preferred languages. This inclusivity is crucial in educational contexts and information dissemination.

## Evolution of Neural Machine Translation

The development of Neural Machine Translation (NMT) marks a significant milestone in the progression from basic models to sophisticated neural networks designed to handle complex language processing tasks. This evolution is not only a story of technological advancement but also of conceptual shifts in how machines understand and process human languages.


### Early Translation Models

1. **Rule-Based Translation Systems (RBMT)**: The earliest attempts at machine translation were rule-based, relying on a comprehensive set of language rules and bilingual dictionaries. These systems required extensive manual work to define grammatical structures and vocabulary mappings between the source and target languages. While they were somewhat effective for languages with similar structures, their performance dropped significantly with complex or unrelated language pairs.

2. **Statistical Machine Translation (SMT)**: Emerging in the late 1980s and coming into prominence in the 1990s, SMT represented a shift towards data-driven approaches. SMT models used statistical methods to translate text based on the probability distributions of words and phrases, learned from large corpora of translated texts. This method was more flexible and scalable than RBMT, but still struggled with syntactic and semantic nuances, often producing literal translations that lacked contextual coherence.

### The Role of Embeddings in NMT Evolution

#### Introduction to Word Embeddings

Word embeddings are a type of word representation that allows words with similar meanings to have a similar representation. They are low-dimensional, continuous, dense vectors that are learned from text data. These vectors aim to capture syntactic and semantic word relationships based on the contexts in which words appear. The development and use of word embeddings have been fundamental in advancing NMT by providing a more nuanced and effective means of representing language data in neural networks.

#### Early Embeddings: From One-Hot to Distributed Representations

1. **One-Hot Encodings**: Initially, language models, including early translation systems, relied on one-hot encoding to represent words. Each word in the vocabulary was represented by a vector where only one element is one, and all others are zero. This method was simple but had major limitations, such as high dimensionality and an inability to capture semantic relationships between words.

2. **Distributed Representations (Word2Vec, GloVe)**: The shift to distributed representations marked a significant improvement. Techniques like Word2Vec and GloVe allowed words to be represented as dense vectors where semantically similar words were mapped to proximate points in vector space. These embeddings were pre-trained on large corpora and could then be used to initialize the first layer of neural networks in NMT systems, providing a richer and more expressive input representation.

#### Embeddings in NMT

1. **Improved Semantic Capture**: Embeddings provided a way for models to grasp semantic meanings and relationships, which are crucial for accurate translation. For instance, synonyms or contextually related words could be recognized as closer in the embedding space, aiding in more coherent and contextually appropriate translations.

2. **Efficiency and Scalability**: By reducing the dimensionality of the input space compared to one-hot encodings, embeddings made neural models more computationally efficient and easier to train. This scalability was essential as NMT models began to be applied to larger and more complex language pairs and datasets.

3. **Custom Embeddings for NMT**: As NMT systems evolved, researchers started training custom embeddings as part of the end-to-end training process of the translation model. This allowed the embeddings to be optimized specifically for the translation task, rather than relying solely on pre-trained embeddings. This integration led to further improvements in translation quality by tailoring the embeddings to capture nuances specific to the languages and textual contexts involved in the translation tasks.

#### Continuous Improvement with Contextual Embeddings

With the arrival of models like ELMo and later BERT, the concept of embeddings expanded from static representations to contextual embeddings. These are dynamic embeddings that change based on the words' context in a sentence, providing even richer information about word usage and meaning. This advancement significantly improved the subtlety with which NMT systems could handle language, leading to even better translations, particularly in handling idiomatic and nuanced phrases.

### Introduction of Neural Networks

1. **Initial Neural Approaches**: The introduction of neural networks to machine translation began with feedforward neural networks, which were initially used to improve specific components of SMT systems, like language modeling and re-ranking of translation hypotheses. These early neural components hinted at the potential of fully neural systems.

2. **Recurrent Neural Networks (RNNs)**: The true breakthrough came with the application of RNNs, particularly Long Short-Term Memory (LSTM) networks, which could remember long sequences of words—crucial for maintaining context in sentences. The encoder-decoder architecture, where one RNN encoded the input sentence into a context vector and another RNN decoded this vector into a translation, became a foundational model for NMT.

### Advancements and Modern Architectures


1. **Attention Mechanisms**: The introduction of attention mechanisms was a pivotal improvement in NMT. It allowed the model to focus on different parts of the input sentence while translating, mimicking how human translators revisit different words and phrases. This led to translations that were not only more fluent but also more accurate in terms of context and semantics.

2. **Transformers and Self-Attention**: The development of the Transformer model in 2017 marked the next significant evolution. Transformers replaced recurrence with self-attention layers, which process all words in the sentence simultaneously. This parallel processing significantly increased the speed and efficiency of training and improved the handling of long-range dependencies in text, setting new standards for translation quality.

3. **Integration of BERT and Pre-trained Models**: Following the success of Transformers, the use of bidirectional and pre-trained models like BERT (Bidirectional Encoder Representations from Transformers) further pushed the boundaries. These models, pre-trained on vast amounts of text before being fine-tuned for translation, brought improvements in understanding contextual nuances and generative capabilities.


## Implementation of NMT

### Setting Up the Environment

In [40]:
# Import necessary libraries
import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn import Transformer

import torchtext
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

from pathlib import Path
from tqdm import tqdm

import numpy as np
import matplotlib.pyplot as plt

import spacy

In [22]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

### Data Preparation

#### Downloading the Dataset

In [24]:
!wget https://www.manythings.org/anki/fra-eng.zip
!unzip fra-eng.zip -d dataset

# Path to the data txt file on disk.
data_path = 'dataset/fra.txt'

--2024-04-21 21:56:20--  https://www.manythings.org/anki/fra-eng.zip
Resolving www.manythings.org (www.manythings.org)... 173.254.30.110
Connecting to www.manythings.org (www.manythings.org)|173.254.30.110|:443... 

connected.
HTTP request sent, awaiting response... 200 OK
Length: 7943074 (7,6M) [application/zip]
Saving to: ‘fra-eng.zip’


2024-04-21 21:56:23 (3,34 MB/s) - ‘fra-eng.zip’ saved [7943074/7943074]

Archive:  fra-eng.zip
  inflating: dataset/_about.txt      
  inflating: dataset/fra.txt         


#### Preprocessing Data

In [25]:
import os
os.environ['KMP_DUPLICATE_LIB_OK']='TRUE'

In [26]:
! python -m spacy download en_core_web_sm -q
! python -m spacy download fr_core_news_sm -q

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')


**Load Data:** 

This function opens the dataset and extracts the English and French sentences from each line, ignoring metadata and attribution text.

In [27]:
# Load spaCy models
spacy_en = spacy.load('en_core_web_sm')
spacy_fr = spacy.load('fr_core_news_sm')

def load_data(file_path):
    """Load and preprocess data from file."""
    data = []
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            src, trg, _ = line.strip().split('\t', 2)
            data.append((src, trg))
    return data

**Tokenization:** 

Depending on the language, the appropriate spaCy tokenizer is applied. This is handled in a separate function to keep tokenization modular and reusable.

In [28]:
def tokenize(data, lang):
    """Tokenize sentences using the specified spaCy tokenizer."""
    tokenizer = spacy_en.tokenizer if lang == 'en' else spacy_fr.tokenizer
    tokenized_data = []
    for src, trg in data:
        sentence = src if lang == 'en' else trg
        tokenized_sentence = [token.text for token in tokenizer(sentence)]
        tokenized_data.append(tokenized_sentence)
    return tokenized_data

**Yield Tokens:** 

This generator function is used to iterate over the data for building the vocabulary. 

It includes <start> and <end> tokens in each sequence, which are essential for the LSTM model's training and inference phases.

In [29]:
def yield_tokens(data, lang):
    """Yield tokens for vocabulary building."""
    for sentence in tokenize(data, lang):
        yield ['<start>'] + sentence + ['<end>']

In [30]:
data_path = 'dataset/fra.txt' 
data = load_data(data_path)

# Build vocabularies
vocab_en = build_vocab_from_iterator(yield_tokens(data, 'en'), specials=['<unk>', '<start>', '<end>'])
vocab_fr = build_vocab_from_iterator(yield_tokens(data, 'fr'), specials=['<unk>', '<start>', '<end>'])

# Set default unknown token index
vocab_en.set_default_index(vocab_en["<unk>"])
vocab_fr.set_default_index(vocab_fr["<unk>"])

print("Sample English vocabulary:", list(vocab_en.get_itos())[:10])
print("Sample French vocabulary:", list(vocab_fr.get_itos())[:10])


Sample English vocabulary: ['<unk>', '<start>', '<end>', '.', 'I', 'you', 'to', '?', 'the', "n't"]
Sample French vocabulary: ['<unk>', '<start>', '<end>', '.', 'de', 'Je', '?', 'pas', 'est', 'que']


In [33]:
import torch
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader, Dataset

class TranslationDataset(Dataset):
    def __init__(self, data, src_vocab, trg_vocab):
        self.data = data
        self.src_vocab = src_vocab
        self.trg_vocab = trg_vocab

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        src_sentence, trg_sentence = self.data[idx]
        src_indices = [self.src_vocab['<start>']] + [self.src_vocab[token] for token in src_sentence] + [self.src_vocab['<end>']]
        trg_indices = [self.trg_vocab['<start>']] + [self.trg_vocab[token] for token in trg_sentence] + [self.trg_vocab['<end>']]
        return torch.tensor(src_indices, dtype=torch.long), torch.tensor(trg_indices, dtype=torch.long)

def collate_fn(batch):
    src_batch, trg_batch = zip(*batch)
    src_batch = pad_sequence(src_batch, padding_value=vocab_en["<unk>"])
    trg_batch = pad_sequence(trg_batch, padding_value=vocab_fr["<unk>"])
    return src_batch, trg_batch

# Load data
split_ratio = 0.8
train_size = int(len(data) * split_ratio)
train_data = TranslationDataset(data[:train_size], vocab_en, vocab_fr)
valid_data = TranslationDataset(data[train_size:], vocab_en, vocab_fr)

# Create DataLoaders
batch_size = 32
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
valid_loader = DataLoader(valid_data, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)


### Model Architecture

#### Encoder

Processes the input English sentences and creates context vectors.

In [31]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        self.embedding = nn.Embedding(input_dim, emb_dim)
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout)
        self.dropout = nn.Dropout(dropout)

    def forward(self, src):
        embedded = self.dropout(self.embedding(src))
        outputs, (hidden, cell) = self.rnn(embedded)
        return hidden, cell

#### Decoder

Uses the context vectors to start generating the translated French sentences.

In [32]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        self.output_dim = output_dim
        self.embedding = nn.Embedding(output_dim, emb_dim)
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout)
        self.fc_out = nn.Linear(hid_dim, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, input, hidden, cell):
        input = input.unsqueeze(0)
        embedded = self.dropout(self.embedding(input))
        output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
        prediction = self.fc_out(output.squeeze(0))
        return prediction, hidden, cell

#### Seq2Seq Wrapper

Manages the data flow from the encoder to the decoder and structures the training process.

In [35]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        hidden, cell = self.encoder(src)
        
        input = trg[0,:]  # Start token is assumed to be the first in the trg sequence
        
        for t in range(1, trg_len):
            output, hidden, cell = self.decoder(input, hidden, cell)
            outputs[t] = output
            teacher_force = torch.rand(1) < teacher_forcing_ratio
            top1 = output.argmax(1) 
            input = trg[t] if teacher_force else top1
        
        return outputs

In [41]:
# Parameters
n_epochs = 10
save_path='model/best_model.pt'
train_losses = []
valid_losses = []
INPUT_DIM = len(vocab_en)
OUTPUT_DIM = len(vocab_fr)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

# Model instantiation
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)

model = Seq2Seq(enc, dec, device).to(device)

# Example initialization of optimizers and loss
optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss(ignore_index=vocab_fr["<unk>"])  # Ignore index for padding or <unk>

### Training the Model

In [42]:
def train_model(model, train_loader, valid_loader, optimizer, criterion, n_epochs, device, save_path='best_model.pt'):
    model.train()
    best_valid_loss = float('inf')

    for epoch in range(n_epochs):
        model.train()
        epoch_loss = 0
        for i, (src, trg) in enumerate(tqdm(train_loader, desc=f"Training Epoch {epoch+1}")):
            src, trg = src.to(device), trg.to(device)

            optimizer.zero_grad()
            output = model(src, trg)

            # trg is of shape [trg_len, batch_size]
            # output is of shape [trg_len, batch_size, output_dim]
            output_dim = output.shape[-1]
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            loss = criterion(output, trg)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()

            if (i + 1) % 200 == 0:
                print(f'Step {i+1}, Training Loss: {loss.item():.4f}')

        # Validation loss
        model.eval()
        valid_loss = 0
        with torch.no_grad():
            for src, trg in valid_loader:
                src, trg = src.to(device), trg.to(device)
                output = model(src, trg, 0)  # Turn off teacher forcing
                output_dim = output.shape[-1]
                output = output[1:].view(-1, output_dim)
                trg = trg[1:].view(-1)
                loss = criterion(output, trg)
                valid_loss += loss.item()

        average_train_loss = epoch_loss / len(train_loader)
        train_losses.append(average_train_loss)
        
        average_valid_loss = valid_loss / len(valid_loader)
        valid_losses.append(average_valid_loss)

        print(f'Epoch: {epoch+1}, Train Loss: {average_train_loss:.4f}, Valid Loss: {average_valid_loss:.4f}')

        # Save the best model
        if average_valid_loss < best_valid_loss:
            best_valid_loss = average_valid_loss
            torch.save(model.state_dict(), save_path)
            print(f'Best model saved at Epoch {epoch+1} with Validation Loss: {average_valid_loss:.4f}')

    return model

trained_model = train_model(model, train_loader, valid_loader, optimizer, criterion, n_epochs, device, save_path)


Training Epoch 1:   0%|          | 1/5819 [00:22<37:07:39, 22.97s/it]


KeyboardInterrupt: 

#### Print Losses Curve

In [None]:
plt.figure(figsize=(10, 5))
plt.plot(train_losses, label='Training Loss')
plt.plot(valid_losses, label='Validation Loss')
plt.title('Training and Validation Losses')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

### Inference: Translating Sentences

To use the trained model for translating an English sentence into French, we need to encode the sentence using the trained Encoder and then iteratively decode the output using the trained Decoder.

In [None]:
def translate_sentence(model, sentence, src_vocab, trg_vocab, device, max_len=50):
    model.eval()
    
    # Tokenize and numericalize the input sentence
    tokens = ['<start>'] + [token.text.lower() for token in spacy_en.tokenizer(sentence)] + ['<end>']
    src_indexes = [src_vocab[token] for token in tokens]
    src_tensor = torch.LongTensor(src_indexes).unsqueeze(1).to(device)

    with torch.no_grad():
        hidden, cell = model.encoder(src_tensor)

    trg_indexes = [trg_vocab['<start>']]

    for i in range(max_len):
        trg_tensor = torch.LongTensor([trg_indexes[-1]]).to(device)

        with torch.no_grad():
            output, hidden, cell = model.decoder(trg_tensor, hidden, cell)
            pred_token = output.argmax(1).item()

        trg_indexes.append(pred_token)

        if pred_token == trg_vocab['<end>']:
            break

    trg_tokens = [trg_vocab.get_itos()[i] for i in trg_indexes]
    return trg_tokens[1:-1]  # Remove the start and end tokens

# Example usage
sentence = "Hello, how are you?"
translation = translate_sentence(model, sentence, vocab_en, vocab_fr, device)
print("Translated:", " ".join(translation))

## Conclusion

Today, NMT systems are integral parts of global communication, supporting instant translation across numerous language pairs with increasing reliability. As NMT continues to evolve, it integrates more deeply with other AI technologies, pushing towards truly interactive, real-time multilingual communication and making the dream of removing language barriers more attainable than ever. The evolution from simple rule-based systems to complex neural architectures reflects broader trends in AI towards more holistic and context-aware systems, promising exciting developments for the future of language processing.