# Sequence-to-sequence (seq2seq) models and machine translation

The `17_seq2seq_translation` notebook focuses on sequence-to-sequence (seq2seq) models for machine translation, a key application of neural networks in natural language processing. It covers preparing a dataset for translation tasks, building both the Encoder and Decoder models, and combining them into a complete seq2seq architecture. 

The notebook further explores training the model, evaluating its performance, translating new sentences, and experimenting with hyperparameters to fine-tune the model’s accuracy and fluency in translation tasks.

## Table of contents

1. [Understanding seq2seq models and machine translation](#understanding-seq2seq-models-and-machine-translation)
2. [Setting up the environment](#setting-up-the-environment)
3. [Preparing the dataset for machine translation](#preparing-the-dataset-for-machine-translation)
4. [Building the Encoder model](#building-the-encoder-model)
5. [Building the Decoder model](#building-the-decoder-model)
6. [Combining Encoder and Decoder into a seq2seq model](#combining-encoder-and-decoder-into-a-seq2seq-model)
7. [Training the seq2seq model](#training-the-seq2seq-model)
8. [Evaluating the seq2seq model](#evaluating-the-seq2seq-model)
9. [Translating new sentences](#translating-new-sentences)
10. [Experimenting with hyperparameters](#experimenting-with-hyperparameters)

## Understanding seq2seq models and machine translation

Sequence-to-sequence (seq2seq) models are a class of neural networks designed to transform one sequence into another, making them particularly effective for tasks where input and output are sequences of varying lengths. These models are widely used in tasks like **machine translation**, where the input is a sentence in one language and the output is the translation in another language. Other applications include text summarization, speech recognition, and image captioning.

The seq2seq model is based on a **recurrent neural network (RNN)** architecture and is typically composed of two main parts: an **encoder** and a **decoder**. The encoder processes the input sequence and compresses it into a fixed-size context vector, and the decoder takes this context vector to generate the output sequence.

### **How seq2seq models work**

Seq2seq models are designed to handle input and output sequences of different lengths, which makes them ideal for translation tasks where sentences in different languages vary in length and structure. The core idea is to read the input sequence, encode it into a compact representation, and then use this representation to generate the output sequence.

#### **Encoder**

The encoder processes the input sequence, which can be a sequence of words (such as a sentence) or other time-ordered data. Each input token (word) is passed one by one through an RNN, which updates its hidden state at each step. The hidden state at the final time step is a summary of the entire input sequence and serves as the **context vector** that the decoder will use to generate the output sequence.

In traditional seq2seq models, this context vector is the only information passed to the decoder, making it a crucial component of the model. It needs to encode all relevant information from the input sequence.

#### **Decoder**

The decoder is another RNN that takes the context vector generated by the encoder and produces the output sequence. At each time step, the decoder generates one token of the output sequence, conditioned on the context vector and the tokens generated so far.

The decoder also maintains its own hidden state, which evolves as it generates the output sequence token by token. It typically uses a **teacher forcing** strategy during training, where the ground truth output token from the previous step is provided as input for the next step rather than the token predicted by the model.

In tasks like machine translation, the decoder will generate words in the target language until it outputs a special **end-of-sequence** token, signaling the end of the translation.

### **Training seq2seq models**

Seq2seq models are trained by minimizing the difference between the predicted output sequence and the actual target sequence. This is typically done using a loss function like **cross-entropy**, which compares the predicted probabilities for each output token to the actual token in the target sequence.

During training, the model learns to map the input sequence to the output sequence, improving its ability to capture long-range dependencies and handle variable-length inputs and outputs. However, the reliance on the context vector alone (as in traditional seq2seq models) can lead to information loss, especially in long sequences, which is why extensions like **attention mechanisms** have been developed to address this issue.

### **Limitations of vanilla seq2seq models**

While seq2seq models have been highly successful, they come with limitations, particularly when handling long input sequences. In a vanilla seq2seq model, the entire input sequence is compressed into a single context vector. This vector must carry all the necessary information for the decoder to generate the entire output sequence. For short sentences, this works relatively well, but for longer or more complex sequences, important details can be lost.

Some key limitations include:
- **Information bottleneck**: The encoder must compress all the information from the input sequence into a single fixed-length vector, which can lead to an information bottleneck, especially for long sequences.
- **Difficulty with long-term dependencies**: Seq2seq models, especially when based on traditional RNNs or GRUs, struggle to capture long-term dependencies in the data. While LSTMs help mitigate this issue, they still face challenges when handling very long sequences.

### **Machine translation with seq2seq models**

Seq2seq models are particularly well-suited for machine translation tasks. In this setting, the input sequence is a sentence in the source language, and the output sequence is its translation in the target language. The seq2seq model learns to map the structure and meaning of the source sentence into a context vector, which the decoder then uses to generate the translated sentence.

The process of machine translation with seq2seq models typically follows these steps:
1. **Input processing**: The encoder reads the input sentence (in the source language) one word at a time, updating its hidden state at each time step.
2. **Context vector creation**: Once the entire sentence has been processed, the encoder produces a final hidden state, known as the context vector, which summarizes the input sentence.
3. **Decoding**: The decoder takes the context vector and starts generating the translated sentence word by word, based on the context and the previous words generated in the target language.
4. **End of sequence**: The decoding process continues until an end-of-sequence token is generated, signaling that the translation is complete.

Seq2seq models for machine translation can be trained on large parallel corpora, where sentences in the source language are paired with their translations in the target language. The model learns to align and map the structure of sentences across different languages.

### **Variants and improvements to seq2seq models**

To address the limitations of vanilla seq2seq models, several improvements have been introduced over the years. One of the most important innovations is the **attention mechanism**, which allows the decoder to focus on different parts of the input sequence at each decoding step, rather than relying solely on the context vector.

#### **Attention mechanisms**

Attention mechanisms allow the model to selectively focus on different parts of the input sequence while generating the output. Instead of compressing the entire input into a single fixed-length vector, the attention mechanism provides the decoder with a dynamic weighted combination of the encoder's hidden states, allowing the model to "attend" to specific words in the input sequence during each step of the decoding process.

By doing so, attention helps the model overcome the information bottleneck issue and improves its ability to handle long input sequences. This is particularly useful in machine translation, where the correspondence between words in the source and target languages can vary significantly.

#### **Bidirectional RNNs**

Another enhancement to seq2seq models is the use of **bidirectional RNNs** in the encoder. A bidirectional RNN processes the input sequence in both forward and backward directions, allowing the model to capture context from both the past and the future at each time step. This helps improve the quality of the context vector by giving the encoder access to information about the entire sequence.

### **Maths**

#### **Encoder**

In a seq2seq model, the encoder processes the input sequence one element (token) at a time, updating its hidden state at each step. Let the input sequence be $ X = (x_1, x_2, \dots, x_T) $, where $ T $ is the length of the input sequence. The encoder uses a recurrent neural network (RNN), such as a vanilla RNN, a GRU (Gated Recurrent Unit), or an LSTM (Long Short-Term Memory), to produce a sequence of hidden states $ h_t $ at each time step $ t $.

For an RNN, the hidden state update can be represented as:

$$
h_t = f(W_{hx} x_t + W_{hh} h_{t-1} + b_h)
$$

where:
- $ x_t $ is the input at time step $ t $,
- $ h_{t-1} $ is the hidden state from the previous time step,
- $ W_{hx} $ and $ W_{hh} $ are the weight matrices for the input and the previous hidden state, respectively,
- $ b_h $ is the bias term,
- $ f $ is a non-linear activation function (e.g., tanh or ReLU).

The final hidden state of the encoder $ h_T $ is used as the context vector, which summarizes the entire input sequence:

$$
c = h_T
$$

This context vector $ c $ is then passed to the decoder.

#### **Decoder**

The decoder generates the output sequence $ Y = (y_1, y_2, \dots, y_{T'}) $, where $ T' $ is the length of the output sequence. The decoder is also an RNN, and it generates the output one token at a time, conditioned on the context vector $ c $ from the encoder and its own previous hidden state.

At each time step $ t $ in the decoder, the hidden state is updated as follows:

$$
s_t = f(W_{sy} y_{t-1} + W_{sc} c + W_{ss} s_{t-1} + b_s)
$$

where:
- $ y_{t-1} $ is the previous output token (used as input during training in the teacher forcing setup),
- $ c $ is the context vector from the encoder,
- $ s_{t-1} $ is the previous hidden state of the decoder,
- $ W_{sy}, W_{sc}, W_{ss} $ are the weight matrices for the previous output token, the context vector, and the previous hidden state, respectively,
- $ b_s $ is the bias term.

At each step, the decoder produces an output $ \hat{y_t} $, which is the probability distribution over the possible output tokens. This is usually done by applying a softmax function to the decoder’s output at each time step:

$$
\hat{y_t} = \text{softmax}(W_o s_t + b_o)
$$

where:
- $ W_o $ is the output weight matrix,
- $ b_o $ is the output bias.

The softmax function normalizes the output into a probability distribution over the vocabulary, allowing the model to predict the next token in the sequence.

#### **Sequence generation**

During training, the seq2seq model uses **teacher forcing**, where the true output token from the previous time step is provided as input to the decoder for the next time step. During inference (or testing), the model uses its own predictions as input for the next time step, generating the output sequence token by token.

The output sequence is generated until the model produces an **end-of-sequence** (EOS) token, signaling that the sequence is complete.

#### **Loss function**

The seq2seq model is trained to minimize the difference between the predicted sequence $ \hat{Y} $ and the true sequence $ Y $. A common loss function for this is the **cross-entropy loss**, which measures the difference between the predicted probability distribution $ \hat{y_t} $ and the true one-hot encoded output $ y_t $ at each time step:

$$
L = - \sum_{t=1}^{T'} \sum_{k=1}^{V} y_{t,k} \log(\hat{y_{t,k}})
$$

where:
- $ T' $ is the length of the output sequence,
- $ V $ is the size of the output vocabulary,
- $ y_{t,k} $ is the true one-hot encoded value for the $ k $-th word in the vocabulary at time step $ t $,
- $ \hat{y_{t,k}} $ is the predicted probability for the $ k $-th word at time step $ t $.

The goal is to minimize this loss over the entire training dataset, adjusting the model's parameters using gradient descent or another optimization algorithm.

#### **Gradient flow and backpropagation through time (BPTT)**

Training seq2seq models involves **backpropagation through time (BPTT)**, which is a form of backpropagation applied to sequences. In BPTT, the gradients of the loss with respect to the model’s parameters are computed over the entire sequence, and the parameters are updated accordingly.

For each time step $ t $, the gradients are calculated for both the encoder and decoder. The weights in both networks are updated based on the error signals from the output sequence, which are propagated backward through the decoder and then through the encoder.

Since seq2seq models involve both an encoder and a decoder, BPTT is applied to the entire architecture, ensuring that the gradients flow from the output sequence back through the decoder and encoder.

## Setting up the environment


##### **Q1: How do you install the necessary libraries for building and training seq2seq models in PyTorch?**


In [1]:
# !!conda install -y pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
# !pip install nltk

##### **Q2: How do you import the required modules for model building, training, and data loading in PyTorch?**


In [9]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import random
import numpy as np
import nltk
import os
import urllib.request
import gzip
import shutil
import re

##### **Q3: How do you set up the environment to use a GPU for training seq2seq models, and how do you fallback to CPU in PyTorch?**


In [3]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

Using device: cuda


##### **Q4: How do you set random seeds in PyTorch to ensure reproducibility when training seq2seq models?**

In [4]:
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed(42)
np.random.seed(42)
random.seed(42)

## Preparing the dataset for machine translation


##### **Q5: How do you load a machine translation dataset (e.g., English to German) to use in PyTorch?**


In [6]:
nltk.download('punkt')

url_train_en = 'https://raw.githubusercontent.com/multi30k/dataset/master/data/task1/raw/train.en.gz'
url_train_de = 'https://raw.githubusercontent.com/multi30k/dataset/master/data/task1/raw/train.de.gz'

os.makedirs('data', exist_ok=True)

def download_and_extract(url, filename):
    filepath = os.path.join('data', filename)
    if not os.path.exists(filepath):
        urllib.request.urlretrieve(url, filepath + '.gz')
        with gzip.open(filepath + '.gz', 'rb') as f_in:
            with open(filepath, 'wb') as f_out:
                shutil.copyfileobj(f_in, f_out)
        os.remove(filepath + '.gz')

download_and_extract(url_train_en, 'train.en')
download_and_extract(url_train_de, 'train.de')

[nltk_data] Downloading package punkt to /home/fellmir/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


##### **Q6: How do you preprocess the dataset by tokenizing the sentences and converting them into sequences of indices?**


In [10]:
def simple_tokenizer(text):
    return re.findall(r'\b\w+\b', text.lower())

In [11]:
# Assign the tokenizer functions:
tokenizer_src = simple_tokenizer  # For English
tokenizer_trg = simple_tokenizer  # For German

with open('data/train.en', 'r', encoding='utf-8') as f:
    sentences_en = f.readlines()

with open('data/train.de', 'r', encoding='utf-8') as f:
    sentences_de = f.readlines()

assert len(sentences_en) == len(sentences_de)  # Ensure both files have the same number of sentences

tokenized_en = [tokenizer_src(sentence) for sentence in sentences_en]
tokenized_de = [tokenizer_trg(sentence) for sentence in sentences_de]  # Tokenize the sentences

In [None]:
# Alternative using NLTK's wordpunct_tokenize:
# from nltk.tokenize import wordpunct_tokenize

# tokenizer_src = wordpunct_tokenize  # For English
# tokenizer_trg = wordpunct_tokenize  # For German

# Tokenize the sentences
# tokenized_en = [tokenizer_src(sentence.lower()) for sentence in sentences_en]
# tokenized_de = [tokenizer_trg(sentence.lower()) for sentence in sentences_de]

##### **Q7: How do you build vocabulary for both the source and target languages?**


In [12]:
from collections import Counter

def build_vocab(tokenized_sentences, min_freq):
    counter = Counter()
    for tokens in tokenized_sentences:
        counter.update(tokens)
    vocab = {'<unk>': 0, '<pad>': 1, '<bos>': 2, '<eos>': 3}
    idx = 4
    for word, freq in counter.items():
        if freq >= min_freq:
            vocab[word] = idx
            idx += 1
    return vocab

MIN_FREQ = 2
vocab_src = build_vocab(tokenized_en, MIN_FREQ)
vocab_trg = build_vocab(tokenized_de, MIN_FREQ)

inv_vocab_src = {idx: word for word, idx in vocab_src.items()}
inv_vocab_trg = {idx: word for word, idx in vocab_trg.items()}  # Inverse vocabularies for decoding

##### **Q8: How do you create DataLoaders for batching the source-target sentence pairs during training?**

In [13]:
class TranslationDataset(Dataset):
    def __init__(self, tokenized_src, tokenized_trg, vocab_src, vocab_trg):
        self.data = []
        for src_tokens, trg_tokens in zip(tokenized_src, tokenized_trg):
            src_indices = [vocab_src.get('<bos>')] + [vocab_src.get(token, vocab_src['<unk>']) for token in src_tokens] + [vocab_src.get('<eos>')]
            trg_indices = [vocab_trg.get('<bos>')] + [vocab_trg.get(token, vocab_trg['<unk>']) for token in trg_tokens] + [vocab_trg.get('<eos>')]
            self.data.append((torch.tensor(src_indices), torch.tensor(trg_indices)))

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

def collate_fn(batch):
    src_batch, trg_batch = zip(*batch)
    src_padded = nn.utils.rnn.pad_sequence(src_batch, padding_value=vocab_src['<pad>'], batch_first=True)
    trg_padded = nn.utils.rnn.pad_sequence(trg_batch, padding_value=vocab_trg['<pad>'], batch_first=True)
    return src_padded, trg_padded

In [14]:
from sklearn.model_selection import train_test_split

train_src, valid_src, train_trg, valid_trg = train_test_split(tokenized_en, tokenized_de, test_size=0.1, random_state=42)

train_dataset = TranslationDataset(train_src, train_trg, vocab_src, vocab_trg)
valid_dataset = TranslationDataset(valid_src, valid_trg, vocab_src, vocab_trg)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, collate_fn=collate_fn)
valid_loader = DataLoader(valid_dataset, batch_size=32, shuffle=False, collate_fn=collate_fn)

## Building the Encoder model


##### **Q9: How do you define the architecture of the Encoder model using PyTorch’s `nn.Module`?**


In [15]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(input_dim, emb_dim)
        self.rnn = nn.GRU(emb_dim, hid_dim, n_layers, batch_first=True)
        
    def forward(self, src):
        embedded = self.embedding(src)
        outputs, hidden = self.rnn(embedded)
        return hidden

##### **Q10: How do you implement the forward pass of the Encoder to process input sequences and generate the context vector?**


In [16]:
# The forward pass of the Encoder processes input sequences and generates the context vector (hidden state)

##### **Q11: How do you specify the number of layers and hidden units in the Encoder?**

In [17]:
INPUT_DIM = len(vocab_src)
ENC_EMB_DIM = 256
HID_DIM = 512
ENC_N_LAYERS = 2

encoder = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, ENC_N_LAYERS).to(device)  # Initialize the Encoder

## Building the Decoder model


##### **Q12: How do you define the Decoder architecture using PyTorch’s `nn.Module`?**


In [18]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(output_dim, emb_dim)
        self.rnn = nn.GRU(emb_dim, hid_dim, n_layers, batch_first=True)
        self.fc_out = nn.Linear(hid_dim, output_dim)
        
    def forward(self, input, hidden):
        input = input.unsqueeze(1)
        embedded = self.embedding(input)
        output, hidden = self.rnn(embedded, hidden)
        prediction = self.fc_out(output.squeeze(1))
        return prediction, hidden

##### **Q13: How do you implement the forward pass of the Decoder to generate translated sequences from the context vector?**


In [None]:
# The forward pass of the Decoder generates translated sequences from the context vector

##### **Q14: How do you use the `nn.Linear` and `nn.Softmax` layers to convert the Decoder's output into predicted tokens?**

In [None]:
# The nn.Linear and nn.Softmax layers convert the Decoder's output into predicted tokens

## Combining Encoder and Decoder into a seq2seq model


##### **Q15: How do you combine the Encoder and Decoder models into a complete seq2seq model for machine translation?**


In [19]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        batch_size = src.shape[0]
        trg_len = trg.shape[1]
        trg_vocab_size = self.decoder.embedding.num_embeddings
        
        outputs = torch.zeros(batch_size, trg_len, trg_vocab_size).to(self.device)
        
        hidden = self.encoder(src)
        
        input = trg[:, 0]
        
        for t in range(1, trg_len):
            output, hidden = self.decoder(input, hidden)
            outputs[:, t, :] = output
            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.argmax(1)
            input = trg[:, t] if teacher_force else top1
        return outputs

In [20]:
OUTPUT_DIM = len(vocab_trg)
DEC_EMB_DIM = 256
DEC_N_LAYERS = 2

decoder = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, DEC_N_LAYERS).to(device)
seq2seq_model = Seq2Seq(encoder, decoder, device).to(device)

##### **Q16: How do you implement teacher forcing in the training loop to improve the Decoder’s performance during training?**


In [21]:
# Teacher forcing is implemented in the Seq2Seq model's forward method

##### **Q17: How do you implement the forward pass for the combined seq2seq model, using the context vector from the Encoder to initialize the Decoder?**

In [22]:
# The forward pass for the combined seq2seq model uses the context vector from the Encoder to initialize the Decoder

## Training the seq2seq model


##### **Q18: How do you define the loss function (e.g., CrossEntropyLoss) for training the seq2seq model on sequence data?**


In [23]:
PAD_IDX = vocab_trg['<pad>']
criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)  # Define the loss function with padding index ignored

##### **Q19: How do you configure an optimizer (e.g., Adam) to update the parameters of both the Encoder and Decoder models during training?**


In [24]:
optimizer = optim.Adam(seq2seq_model.parameters())

##### **Q20: How do you implement the training loop for the seq2seq model, including the forward pass, loss calculation, and backpropagation?**


In [26]:
num_epochs = 30

for epoch in range(num_epochs):
    seq2seq_model.train()
    epoch_loss = 0
    for src_batch, trg_batch in train_loader:
        src_batch = src_batch.to(device)
        trg_batch = trg_batch.to(device)
        
        optimizer.zero_grad()
        output = seq2seq_model(src_batch, trg_batch)
        
        # output: (batch_size, trg_len, trg_vocab_size)
        output_dim = output.shape[-1]
        output = output[:, 1:, :].reshape(-1, output_dim)
        trg = trg_batch[:, 1:].reshape(-1)
        
        loss = criterion(output, trg)
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
        
    if (epoch + 1) % 10 == 0:
        print(f'Epoch {epoch+1}/{num_epochs}, Loss: {epoch_loss/len(train_loader):.4f}')

Epoch 10/30, Loss: 0.7457
Epoch 20/30, Loss: 0.6495
Epoch 30/30, Loss: 0.6860


##### **Q21: How do you monitor and log the training loss over epochs to ensure the seq2seq model is learning effectively?**

In [None]:
# Training loss is monitored and logged in the training loop above

## Evaluating the seq2seq model


##### **Q22: How do you evaluate the seq2seq model on a validation dataset using metrics such as the BLEU score?**


In [27]:
from nltk.translate.bleu_score import corpus_bleu

def evaluate(model, data_loader):
    model.eval()
    references = []
    hypotheses = []
    
    with torch.no_grad():
        for src_batch, trg_batch in data_loader:
            src_batch = src_batch.to(device)
            trg_batch = trg_batch.to(device)
            output = model(src_batch, trg_batch, teacher_forcing_ratio=0)
            output_tokens = output.argmax(2)
            for i in range(src_batch.size(0)):
                trg_indices = trg_batch[i].cpu().numpy()
                output_indices = output_tokens[i].cpu().numpy()
                trg_tokens = [inv_vocab_trg[idx] for idx in trg_indices if idx != PAD_IDX and idx != vocab_trg['<bos>']]
                output_tokens_list = [inv_vocab_trg[idx] for idx in output_indices if idx != PAD_IDX and idx != vocab_trg['<bos>']]
                references.append([trg_tokens])
                hypotheses.append(output_tokens_list)
    bleu = corpus_bleu(references, hypotheses)
    return bleu

In [28]:
bleu_score = evaluate(seq2seq_model, valid_loader)
print(f'Validation BLEU score: {bleu_score*100:.2f}')

Validation BLEU score: 7.99


##### **Q23: How do you implement a function to calculate the BLEU score to assess the quality of the machine-translated sequences?**


In [None]:
# The "evaluate" function calculates the BLEU score to assess translation quality

##### **Q24: How do you compare the model's predictions to the target translations during evaluation to measure performance?**

In [None]:
# The model's predictions are compared to the target translations during evaluation

## Translating new sentences


##### **Q25: How do you implement a function to translate new sentences using the trained seq2seq model?**


In [29]:
def translate_sentence(sentence, vocab_src, vocab_trg, model, tokenizer_src, max_len=50):
    model.eval()
    tokens = ['<bos>'] + tokenizer_src(sentence.lower()) + ['<eos>']
    src_indices = [vocab_src.get(token, vocab_src['<unk>']) for token in tokens]
    src_tensor = torch.LongTensor(src_indices).unsqueeze(0).to(device)
    
    with torch.no_grad():
        hidden = model.encoder(src_tensor)
    
    input_token = torch.LongTensor([vocab_trg['<bos>']]).to(device)
    outputs = []
    
    for _ in range(max_len):
        output, hidden = model.decoder(input_token, hidden)
        top1 = output.argmax(1)
        outputs.append(top1.item())
        if top1.item() == vocab_trg['<eos>']:
            break
        input_token = top1
    translated_tokens = [inv_vocab_trg.get(idx, '<unk>') for idx in outputs]
    return translated_tokens

##### **Q26: How do you handle sentences of varying lengths when translating new sentences with the seq2seq model?**


In [None]:
# The translate_sentence function handles sentences of varying lengths using a max_len parameter

##### **Q27: How do you visualize the original, translated, and reference (ground truth) sentences to evaluate the model’s translation performance?**

In [30]:
def display_translation(sentence):
    print(f'Original: {sentence}')
    translation = translate_sentence(sentence, vocab_src, vocab_trg, seq2seq_model, tokenizer_src)
    print(f'Translated: {" ".join(translation)}')

In [31]:
test_sentence = "A man is playing a guitar."
display_translation(test_sentence)

Original: A man is playing a guitar.
Translated: ein mann spielt gitarre einer trommel <eos>


In [32]:
display_translation('the book is on the table')

Original: the book is on the table
Translated: die am tisch beim essen <eos>


In [37]:
display_translation('words')

Original: words
Translated: straßenkünstler <unk> die arbeit <eos>


## Experimenting with hyperparameters


##### **Q28: How do you adjust the learning rate and observe its effect on the seq2seq model’s training stability and performance?**


In [38]:
learning_rates = [0.001, 0.0005, 0.0001]

def train_with_learning_rate(lr):
    print(f"\nTraining with learning rate: {lr}")
    # Re-initialize the model:
    encoder = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, ENC_N_LAYERS).to(device)
    decoder = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, DEC_N_LAYERS).to(device)
    seq2seq_model = Seq2Seq(encoder, decoder, device).to(device)
    
    optimizer = optim.Adam(seq2seq_model.parameters(), lr=lr)
    
    criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)
    
    num_epochs = 10
    for epoch in range(num_epochs):
        seq2seq_model.train()
        epoch_loss = 0
        for src_batch, trg_batch in train_loader:
            src_batch = src_batch.to(device)
            trg_batch = trg_batch.to(device)
            
            optimizer.zero_grad()
            output = seq2seq_model(src_batch, trg_batch)
            
            # output: (batch_size, trg_len, trg_vocab_size)
            output_dim = output.shape[-1]
            output = output[:, 1:, :].reshape(-1, output_dim)
            trg = trg_batch[:, 1:].reshape(-1)
            
            loss = criterion(output, trg)
            loss.backward()
            optimizer.step()
            
            epoch_loss += loss.item()
        avg_loss = epoch_loss / len(train_loader)
        print(f'Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}')
    
    bleu_score = evaluate(seq2seq_model, valid_loader)
    print(f'Validation BLEU score: {bleu_score*100:.2f}')
    return avg_loss, bleu_score

In [39]:
results_lr = []
for lr in learning_rates:
    avg_loss, bleu = train_with_learning_rate(lr)
    results_lr.append({'learning_rate': lr, 'loss': avg_loss, 'bleu_score': bleu})

print("\nLearning Rate Experiment Results:")
for res in results_lr:
    print(f"Learning Rate: {res['learning_rate']}, Final Loss: {res['loss']:.4f}, BLEU Score: {res['bleu_score']*100:.2f}")


Training with learning rate: 0.001
Epoch 1/10, Loss: 4.6847
Epoch 2/10, Loss: 3.6639
Epoch 3/10, Loss: 3.1361
Epoch 4/10, Loss: 2.7249
Epoch 5/10, Loss: 2.3774
Epoch 6/10, Loss: 2.0795
Epoch 7/10, Loss: 1.8155
Epoch 8/10, Loss: 1.6143
Epoch 9/10, Loss: 1.4452
Epoch 10/10, Loss: 1.2949
Validation BLEU score: 9.00

Training with learning rate: 0.0005
Epoch 1/10, Loss: 4.8586
Epoch 2/10, Loss: 3.7968
Epoch 3/10, Loss: 3.2709
Epoch 4/10, Loss: 2.8791
Epoch 5/10, Loss: 2.5498
Epoch 6/10, Loss: 2.2679
Epoch 7/10, Loss: 1.9957
Epoch 8/10, Loss: 1.7537
Epoch 9/10, Loss: 1.5253
Epoch 10/10, Loss: 1.3123
Validation BLEU score: 10.03

Training with learning rate: 0.0001
Epoch 1/10, Loss: 5.4526
Epoch 2/10, Loss: 4.8022
Epoch 3/10, Loss: 4.4589
Epoch 4/10, Loss: 4.1844
Epoch 5/10, Loss: 3.9739
Epoch 6/10, Loss: 3.7985
Epoch 7/10, Loss: 3.6333
Epoch 8/10, Loss: 3.4942
Epoch 9/10, Loss: 3.3708
Epoch 10/10, Loss: 3.2538
Validation BLEU score: 6.85

Learning Rate Experiment Results:
Learning Rate: 0.

##### **Q29: How do you experiment with different batch sizes to observe how they impact training speed and memory usage?**


In [40]:
batch_sizes = [16, 32, 64]

def train_with_batch_size(batch_size):
    print(f"\nTraining with batch size: {batch_size}")
    # Re-create DataLoaders with specified batch size:
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
    valid_loader = DataLoader(valid_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)
    
    encoder = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, ENC_N_LAYERS).to(device)
    decoder = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, DEC_N_LAYERS).to(device)
    seq2seq_model = Seq2Seq(encoder, decoder, device).to(device)
    
    optimizer = optim.Adam(seq2seq_model.parameters(), lr=0.001)
    criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)
    
    num_epochs = 10
    import time
    start_time = time.time()
    for epoch in range(num_epochs):
        seq2seq_model.train()
        epoch_loss = 0
        for src_batch, trg_batch in train_loader:
            src_batch = src_batch.to(device)
            trg_batch = trg_batch.to(device)
            
            optimizer.zero_grad()
            output = seq2seq_model(src_batch, trg_batch)
            
            # output: (batch_size, trg_len, trg_vocab_size)
            output_dim = output.shape[-1]
            output = output[:, 1:, :].reshape(-1, output_dim)
            trg = trg_batch[:, 1:].reshape(-1)
            
            loss = criterion(output, trg)
            loss.backward()
            optimizer.step()
            
            epoch_loss += loss.item()
        avg_loss = epoch_loss / len(train_loader)
        print(f'Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}')
    end_time = time.time()
    total_time = end_time - start_time
    
    bleu_score = evaluate(seq2seq_model, valid_loader)
    print(f'Validation BLEU score: {bleu_score*100:.2f}')
    print(f"Training time: {total_time:.2f} seconds")
    return avg_loss, bleu_score, total_time

In [41]:
results_bs = []
for batch_size in batch_sizes:
    avg_loss, bleu, total_time = train_with_batch_size(batch_size)
    results_bs.append({'batch_size': batch_size, 'loss': avg_loss, 'bleu_score': bleu, 'training_time': total_time})

print("\nBatch Size Experiment Results:")
for res in results_bs:
    print(f"Batch Size: {res['batch_size']}, Final Loss: {res['loss']:.4f}, BLEU Score: {res['bleu_score']*100:.2f}, Training Time: {res['training_time']:.2f} seconds")


Training with batch size: 16
Epoch 1/10, Loss: 4.4521
Epoch 2/10, Loss: 3.3737
Epoch 3/10, Loss: 2.8145
Epoch 4/10, Loss: 2.3840
Epoch 5/10, Loss: 2.0419
Epoch 6/10, Loss: 1.8051
Epoch 7/10, Loss: 1.6383
Epoch 8/10, Loss: 1.5304
Epoch 9/10, Loss: 1.4372
Epoch 10/10, Loss: 1.3601
Validation BLEU score: 9.88
Training time: 829.96 seconds

Training with batch size: 32
Epoch 1/10, Loss: 4.6968
Epoch 2/10, Loss: 3.6638
Epoch 3/10, Loss: 3.1340
Epoch 4/10, Loss: 2.7402
Epoch 5/10, Loss: 2.3981
Epoch 6/10, Loss: 2.0969
Epoch 7/10, Loss: 1.8206
Epoch 8/10, Loss: 1.6418
Epoch 9/10, Loss: 1.4646
Epoch 10/10, Loss: 1.3230
Validation BLEU score: 8.94
Training time: 537.01 seconds

Training with batch size: 64
Epoch 1/10, Loss: 5.1268
Epoch 2/10, Loss: 4.1523
Epoch 3/10, Loss: 3.6510
Epoch 4/10, Loss: 3.2843
Epoch 5/10, Loss: 2.9597
Epoch 6/10, Loss: 2.7052
Epoch 7/10, Loss: 2.4615
Epoch 8/10, Loss: 2.2732
Epoch 9/10, Loss: 2.0692
Epoch 10/10, Loss: 1.8595
Validation BLEU score: 7.67
Training time

##### **Q30: How do you modify the number of training epochs and analyze how it affects the model’s convergence and translation accuracy?**


In [42]:
epoch_numbers = [5, 15, 20]

def train_with_epochs(num_epochs):
    print(f"\nTraining with number of epochs: {num_epochs}")
    encoder = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, ENC_N_LAYERS).to(device)
    decoder = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, DEC_N_LAYERS).to(device)
    seq2seq_model = Seq2Seq(encoder, decoder, device).to(device)
    
    optimizer = optim.Adam(seq2seq_model.parameters(), lr=0.001)
    criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)
    
    for epoch in range(num_epochs):
        seq2seq_model.train()
        epoch_loss = 0
        for src_batch, trg_batch in train_loader:
            src_batch = src_batch.to(device)
            trg_batch = trg_batch.to(device)
            
            optimizer.zero_grad()
            output = seq2seq_model(src_batch, trg_batch)
            
            # output: (batch_size, trg_len, trg_vocab_size)
            output_dim = output.shape[-1]
            output = output[:, 1:, :].reshape(-1, output_dim)
            trg = trg_batch[:, 1:].reshape(-1)
            
            loss = criterion(output, trg)
            loss.backward()
            optimizer.step()
            
            epoch_loss += loss.item()
        avg_loss = epoch_loss / len(train_loader)
        print(f'Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}')
    
    bleu_score = evaluate(seq2seq_model, valid_loader)
    print(f'Validation BLEU score after {num_epochs} epochs: {bleu_score*100:.2f}')
    return avg_loss, bleu_score

In [43]:
results_epochs = []
for num_epochs in epoch_numbers:
    avg_loss, bleu = train_with_epochs(num_epochs)
    results_epochs.append({'num_epochs': num_epochs, 'loss': avg_loss, 'bleu_score': bleu})

print("\nEpoch Number Experiment Results:")
for res in results_epochs:
    print(f"Number of Epochs: {res['num_epochs']}, Final Loss: {res['loss']:.4f}, BLEU Score: {res['bleu_score']*100:.2f}")


Training with number of epochs: 5
Epoch 1/5, Loss: 4.6984
Epoch 2/5, Loss: 3.6317
Epoch 3/5, Loss: 3.0833
Epoch 4/5, Loss: 2.6554
Epoch 5/5, Loss: 2.2912
Validation BLEU score after 5 epochs: 9.15

Training with number of epochs: 15
Epoch 1/15, Loss: 4.7550
Epoch 2/15, Loss: 3.7759
Epoch 3/15, Loss: 3.2685
Epoch 4/15, Loss: 2.8572
Epoch 5/15, Loss: 2.5082
Epoch 6/15, Loss: 2.1977
Epoch 7/15, Loss: 1.9555
Epoch 8/15, Loss: 1.7371
Epoch 9/15, Loss: 1.5629
Epoch 10/15, Loss: 1.4165
Epoch 11/15, Loss: 1.2908
Epoch 12/15, Loss: 1.1703
Epoch 13/15, Loss: 1.0776
Epoch 14/15, Loss: 1.0027
Epoch 15/15, Loss: 0.9374
Validation BLEU score after 15 epochs: 8.78

Training with number of epochs: 20
Epoch 1/20, Loss: 4.7713
Epoch 2/20, Loss: 3.7686
Epoch 3/20, Loss: 3.2649
Epoch 4/20, Loss: 2.8748
Epoch 5/20, Loss: 2.5597
Epoch 6/20, Loss: 2.2812
Epoch 7/20, Loss: 2.0288
Epoch 8/20, Loss: 1.8197
Epoch 9/20, Loss: 1.6428
Epoch 10/20, Loss: 1.5024
Epoch 11/20, Loss: 1.3940
Epoch 12/20, Loss: 1.2498
Ep

##### **Q31: How do you experiment with different recurrent layers (e.g., LSTM vs. GRU) to evaluate their impact on translation quality?**

In [44]:
def train_model(model_type='GRU'):
    print(f"\nTraining with {model_type} model")
    if model_type == 'GRU':
        # Define GRU-based model:
        encoder = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, ENC_N_LAYERS).to(device)
        decoder = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, DEC_N_LAYERS).to(device)
        seq2seq_model = Seq2Seq(encoder, decoder, device).to(device)
    elif model_type == 'LSTM':
        # Define LSTM-based model:
        class EncoderLSTM(nn.Module):
            def __init__(self, input_dim, emb_dim, hid_dim, n_layers):
                super(EncoderLSTM, self).__init__()
                self.embedding = nn.Embedding(input_dim, emb_dim)
                self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, batch_first=True)
                
            def forward(self, src):
                embedded = self.embedding(src)
                outputs, (hidden, cell) = self.rnn(embedded)
                return hidden, cell
        
        class DecoderLSTM(nn.Module):
            def __init__(self, output_dim, emb_dim, hid_dim, n_layers):
                super(DecoderLSTM, self).__init__()
                self.embedding = nn.Embedding(output_dim, emb_dim)
                self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, batch_first=True)
                self.fc_out = nn.Linear(hid_dim, output_dim)
                
            def forward(self, input, hidden, cell):
                input = input.unsqueeze(1)
                embedded = self.embedding(input)
                output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
                prediction = self.fc_out(output.squeeze(1))
                return prediction, hidden, cell
        
        class Seq2SeqLSTM(nn.Module):
            def __init__(self, encoder, decoder, device):
                super(Seq2SeqLSTM, self).__init__()
                self.encoder = encoder
                self.decoder = decoder
                self.device = device
                
            def forward(self, src, trg, teacher_forcing_ratio=0.5):
                batch_size = src.shape[0]
                trg_len = trg.shape[1]
                trg_vocab_size = self.decoder.embedding.num_embeddings
                
                outputs = torch.zeros(batch_size, trg_len, trg_vocab_size).to(self.device)
                
                hidden, cell = self.encoder(src)
                
                input = trg[:, 0]
                
                for t in range(1, trg_len):
                    output, hidden, cell = self.decoder(input, hidden, cell)
                    outputs[:, t, :] = output
                    teacher_force = random.random() < teacher_forcing_ratio
                    top1 = output.argmax(1)
                    input = trg[:, t] if teacher_force else top1
                return outputs
        
        encoder = EncoderLSTM(INPUT_DIM, ENC_EMB_DIM, HID_DIM, ENC_N_LAYERS).to(device)
        decoder = DecoderLSTM(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, DEC_N_LAYERS).to(device)
        seq2seq_model = Seq2SeqLSTM(encoder, decoder, device).to(device)
    else:
        raise ValueError("model_type must be 'GRU' or 'LSTM'")
    
    optimizer = optim.Adam(seq2seq_model.parameters(), lr=0.001)
    criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)
    
    num_epochs = 10
    for epoch in range(num_epochs):
        seq2seq_model.train()
        epoch_loss = 0
        for src_batch, trg_batch in train_loader:
            src_batch = src_batch.to(device)
            trg_batch = trg_batch.to(device)
            
            optimizer.zero_grad()
            output = seq2seq_model(src_batch, trg_batch)
            
            # output: (batch_size, trg_len, trg_vocab_size)
            output_dim = output.shape[-1]
            output = output[:, 1:, :].reshape(-1, output_dim)
            trg = trg_batch[:, 1:].reshape(-1)
            
            loss = criterion(output, trg)
            loss.backward()
            optimizer.step()
            
            epoch_loss += loss.item()
        avg_loss = epoch_loss / len(train_loader)
        print(f'Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}')
    
    bleu_score = evaluate(seq2seq_model, valid_loader)
    print(f'Validation BLEU score: {bleu_score*100:.2f}')
    return avg_loss, bleu_score

In [45]:
gru_loss, gru_bleu = train_model(model_type='GRU')  # Train and evaluate GRU-based model

lstm_loss, lstm_bleu = train_model(model_type='LSTM')  # Train and evaluate LSTM-based model

print("\nRecurrent Layer Experiment Results:")
print(f"GRU Model - Final Loss: {gru_loss:.4f}, BLEU Score: {gru_bleu*100:.2f}")
print(f"LSTM Model - Final Loss: {lstm_loss:.4f}, BLEU Score: {lstm_bleu*100:.2f}")


Training with GRU model
Epoch 1/10, Loss: 4.6459
Epoch 2/10, Loss: 3.5893
Epoch 3/10, Loss: 3.0096
Epoch 4/10, Loss: 2.5689
Epoch 5/10, Loss: 2.2054
Epoch 6/10, Loss: 1.8771
Epoch 7/10, Loss: 1.6052
Epoch 8/10, Loss: 1.4044
Epoch 9/10, Loss: 1.2310
Epoch 10/10, Loss: 1.0996
Validation BLEU score: 9.33

Training with LSTM model
Epoch 1/10, Loss: 4.9880
Epoch 2/10, Loss: 4.1204
Epoch 3/10, Loss: 3.7079
Epoch 4/10, Loss: 3.3919
Epoch 5/10, Loss: 3.1239
Epoch 6/10, Loss: 2.8745
Epoch 7/10, Loss: 2.6470
Epoch 8/10, Loss: 2.4340
Epoch 9/10, Loss: 2.2234
Epoch 10/10, Loss: 2.0312
Validation BLEU score: 8.62

Recurrent Layer Experiment Results:
GRU Model - Final Loss: 1.0996, BLEU Score: 9.33
LSTM Model - Final Loss: 2.0312, BLEU Score: 8.62


In [46]:
import shutil
import os

if os.path.exists('data'):
    shutil.rmtree('data')
    print("Folder 'data' has been deleted.")
else:
    print("Folder 'data' does not exist.")

Folder 'data' has been deleted.
