<a href="https://colab.research.google.com/github/deepakgarg08/llm-diary/blob/main/llm_chronicles_rnn_encoder_decoder_translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Language Translation with RNNs

In this notebook we'll implement an encoder/decoder RNN for language translation from English to Italian, similar to the model described in the 2014 paper "Sequence to Sequence Learning with Neural Networks" by Ilya Sutskever (OpenAI) et all.

https://proceedings.neurips.cc/paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2-Paper.pdf

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence

In [None]:
# Device-independent code
DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
DEVICE

device(type='cuda', index=0)

# 1 - Processing dataset

For this network, we'll use a distilled version of this "English_Italian_Sentence_Translation" dataset from Kaggle (https://www.kaggle.com/datasets/ncsaayali/english-italian-sentence-translation).

I have pre-proccessed this dataset by reducing the vocabulary and removing punctuation, so that our small RNN can quickly learn some useful patterns.

![picture](https://raw.githubusercontent.com/kyuz0/llm-chronicles/main/4.5%20-%20Lab%20-%20Translation/dataset.png)


In [None]:
!wget https://raw.githubusercontent.com/kyuz0/llm-chronicles/main/datasets/eng_ita_v2.txt

--2023-12-19 17:05:34--  https://raw.githubusercontent.com/kyuz0/llm-chronicles/main/datasets/eng_ita_v2.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7240475 (6.9M) [text/plain]
Saving to: ‘eng_ita_v2.txt’


2023-12-19 17:05:34 (162 MB/s) - ‘eng_ita_v2.txt’ saved [7240475/7240475]



In [None]:
import numpy as np

# Function to read the file and extract pairs
def read_data(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        lines = file.read().strip().split('\n')
    pairs = [[s for s in line.split(' -> ')] for line in lines]
    return pairs

In [None]:
# Path to text file
file_path = 'eng_ita_v2.txt'

# Process the data
pairs = read_data(file_path)
len(pairs)

120746

![picture](https://raw.githubusercontent.com/kyuz0/llm-chronicles/main/4.5%20-%20Lab%20-%20Translation/vocab.png)

In [None]:
# Create vocabulary
def tokenize(sentence):
  return sentence.lower().split()

# Tokenize and build vocabularies
def build_vocab(pairs):
    eng_vocab = set()
    ita_vocab = set()
    for eng, ita in pairs:
        eng_vocab.update(tokenize(eng))
        ita_vocab.update(tokenize(ita))
    return eng_vocab, ita_vocab

english_vocab, italian_vocab = build_vocab(pairs)

# Creating word to integer mapping
eng_word2int = {word: i for i, word in enumerate(english_vocab)}
ita_word2int = {word: i for i, word in enumerate(italian_vocab)}

# Creating integer to word mapping
eng_int2word = {i: word for word, i in eng_word2int.items()}
ita_int2word = {i: word for word, i in ita_word2int.items()}

print('English vocabulary size:', len(english_vocab))
print('Italian vocabulary size:', len(italian_vocab))

English vocabulary size: 4997
Italian vocabulary size: 13673


In [None]:
# Example usage
eng_example = "Who are you"
ita_example = "chi sei tu"

# Encoding
eng_encoded = np.array([eng_word2int[word] for word in tokenize(eng_example)], dtype=np.int32)
ita_encoded = np.array([ita_word2int[word] for word in tokenize(ita_example)], dtype=np.int32)

print('English text encoded:', eng_encoded)
print('Italian text encoded:', ita_encoded)

# Decoding
print('Decoded English:', ' '.join([eng_int2word[i] for i in eng_encoded]))
print('Decoded Italian:', ' '.join([ita_int2word[i] for i in ita_encoded]))

English text encoded: [4492 3186 4986]
Italian text encoded: [ 4502  2356 10759]
Decoded English: who are you
Decoded Italian: chi sei tu


## 1.1 - Dataset and Dataloader

As usual, we'll wrap our dataset into PyTorch Dataset and Dataloder objects, so that we can interact them in a standard way during treaning.

We'll also add some special tokens to our dataset. These tokens play crucial roles in the processing and generation of sequences:

- **PAD_TOKEN** (<PAD>): This token is used for padding shorter sequences to match the length of the longest sequence in a batch. Padding ensures that all sequences in a batch have the same length, which is a requirement for many models.

- **EOS_TOKEN** (<EOS>): The End Of Sequence/Sentence token signifies the end of a sequence. This is particularly important in models that generate text, as it indicates when the model should stop generating further output.

- **SOS_TOKEN** (<SOS>): The Start Of Sequence/Sentence token is used to signal the beginning of a new sequence. This is often used in models that generate sequences, as it indicates the start of a new textual output.

- **UNK_TOKEN** (<UNK>): The Unknown token is used to represent words that are not in the vocabulary. This is a common practice in NLP applications to handle rare or unknown words that the model may encounter.

By incorporating these tokens into our dataset, we can effectively manage sequence lengths and handle special cases in sequence generation and processing. In the next steps, we will integrate these tokens into our dataset preprocessing and model training pipeline.

In [None]:
# Special tokens
PAD_TOKEN = "<PAD>"
EOS_TOKEN = "<EOS>"
SOS_TOKEN = "<SOS>"
UNK_TOKEN = "<UNK>"

# Update the function to create mappings to include the special tokens
def create_mappings(vocab):
    vocab = [PAD_TOKEN, SOS_TOKEN, EOS_TOKEN, UNK_TOKEN] + sorted(vocab)
    word2int = {word: i for i, word in enumerate(vocab)}
    int2word = {i: word for word, i in word2int.items()}
    return word2int, int2word

# Update the vocabularies
eng_word2int, eng_int2word = create_mappings(english_vocab)
ita_word2int, ita_int2word = create_mappings(italian_vocab)

In [None]:
class TranslationDataset(Dataset):
    def __init__(self, pairs, eng_word2int, ita_word2int):
        self.pairs = pairs
        self.eng_word2int = eng_word2int
        self.ita_word2int = ita_word2int

    def __len__(self):
        return len(self.pairs)

    def __getitem__(self, idx):
        eng, ita = self.pairs[idx]
        eng_tensor = torch.tensor([self.eng_word2int[word] for word in tokenize(eng)]
                                  + [self.eng_word2int[EOS_TOKEN]], dtype=torch.long)
        ita_tensor = torch.tensor([self.ita_word2int[word] for word in tokenize(ita)]
                                  + [self.ita_word2int[EOS_TOKEN]], dtype=torch.long)
        return eng_tensor, ita_tensor

# Custom collate function to handle padding
def collate_fn(batch):
    eng_batch, ita_batch = zip(*batch)
    eng_batch_padded = pad_sequence(eng_batch, batch_first=True, padding_value=eng_word2int[PAD_TOKEN])
    ita_batch_padded = pad_sequence(ita_batch, batch_first=True, padding_value=ita_word2int[PAD_TOKEN])
    return eng_batch_padded, ita_batch_padded

In [None]:
from torch.nn.utils.rnn import pad_sequence

# Create the dataset and DataLoader
translation_dataset = TranslationDataset(pairs, eng_word2int, ita_word2int)
batch_size = 64
translation_dataloader = DataLoader(translation_dataset, batch_size=batch_size,
                                    shuffle=True,  drop_last=True, collate_fn=collate_fn)

print("Translation samples: ", len(translation_dataset))
print("Translation batches: ", len(translation_dataloader))

Translation samples:  120746
Translation batches:  1886


In [None]:
# Example: iterating over the DataLoader
for eng, ita in translation_dataloader:
    print("English batch:", eng)
    print("Italian batch:", ita)
    break # remove this to iterate over the whole dataset

English batch: tensor([[4848,  253, 4407,  ...,    0,    0,    0],
        [4478,  646, 2629,  ...,    0,    0,    0],
        [4478, 2395, 2629,  ...,    0,    0,    0],
        ...,
        [3825, 1658, 2030,  ...,    0,    0,    0],
        [2484, 4392, 2074,  ...,    0,    0,    0],
        [4985, 1982, 4467,  ...,    0,    0,    0]])
Italian batch: tensor([[ 3264, 12016,  9301,  ...,     0,     0,     0],
        [12656,  2347,  7031,  ...,     0,     0,     0],
        [12656, 10590,  2331,  ...,     0,     0,     0],
        ...,
        [ 5580,  9932,  6453,  ...,     0,     0,     0],
        [ 2331,  6122,  6634,  ...,     0,     0,     0],
        [ 7774,  8382, 13332,  ...,     0,     0,     0]])


# 2 - Encoder / Decoder RNN

We'll now implement this delayed sequence-to-sequence model, with two distinct components:

- **Encoder**: the first RNN component processes the input sentence and converts it into a fixed-size hidden representation.


- **Decoder**: the second RNN component takes the hidden representation from the encoder and produces the translated sentence in the target language.

![picture](https://raw.githubusercontent.com/kyuz0/llm-chronicles/main/4.5%20-%20Lab%20-%20Translation/encoder-decoder_1.png)

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

class Encoder(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers=1):
        super().__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers=num_layers,
                            batch_first=True)

    def forward(self, x):
        # Reversing the sequence of indices
        x = torch.flip(x, [1])
        embedded = self.embedding(x)
        outputs, (hidden, cell) = self.lstm(embedded)
        return outputs, hidden, cell

In [None]:
class Decoder(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers=1):
        super().__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers=num_layers,
                            batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)

    def forward(self, x, hidden, cell):
        out = self.embedding(x)
        out, (hidden, cell) = self.lstm(out, (hidden, cell))
        out = self.fc(out).reshape(out.size(0), -1)
        return out, hidden, cell

In [None]:
# Hyperparameters
eng_vocab_size = len(eng_word2int)
ita_vocab_size = len(ita_word2int)
embed_size = 256
hidden_size = 512
num_layers = 1

# Initialize the models
encoder = Encoder(eng_vocab_size, embed_size, hidden_size, num_layers).to(DEVICE)
decoder = Decoder(ita_vocab_size, embed_size, hidden_size, num_layers).to(DEVICE)

# 3 - Translation (inference)

Both the encoder and decoder RNNs will take in a sequence of tokens and project them into embeddings. When it comes to the encoder, we discard the outputs at the individual time steps and only consider the last hidden state which we feed into the decoder. The decoder works in the same way as a regular language model, predicting the Italian translation one word at a time.

As for our basic language model, the decoder typically has a linear layer at the end that outputs the logits which give us the probability distribution across the Italian vocabulary of the next word. The only difference with a regular language model is that we use a special SOS as the seed for the first time step. We continue generating words until the model outputs another special token called the “EOS” (end of sentence).


![picture](https://raw.githubusercontent.com/kyuz0/llm-chronicles/main/4.5%20-%20Lab%20-%20Translation/encoder-decoder_2.png)








In [None]:
def translate(encoder, decoder, sentence, eng_word2int, ita_int2word, max_length=15):
    encoder.eval()
    decoder.eval()
    with torch.inference_mode():
        # Tokenize and encode the sentence
        input_tensor = torch.tensor([eng_word2int[word] for word in tokenize(sentence)]
                                    + [eng_word2int[EOS_TOKEN]], dtype=torch.long)
        input_tensor = input_tensor.view(1, -1).to(DEVICE)  # batch_first=True

        # Pass the input through the encoder
        _, encoder_hidden, encoder_cell = encoder(input_tensor)

        # Initialize the decoder input with the SOS token
        decoder_input = torch.tensor([[eng_word2int[SOS_TOKEN]]], dtype=torch.long)  # SOS
        # Initialize the hidden state of the decoder with the encoder's hidden state
        decoder_hidden, decoder_cell = encoder_hidden, encoder_cell

        # Decoding the sentence
        decoded_words = []
        last_word = torch.tensor([[eng_word2int[SOS_TOKEN]]]).to(DEVICE)
        for di in range(max_length):
            logits, decoder_hidden, decoder_cell = decoder(last_word, decoder_hidden, decoder_cell)
            next_token = logits.argmax(dim=1) # greedy
            last_word = torch.tensor([[next_token]]).to(DEVICE)
            if next_token.item() == ita_word2int[EOS_TOKEN]:
                break
            else:
                decoded_words.append(ita_int2word.get(next_token.item()))

        return ' '.join(decoded_words)

In [None]:
# Example usage
# s/he likes music, listening to music,
# tom left yesterday, they left yesterday
# i want to go home now
# tom likes chocolate
# tom was right about that
# tom said he would not come

sentence = "tom said he would not come"
translated_sentence = translate(encoder, decoder, sentence, eng_word2int, ita_int2word)
print("Translated:", translated_sentence)

Translated: tom ha detto che non sarebbe venuto


#3 - Training

Let's now implement the training loop, using teacher forcing. **Teacher forcing** is a training method used in the context of sequential models, particularly in language modeling and machine translation. This technique involves using the actual target output from the previous time step as the current input, rather than using the model's prediction from the previous step.

During training, this approach accelerates and stabilizes learning by providing the model with the correct context for generating the subsequent part of the sequence. However, while teacher forcing can speed up training and often improve the model's performance on the training data, it can also lead to issues like the model becoming overly reliant on this guidance. This reliance can result in a discrepancy between training and inference phases, as the model may struggle to generate accurate predictions on its own during inference, a phenomenon known as exposure bias.

![picture](https://raw.githubusercontent.com/kyuz0/llm-chronicles/main/4.5%20-%20Lab%20-%20Translation/training.png)

As an optimizer we use **AdamW** (https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html). The AdamW optimizer is an extension of the Adam optimizer, known for combining momentum and RMSProp techniques. The key addition in AdamW is the modification of the weight decay component. Unlike traditional **L2 regularization**, which is applied directly to the loss function, AdamW decouples weight decay from the loss function and directly applies it to the parameter updates. This approach helps in better controlling overfitting by **penalizing large weights**, and it's particularly effective when training deep neural networks, as it allows for more precise and stable optimization of complex models.


In [None]:
import torch.optim as optim
import torch.nn as nn
import random

# Loss Function (exclude padding)
loss_fn = nn.CrossEntropyLoss(ignore_index=eng_word2int[PAD_TOKEN])

# Optimizers
encoder_optimizer = optim.AdamW(encoder.parameters())
decoder_optimizer = optim.AdamW(decoder.parameters())

# Number of epochs
num_epochs = 10

# Training Loop
encoder.train()
decoder.train()

for epoch in range(num_epochs):
    for i, (input_tensor, target_tensor) in enumerate(translation_dataloader):
        input_tensor, target_tensor = input_tensor.to(DEVICE), target_tensor.to(DEVICE)

        # Zero gradients of both optimizers
        encoder_optimizer.zero_grad()
        decoder_optimizer.zero_grad()

        target_length = target_tensor.size(1)

        # Encoder
        _, encoder_hidden, encoder_cell = encoder(input_tensor)

        # Decoder
        decoder_input = torch.full((batch_size, 1), eng_word2int[SOS_TOKEN], dtype=torch.long).to(DEVICE)
        decoder_hidden = encoder_hidden
        decoder_cell = encoder_cell

        # Randomly select a word index from the target sequence
        random_word_index = random.randint(0, target_length - 1)

        loss = 0

        for di in range(target_length):
            logits, decoder_hidden, decoder_cell  = decoder(decoder_input, decoder_hidden, decoder_cell)
            #if di == random_word_index:
            #    loss = loss_fn(logits, target_tensor[:, di])
            #    break  # Only compute loss for the randomly selected word
            loss += loss_fn(logits, target_tensor[:,di])
            decoder_input = target_tensor[:, di].reshape(batch_size, 1)  # Teacher forcing


        # Backpropagation
        loss.backward()

        encoder_optimizer.step()
        decoder_optimizer.step()

        if i % 100 == 0:  # Print loss every 10 batches
            print(f'Epoch {epoch}, Batch {i}, Loss: {loss.item() / target_length:.4f}')


Epoch 0, Batch 0, Loss: 9.5280
Epoch 0, Batch 100, Loss: 4.4034
Epoch 0, Batch 200, Loss: 4.1602
Epoch 0, Batch 300, Loss: 4.5719
Epoch 0, Batch 400, Loss: 3.5316
Epoch 0, Batch 500, Loss: 3.4764
Epoch 0, Batch 600, Loss: 3.1723
Epoch 0, Batch 700, Loss: 2.8174
Epoch 0, Batch 800, Loss: 2.9532
Epoch 0, Batch 900, Loss: 2.5940
Epoch 0, Batch 1000, Loss: 2.8760
Epoch 0, Batch 1100, Loss: 2.6314
Epoch 0, Batch 1200, Loss: 2.4217
Epoch 0, Batch 1300, Loss: 3.2909
Epoch 0, Batch 1400, Loss: 2.4033
Epoch 0, Batch 1500, Loss: 2.5604
Epoch 0, Batch 1600, Loss: 2.1422
Epoch 0, Batch 1700, Loss: 2.2601
Epoch 0, Batch 1800, Loss: 3.0945
Epoch 1, Batch 0, Loss: 0.7630
Epoch 1, Batch 100, Loss: 1.4685
Epoch 1, Batch 200, Loss: 1.6303
Epoch 1, Batch 300, Loss: 1.3961
Epoch 1, Batch 400, Loss: 1.3776
Epoch 1, Batch 500, Loss: 1.4368
Epoch 1, Batch 600, Loss: 1.2745
Epoch 1, Batch 700, Loss: 1.0998
Epoch 1, Batch 800, Loss: 1.1950
Epoch 1, Batch 900, Loss: 1.4213
Epoch 1, Batch 1000, Loss: 1.2337
Epoc

# 4 - Bi-directional RNN

A possible optimization is using a bi-directional RNN for the encoder. In this setup, the model processes the input sequence in both the original and reverse order. This results in two sets of hidden states for each part of the sequence, giving the model a fuller understanding of the sentence context from both directions, which can lead to more accurate translations.


![picture](https://raw.githubusercontent.com/kyuz0/llm-chronicles/main/4.5%20-%20Lab%20-%20Translation/bi-directional.png)

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

class Encoder(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers=1):
        super().__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers=num_layers,
                            batch_first=True, bidirectional=True)

    def forward(self, x):
        embedded = self.embedding(x)
        outputs, (hidden, cell) = self.lstm(embedded)

        # concatenate hidden states of the bi-directional RNN layer
        hidden = torch.cat((hidden[0,:,:], hidden[1,:,:]), dim=1).unsqueeze(0)
        cell = torch.cat((cell[0,:,:], cell[1,:,:]), dim=1).unsqueeze(0)

        return outputs, hidden, cell

class Decoder(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers=1):
        super().__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers=num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)

    def forward(self, x, hidden, cell):
        out = self.embedding(x)
        out, (hidden, cell) = self.lstm(out, (hidden, cell))
        out = self.fc(out).reshape(out.size(0), -1)
        return out, hidden, cell

In [None]:
# Hyperparameters
eng_vocab_size = len(eng_word2int)
ita_vocab_size = len(ita_word2int)
embed_size = 256
hidden_size = 512
num_layers = 1

# Initialize the models
encoder = Encoder(eng_vocab_size, embed_size, hidden_size, num_layers).to(DEVICE)
decoder = Decoder(ita_vocab_size, embed_size, hidden_size*2, num_layers).to(DEVICE)