# Translation with Encoder-Decoder RNN

We will use this notebook to implement a simple encoder decoder architecture. The results will definetely not create the best possible model, but we hope that the general approach will become apparent. 

Some parts of this notebook were inspired by different official [PyTorch tutorials](https://pytorch.org/tutorials/). There are more tutorials to be discovered, if you need to deepen your knowledge.

In [1]:
import re
import random

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

In [2]:
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

We are going to use very simple language pairs for our translation task, specifically the German-English. This dataset is taken from a an Anki dataset, a dataset that is used for flashcards in the spaced repition program Anki. There are many different languages available on the [website](https://www.manythings.org/anki/), if you would like to implement a similar model for a different language.

Below we utilize a couple of bash commands to download and unzip the data.

In [3]:
!cd ../datasets/ && { curl -O https://www.manythings.org/anki/deu-eng.zip ; cd -; }

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9376k  100 9376k    0     0  1321k      0  0:00:07  0:00:07 --:--:-- 1809k
/home/petruschka/repos/World4AI/website/src/notebooks/sequence_modelling


In [4]:
!rm -rf ../datasets/deu_eng/
!unzip ../datasets/deu-eng.zip -d ../datasets/deu_eng

Archive:  ../datasets/deu-eng.zip
  inflating: ../datasets/deu_eng/deu.txt  
  inflating: ../datasets/deu_eng/_about.txt  


In [5]:
!ls ../datasets/deu_eng

_about.txt  deu.txt


The starting language pairs are really simple, consisting only of a single word or expression. You should also notice that the pair is always followed by the licence, which we will need to remove at a later point.

In [6]:
!head ../datasets/deu_eng/deu.txt

Go.	Geh.	CC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #8597805 (Roujin)
Hi.	Hallo!	CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #380701 (cburgmer)
Hi.	Grüß Gott!	CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #659813 (Esperantostern)
Run!	Lauf!	CC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #941078 (Fingerhut)
Run.	Lauf!	CC-BY 2.0 (France) Attribution: tatoeba.org #4008918 (JSakuragi) & #941078 (Fingerhut)
Wow!	Potzdonner!	CC-BY 2.0 (France) Attribution: tatoeba.org #52027 (Zifre) & #2122382 (Pfirsichbaeumchen)
Wow!	Donnerwetter!	CC-BY 2.0 (France) Attribution: tatoeba.org #52027 (Zifre) & #2122391 (Pfirsichbaeumchen)
Duck!	Kopf runter!	CC-BY 2.0 (France) Attribution: tatoeba.org #280158 (CM) & #9968521 (wolfgangth)
Fire!	Feuer!	CC-BY 2.0 (France) Attribution: tatoeba.org #1829639 (Spamster) & #1958697 (Tamy)
Help!	Hilfe!	CC-BY 2.0 (France) Attribution: tatoeba.org #435084 (lukaszpp) & #575889 (MUIRIEL)


Later sentences consist fairly complex sentences, which are much harder to translate.

In [7]:
!tail -n 2 ../datasets/deu_eng/deu.txt

I know that adding sentences only in your native or strongest language is probably not as much fun as practicing writing foreign languages, but please don't add sentences to the Tatoeba Corpus if you are not absolutely sure they are correct. If you want to practice languages that you are studying, please do so by using a website designed for that purpose such as www.lang-8.com.	Ich weiß wohl, dass das ausschließliche Beitragen von Sätzen in der Muttersprache – oder der am besten beherrschten Sprache – nicht ganz so viel Spaß macht, wie sich im Schreiben von Fremdsprachen zu üben; steuere beim Tatoeba-Korpus aber bitte trotzdem keine Sätze bei, über deren Korrektheit du dir nicht völlig im Klaren bist. Wenn du Sprachen, die du gerade lernst, üben möchtest, verwende dazu bitte Netzangebote, die eigens hierfür eingerichtet wurden, wie zum Beispiel www.lang-8.com.	CC-BY 2.0 (France) Attribution: tatoeba.org #3847634 (CM) & #4878147 (Pfirsichbaeumchen)
Doubtless there exists in this world p

Our tokenizer is extremely simple. We lowercase all words, strip unnecessary whitespace and put a space between the word and the .!? tokens. When also insert two special tokens `<sos>` to indicate the start of the sentence and `<eos>` to indicate the end of the sentence.

In [8]:
def normalize(s):
    s = s.lower().strip()
    s = re.sub(r"([.!?])", r" \1", s)
    return s

def tokenizer(s):
    s = normalize(s)
    s = s.split(' ')
    s.insert(0, '<sos>')
    s.append('<eos>')
    return s

We also remove sentences that have more than 20 tokens and the license. Eventually we return two lists with English and German sequene respectively.

In [9]:
def read_pairs(max_len=20):
    print("Reading lines...")
    en_seq = []
    de_seq = []
    with open('../datasets/deu_eng/deu.txt', 'r', encoding='utf-8') as file:
        print(f"Tokenizing and removing sentences larger than {max_len}")
        for line in file:
            pairs = line.split('\t')
            
            en_sentence, de_sentence = tokenizer(pairs[0]), tokenizer(pairs[1])
            
            if len(en_sentence) <= max_len and len(de_sentence) <= max_len:
                en_seq.append(en_sentence)
                de_seq.append(de_sentence)
        print(f"The dataset has {len(en_seq)} pairs")
        return en_seq, de_seq

In [10]:
en_seq, de_seq = read_pairs()

Reading lines...
Tokenizing and removing sentences larger than 20
The dataset has 255279 pairs


Below is an example what the tokenization process returns.

In [11]:
en_seq[0]

['<sos>', 'go', '.', '<eos>']

We use our usual procedure to divide the dataset into train, validation and test sets.

In [12]:
from sklearn.model_selection import train_test_split

In [13]:
#separate into train test split
# train_frac = 0.8
# val_frac = 0.1
# test_frac = 0.1
train_en, test_val_en, train_de, test_val_de = train_test_split(en_seq, de_seq, test_size=0.2)
val_en, test_en, val_de, test_de = train_test_split(test_val_en, test_val_de, test_size=0.5)

The PyTorch dataset is initialized with the two lists and returns the elements from the provided index. There is no magic here.

In [14]:
class PairDataset(Dataset):
    def __init__(self, en, de):
        assert len(en) == len(de)
        self.en = en
        self.de = de
    
    def __len__(self):
        return len(self.en)
    
    def __getitem__(self, idx):
        return self.en[idx], self.de[idx]

In [15]:
train_dataset = PairDataset(train_en, train_de)
val_dataset = PairDataset(val_en, val_de)
test_dataset = PairDataset(test_en, test_de)

We need to create a corpus and to transform the tokens into indices. We use torchtext for that purpose.

In [16]:
from collections import Counter, OrderedDict

In [17]:
en_counter = Counter()
de_counter = Counter()

for line in train_en:
    en_counter.update(line)

for line in train_de:
    de_counter.update(line)

In [18]:
en_sorted_by_freq_tuples = sorted(en_counter.items(), key=lambda x: x[1], reverse=True)
en_ordered_dict = OrderedDict(en_sorted_by_freq_tuples)

de_sorted_by_freq_tuples = sorted(de_counter.items(), key=lambda x: x[1], reverse=True)
de_ordered_dict = OrderedDict(de_sorted_by_freq_tuples)

We have 4 special tokens. `<pad>` for zero-padding, `<unk>` for out-of-vocabulary tokens, `<sos>` to identify the start of the sentence and `<eos>` to identify the end of the sentence.

In [19]:
import torchtext
en_vocab = torchtext.vocab.vocab(en_ordered_dict, min_freq = 5, specials=['<pad>', '<unk>', '<sos>', '<eos>'], special_first = True)
de_vocab = torchtext.vocab.vocab(de_ordered_dict, min_freq = 5, specials=['<pad>', '<unk>', '<sos>', '<eos>'], special_first = True)

en_vocab.set_default_index(1)
de_vocab.set_default_index(1)

Here is an example of what those vocabularies produce. The 2 and 3 tokens correspond to `<sos>` and `<eos>` repsectively.

In [20]:
print(en_vocab(en_seq[0]))
print(de_vocab(de_seq[0]))

[2, 49, 4, 3]
[2, 629, 4, 3]


In our collate function we pad the smaller sentences with 0 values to generate a batch.

In [21]:
def collate(batch):
    en, de = [], []
    for en_token, de_token in batch:
        en.append(torch.tensor(en_vocab(en_token), dtype=torch.int64))
        de.append(torch.tensor(de_vocab(de_token), dtype=torch.int64))
    en_padded = nn.utils.rnn.pad_sequence(en, batch_first=True)
    de_padded = nn.utils.rnn.pad_sequence(de, batch_first=True)
    return en_padded, de_padded

Finally we create the dataloaders we can loop over during training or inference.

In [22]:
BATCH_SIZE=128
train_dataloader = DataLoader(dataset=train_dataset, 
                              batch_size=BATCH_SIZE,
                              shuffle=True,
                              num_workers=2,
                              drop_last=True,
                              collate_fn=collate)
val_dataloader = DataLoader(dataset=val_dataset, 
                              batch_size=BATCH_SIZE,
                              shuffle=False,
                              num_workers=2,
                              drop_last=False,
                              collate_fn=collate)
test_dataloader = DataLoader(dataset=test_dataset, 
                              batch_size=BATCH_SIZE,
                              shuffle=False,
                              num_workers=2,
                              drop_last=False,
                              collate_fn=collate)

The encoder is just an embedding layer and a two layer LSTM. At the end of the forward pass we return the hidden state h_n and the cell state c_n, that can later be used as an input into the decoder.

In [23]:
class Encoder(nn.Module):
    
    def __init__(self, num_embeddings, embedding_dim=128, hidden_size=128, lstm_layers=2):
        super().__init__()
        self.embedding = nn.Embedding(num_embeddings, embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_size, num_layers=lstm_layers, batch_first=True)
        
        
    def forward(self, x):
        x = self.embedding(x)
        _, (h_n, c_n) = self.lstm(x)
        return h_n, c_n

The decoder is slightly more involved. The `LSTMCell` is a single LSTM cell, that we can loop over. The reason why we not use a full LSTM layer, is our requirement to generate a word at each step and use that word as an input in the next step. For that we return the logits, that are used to greedily select the word. Then we use the word and invoke the forward pass of the decoder until the sequence is exhausted.

In [24]:
class Decoder(nn.Module):
    
    def __init__(self, num_embeddings, embedding_dim=128, hidden_size=128, lstm_layers=2):
        super().__init__()
        self.embedding = nn.Embedding(num_embeddings, embedding_dim, padding_idx=0)
        self.lstm_cell_list = nn.ModuleList([nn.LSTMCell(input_size=embedding_dim, hidden_size=hidden_size) for i in range(lstm_layers)])
        self.fc = nn.Linear(hidden_size, num_embeddings)
    
    def forward(self, x, h, c):
        x = self.embedding(x)
        h_n, c_n = torch.zeros_like(h, device=DEVICE), torch.zeros_like(c, device=DEVICE)
        for i, lstm_cell in enumerate(self.lstm_cell_list):
            (h_n[i], c_n[i]) = lstm_cell(x, (h[i], c[i]))
            x = h_n[i].clone()
        logits = self.fc(x)
        return logits, h_n, c_n

The `EncoderDecoder` is combines the two. Here we use a technique called `teacher forcing`. Teacher forcing means that from time to time we do not use the words that our model generated as an input to the decoder, but words that are actually contained in the translation. That can stabilize training. We use a probability of 50% to decide if we use teacher forcing or not.

The `Encoder` output is generated in a single run, but the `Decoder` is utilized in a loop, as the next word needs to be generated first.

In [25]:
class EncoderDecoder(nn.Module):
    def __init__(self, encoder, decoder, teacher_forcing_ratio=0.5):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.teacher_forcing_ratio = teacher_forcing_ratio
    
    def forward(self, en_sequence, de_sequence):
        batch_size, sequence_len, num_de_embeddings = de_sequence.size()[0], de_sequence.size()[1], self.decoder.embedding.num_embeddings
        
        # minus 1 due to fewer predictions as inputs, we don't predict <sos>
        outputs = torch.zeros(batch_size, sequence_len-1, num_de_embeddings, device=DEVICE)

        h_n, c_n = self.encoder(en_sequence)
        inp = de_sequence[:, 0]
        for i in range(1, sequence_len):
            logits, h_n, c_n = decoder(inp, h_n, c_n)
            outputs[:, i-1] = logits
            
            force = random.random() < self.teacher_forcing_ratio
            if force:
                inp = de_sequence[:, i]
            else:
                inp = logits.argmax(dim=1)
        
        return outputs

In [26]:
def track_performance(dataloader, model, criterion):
    # switch to evaluation mode
    model.eval()
    loss_sum = 0
    num_iterations = 0

    # no need to calculate gradients
    with torch.inference_mode():
        for en_sequence, de_sequence in dataloader:
            en_sequence = en_sequence.to(DEVICE)
            de_sequence = de_sequence.to(DEVICE)

            logits = model(en_sequence, de_sequence)
            
            # we don't actually predict the <sos> token
            labels = de_sequence[:, 1:]
            # we need to reshape in order to be able to use these tensors with CrossEntropyLoss
            logits = logits.reshape(-1, logits.size()[2])
            labels = labels.reshape(-1)
            loss = criterion(logits, labels)
            loss_sum += loss.cpu().item()
            num_iterations+=1

    # we return the average loss
    return loss_sum/num_iterations


The train function is mostly the same, that we used in all previous sections. 

The labels that are used as inputs as a decoder are sliced. Let's assume we have the following sentence as the output.

`<sos>, what, is, your, name, <eos>`

We use the following part as the input into the decoder
`<sos>, what, is, your, name`

And this is what we expect the model to generate
`what, is, your, name, <eos>` 
Therefore only this part is used in the loss function

In [27]:
def train(num_epochs, train_dataloader, val_dataloader, model, optimizer, criterion, scheduler=None):
    min_loss = float("inf")
    for epoch in range(num_epochs):
        loss_sum = 0
        num_iterations = 0
        for en_sequence, de_sequence in train_dataloader:
            model.train()

            optimizer.zero_grad()
            en_sequence = en_sequence.to(DEVICE)
            de_sequence = de_sequence.to(DEVICE)

            logits = model(en_sequence, de_sequence)
            # we don't actually predict the <sos> token
            labels = de_sequence[:, 1:]

            # we need to reshape in order to be able to use these tensors with CrossEntropyLoss
            logits = logits.reshape(-1, logits.size()[2])
            labels = labels.reshape(-1)
            loss = criterion(logits, labels)
            loss.backward()
            optimizer.step()
            
            loss_sum += loss.cpu().item()
            num_iterations += 1
        train_loss=loss_sum/num_iterations
        val_loss = track_performance(val_dataloader, model, criterion)
        if scheduler:
            scheduler.step(val_loss)
        print(f'Epoch: {epoch+1:>2}/{num_epochs} | Train Loss: {train_loss:.5f} | Val Loss: {val_loss:.5f}')
        
        if val_loss < min_loss:
            print("Saving Weights!")
            min_loss = val_loss
            torch.save({'encoder_weights': encoder.state_dict(), 'decoder_weights': decoder.state_dict()}, f='../temp/encoder_decoder.pt')

In [28]:
encoder = Encoder(num_embeddings=len(en_vocab), embedding_dim=128)
decoder = Decoder(num_embeddings=len(de_vocab), embedding_dim=128)
seq2seq = EncoderDecoder(encoder, decoder).to(DEVICE)

In [29]:
optimizer = optim.Adam(seq2seq.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss(ignore_index=0)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer,
                                                       factor=0.1,
                                                       mode='min',
                                                       patience=2,
                                                       verbose=True)

num_epochs=25

In [30]:
train(num_epochs, train_dataloader, val_dataloader, seq2seq, optimizer, criterion, scheduler)

Epoch:  1/25 | Train Loss: 4.77573 | Val Loss: 4.09706
Saving Weights!
Epoch:  2/25 | Train Loss: 3.74642 | Val Loss: 3.46187
Saving Weights!
Epoch:  3/25 | Train Loss: 3.25335 | Val Loss: 3.09955
Saving Weights!
Epoch:  4/25 | Train Loss: 2.93766 | Val Loss: 2.85532
Saving Weights!
Epoch:  5/25 | Train Loss: 2.68451 | Val Loss: 2.68323
Saving Weights!
Epoch:  6/25 | Train Loss: 2.49218 | Val Loss: 2.53332
Saving Weights!
Epoch:  7/25 | Train Loss: 2.33968 | Val Loss: 2.44510
Saving Weights!
Epoch:  8/25 | Train Loss: 2.21576 | Val Loss: 2.35043
Saving Weights!
Epoch:  9/25 | Train Loss: 2.10026 | Val Loss: 2.28960
Saving Weights!
Epoch: 10/25 | Train Loss: 2.02329 | Val Loss: 2.23866
Saving Weights!
Epoch: 11/25 | Train Loss: 1.94164 | Val Loss: 2.18903
Saving Weights!
Epoch: 12/25 | Train Loss: 1.87876 | Val Loss: 2.17249
Saving Weights!
Epoch: 13/25 | Train Loss: 1.82498 | Val Loss: 2.14898
Saving Weights!
Epoch: 14/25 | Train Loss: 1.77670 | Val Loss: 2.11008
Saving Weights!
Epoch:

We load the model with the best weights for prediction.

In [31]:
weights = torch.load('../temp/encoder_decoder.pt')
encoder_weights = weights['encoder_weights']
decoder_weights = weights['decoder_weights']

In [32]:
encoder.load_state_dict(encoder_weights)
decoder.load_state_dict(decoder_weights)

<All keys matched successfully>

Here we use a provided English sentence and generate a translation. For that we generate words from the Decoder until the `<eos>` token is generated.

In [33]:
def translate_sentence(sentence, vocab, encoder, decoder):
    with torch.inference_mode():
        outputs = []
        
        start_token = ["<sos>"]
        end_token = ["<eos>"]
        start_idx = vocab(start_token)[0]
        end_idx = vocab(end_token)[0]
                
        h_n, c_n = encoder(sentence)
        inp = torch.tensor([start_idx], device=DEVICE)
        while True:
            logits, h_n, c_n = decoder(inp, h_n, c_n)
            inp = logits.argmax(dim=1)
            outputs.append(inp.cpu().item())
            if inp.item() == end_idx:
                break
        return outputs

Below we show examples for translations of 10 sentences. The quality of the translation is not state of the art, but given the small model and dataset and greedy search, this is an ok result.

In [34]:
en_sequence, de_sequence = next(iter(test_dataloader))
en_sequence = en_sequence.to(DEVICE)

In [35]:
for i in range(10):
    en_sentence = en_sequence[i].unsqueeze(0)
    de_sentence = de_sequence[i].unsqueeze(0)
    translation = translate_sentence(en_sentence, en_vocab, encoder, decoder)
    print('-'*130)
    print(f'English Sentence: {en_vocab.lookup_tokens(en_sentence[0].cpu().tolist())}')
    print(f'German Translation: {de_vocab.lookup_tokens(de_sentence[0].cpu().tolist())}')
    print(f'Model Translation: {de_vocab.lookup_tokens(translation)}')
    

----------------------------------------------------------------------------------------------------------------------------------
English Sentence: ['<sos>', 'tom', 'did', 'only', 'what', 'he', 'had', 'to', 'do', '.', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
German Translation: ['<sos>', 'tom', 'tat', 'nur', 'seine', 'pflicht', '.', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
Model Translation: ['tom', 'hat', 'was', 'was', 'er', 'tun', 'muss', '.', '<eos>']
----------------------------------------------------------------------------------------------------------------------------------
English Sentence: ['<sos>', "i'm", 'sure', "you'd", 'feel', 'better', 'if', 'you', 'ate', 'something', '.', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
German Translation: ['<sos>', 'es', 'ging