### Imports

In [1]:
import numpy as np
import torch
from torch import nn
import re
import os

try:
    from tensorboardX import SummaryWriter
except ModuleNotFoundError:
    print("TensorboardX not available")
    pass

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f'device = {device}')

device = cpu


## Translating text

How do you do this? There are many difficulties with different sentence lengths, different grammar or contextual information. In this notebook we will cover how to do this using sequence to sequence learning.

![](img/hello-lead.png)

## Sequence to sequence learning
We will use pytorch to translate short sentences from French to English and vice versa

Some concepts that will be covered:
- Embeddings
- Recurrent neural networks
- Encoder / decoders
- Attention

In [2]:
# download the needed data
if not os.path.isfile('data.zip'):
    ! curl -o data.zip https://download.pytorch.org/tutorial/data.zip && unzip data.zip 

In [3]:
# Take a quick view of the data.
with open('data/eng-fra.txt') as f:
    f.seek(1000)
    print(f.read(200))

 de question !
Really?	Vraiment ?
Really?	Vrai ?
Really?	Ah bon ?
Thanks.	Merci !
We try.	On essaye.
We won.	Nous avons gagné.
We won.	Nous gagnâmes.
We won.	Nous l'avons emporté.
We won.	Nous l'empor


# Preparing the data 0
During the process, we need to interact with the languages quite often. We probably need to switch between words and indexes & vice versa. Therefore we need to keep some sort of mapping between the two. Something like:

**indexes to word**
```python
{0: 'SOS',
 1: 'EOS',
 2: 'The'
 ...
 n: 'World'
}
```

**words to indexes**
```python
{'SOS': 0,
 'EOS': 1,
 'The': 2
 ...
 'World': n
}
```

A nice way to do this, is creating an object that stores these mappings. This is already done for you. To check, go to: `utils.Language`.

# Preparing the data 1

What should we do?
- Reading data from file
- Make lowercase
- Remove non-letter characters
- Mark the end of the scentence
- Mark the start of the scentence
- Remove rare letters. (á, ò, ê)
- ...
- Translate words into numbers?

This is already done for you. To check, go to: `preprocessing.normalize_string`, `preprocessing.unicode2ascii` and `preprocessing.read_lang_pairs`.

# Preparing the data 2
Since there are a lot of example sentences and we want to train something quickly in this short training, we'll trim the dataset to only contain relatively short and simple sentences. Here the maximum length is 10 words (that includes ending punctuation) and we're filtering to sentences that translate to the form "I am" or "He is" etc. (assuming that apostrophes are replaced earlier).

In short:
- only sentences < 10 words
- only sentences that start with 'I am', 'He is' etc.

This function is already created. To check it out, go to: `preprocessing.filter_pairs_eng2other`.

# Preparing the data 3

Next to this, it would be nice to create an object that contains the data. This object can help with several tasks, such as querying the data or shuffling the sentences. Something we need later on in the training process.

We also need to:
- Create a `Data` class

This is already done for you. To check, go to: `utils.Data`.

# Preparing the data 4

Now we have to tie it all together. We need to:
- Initialize the `Language` objects
- Preprocess the sentence pairs
- Filter out simple cases for this training

We can of course put this in our `preprocessing` module as well, but for illustration purposes, we've put it below:

In [4]:
from utils import Language, Data
from preprocessing import read_lang_pairs, filter_pairs_eng2other


def prepare_dataset(from_lang, to_lang):
    """ Initializes the Language objects (still empty), creates the sentences pairs
    and returns a Data object containing all languages and scentence pairs.
    """
    pairs = read_lang_pairs(from_lang, to_lang)
    print(f"Read {len(pairs)} sentence pairs")
    
    # Reduce data. We haven't got all day to train a model.
    if from_lang != 'eng':
         raise ValueError(f'No filter implemented for translation from {from_lang} to {to_lang}')
    
    pairs = filter_pairs_eng2other(pairs) 
    print(f"Trimmed to {len(pairs)} sentence pairs")
    
    input_lang = Language(from_lang)
    output_lang = Language(to_lang)
    # Add pairs to the languages
    for pair in pairs:
        input_lang.add_sentence(pair[0])
        output_lang.add_sentence(pair[1])
        
    print("Counted words:")
    print(input_lang.name, input_lang.n_words)
    print(output_lang.name, output_lang.n_words)
    
    return input_lang, output_lang, Data(pairs, input_lang, output_lang)

In [5]:
np.random.seed(42)

eng, fra, data = prepare_dataset('eng', 'fra')
print(f"First data pair: {data.pairs[0]}")

Reading lines...
Read 135842 sentence pairs
Trimmed to 10853 sentence pairs
Counted words:
eng 2922
fra 4486
First data pair: ['i m EOS' 'j ai ans EOS']


# Sequence to sequence model overview
So this is what we're going to build:

![](img/seq2seq.png)

Looking at the statistics printed above (of our simplified dataset), do you see any interesting output?
- More French words than English
- Quite a lot of words

## The Encoder

The encoder of a seq2seq network is a RNN that outputs some value for every word from the input sentence. For every input word the encoder outputs a vector and a hidden state, and uses the hidden state for the next input word. Every output could be seen as the context of the sentence up to that point.

<img src="img/training_seq2seq_many2may.svg" alt="drawing" style="width:300px;"/>

As mentioned above, we have quite some words in our dictionaries. Therefore, it might be a good idea to create embeddings of our words since we're only passing context anyway.

![](img/encoder-network.png)

In [6]:
class Encoder(nn.Module):
    def __init__(self, n_words, embedding_size, hidden_size, device=device):
        super(Encoder, self).__init__()
        self.hidden_size = hidden_size
        # The word embeddings will also be trained simultaneously with the NN weights
        # To freeze them --> m.embedding.weight.requires_grad = False
        self.embedding = nn.Embedding(n_words, embedding_size)  
        self.rnn = nn.GRU(embedding_size, hidden_size)
        
        self.device = device
        if device == 'cuda':
            self.cuda()
                    
    def forward(self, x):
        # shape (seq_length, batch_size, input_size)
        dense_vector = self.embedding(x).view(x.shape[0], 1, -1)
        
        # init hidden layer at beginning of sequence --> SOS
        h = torch.zeros(1, 1, self.hidden_size, device=self.device)
        
        x, h = self.rnn(dense_vector, h)

        return x, h

In [11]:
params = {
    'n_words': eng.n_words,
    'embedding_size': 10,
    'hidden_size': 2,
    'device': device
}        

m = Encoder(**params)

eng_sentence = data.pairs[0][0]
sentence = torch.tensor(eng.translate_words(eng_sentence), device=device)
enc_out, enc_hidden = m(sentence)

print(f"Test sentence: '{eng_sentence}'")
print(f"Test tensor  : {sentence}")
print(f"output shape : {enc_out.shape}")

Test sentence: 'i m EOS'
Test tensor  : tensor([2, 3, 1])
output shape : torch.Size([3, 1, 2])


# Simple Decoder

In the simplest seq2seq decoder we use only last output of the encoder. This last output is sometimes called the context vector as it encodes context from the entire sequence. This context vector is used as the initial hidden state of the decoder.

At every step of decoding, the decoder is given an input token and hidden state. The initial input token is the start-of-string <SOS> token, and the first hidden state is the context vector (the encoder’s last hidden state).
    
![](img/decoder-network-adapted.png)
    

The power of this model lies in the fact that it can map sequences of different lengths to each other. As you can see the inputs and outputs are not correlated and their lengths can differ. This opens a whole new range of problems which can now be solved using such architecture.    
    
<img src="img/unfolded-encoder-decoder.png" alt="drawing" style="width:500px;float: left;"/>

In [13]:
class Decoder(nn.Module):
    def __init__(self, embedding_size, hidden_size, output_size, device=device):
        super(Decoder, self).__init__()
        self.decoder = 'simple'
        self.hidden_size = hidden_size
        # Lookup table for the last word activation.
        self.embedding = nn.Embedding(output_size, embedding_size)
        self.rnn = nn.GRU(embedding_size, hidden_size)
        self.out = nn.Sequential(
            nn.Linear(hidden_size, output_size),
            nn.LogSoftmax(dim=2)
        )
        self.device = device
        if device == 'cuda':
            self.cuda()
            
    def forward(self, word, h):
        """ Forward pass of the NN
        
        Parameters
        ----------
        word : torch.tensor
            Last word or start of sentence token.
        h : torch.tensor
            Hidden state or context tensor.
        """
        # Map from shape (seq_len, embedding_size) to (seq_len, batch, embedding_size)
        # Note: seq_len is the number of words in the sentence
        word_embedding = self.embedding(word).view(1, 1, -1)
        x, h = self.rnn(word_embedding, h)

        return self.out(x), h

params = {
    'embedding_size': 10,
    'hidden_size': 20,
    'output_size': end.n_words,
    'device': device
}  
m = Decoder(**params)
m.train(False)
out, hidden = m(torch.tensor([1]) ,torch.zeros(1, 1, 20))
out.size(), hidden.size()

(torch.Size([1, 1, 2922]), torch.Size([1, 1, 20]))

## What is wrong with the simple decoder?

![](img/seq2seq.png)
![](img/vanishing_context.png)

## Solution: Attention
<img src="img/seq2seq-attn.png" alt="drawing" style="height:400px;"/>

![](img/attention-decoder-network-adapted.png)


In [14]:
class AttentionDecoder(nn.Module):
    def __init__(self, embedding_size, hidden_size, output_size, dropout=0.1, max_length=10, device=device):
        super(AttentionDecoder, self).__init__()
        self.decoder = 'attention'
        self.max_length = max_length
        self.device = device
        self.embedding = nn.Sequential(
            nn.Embedding(output_size, embedding_size),
        )
        
        # Seperate neural network to learn the attention weights
        self.attention_weights = nn.Sequential(
            nn.Linear(embedding_size + hidden_size, max_length),
            nn.Softmax(2)
        )
        self.attention_combine = nn.Sequential(
            nn.Linear(hidden_size + embedding_size, hidden_size),
            nn.ReLU()
        )
        self.rnn = nn.GRU(hidden_size, hidden_size)
        self.out = nn.Sequential(
            nn.Linear(hidden_size, output_size),
            nn.LogSoftmax(2)
        )
        
        if device == 'cuda':
            self.cuda()
        
    def forward(self, word, h, encoder_outputs):
        """
        :param word: (LongTensor) The word indices. This is the last activated word or 
        :param h: (tensor) The hidden state from the previous step. In the first step, the hidden state of the encoder.
        :param encoder_outputs: (tensor) Zero padded (max_length, shape, shape) outputs from the encoder.
        """
        # map from shape (seq_len, embedding_size) to (seq_len, batch, embedding_size) 
        # Note: seq length is the number of words in the sentence
        word_embedding = self.embedding(word).view(1, 1, -1)
        # Concatenate the word embedding and the last hidden state, so that attention weights can be determined.
        x = torch.cat([word_embedding, h], dim=2)
        
        # attention applied
        attention_weights = self.attention_weights(x)
        
        x = torch.bmm(attention_weights, encoder_outputs.unsqueeze(0))  # could also be done with matmul
   
        # attention combined
        x = torch.cat((word_embedding, x), 2)
        x = self.attention_combine(x)
        
        x, h = self.rnn(x, h)
        x = self.out(x)

        return x, h

In [15]:
params = {
    'n_words': eng.n_words,
    'embedding_size': 256,
    'hidden_size': 256,
    'device': device
}

enc      = Encoder(**params)
sentence = torch.tensor([1, 23, 9], device=device)
out, h   = enc(sentence)
print("out.shape:", out.shape)

out.shape: torch.Size([3, 1, 256])


In [16]:
# in case sentence is shorter than max_length, pad with zeros
max_length = 10
encoder_outputs = torch.zeros(max_length, out.shape[-1], device=device)
encoder_outputs[:out.shape[0], :out.shape[-1]] = out.view(out.shape[0], -1)
print(f'encoder_outputs.shape: {encoder_outputs.shape}')

encoder_outputs.shape: torch.Size([10, 256])


In [17]:
params = {
    'embedding_size': 256,
    'hidden_size': 256,
    'output_size': 2,
    'device': device
}

a_dec = AttentionDecoder(**params)
a_dec(torch.tensor([1], device=device), h, encoder_outputs)[0].shape

torch.Size([1, 1, 2])

## Utility function to run the decoder & calculate the loss

In [18]:
def run_decoder(decoder, criterion, sentence, h, teacher_forcing=False, encoder_outputs=None):
    loss = 0
    word = torch.tensor([0], device=device) # <SOS>
    for j in range(sentence.shape[0]):
        if decoder.decoder == 'attention':
            x, h = decoder(word, h, encoder_outputs)
        else:
            x, h = decoder(word, h)

        loss += criterion(x.view(1, -1), sentence[j].view(-1))
        if teacher_forcing:
            word = sentence[j]
        else:
            word = x.argmax().detach()
        if word.item() == 1: # <EOS>
            break
    return loss

## Training the model

In [19]:
embedding_size        = 100
context_vector_size   = 256

enc_params = {
    'n_words': eng.n_words,
    'embedding_size': embedding_size,
    'hidden_size': context_vector_size,
    'device': device
}
encoder = Encoder(**enc_params)

dec_params = {
    'embedding_size': embedding_size,
    'hidden_size': context_vector_size,
    'output_size': fra.n_words,
    'device': device
}
decoder = AttentionDecoder(**dec_params)

if 'SummaryWriter' in globals():
    writer = SummaryWriter('tb/train-3')

In [None]:
epochs                = 10
teacher_forcing_ratio = 0.5

def train(encoder, decoder):
    # Criterion
    criterion = nn.NLLLoss()
    
    # Optimizers
    optim_encoder = torch.optim.SGD(encoder.parameters(), lr=0.01)
    optim_decoder = torch.optim.SGD(decoder.parameters(), lr=0.01)  
    
    # Models
    encoder.train(True)
    decoder.train(True)

    # Train loop
    for epoch in range(epochs):
        data.shuffle()
        for i in range(data.pairs.shape[0]):
            optim_decoder.zero_grad()
            optim_encoder.zero_grad()
            
            pair = data.idx_pairs[i]
            eng_sentence = torch.tensor(pair[0], device=device)
            fra_sentence = torch.tensor(pair[1], device=device)

            # Encode the input language
            out, h = encoder(eng_sentence)        
            
            # pad encoder outputs with zeros
            encoder_outputs = torch.zeros(max_length, out.shape[-1], device=device)
            if decoder.decoder == 'attention':
                encoder_outputs[:out.shape[0], :out.shape[-1]] = out.view(out.shape[0], -1) # remove batch dim
            
            # implement teacher_forcing
            teacher_forcing = np.random.rand() < teacher_forcing_ratio
            loss = run_decoder(decoder, criterion, fra_sentence, h, teacher_forcing, encoder_outputs)
            loss.backward()
            
            if 'SummaryWriter' in globals():
                writer.add_scalar('loss', loss.cpu().item() / (len(fra_sentence)))

            optim_decoder.step()
            optim_encoder.step()

        print(f'epoch {epoch}')

train(encoder, decoder)

## Or load a pretrained model

In [21]:
encoder = Encoder(eng.n_words, embedding_size, context_vector_size)
encoder.load_state_dict(torch.load('models/encoder_10_epochs.pt', map_location=device))

decoder = AttentionDecoder(embedding_size, context_vector_size, fra.n_words)
decoder.load_state_dict(torch.load('models/decoder_10_epochs.pt', map_location=device))

<All keys matched successfully>

## Start translating some sentences from English to French

In [22]:
def translate(start, end):
    for i in range(start, end):
        pair = data.idx_pairs[i]
        eng_sentence = torch.tensor(pair[0], device=device)
        fra_sentence = torch.tensor(pair[1], device=device)

        print('English sentence:\t', ' '.join([eng.index2word[i.item()] for i in eng_sentence[:-1]]))
        print('French sentence:\t', ' '.join([fra.index2word[i.item()] for i in fra_sentence[:-1]]))

        # Encode the input language
        out, h = encoder(eng_sentence)        
        encoder_outputs = torch.zeros(max_length, out.shape[-1], device=device)
        encoder_outputs[:out.shape[0], :out.shape[-1]] = out.view(out.shape[0], -1)
        
        word = torch.tensor([0], device=device) # <SOS>
  
        translation = []
        for j in range(eng_sentence.shape[0]):
            x, h = decoder(word, h, encoder_outputs=encoder_outputs)
  
            word = x.argmax().detach()
            translation.append(word.cpu().data.tolist())

            if word.item() == 1: # <EOS>
                break
        print('\nModel translation:\t', ' '.join([fra.index2word[i] for i in translation][:-1]), '\n' + '-'*50)
        
translate(0, 60)

English sentence:	 you re jealous
French sentence:	 tu es jalouse

Model translation:	 vous etes jalouse 
--------------------------------------------------
English sentence:	 i m not your husband anymore
French sentence:	 je ne suis plus votre epoux

Model translation:	 je ne suis plus votre mari 
--------------------------------------------------
English sentence:	 i m somewhat dizzy
French sentence:	 j ai un peu la tete qui tourne

Model translation:	 j ai un peu 
--------------------------------------------------
English sentence:	 i m too busy
French sentence:	 je suis trop affaire

Model translation:	 je suis trop occupe 
--------------------------------------------------
English sentence:	 i am interested in swimming
French sentence:	 la natation m interesse

Model translation:	 je natation interesse a la 
--------------------------------------------------
English sentence:	 i m going to work
French sentence:	 je vais travailler

Model translation:	 je vais travailler 
---------