<a href="https://colab.research.google.com/github/aayu24/co-learning-lounge/blob/master/Technology/Artificial%20Intelligence/Natural%20Language%20Processing/Concepts/Machine%20Translation/Phase_1%20Machine_Translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Phase 1: Basic Machine Translation with RNN

In this phase we'll be building a machine learning model for translation task, using PyTorch and TorchText. Translation will be done from German to English using a specific type of architecture called Sequence to Sequence Models. The same architecture is being used for various tasks like summarization, utterance generation, etc.

In this first phase, we'll start simple to understand the general concepts by implementing the model using RNN

## Introduction

The most common sequence-to-sequence (seq2seq) models are encoder-decoder models, which commonly use a recurrent neural network (RNN) to encode the source (input) sentence into a single vector. In this notebook, we'll refer to this single vector as a context vector. We can think of the context vector as being an abstract representation of the entire input sentence. This vector is then decoded by a second RNN which learns to output the target (output) sentence by generating it one word at a time.


![](https://drive.google.com/uc?id=1AZ8D6qSu_rMWOHa4STtLaPM4cKvTdZvL)

The source sentence "guten morgen" is passed through an embedding layer and then input into the EncoderRNN (green). At each time step $t$, the input to the EncoderRNN is both the embedding of the current word $x_t$, $e(x_t)$ and the hidden state from the previous time step, $h_{t-1}$. Then the EncoderRNN outputs a new hidden state $h_t$. This can be represented as following:

$$h_t = EncoderRNN(e(x_t), h_{t-1})$$


We also append the `<sos>` and `<eos>` tokens to indicate the start and end of the sentence.

Here, we will denote the source sentence as $X = \{x_1, x_2, ...., x_T\}$ where T is the number of words. We compute the hidden state for each time step as mentioned above, and we use the final hidden state $h_T$, as the context vector, i.e $h_T = z$. Intuitively the context vector represents the source sentence.

Now we have the context vector, $z$, we can use it to start the decoding. The DecoderRNN (blue) is also similar to similar to EncoderRNN, at each time step, it takes the input as the embedding of the current word $y_t$, $e^{'}(y_t)$, as well as the hidden state  from the previous time step, $s_{t-1}$. Then the DecoderRNN outputs a new hidden state $s_t$. This can be represented as:

$$s_t = DecoderRNN(e^{'}(y_t), s_{t-1})$$

**Note**: The hidden state for the initial step of decoding is context vector, i.e $s_0 = z = h_T$, i.e the decoder initial hidden state is the final encoder hidden state.

**Note**: The embedding layers are different for the Encoder and Decoder since the source and target languages are different. Though in diagram it is represented in the same color (yellow), we differentiate it in the equations as $e$ for source embedding and $e^{'}$ for target embedding layer.

In the decoder, we need to go from the hidden state $s_t$ to an acutal word $\hat{y}_t$. This will be done by passing the state $s_t$ through a linear layer which predicts the most probable word.

$$\hat{y}_t = softmax(f(s_t))$$

The decoding happens step-by-step, i.e words are generated one after another. We always use `<sos>` as the first input to the decoder for the first time step, i.e $y_1 = $ \<sos\>. But for subsequent inputs, $y_{t>1}$ we use the actual, ground truth as the input.

Since during the inference the ground truth is not available we will use the predicted word $\hat{y}_t$ as the input word to the next time step $t+1$. During inference we will keep generating the words until the model outputs an \<eos\> token or for a certain number of steps.

Once we have our predicted target sequence, $\hat{Y} = \{\hat{y}_1, \hat{y}_2, ..., \hat{y}_T\}$, we compare it against out actual target sentence, $Y = \{y_1, y_2, ..., y_T\}$, to calculate our loss. We then use this loss to update the parameters of the model.

 

## Imports

In [1]:
import time
import math
import random
import spacy

import numpy as np

import torch
import torch.nn as nn
import torch.optim as optim

from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator

In [2]:
!python -m spacy download de

Collecting de_core_news_sm==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-2.2.5/de_core_news_sm-2.2.5.tar.gz (14.9MB)
[K     |████████████████████████████████| 14.9MB 534kB/s 
Building wheels for collected packages: de-core-news-sm
  Building wheel for de-core-news-sm (setup.py) ... [?25l[?25hdone
  Created wheel for de-core-news-sm: filename=de_core_news_sm-2.2.5-cp36-none-any.whl size=14907056 sha256=580d864602cec6e228f0368ab4a3624cdee4487cd89fa63668361ccdfc8b00f3
  Stored in directory: /tmp/pip-ephem-wheel-cache-_9z1iwhr/wheels/ba/3f/ed/d4aa8e45e7191b7f32db4bfad565e7da1edbf05c916ca7a1ca
Successfully built de-core-news-sm
Installing collected packages: de-core-news-sm
Successfully installed de-core-news-sm-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('de_core_news_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/de_core_news_sm -->
/usr/local/

In [3]:
spacy_de = spacy.load('de')
spacy_en = spacy.load('en')

### Tokenization

In [4]:
def tokenize_de(text):
    """
    Tokenizes the German text from a string into a list of tokens
    """
    # TODO
    # Implement the tokenization of german sentence using spacy_de
    tokens = [token.text for token in spacy_de.tokenizer(text)] #Reverse the tokens as reverse of input sentence is fed so that decoder?
    return tokens

In [5]:
def tokenize_en(text):
    """
    Tokenizes the English text from a string into a list of tokens
    """
    # TODO
    # Implement the tokenization of english sentence using spacy_en
    tokens = [token.text for token in spacy_en.tokenizer(text)]
    return tokens

### Fields

Learn about Torchtext Fields here:

- [Youtube Tutorials](https://www.youtube.com/watch?v=KRgq4VnCr7I)
- [Blog](http://anie.me/On-Torchtext/)

In Machine Translation, the source language and target languages are different.

Field defines how the data should be processed.

Since we are tokenizing using spacy, we can pass our tokenizer method to the argument tokenizer

In order to indicate the starting and ending of a sentence, we can init_token as <sos> and eos_token as <eos>

In [6]:
SRC = Field(
        tokenize=tokenize_de,
        init_token='<sos>',
        eos_token='<eos>',
        lower=True)

TRG = Field(
        tokenize=tokenize_en,
        init_token='<sos>',
        eos_token='<eos>',
        lower=True)

### Translation Dataset

We will be using Multi30K dataset. Torchtext provides support for multi open datasets. This is a dataset with ~31,000 parallel English, German and French sentences.

exts specifies which languages to use as the source and target (source goes first) and fields specifies which field to use for the source and target.

In [7]:
train_data, valid_data, test_data = Multi30k.splits(exts=('.de', '.en'), fields=(SRC, TRG))

downloading training.tar.gz


training.tar.gz: 100%|██████████| 1.21M/1.21M [00:02<00:00, 557kB/s]


downloading validation.tar.gz


validation.tar.gz: 100%|██████████| 46.3k/46.3k [00:00<00:00, 172kB/s]


downloading mmt_task1_test2016.tar.gz


mmt_task1_test2016.tar.gz: 100%|██████████| 66.2k/66.2k [00:00<00:00, 162kB/s]


In [8]:
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}")

Number of training examples: 29000
Number of validation examples: 1014
Number of testing examples: 1000


In [9]:
print(vars(train_data.examples[0]))

{'src': ['zwei', 'junge', 'weiße', 'männer', 'sind', 'im', 'freien', 'in', 'der', 'nähe', 'vieler', 'büsche', '.'], 'trg': ['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']}


### Vocabulary

Using the **min_freq** argument, we only allow tokens that appear at least 2 times to appear in our vocabulary. Tokens that appear only once are converted into an **\<unk\>** (unknown) token.

It is important to note that our vocabulary should only be built from the training set and not the validation/test set.

In [10]:
# TODO
# build vocabulary on SRC, TRG fields using training data and a min_freq of 2
SRC.build_vocab(train_data,min_freq=2)
TRG.build_vocab(train_data,min_freq=2)

In [11]:
print(f"Unique tokens in source (de) vocabulary: {len(SRC.vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(TRG.vocab)}")

Unique tokens in source (de) vocabulary: 7855
Unique tokens in target (en) vocabulary: 5893


In [12]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

### Iterators

The final step of preparing the data is to create the iterators. These can be iterated on to return a batch of data which will have a src attribute (the PyTorch tensors containing a batch of numericalized source sentences) and a trg attribute (the PyTorch tensors containing a batch of numericalized target sentences).

When we get a batch of examples using an iterator we need to make sure that all of the source sentences are padded to the same length, the same with the target sentences. Luckily, TorchText iterators handle this for us!

We use a BucketIterator instead of the standard Iterator as it creates batches in such a way that it minimizes the amount of padding in both the source and target sentences.

In [13]:
BATCH_SIZE = 64

In [14]:
train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data),
    batch_size=BATCH_SIZE,
    device=device
)

In [15]:
# sample checking
temp = next(iter(train_iterator))
temp.src.shape, temp.trg.shape

(torch.Size([25, 64]), torch.Size([26, 64]))

### Encoder

The encoder we are using is a 2 layer RNN. The paper we are implementing uses a 4-layer LSTM, but in the interest of training time we cut this down to 2-layers.

Once the source sentence is passed through the RNN, the final step's hidden state in all the layers will be used for decoding.

This final hidden state is called context vector, which represents the encoding of the source sentence. This will be used as the initial hidden state for decoder.

> Note: In order for this to work, the number of layers in encoder and decoder must be same, if not then some extra strategy needs to be applied. For example, if the encoder is 1 layer and decoder is 2 layers means, then encoder final hidden state needs to replicated and sent as initial state to decoder



In [16]:
class Encoder(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, num_layers, dropout):
        super().__init__()

        self.hidden_dim = hidden_dim
        self.num_layers = num_layers

        # TODO
        # Declare the embedding layer
        self.embedding = nn.Embedding(input_dim,embedding_dim)

        # TODO
        # Declare the encoder rnn - the input dimension is same as embedding since our input is passed through the embedding layer, after which it becomes the input for rnn
        self.rnn = nn.RNN(embedding_dim,hidden_dim,num_layers,dropout=dropout) 

        self.dropout = nn.Dropout(dropout)
    
    def forward(self, src):
        # TODO
        # comment the src shape - [src_len,batch_size]
        
        embedded_input = self.embedding(src)
        embedded_input = self.dropout(embedded_input)
        # TODO
        # comment the embedded_input shape - [src_len,batch_size,embedding_dim]

        outputs, (hidden) = self.rnn(embedded_input)
        # TODO
        # comment the shapes of outputs and hidden - [],[]

        return hidden

### Decoder

The decoder we are using is a 2 layer RNN like encoder. The Decoder class does a single step of decoding, i.e. it ouputs single token per time-step.

The initial hidden and cell states to the decoder are context vectors $z^1, z^2$, which are the final hidden and cell states of the encoder from the same layer.

As we are only decoding one token at a time, the input tokens will always have a sequence length of 1. We unsqueeze the input tokens to add a sentence length dimension of 1. Then it is passed through the RNN. There is an additional Linear layer, used to make the predictions from the top layer hidden state.

In [17]:
class Decoder(nn.Module):
    def __init__(self, output_dim, embedding_dim, hidden_dim, num_layers, dropout):
        super().__init__()

        self.output_dim = output_dim
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers

        # TODO
        # declare the embedding layer - since input tokens will be of length 1
        self.embedding = nn.Embedding(output_dim,embedding_dim)

        # TODO
        # declare the decoder rnn
        self.rnn = nn.RNN(embedding_dim,hidden_dim,num_layers=num_layers,dropout=dropout) 
        
        # TODO
        # declare the output prediction layer
        self.out = nn.Linear(in_features=hidden_dim,out_features=output_dim)

        self.dropout = nn.Dropout(dropout)
    
    def forward(self, dec_input, hidden):
        # TODO
        # comment the shapes of dec_input & hidden inputs
        # dec-input=[batch_size] - since only 1 token(or seq_len=1)

        dec_input = dec_input.unsqueeze(0)
        # TODO
        # comment why this is required
        # To convert shape of dec_input to [1,batch_size]

        dec_input_embedded = self.embedding(dec_input)
        dec_input_embedded = self.dropout(dec_input_embedded)
        # TODO
        # comment the shape of dec_input_embedded
        # dec_input_embedded=[1,batch_size,embedding_dim]

        output, (hidden) = self.rnn(dec_input_embedded, (hidden))
        # TODO
        # comment the shapes of output, hidden

        # TODO
        # pass the output through the prediction layers and get the logits of tokens
        prediction = self.out(output.squeeze(0))

        return prediction, hidden

### Sequence-to-Sequence Model


For the final part of the implemenetation, we'll implement the seq2seq model. This will handle:

- receiving the input/source sentence
- using the encoder to produce the context vectors
- using the decoder to produce the predicted output/target sentence

The forward method takes the source sentence, target sentence.

When decoding, at each time-step we will predict what the next token in the target sequence will be from the previous tokens decoded, $\hat{y}_{t+1}=f(s_t^L)$. We will use the actual ground-truth next token in the sequence as the input to the decoder during the next time-step.

The first input to the decoder is the start of sequence <sos> token. As our trg tensor already has the <sos> token appended we get our $y_1$ by slicing into it. We know how long our target sentences should be (max_len), so we loop that many times. The last token input into the decoder is the one before the <eos> token


In [18]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()

        self.encoder = encoder
        self.decoder = decoder
    
    def forward(self, src, trg):
        # TODO
        # comment the shapes of src and trg

        # TODO
        # get the values of the following from inputs
        batch_size = src.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = len(TRG.vocab)

        # outputs => to store the predictions at each decoding step
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(device)

        hidden = self.encoder(src)

        # TODO
        # initial input to decoder is <sos> which indicates to start the decoding process
        # since the first token is always <sos> take that as initial decoder input
        decoder_input = trg[0,:]

        # Now let's run the decoding till the trg_len steps
        # skipping the first step as we have taken input as <sos>
        for step in range(1, trg_len):
            # decode step
            output, hidden = self.decoder(decoder_input, hidden)

            # save the output
            outputs[step] = output

            # TODO
            # get the input for decoder, for the next step
            decoder_input = output.argmax(1)
        
        return outputs

In [19]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 300
DEC_EMB_DIM = 300
HID_DIM = 1024
N_LAYERS = 2
DROPOUT = 0.5

In [20]:
encoder = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, DROPOUT)
decoder = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DROPOUT)

model = Seq2Seq(encoder, decoder).to(device)

In [21]:
def init_weights(model):
    for name, param in model.named_parameters():
        nn.init.uniform_(param.data, -0.08, 0.08)

model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(7855, 300)
    (rnn): RNN(300, 1024, num_layers=2, dropout=0.5)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(5893, 300)
    (rnn): RNN(300, 1024, num_layers=2, dropout=0.5)
    (out): Linear(in_features=1024, out_features=5893, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

In [22]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"The model has {count_parameters(model):,} trainable parameters")

The model has 17,078,773 trainable parameters


### Optimizer

In [23]:
optimizer = optim.Adam(model.parameters())

### Loss Criterion

In [24]:
# TODO
# get the padding index from the target vocabulary
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]
criterion = nn.CrossEntropyLoss(ignore_index=TRG_PAD_IDX)

### Training Method

At each iteration:

- get the source and target sentences from the batch, $X$ and $Y$
- zero the gradients calculated from the last batch
- feed the source and target into the model to get the output, $\hat{Y}$
- as the loss function only works on 2d inputs with 1d targets we need to flatten each of them with .view
- we slice off the first column of the output and target tensors (<sos> not used)
- calculate the gradients with loss.backward()
- clip the gradients to prevent them from exploding
- update the parameters of our model by doing an optimizer step
- update the loss

Finally, we return the loss that is averaged over all batches.

In [30]:
def train(model, iterator, optimizer, criterion, clip):
    model.train()
    epoch_loss = 0

    for i, batch in enumerate(iterator):
        src = batch.src
        trg = batch.trg
        # TODO
        # comment the shapes of the src, trg

        # zero the gradients
        optimizer.zero_grad()

        # forward pass
        output = model(src, trg)        

        # TODO
        # reshape the output, trg to make it compatible for loss cal.
        output_dim=output.shape[-1]
        loss = criterion(output.view(-1,output_dim), trg.view(-1))

        # backward pass
        loss.backward()

        # gradient clipping
        nn.utils.clip_grad_norm_(model.parameters(), clip)

        # update the parameters of the model
        optimizer.step()

        # update the loss
        epoch_loss += loss.item()
    
    return epoch_loss / len(iterator)

### Evaluation Method

In [31]:
def evaluate(model, iterator, criterion):
    model.eval()
    epoch_loss = 0

    with torch.no_grad():
        for i, batch in enumerate(iterator):
            src = batch.src
            trg = batch.trg
            # TODO
            # comment the shapes of the src, trg

            # forward pass
            output = model(src, trg)        

            # TODO
            # reshape the output, trg to make it compatible for loss cal.
            output_dim=output.shape[-1]
            loss = criterion(output.view(-1,output_dim), trg.view(-1))

            # update the loss
            epoch_loss += loss.item()
    
    return epoch_loss / len(iterator)

In [32]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = elapsed_time - (elapsed_mins * 60)
    return elapsed_mins, elapsed_secs

### Training Loop

In [34]:
N_EPOCHS = 10
CLIP = 2

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    start_time = time.time()
    
    train_loss = train(model, train_iterator,optimizer,criterion ,CLIP) #Correct the base code here-order of optimizer/criterion
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    # TODO
    # update the best validation loss and save the model 
    if valid_loss < best_valid_loss:
      best_valid_loss = valid_loss
      torch.save(model.state_dict(),'assignment1-model.pt')

    print(f"Epoch {epoch + 1} | Time: {epoch_mins}m {epoch_secs}s")
    print(f"\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f} |")
    print(f"\tValid Loss: {valid_loss:.3f} | Valid PPL: {math.exp(valid_loss):7.3f} |")

Epoch 1 | Time: 0m 48.666165828704834s
	Train Loss: 5.581 | Train PPL: 265.364 |
	Valid Loss: 5.403 | Valid PPL: 222.093 |
Epoch 2 | Time: 0m 48.23629450798035s
	Train Loss: 5.489 | Train PPL: 241.967 |
	Valid Loss: 5.387 | Valid PPL: 218.541 |
Epoch 3 | Time: 0m 48.04100728034973s
	Train Loss: 5.485 | Train PPL: 241.039 |
	Valid Loss: 5.433 | Valid PPL: 228.872 |
Epoch 4 | Time: 0m 48.32613778114319s
	Train Loss: 5.478 | Train PPL: 239.338 |
	Valid Loss: 5.426 | Valid PPL: 227.308 |
Epoch 5 | Time: 0m 47.906588554382324s
	Train Loss: 5.474 | Train PPL: 238.469 |
	Valid Loss: 5.379 | Valid PPL: 216.732 |
Epoch 6 | Time: 0m 48.27138686180115s
	Train Loss: 5.470 | Train PPL: 237.499 |
	Valid Loss: 5.382 | Valid PPL: 217.452 |
Epoch 7 | Time: 0m 48.45842409133911s
	Train Loss: 5.465 | Train PPL: 236.190 |
	Valid Loss: 5.451 | Valid PPL: 232.973 |
Epoch 8 | Time: 0m 47.88945722579956s
	Train Loss: 5.466 | Train PPL: 236.474 |
	Valid Loss: 5.408 | Valid PPL: 223.282 |
Epoch 9 | Time: 0m 47.

### Testing

In [37]:
# TODO
# load the trained model
model = Seq2Seq(encoder, decoder).to(device)
model.load_state_dict(torch.load('assignment1-model.pt'))
# TODO
# calculate test loss by using test iterator
test_loss = evaluate(model,test_iterator,criterion)

print(f"\tTest Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |")

	Test Loss: 5.391 | Test PPL: 219.366 |
