# Introduction

This is an implementation for Sequence to Sequence Encoder - Decoder Model (with Attention) using GRU trained for the task of English-French translations. It is based on the paper - [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473) and the notebook [Pytorch Seq2Seq - RNN with Attention](https://github.com/SethHWeidman/pytorch-seq2seq/blob/master/3%20-%20Neural%20Machine%20Translation%20by%20Jointly%20Learning%20to%20Align%20and%20Translate.ipynb). 

It is implemented as an excerise to gain a deeper understanding of Encoder - Decoder models and build on it to explore advancements on the same. The notebook is a followup to the first notebook which used 2 layer LSTM Encoder-Decoder model for translation English sentences to French.

Dataset used - [English - French Translations](https://www.kaggle.com/datasets/dhruvildave/en-fr-translation-dataset)

### Drawback of Notebook 1 - RNN Encoder - Decoder
In the previous notebook we used 2 layer LSTM Encoder-Decoder model. 
The main drawback of such an architecture is that all the information from the starting of the sentence to the end has to be crammed into a single context vector calculated as the final hidden state of the sentence pass through encoder. 
This leads to 3 effects - 
1. A single vector most of the time proves insufficient to capture context of the whole sentence, specially longer sentences.
2. Information from tokens at the beginning of the sentence goes through multiple non-linear operations, while ones at the end are processed through few operations. Hence, there is an unevenness in processing of tokens.
3. Further, when this context vector is passed through the decoder units, initial token pass have much more info about the source sentence compared to later tokens. This is because, decoding process of later tokens have info about the last hidden state which packs info about all of the previously generated tokens along with the initial context vector.

Certain improvements were implemented in the paper - [Learning Phrase Representations - RNN Encoder Decoder](https://arxiv.org/abs/1406.1078), where the context vector from the encoder is fed as input to each token pass in the decoder along with the previous hidden state to provide with more info about the source sentence.

Along with the above, the attention mechanism discussed in the referenced material utilises the final outputs of the whole sentence (one vector for each token), so that each decoder pass for a token has information about the whole sentence.

Another mechanism utilised in this notebook is bidirectional RNN, which basically processes the sentence token by token in both the directions. This is equivalent to adding another RNN layer with inputs reversed. Pytorch internally configures this without any extra effort with just another parameter bidirectional=True.

### Importing modules

In [1]:
import numpy as np
import pandas as pd
import spacy
from string import digits
import random
from torchtext.data.utils import get_tokenizer
import torch
import torchtext
from collections import Counter
from torchtext.vocab import vocab
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader
import torch.optim as optim
import torch.nn as nn
torch.cuda.empty_cache()

import math
import time

import numpy as np
import pandas as pd

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/en-fr-translation-dataset/en-fr.csv


In [2]:
SEED = 97

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

### Downloading Spacy models for use as tokenizers

In [3]:
!python -m spacy download en_core_web_sm
!python -m spacy download fr_core_news_sm

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m36.0 MB/s[0m eta [36m0:00:00[0m
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Collecting fr-core-news-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.5.0/fr_core_news_sm-3.5.0-py3-none-any.whl (16.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.3/16.3 MB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: fr-core-news-sm
Successfully installed fr-core-news-sm-3.5.0
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')


### Data import
Reading only 50000 rows for demonstration purposes. Training RNN is slow as it it takes input tokens one by one and I am limited by the compute available.

In [4]:
data = pd.read_csv('/kaggle/input/en-fr-translation-dataset/en-fr.csv', nrows=30000)
data = data.dropna().drop_duplicates()
data.head(5)

Unnamed: 0,en,fr
0,Changing Lives | Changing Society | How It Wor...,Il a transformé notre vie | Il a transformé la...
1,Site map,Plan du site
2,Feedback,Rétroaction
3,Credits,Crédits
4,Français,English


### Initializing the tokenizers

In [5]:
fr_tokenizer = get_tokenizer('spacy', language='fr_core_news_sm')
en_tokenizer = get_tokenizer('spacy', language='en_core_web_sm')

### Splitting train, test and val datasets
Train - 85%, Val - 10%, Test - 5%

In [6]:
val_frac = 0.1
test_frac = 0.05
val_split_idx = int(len(data)*val_frac)
test_split_idx = int(len(data)*(val_frac + test_frac))
data_idx = list(range(len(data)))
np.random.shuffle(data_idx)

val_idx, test_idx, train_idx = data_idx[:val_split_idx], data_idx[val_split_idx:test_split_idx], data_idx[test_split_idx:]
print('len of train: ', len(train_idx))
print('len of val: ', len(val_idx))
print('len of test: ', len(test_idx))

df_train = data.iloc[train_idx].reset_index().drop('index',axis=1)
df_test = data.iloc[test_idx].reset_index().drop('index',axis=1)
df_val = data.iloc[val_idx].reset_index().drop('index',axis=1)

len of train:  25500
len of val:  2999
len of test:  1500


In [7]:
def build_vocab(data, source_tokenizer, target_tokenizer):
    en_counter = Counter()
    fr_counter = Counter()
    translations = data.values.tolist()
    for translation in translations:
        en_counter.update(source_tokenizer(translation[0]))
        fr_counter.update(target_tokenizer(translation[1]))
    return vocab(en_counter, specials=['<unk>', '<pad>', '<bos>', '<eos>'], min_freq=5), vocab(fr_counter, specials=['<unk>', '<pad>', '<bos>', '<eos>'], min_freq=5)

### Building separate vocabs for English & French
Note: Using only training data to build the vocab.

In [8]:
en_vocab, fr_vocab = build_vocab(df_train, en_tokenizer, fr_tokenizer)
en_vocab.set_default_index(en_vocab['<unk>'])
fr_vocab.set_default_index(fr_vocab['<unk>'])

In [9]:
def data_process(data):
    translations = data.values.tolist()
    pairs = []
    for translation in translations:
        en_tensor = torch.tensor([en_vocab[token] for token in en_tokenizer(translation[0])],
                            dtype=torch.long)
        fr_tensor = torch.tensor([fr_vocab[token] for token in fr_tokenizer(translation[1])],
                            dtype=torch.long)
        pairs.append((en_tensor, fr_tensor))
    return pairs

In [10]:
train_data = data_process(df_train)
val_data = data_process(df_val)
test_data = data_process(df_test)

### Note:
Keeping a small batch size due to limitation of GPU size.

In [11]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

BATCH_SIZE = 8
PAD_IDX = en_vocab['<pad>']
BOS_IDX = en_vocab['<bos>']
EOS_IDX = en_vocab['<eos>']

In [12]:
def generate_batch(data_batch):
    en_batch, fr_batch = [], []
    for (en_item, fr_item) in data_batch:
        en_batch.append(torch.cat([torch.tensor([BOS_IDX]), en_item, torch.tensor([EOS_IDX])], dim=0).to(device))
        fr_batch.append(torch.cat([torch.tensor([BOS_IDX]), fr_item, torch.tensor([EOS_IDX])], dim=0).to(device))  
        
    en_batch = pad_sequence(en_batch, padding_value=PAD_IDX)
    fr_batch = pad_sequence(fr_batch, padding_value=PAD_IDX)
    return en_batch, fr_batch

In [13]:
train_loader = DataLoader(train_data, batch_size=BATCH_SIZE,
                        shuffle=True, collate_fn=generate_batch)
val_loader = DataLoader(val_data, batch_size=BATCH_SIZE,
                        shuffle=True, collate_fn=generate_batch)
test_loader = DataLoader(test_data, batch_size=BATCH_SIZE,
                        shuffle=True, collate_fn=generate_batch)

#### Encoder process -
The encoder process is similar to the previous notebook with few additions. 
1. Bidirectional processing - RNN processes the input in both the directions (first to last word and last word to first).
2. Bidirectional processing increases the size of the hidden vector by the magnitude of 2 (hidden vetor - [forward hidden vectors, backward hidden vectors]). This vector needs to be mapped to the decoder input hidden vector shape. For this a linear_layer maps from enc_hidden_vector*2 -> dec_hidden_vector. 
3. Both the output from the linear_layer discussed above and the final ouptus of all the tokens are taken are returned from the encoder.

example pseudocode for one sentence pass through the encoder - 

<code>
    vocab = build_vocab(source_sentence)
    sentence_tensor = [vocab[tokenize(word)] for word in source_sentence.split(' ')]
    hidden = 0
    for word_tensor in sentence_tensor:
        emb_vector = embed(word_tensor)
        [hidden_forward, hidden_backward], output = rnn_unit(emb_vector, hidden) # for 1 layer
        # for multiple layers, hidden state are stacked on top of each other -> [forward1, backward1, forward2, backward2..]
        # output = [forwardLast, backwardLast]
        hidden = [hidden_forward, hidden_backward]
    context = linear_layer(concat(hidden[-2], hidden[-1])) # output is size of decoder hidden input
</code>

#### Inputs
* input_dim - source sentence vocab size
* emb_dim - output size of embedding layer
* enc_hid_dim - Output size of hidden vector generated by RNN unit (actual output would have this size X 2 for bidirectional usecase)
* dec_hid_dim - Size of decoder hidden dimension to be used for transforming final encoder hidden state to decoder initial hidden state through a linear layer.
* dropout - percentage of droput to be used to avoid overfitting

In [14]:
class Encoder(nn.Module):
    def __init__(self, 
                 input_dim: int, 
                 emb_dim: int, 
                 enc_hid_dim: int, 
                 dec_hid_dim: int, 
                 dropout: float):
        super().__init__()
        
        self.input_dim = input_dim
        self.emb_dim = emb_dim
        self.enc_hid_dim = enc_hid_dim
        self.dec_hid_dim = dec_hid_dim
        self.dropout = dropout
        
        self.embedding = nn.Embedding(input_dim, emb_dim)
        self.gru = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True)
        self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        
        #src = [src sent len, batch size]
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src sent len, batch size, emb dim] 
        # each token is converted into an embedding vector of size emb_dim
        
        outputs, hidden = self.gru(embedded)
                
        #outputs = [src sent len, batch size, hid dim * num directions] last layer output of gru
        #hidden = [n layers * num directions, batch size, hid dim] hidden state output of last token pass through gru
        
        #hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]
        
        #initial decoder hidden is final hidden state of the forwards and backwards 
        #  encoder RNNs fed through a linear layer
        
        # here hidden[-2,:,:] = forward rnn last hidden state, hidden[-1,:,:] = backward rnn last hidden state
        # here we have only one layer, in case of multiple layers the hidden states are further stacked
        
        # Note: torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)
        # is of shape [batch_size, enc_hid_dim * 2]
        
        hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)))
        
        #tanh is applied to keep the results in the range [-1, 1]
        #outputs = [src sent len, batch size, enc hid dim * 2]
        #hidden = [batch size, dec hid dim]
        
        return outputs, hidden


#### Attention process - 
1. Takes in outputs and final hidden state of encoder (passed through linear layer).
2. Outputs shape - [1, sent_len, enc_hidden_size * 2] multiplied by 2 for forward and backward
3. Hidden shape - [1, dec_hidden_state]
4. Repeat hidden state sent_len times and concatenate outputs and hidden shape.
5. Pass through a linear layer to map to an attention vector.
6. Attention vector is used in the decoder to calculated weighted sum of encoder outputs to be used in decoder RNN unit for token prediction. This helps decoder learn to attend to different parts of the whole source sentence important to the token generation.

<code>
    inputs -> encoder_outputs, hidden
    hidden_repeated = repeat(hidden, sent_len)
    attention_input = concat(encoder_outputs, hidden_repeated)
    attention_vector = linear_layer(attention_input)
    # attention vector size -> [1, sent_len, attention_dim]
    # we need to transform it to [1, sent_len]
    attention_vector = transform(attention_vector) # either by summing across last dimension or through mat mul with a randomly initialized vector that can be trained
    return attention_vector
 </code>
 
#### Inputs
* enc_hid_dim = hidden dimension of encoder rnn unit
* dec_hid_dim = hidden dimension of decoder rnn unit
* attn_dim = dimension of attention vector to be generated by the attention unit.

Note: -
Using summation approach here to reduce dimension of attention vector from [batch_size, src_len, attn_dim] to [batch_size, attn_len] to reduce training time

In [15]:
class Attention(nn.Module):
    def __init__(self, 
                 enc_hid_dim: int, 
                 dec_hid_dim: int,
                 attn_dim: int):
        super().__init__()
        
        self.enc_hid_dim = enc_hid_dim
        self.dec_hid_dim = dec_hid_dim
        self.attn_in = (enc_hid_dim * 2) + dec_hid_dim
        
        self.attn = nn.Linear(self.attn_in, attn_dim)
        # self.v = nn.Parameter(torch.rand(attn_dim))
        
    def forward(self, decoder_hidden, encoder_outputs):
        
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src sent len, batch size, enc hid dim * 2]
        
        batch_size = encoder_outputs.shape[1]
        src_len = encoder_outputs.shape[0]
        
        #repeat decoder hidden state src_len times
        repeated_decoder_hidden = decoder_hidden.unsqueeze(1).repeat(1, src_len, 1)
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        
        #decoder_hidden = [batch size, src sent len, dec hid dim]
        #encoder_outputs = [batch size, src sent len, enc hid dim * 2]
        
        # attention layer takes encoder_outpts and decoder_hidden concatenated together.
        # Hence, the need of repeating decoder_hidden sent_len times as the encoder_outputs are sent_len long
        
        # concatenate along last dim and feed through attention layer
        # apply tanh for restricting values to [-1, 1]
        
        energy = torch.tanh(self.attn(torch.cat((
            repeated_decoder_hidden, 
            encoder_outputs), 
            dim = 2))) 
        
        attention = torch.sum(energy, dim=2)
        
        #energy = [batch size, src sent len, attn_dim]
        
        #v = [attn_dim] is used to reduce dimensionality of energy vector for each sent_token
        # currently each sent_token's attn vector is attn_dim long. 
        # We need to reduce it to 1 value for each sentence_token
        
        #here we learn a parameter v, it can be done by summing through the last dim as well for energy vector
        
        ### energy = energy.permute(0, 2, 1)
        
        #energy = [batch size, attn_dim, src sent len]
        
        ### v = self.v.repeat(batch_size, 1).unsqueeze(1)
        #v = [batch size, 1, attn_dim]
        
        # batch matrix mul with v will transform energy to size - [batch size, 1, src_len]
        ### attention = torch.bmm(v, energy).squeeze(1)
        #attention= [batch size, src len]
        
        # at the end we have one number for each token in a sentence in attention vector.
        # this will be used to calculate the weighted attention vector in decoder.
        
        return torch.nn.functional.softmax(attention, dim=1)

#### Decoder process - 
The decoder process is similar to the previous step, except few changes.
1. The decoder layer contains the attention layer anduses the output and hidden state returned by encoder to generate attention vector.
2. It performs matrix multiplication of encoder output vector with attention vector to generate weighted attention.
3. Weighted attention vector along with previous hidden state and embedding of previous token is passed as input to rnn unit. 
4. Output generated by the rnn unit along with weighted attention vector and embedding vector for the previous token is passed to linear_layer to generate the final output probability distribution. Argmax is taken to get the output index from the target vocab.

example pseudocode for one sentence pass through decoder - 

<code>
    vocab = build_vocab(target_sentence)
    sentence_tensor = [vocab[tokenize(word)] for word in target_sentence.split(' ')]
    hidden = hidden state from encoder
    word_tensor = sentence_tensor[0]
    predicted_sentence = []
    for i in range(len(sentence_tensor)):
        emb_vector = embed(word_tensor)
        attention_vector = attention(hidden, encoder_outputs)
        weighted_attention = attention_vetor * encoder_outputs
        hidden, output = rnn_unit(concat(weighted_attention, emb_vector), hidden) # hidden = output for 1 layer rnn
        prediction = argmax(linear_layer(concat(emb_vector, weighted_attention, output)))
        word_tensor = prediction
        predicted_sentence.append(prediction)
</code>


#### Inputs 
* output_dim = target vocab size
* emb_dim = embedding size
* enc_hid_dim = size of encoder rnn unit hidden state
* dec_hid_dim = size of decoder rnn unit hidden state
* dropout = for avoiding overfitting
* attention = attention module object

In [16]:
class Decoder(nn.Module):
    def __init__(self, 
                 output_dim: int, 
                 emb_dim: int, 
                 enc_hid_dim: int, 
                 dec_hid_dim: int, 
                 dropout: int, 
                 attention: nn.Module):
        super().__init__()

        self.emb_dim = emb_dim
        self.enc_hid_dim = enc_hid_dim
        self.dec_hid_dim = dec_hid_dim
        self.output_dim = output_dim
        self.dropout = dropout
        self.attention = attention

        self.embedding = nn.Embedding(output_dim, emb_dim)
        self.gru = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim)
        # self.attention.attn_in = (enc_hid_dim*2) + dec_hid_dim
        self.out = nn.Linear(self.attention.attn_in + emb_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        
        
    def _weighted_encoder_rep(self, decoder_hidden, encoder_outputs):
        
        # Outputs a vector summing to 1 of length seq_len for each observation
        a = self.attention(decoder_hidden, encoder_outputs)

        #a = [batch size, src len]
        a = a.unsqueeze(1)
        #a = [batch size, 1, src len]

        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        #encoder_outputs = [batch size, src sent len, enc hid dim * 2]

        weighted_encoder_rep = torch.bmm(a, encoder_outputs)
        # batch matrix mul of encoder_outputs and attention vector
        # essentially this step's output helps the decoder gather context from the entire sentence
        
        #weighted_encoder_rep = [batch size, 1, enc hid dim * 2]
        weighted_encoder_rep = weighted_encoder_rep.permute(1, 0, 2)
        #weighted_encoder_rep = [1, batch size, enc hid dim * 2]
        
        return weighted_encoder_rep
        
        
    def forward(self, input, decoder_hidden, encoder_outputs):
             
        #input = [batch size] Note: "one word(token) at a time"
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src sent len, batch size, enc hid dim * 2]
        
        input = input.unsqueeze(0)
        #input = [1, batch size]
        
        embedded = self.dropout(self.embedding(input))
        #embedded = [1, batch size, emb dim]
        
        weighted_encoder_rep = self._weighted_encoder_rep(decoder_hidden, 
                                                          encoder_outputs)
        
        # Then, the input to the decoder for the current token is a concatenation of:
        # the weighted attention calculated from encoder outputs and attention vector and
        # The embedding itself
        # Decoder hidden state is fed separately without concatenating
        # earlier we used to only use the embedding along with the decoder hidden state
        
        rnn_input = torch.cat((embedded, weighted_encoder_rep), dim = 2)
        
        #rnn_input = [1, batch size, (enc hid dim * 2) + emb dim]
        output, decoder_hidden = self.gru(rnn_input, decoder_hidden.unsqueeze(0))
        
        #output = [sent len, batch size, dec hid dim * n directions] 1 direction in this case
        # sent len is 1 as we pass token by token 
        # so output size is [1, batch size, dec_hid_dim]
        
        #decoder_hidden = [n layers * n directions, batch size, dec hid dim]

        #output = [1, batch size, dec hid dim]
        #hidden = [1, batch size, dec hid dim]
        #this also means that output == hidden
        assert (output == decoder_hidden).all()
        
        embedded = embedded.squeeze(0)
        output = output.squeeze(0)
        weighted_encoder_rep = weighted_encoder_rep.squeeze(0)
        
        # output is fed tho
        output = self.out(torch.cat((output, 
                                     weighted_encoder_rep, 
                                     embedded), dim = 1))
        
        #output = [batch size, output dim]
        
        return output, decoder_hidden.squeeze(0)

### Model 
1. Sentence is passed through the encoder and the encoder outputs and last hidden state is received as output.
2. The above received values are passed through decoder along with one token at a time for the entire batch.
3. output and the hidden state is received as output. output is used to generate the token by taking max and hidden is used for the next token generation.
4. The whole process is followed in a loop for all the tokens in the target sentence.
5. Teacher forcing is used for better training. It basically uses the actual target token instead of the generated token from the model for the next rnn decoder step. 

In [17]:
class Seq2Seq(nn.Module):
    def __init__(self, 
                 encoder: nn.Module, 
                 decoder: nn.Module, 
                 device: torch.device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
    def forward(self, src, trg, 
                teacher_forcing_ratio: float = 0.5):
        
        #src = [src sent len, batch size]
        #trg = [trg sent len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use teacher forcing 75% of the time
        
        batch_size = src.shape[1]
        max_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(max_len, batch_size, trg_vocab_size).to(self.device)
        
        #encoder_outputs is all hidden states of the input sequence, back and forwards
        #hidden is the final forward and backward hidden states, passed through a linear layer
        encoder_outputs, hidden = self.encoder(src)
                
        #first input to the decoder is the <sos> tokens
        output = trg[0,:]
        
        for t in range(1, max_len):
            output, hidden = self.decoder(output, hidden, encoder_outputs)
            outputs[t] = output
            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.max(1)[1]
            output = (trg[t] if teacher_force else top1)

        return outputs

### Initializing the model

In [18]:
INPUT_DIM = len(en_vocab)
OUTPUT_DIM = len(fr_vocab)
ENC_EMB_DIM = 64
DEC_EMB_DIM = 64
ENC_HID_DIM = 128
DEC_HID_DIM = 128
ATTN_DIM = 32
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

attn = Attention(ENC_HID_DIM, DEC_HID_DIM, ATTN_DIM)
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn)

model = Seq2Seq(enc, dec, device).to(device)

In [19]:
len(en_vocab), len(fr_vocab)

(8560, 9249)

### Initializing the model parameters with a distribution

In [20]:
def init_weights(m: nn.Module):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
            
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(8560, 64)
    (gru): GRU(64, 128, bidirectional=True)
    (fc): Linear(in_features=256, out_features=128, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (attention): Attention(
      (attn): Linear(in_features=384, out_features=32, bias=True)
    )
    (embedding): Embedding(9249, 64)
    (gru): GRU(320, 128)
    (out): Linear(in_features=448, out_features=9249, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

In [21]:
def count_parameters(model: nn.Module):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 5,659,585 trainable parameters


### Initializer optimizer and loss functions

In [22]:
optimizer = optim.Adam(model.parameters())

PAD_IDX = fr_vocab['<pad>']
criterion = nn.CrossEntropyLoss(ignore_index = PAD_IDX)

In [23]:
def train(model, dataloader, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for _, (src, trg) in enumerate(dataloader):
        #print (src.size())
        optimizer.zero_grad()
        output = model(src, trg)
        
        #trg = [trg sent len, batch size]
        #output = [trg sent len, batch size, output dim]
        
        output = output[1:].view(-1, output.shape[-1])
        trg = trg[1:].view(-1)
        
        #trg = [(trg sent len - 1) * batch size]
        #output = [(trg sent len - 1) * batch size, output dim]
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / (len(dataloader)*BATCH_SIZE)

In [24]:
def evaluate(model, dataloader, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for _, (src, trg) in enumerate(dataloader):
            output = model(src, trg, 0) #turn off teacher forcing

            #trg = [trg sent len, batch size]
            #output = [trg sent len, batch size, output dim]

            output = output[1:].view(-1, output.shape[-1])
            trg = trg[1:].view(-1)

            #trg = [(trg sent len - 1) * batch size]
            #output = [(trg sent len - 1) * batch size, output dim]

            loss = criterion(output, trg)

            epoch_loss += loss.item()
        
    return epoch_loss / (len(dataloader)*BATCH_SIZE)

In [25]:
def epoch_time(start_time: int, 
               end_time: int):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [26]:
MODEL_PATH = 'rnn-attn-model.pt'

In [27]:
del df_train
del df_val
del df_test
del data
del en_tokenizer
del fr_tokenizer

In [28]:
N_EPOCHS = 5
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_loader, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, val_loader, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), MODEL_PATH)
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 6m 23s
	Train Loss: 0.721 | Train PPL:   2.057
	 Val. Loss: 0.725 |  Val. PPL:   2.066
Epoch: 02 | Time: 6m 18s
	Train Loss: 0.631 | Train PPL:   1.880
	 Val. Loss: 0.693 |  Val. PPL:   2.000
Epoch: 03 | Time: 6m 18s
	Train Loss: 0.581 | Train PPL:   1.788
	 Val. Loss: 0.675 |  Val. PPL:   1.965
Epoch: 04 | Time: 6m 20s
	Train Loss: 0.547 | Train PPL:   1.728
	 Val. Loss: 0.665 |  Val. PPL:   1.944
Epoch: 05 | Time: 6m 26s
	Train Loss: 0.521 | Train PPL:   1.684
	 Val. Loss: 0.660 |  Val. PPL:   1.935


In [29]:
model.load_state_dict(torch.load(MODEL_PATH))
test_loss = evaluate(model, test_loader, criterion)
print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

| Test Loss: 0.655 | Test PPL:   1.925 |
