#Table of Contents

1. Import Libraries
2. Load Dataset
3. Create Field Objects
4. Data Preparation
  - Build Vocabulary
  - Create Dataloaders
  
5. Define Model Architecture
  - Encoder Architecture
  - Attention Mechanism
  - Decoder Architecture
  - Sequence-to-Sequence Architecture
7. Train Sequence-to-Sequence Model
8. Model Inference 
  - Build Inference Function
  - Translate Russian Sentences in the Test Dataset
  
The goal of this exercise is to add the attention mechanism between the encoder and decoder to improve translation accuracy.

#1. Import Libraries

In [1]:
import re
import time
import math
import random

import numpy as np
import pandas as pd
import spacy

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from tqdm import notebook
pd.set_option('display.max_colwidth', 200)

In [2]:
#from torchtext import data
#https://stackoverflow.com/questions/51452412/cant-import-torchtext-module-in-jupyter-notebook-while-using-pytorch

# You have to use PyTorch 0.4.x.
# torch.legacy was removed in PyTorch 1.x.
import sys
sys.path.append("C:/Users/czwea/anaconda3/bin/")
import torchtext
#from torchtext import data
# from torchtext.legacy import data

In [5]:
# check GPU availability

# https://queirozf.com/entries/suppressing-ignoring-warnings-in-python-reference-and-examples#:~:text=In%20order%20to%20disable%20all%20warnings%20in%20the,of%20a%20given%20type%20only%2C%20using%20category%3D%20parameter%3A
import warnings
with warnings.catch_warnings():
    # this will suppress all warnings in this block
    warnings.simplefilter("ignore")
    
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

    print(device)

cpu


#2. Load Dataset

In [None]:
# mount google drive
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
# extract the zip file from your Google Drive
# ! unzip '/content/drive/My Drive/Course_Notes/NLP using PyTorch/Seq2Seq/nmt_data.zip'

In [14]:
# read dataset from the Google drive
df = pd.read_csv("D:/LargeData/Analytics_Vidhya/NLP_Deep/nmt_data.csv")
test_df = pd.read_csv("D:/LargeData/Analytics_Vidhya/NLP_Deep/nmt_data_test.csv")

# shape of datasets
df.shape, test_df.shape

((187053, 2), (46668, 2))

#3. Create Field Objects

In [15]:
# import Russian spacy model to tokenize Russian text
from spacy.lang.ru import Russian

In [16]:
# dependency for spaCy Russian tokenizer
# !pip install pymorphy2

In [17]:
# spacy object for Russian
nlp_ru = Russian()

# spacy object for English
nlp_en = spacy.load("en_core_web_sm", disable = ["parser", "tagger", "ner"])

In [18]:
## functions to perform tokenization

# tokenizes Russian text from a string into a list of tokens
def tokenize_ru(text):
  return [tok.text for tok in nlp_ru.tokenizer(text)]

# tokenizes English text from a string into a list of tokens
def tokenize_en(text):
  return [tok.text for tok in nlp_en.tokenizer(text)]

In [19]:
## Create Field objects

# Field object for Russian
SRC = torchtext.data.Field(tokenize = tokenize_ru, 
                 include_lengths = True, 
                 lower = True)

# Field object for English
TRG = torchtext.data.Field(tokenize = tokenize_en, 
                 init_token = '<sos>', # "start" token
                 eos_token = '<eos>', # "" token
                 include_lengths = True, 
                 lower = True)

fields = [('rus', SRC), ('eng', TRG)]

* refer the video "Text preprocessing in PyTorch" in the course "Fundamentals of Deep Learning" to learn more about the TorchText's Field objects

#4. Data Preparation

###4.1 Build Vocabulary

In [21]:
# importing data from csv
nmt_data = torchtext.data.TabularDataset(path="../../../../../LargeData/Analytics_Vidhya/NLP_Deep/nmt_data.csv", format='csv', fields=fields)



In [22]:
# build vocabulary for Russian sequences
SRC.build_vocab(nmt_data, max_size=4000)

# build vocabulary for English sequences
TRG.build_vocab(nmt_data, max_size=4000)

In [23]:
# check size of vocabulary
len(SRC.vocab), len(TRG.vocab)

(4002, 4004)

###4.2 Create Dataloaders

In [24]:
# Split our dialogue data into training, validation, and test sets
train_data, val_data = nmt_data.split(split_ratio=0.8)

In [26]:
# Create a set of iterators for each split
train_iterator, valid_iterator = torchtext.data.BucketIterator.splits(
    (train_data, val_data), 
    batch_size = 64, 
    sort_within_batch = True, 
    sort_key = lambda x:len(x.rus),
    device = device)



#5. Define Model Architecture

###5.1 Encoder Architecture

In [27]:
## embedding layer: 
##    input dimensions = size of Russian vocabulary
##    ouput dimensions = embedding_size

## GRU layer:
##    input dimensions = embedding_size
##    hidden units = hidden_size
##    layers = num_layers
##    output dim = hidden_size

class Encoder(nn.Module):
  
  def __init__(self, hidden_size, embedding_size, num_layers=2, dropout=0.3):
    
    super(Encoder, self).__init__()
    
    # Basic network params
    self.hidden_size = hidden_size
    self.embedding_size = embedding_size
    self.num_layers = num_layers
    self.dropout = dropout
    
    # Embedding layer that will be shared with Decoder
    self.embedding = nn.Embedding(len(SRC.vocab), embedding_size)
    # GRU layer
    self.gru = nn.GRU(embedding_size, hidden_size,
                      num_layers=num_layers,
                      dropout=dropout)
      
  def forward(self, input_sequence):
      
    # Convert input_sequence to word embeddings
    embedded = self.embedding(input_sequence)
            
    outputs, hidden = self.gru(embedded)
    
    # The ouput of a GRU has shape -> (seq_len, batch, hidden_size)
    return outputs, hidden

###5.2 Attention Mechanism

In [28]:
class Attention(nn.Module):
  def __init__(self, hidden_size):
    super(Attention, self).__init__()        
    self.hidden_size = hidden_size
      
    
  def dot_score(self, hidden_state, encoder_states):
    return torch.sum(hidden_state * encoder_states, dim=2)
  
          
  def forward(self, hidden, encoder_outputs, mask):
      
    attn_scores = self.dot_score(hidden, encoder_outputs)
    
    # Transpose max_length and batch_size dimensions
    attn_scores = attn_scores.t()
    
    # Apply mask so network does not attend <pad> tokens        
    attn_scores = attn_scores.masked_fill(mask == 0, -1e5)
    
    # Return softmax over attention scores      
    return F.softmax(attn_scores, dim=1).unsqueeze(1)

###5.3 Decoder Architecture

In [29]:
## embedding layer: 
##    input dimensions = output_size (size of English vocabulary), 
##    ouput dimensions = embedding_size

## GRU layer:
##    input dimensions = embedding_size
##    hidden units = hidden_size
##    layers = n_layers
##    output dim = hidden_size

## concat layer:
##    input dimensions = hidden_size * 2
##    output dimensions = hidden_size

## fully Connected layer:
##    input dimensions = hidden_size, 
##    ouput dimensions = output_size (size of English vocabulary)

class Decoder(nn.Module):
  def __init__(self, embedding_size, hidden_size, output_size, n_layers=2, dropout=0.3):
      
    super(Decoder, self).__init__()
    
    # Basic network params
    self.hidden_size = hidden_size
    self.output_size = output_size
    self.n_layers = n_layers
    self.dropout = dropout
    self.embedding = nn.Embedding(output_size, embedding_size)
            
    self.gru = nn.GRU(embedding_size, hidden_size, n_layers, 
                      dropout=dropout)
    
    self.concat = nn.Linear(hidden_size * 2, hidden_size)
    self.out = nn.Linear(hidden_size, output_size)
    self.attn = Attention(hidden_size)
      
  def forward(self, current_token, hidden_state, encoder_outputs, mask):
    
    # convert current_token to word_embedding
    embedded = self.embedding(current_token)
    
    # Pass through GRU
    gru_output, hidden_state = self.gru(embedded, hidden_state)
    
    # Calculate attention weights
    attention_weights = self.attn(gru_output, encoder_outputs, mask)
    
    # Calculate context vector (weigthed average)
    context = attention_weights.bmm(encoder_outputs.transpose(0, 1))
    
    # Concatenate  context vector and GRU output
    gru_output = gru_output.squeeze(0)
    context = context.squeeze(1)
    concat_input = torch.cat((gru_output, context), 1)
    concat_output = torch.tanh(self.concat(concat_input))
    
    # Pass concat_output to final output layer
    output = self.out(concat_output)
    
    # Return output and final hidden state
    return output, hidden_state

###5.4 Sequence-to-Sequence Architecture

In [30]:
class seq2seq(nn.Module):
  def __init__(self, embedding_size, hidden_size, vocab_size, device, pad_idx, eos_idx, sos_idx):
    super(seq2seq, self).__init__()
    
    # Embedding layer shared by encoder and decoder
    self.embedding = nn.Embedding(vocab_size, embedding_size)
    
    # Encoder network
    self.encoder = Encoder(hidden_size, 
                            embedding_size,
                            num_layers=2,
                            dropout=0.3)
    
    # Decoder network        
    self.decoder = Decoder(embedding_size,
                            hidden_size,
                            vocab_size,
                            n_layers=2,
                            dropout=0.3)
    
    
    # Indices of special tokens and hardware device 
    self.pad_idx = pad_idx
    self.eos_idx = eos_idx
    self.sos_idx = sos_idx
    self.device = device
      
  def create_mask(self, input_sequence):
    return (input_sequence != self.pad_idx).permute(1, 0)
      
      
  def forward(self, input_sequence, output_sequence):
    
    # Unpack input_sequence tuple
    input_tokens = input_sequence[0]
  
    # Unpack output_tokens, or create an empty tensor for text generation
    if output_sequence is None:
      inference = True
      output_tokens = torch.zeros((100, input_tokens.shape[1])).long().fill_(self.sos_idx).to(self.device)
    else:
      inference = False
      output_tokens = output_sequence[0]
    
    vocab_size = self.decoder.output_size
    batch_size = len(input_sequence[1])
    max_seq_len = len(output_tokens)
    
    # tensor to store decoder outputs
    outputs = torch.zeros(max_seq_len, batch_size, vocab_size).to(self.device)        
    
    # pass input sequence to the encoder
    encoder_outputs, hidden = self.encoder(input_tokens)
    
    # first input to the decoder is the <sos> tokens
    output = output_tokens[0,:]
    
    # create mask
    mask = self.create_mask(input_tokens)
    
    
    # Step through the length of the output sequence one token at a time
    for t in range(1, max_seq_len):
      output = output.unsqueeze(0)
      
      output, hidden = self.decoder(output, hidden, encoder_outputs, mask)
      outputs[t] = output
      
      if inference:
        output = output.max(1)[1]
      else:
        output = output_tokens[t]
      
      # If we're in inference mode, keep generating until we produce an
      # <eos> token
      if inference and output.item() == self.eos_idx:
        return outputs[:t]
        
    return outputs

#6. Train Seq2Seq Model

In [31]:
# extract special tokens
pad_idx = TRG.vocab.stoi['<pad>']
eos_idx = TRG.vocab.stoi['<eos>']
sos_idx = TRG.vocab.stoi['<sos>']

# Size of embedding_dim should match the dim of pre-trained word embeddings!
embedding_dim = 100
hidden_dim = 256
vocab_size = len(TRG.vocab)

In [32]:
model = seq2seq(embedding_dim,
                hidden_dim, 
                vocab_size, 
                device, pad_idx, eos_idx, sos_idx).to(device)

In [33]:
# print model architecture
model

seq2seq(
  (embedding): Embedding(4004, 100)
  (encoder): Encoder(
    (embedding): Embedding(4002, 100)
    (gru): GRU(100, 256, num_layers=2, dropout=0.3)
  )
  (decoder): Decoder(
    (embedding): Embedding(4004, 100)
    (gru): GRU(100, 256, num_layers=2, dropout=0.3)
    (concat): Linear(in_features=512, out_features=256, bias=True)
    (out): Linear(in_features=256, out_features=4004, bias=True)
    (attn): Attention()
  )
)

In [34]:
# Adam optimizer
optimizer = optim.Adam(model.parameters())

# cross entropy loss with softmax
criterion = nn.CrossEntropyLoss(ignore_index = pad_idx)

In [35]:
def train(model, iterator, criterion, optimizer):
  # Put the model in training mode!
  model.train()
  
  epoch_loss = 0
  
  for idx, batch in notebook.tqdm(enumerate(iterator), total=len(iterator)):
    input_sequence = batch.rus
    output_sequence = batch.eng

    target_tokens = output_sequence[0]

    # zero out the gradient for the current batch
    optimizer.zero_grad()

    # Run the batch through our model
    output = model(input_sequence, output_sequence)

    # Throw it through our loss function
    output = output[1:].view(-1, output.shape[-1])
    target_tokens = target_tokens[1:].view(-1)

    loss = criterion(output, target_tokens)

    # Perform back-prop and calculate the gradient of our loss function
    loss.backward()

    # Update model parameters
    optimizer.step()

    epoch_loss += loss.item()
      
  return epoch_loss / len(iterator)

In [36]:
def evaluate(model, iterator, criterion):
  # Put the model in training mode!
  model.eval()
  
  epoch_loss = 0
  
  for idx, batch in notebook.tqdm(enumerate(iterator), total=len(iterator)):
    input_sequence = batch.rus
    output_sequence = batch.eng

    target_tokens = output_sequence[0]

    # Run the batch through our model
    output = model(input_sequence, output_sequence)

    # Throw it through our loss function
    output = output[1:].view(-1, output.shape[-1])
    target_tokens = target_tokens[1:].view(-1)

    loss = criterion(output, target_tokens)

    epoch_loss += loss.item()
      
  return epoch_loss / len(iterator)

In [37]:
# function to compute time taken by an epoch (in mm:ss)
def epoch_time(start_time, end_time):
  elapsed_time = end_time - start_time
  elapsed_mins = int(elapsed_time / 60)
  elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
  return elapsed_mins, elapsed_secs

In [39]:
N_EPOCHS = 10

best_valid_loss = float('inf')

# start model training
for epoch in range(N_EPOCHS):
    
  start_time = time.time()
  
  train_loss = train(model, train_iterator, criterion, optimizer)
  valid_loss = evaluate(model, valid_iterator, criterion)
  
  end_time = time.time()
  
  epoch_mins, epoch_secs = epoch_time(start_time, end_time)
  
  # compare validation loss
  if valid_loss < best_valid_loss:
    best_valid_loss = valid_loss
    torch.save(model.state_dict(), '../../../../../LargeData/Analytics_Vidhya/NLP_Deep/best_model_russian_gru_attention.pt')
  
  print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
  print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
  print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

  0%|          | 0/2339 [00:00<?, ?it/s]

  0%|          | 0/585 [00:00<?, ?it/s]

Epoch: 01 | Time: 10m 6s
	Train Loss: 1.704 | Train PPL:   5.496
	 Val. Loss: 1.591 |  Val. PPL:   4.910


  0%|          | 0/2339 [00:00<?, ?it/s]

  0%|          | 0/585 [00:00<?, ?it/s]

Epoch: 02 | Time: 11m 2s
	Train Loss: 1.416 | Train PPL:   4.122
	 Val. Loss: 1.440 |  Val. PPL:   4.220


  0%|          | 0/2339 [00:00<?, ?it/s]

  0%|          | 0/585 [00:00<?, ?it/s]

Epoch: 03 | Time: 10m 43s
	Train Loss: 1.257 | Train PPL:   3.516
	 Val. Loss: 1.365 |  Val. PPL:   3.916


  0%|          | 0/2339 [00:00<?, ?it/s]

  0%|          | 0/585 [00:00<?, ?it/s]

Epoch: 04 | Time: 10m 59s
	Train Loss: 1.154 | Train PPL:   3.171
	 Val. Loss: 1.315 |  Val. PPL:   3.724


  0%|          | 0/2339 [00:00<?, ?it/s]

  0%|          | 0/585 [00:00<?, ?it/s]

Epoch: 05 | Time: 10m 38s
	Train Loss: 1.079 | Train PPL:   2.941
	 Val. Loss: 1.288 |  Val. PPL:   3.624


  0%|          | 0/2339 [00:00<?, ?it/s]

  0%|          | 0/585 [00:00<?, ?it/s]

Epoch: 06 | Time: 10m 54s
	Train Loss: 1.021 | Train PPL:   2.777
	 Val. Loss: 1.269 |  Val. PPL:   3.559


  0%|          | 0/2339 [00:00<?, ?it/s]

  0%|          | 0/585 [00:00<?, ?it/s]

Epoch: 07 | Time: 10m 45s
	Train Loss: 0.977 | Train PPL:   2.656
	 Val. Loss: 1.258 |  Val. PPL:   3.517


  0%|          | 0/2339 [00:00<?, ?it/s]

  0%|          | 0/585 [00:00<?, ?it/s]

Epoch: 08 | Time: 10m 41s
	Train Loss: 0.940 | Train PPL:   2.561
	 Val. Loss: 1.254 |  Val. PPL:   3.503


  0%|          | 0/2339 [00:00<?, ?it/s]

  0%|          | 0/585 [00:00<?, ?it/s]

Epoch: 09 | Time: 10m 57s
	Train Loss: 0.910 | Train PPL:   2.483
	 Val. Loss: 1.248 |  Val. PPL:   3.484


  0%|          | 0/2339 [00:00<?, ?it/s]

  0%|          | 0/585 [00:00<?, ?it/s]

Epoch: 10 | Time: 11m 3s
	Train Loss: 0.883 | Train PPL:   2.419
	 Val. Loss: 1.246 |  Val. PPL:   3.476


#7. Model Inference

In [40]:
# load saved model weights
path = '../../../../../LargeData/Analytics_Vidhya/NLP_Deep/best_model_russian_gru_attention.pt'
model.load_state_dict(torch.load(path))

<All keys matched successfully>

###7.1 Build Inference Function

In [41]:
def translate_sentence(model, sentence):
    model.eval()
    
    # tokenization
    tokenized = nlp_ru(sentence) 
    # convert tokens to lowercase
    tokenized = [t.lower_ for t in tokenized]
    # convert tokens to integers
    int_tokenized = [SRC.vocab.stoi[t] for t in tokenized] 
    
    # convert list to tensor
    sentence_length = torch.LongTensor([len(int_tokenized)]).to(model.device) 
    tensor = torch.LongTensor(int_tokenized).unsqueeze(1).to(model.device) 
    
    # get predictions
    translation_tensor_logits = model((tensor, sentence_length), None) 
    
    # get token index with highest score
    translation_tensor = torch.argmax(translation_tensor_logits.squeeze(1), 1)
    # convert indices (integers) to tokens
    translation = [TRG.vocab.itos[t] for t in translation_tensor]
 
    # Start at the first index.  We don't need to return the <sos> token...
    translation = translation[1:]
    return " ".join(translation)

In [42]:
sentence = "это новый"
response = translate_sentence(model, sentence)
print(response)

is this new


The original translation is *This is new*.  Pretty close and semantically correct.

###7.2 Translate Russian Sentences in the Test Dataset

In [55]:
# read test file 
test_df = pd.read_csv('../../../../../LargeData/Analytics_Vidhya/NLP_Deep/nmt_test_translations.csv')

In [56]:
# attention based translations
attn_translations = [translate_sentence(model, sent) for sent in notebook.tqdm(test_df["rus"])]

  0%|          | 0/46668 [00:00<?, ?it/s]

In [57]:
test_df["attn_translations"] = attn_translations

In [58]:
# check translations
test_df.sample(20)

Unnamed: 0,rus,eng,attn_translations
17443,том забеспокоился,tom became concerned,tom is <unk>
14588,я хорошо знаю тома,i know tom well,i know tom well
44496,мы живём у моря,we live near the sea,we live near a park
16875,почему бы вам не уйти,why don't you leave,why do n't you leave
8732,вас нет в городе,aren't you in town,you 're not in town
28807,это дело меня не касается,the matter does not concern me,it 's not my concern
9206,мы немного опоздали,we've arrived a little late,we 're a little late
23014,он раздал все свои деньги,he gave away all his money,he was all his money
8161,мудрость приходит с возрастом,wisdom comes with age,the <unk> is sinking with <unk>
28723,где находится женский туалет,where is the ladies' room,where is the bathroom


These translations look improved.

Try different vocabulary sizes for both the languages
- Change batch_size from 64 to 128
- Replace GRU layers with LSTM layers
- Try different values of the learning rate (0.0001, 0.005, 0.05, etc)