# 5 - Convolutional Sequence to Sequence Learning
In this notebook we'll be implementing the [Convolutional Sequence to Sequence Learning model](https://arxiv.org/abs/1705.03122). [[Tutorial](https://github.com/bentrevett/pytorch-seq2seq/blob/master/5%20-%20Convolutional%20Sequence%20to%20Sequence%20Learning.ipynb)]

This model is drastically different than the previous models, in that it does not
use recurrent networks. It will instead use convolutional layers. A convolutional layer uses filters. 
These filters have a width and a height in case of images). If a filter has a width of 3, it can see
3 consecutive tokens. Each conv layer has many such filters (1024 in this tutorial).
Each filter slides from beginning to end, looking at all 3 consecutive tokens at a time. The idea is that each of the 1024 filters will learn to extract a different feature from the text. The result of this feature extraction will then be used by the model potentially as input to another conv layer. This can then all be used to extract features from the source sentence to translate it into the target language.

## Data Preparation

Mostly the same as previous tutorials. The only difference is that we now set `batch_first` to `True` to get SRC & TRG tokens as [batch_size, seq_len]. Since the previous tutorials used RNNs, PyTorch RNN models required the tensor to be of shape [seq_len, batch_size]. However now we use conv models so batch_size needs to be first. 

In [2]:
sandbox_path = "models/"

# Install a newer version of troch text than what is default
!pip install torchtext==0.6.0

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

import spacy
import numpy as np

import random
import math
import time

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

!python -m spacy download en
!python -m spacy download de

spacy_en = spacy.load('en')
spacy_de = spacy.load('de')
print(spacy_en)

def tokenize_de(text):
  "Tokenize a German language string"
  # Not reversing for this one
  return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
  return [tok.text for tok in spacy_en.tokenizer(text)]


foo = spacy_en.tokenizer("Hello. how do you do good, sir!")
print(type(foo))
print(foo)
[type(t.text) for t in foo]

SRC = Field(tokenize = tokenize_de, init_token='<sos>', 
            eos_token='<eos>', lower=True, batch_first = True)
TRG = Field(tokenize = tokenize_en, init_token='<sos>', 
            eos_token='<eos>', lower=True, batch_first = True)

train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'),
                                                    fields = (SRC, TRG))
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}")
print(vars(train_data.examples[0]))

SRC.build_vocab(train_data, min_freq=2)
TRG.build_vocab(train_data, min_freq=2)
print(f"Unique token in SRC (de) vocab: {len(SRC.vocab)}")
print(f"Unique token in SRC (en) vocab: {len(TRG.vocab)}")

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

BATCH_SIZE = 128

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), batch_size=BATCH_SIZE, device=device)

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/google/home/ammarh/anaconda3/lib/python3.7/site-packages/en_core_web_sm
-->
/usr/local/google/home/ammarh/anaconda3/lib/python3.7/site-packages/spacy/data/en
You can now load the model via spacy.load('en')
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('de_core_news_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/google/home/ammarh/anaconda3/lib/python3.7/site-packages/de_core_news_sm
-->
/usr/local/google/home/ammarh/anaconda3/lib/python3.7/site-packages/spacy/data/de
You can now load the model via spacy.load('de')
<spacy.lang.en.English object at 0x7fc0c2079bd0>
<class 'spacy.tokens.doc.Doc'>
Hello. how do you do good, sir!
Number of training examples: 29000
Number of validation examples: 1014
Number of testing examples: 1000
{'src': ['zwei', 'junge', 'weiße', 'männer', 'sind', 'im

## Encoder

Previously for recurrent net encoders we produced one context vector that compressed the entire input sequence into the last hidden state (other than for attention computation). In the convolutional seq2seq model, we will get 2 context vectors per input token: *conved vector* and *combined vector*. The conved vector is the output of the last conv layer after being projected down to emb_size. The combined vector is the sum of the conved vector and the original word embedding.

First, we pass the token through an embedding layer (which is standard in NLP). However, since there is no recurrence (aka order can be mixed up), we use a second positional embedding layer. The token and positional embeddings are elementwise summed together to get the final embedding vector (containes information about token and its position in the sequence).

This vector is then passed through a bunch of conv blocks (10 in this case) for the "magic" to happen. After passing through all the conv blocks, the vector is then fed through another linear layer to transform it back from hidden dim size to embedding dim size. This is our conved vector and we have one per token.

### Convolutional Block
We pad the input sequence in order to maintain its length. Without padding, length will be $filter\_size - 1$ shorter than the input sequence entering the conv layer (for odd sized filters).

These filters are designed so the output hidden dimension of them is twice the input hidden dimension (in computer vision terminology these hidden dims are called channels). We double the size of the output hidden dim because we use a special activation function called gated linear units (GLU). GLUs have gating mechanisms (similar to LSTM & GRU) that half the size of their input - whereas usually activation functions keep the input dimension the same. The output vector from GLU (which is same size as hid dim now) is them elementwise summed with the pre-conved vector (residual connection) producing the output vector for a single conv block. We stack these convolution blocks and subsequent blocks take the output of the previous block and perform the same steps though they do not share their parameter weights. The output of the last conv block if fed through a linear layer that projects from hid dim to emb dim to get the *conved vector*.

Note: 
- The scale variable is used by the authors to "ensure that the variance throughout the network does not change dramatically". The performance of the model seems to vary wildly using different seeds if this is not used.
- The positional embedding is initialized to have a "vocabulary" of 100. This means it can handle sequences up to 100 elements long, indexed from 0 to 99. This can be increased if used on a dataset with longer sequences.



In [3]:
class Encoder(nn.Module):
  def __init__(self, input_dim, emb_dim, hid_dim, n_layers, kernel_size, 
               dropout_p, device, max_length = 100):
    super().__init__()  
    assert kernel_size % 2 == 1 # Ensure odd kernel size

    self.device = device

    self.scale = torch.sqrt(torch.FloatTensor([0.5])).to(device)

    self.tok_embedding = nn.Embedding(input_dim, emb_dim)
    self.pos_embedding = nn.Embedding(max_length, emb_dim) # positional embedding are learned as well.

    self.emb2hid = nn.Linear(emb_dim, hid_dim)
    self.hid2emb = nn.Linear(hid_dim, emb_dim)

    self.convs = nn.ModuleList([nn.Conv1d(in_channels = hid_dim,
                                          out_channels = 2*hid_dim,
                                          kernel_size = kernel_size,
                                          padding = (kernel_size - 1)//2)
                                for _ in range(n_layers)])
    self.dropout = nn.Dropout(dropout_p)

  def forward(self, src):
    # src: [batch_sz, src_len]
    batch_sz = src.shape[0]
    src_len = src.shape[1]

    # create position tensor
    pos = torch.arange(0, src_len).unsqueeze(0).repeat(batch_sz, 1).to(self.device) # [batch_sz, src_len]
    # pos = [0, 1, 2, ... , src_len-1]

    tok_embedded = self.tok_embedding(src)
    pos_embedded = self.pos_embedding(pos)

    embedded = self.dropout(tok_embedded + pos_embedded) # [batch_sz, src_len, emb_dim]

    # pass through linear layer to project to hid dim
    conv_input = self.emb2hid(embedded) # [batch_sz, src_len, hid_dim]

    # permute for convolving across the src length
    conv_input = conv_input.permute(0, 2, 1) # [batch_sz, hid_dim, src_len]
    
    # begin convolution blocks ...
    for i, conv in enumerate(self.convs):
      # pass through conv layer
      conved = conv(self.dropout(conv_input)) # [batch_sz, 2 * hid_dim, src_len]
      conved = F.glu(conved, dim = 1) # [batch_sz, hid_dim, src_len]
      # apply residual connection
      conved = (conved + conv_input) * self.scale # [batch_sz, hid_dim, src_len]
      conv_input = conved

    # permute and project back to emb dim
    conved = self.hid2emb(conved.permute(0,2,1)) # [batch_sz, src_len, emb_dim]

    combined = (conved + embedded) * self.scale # [batch_sz, src_len, emb_dim]

    return conved, combined



## Decoder

The decoder predicts all tokens within the sentence in parallel. There is no sequential processing, ie no decogin loop. Decoder is similar to encoder with a few changes within the conv blocks & the main decoder model.

First the token embeddings do not have a residual connection that connects after the conv blocks to create a "combined vector" (as was the case for the encoder).

Second, to compute attention the "conved" (key) & "combined" (value) vectors from the encoder are used within each conv block of the decoder.

Lastly the output of the decoder is a linear layer from emb_dim to output_dim. This is used to predict next word.

### Decoder Convolutional Block

Instead of padding equally on both sides to ensure sentence length stays the same, we only pad at the beginning of the sentence. This is a trick to make sure the decoder cannot cheat by being able to look at the next word. *Intuition*: I think this is a way to predict the last token of a filter window rather than the center one.  

After the GLU activation and before the residual connection, we calculate & apply attention. This uses the encoded representations ("conved" & "combined" vectors), the token embedding (only for the current word) and output from the conv layer (GLU activated).

Attention is calculated by first projecting the conv layer output from hid_dim to emb_dim using a linear layer. Then the token embedding is summed into it via a residual connection. This combination is then used as the "query" vector to see how much it "matches" against the "key" vector which is the "conved" vector from the encoder. This produces a weighted vector over the src_len that sums to 1. The weighted vector is then used to compute a weighted sum over the "value" vectors that are the "combined" vectors produced by the encoder. Why this separation between the "key" & "value" vectors? The paper argues that the encoded "conved" is good for getting a larger context over the encoded sequence, whereas the encoded "combined" has more information about the specific token (because the token embedding is summed in) and is therefore more useful for making a prediction. 

In [4]:
class Decoder(nn.Module):
  def __init__(self, output_dim, emb_dim, hid_dim, n_layers, kernel_size, 
               dropout_p, trg_pad_idx, device, max_length = 100):
    super().__init__()

    self.kernel_size = kernel_size
    self.trg_pad_idx = trg_pad_idx
    self.device = device

    self.scale = torch.sqrt(torch.FloatTensor([0.5])).to(device)

    self.tok_embedding = nn.Embedding(output_dim, emb_dim)
    self.pos_embedding = nn.Embedding(max_length, emb_dim)
    
    self.emb2hid = nn.Linear(emb_dim, hid_dim)
    self.hid2emb = nn.Linear(hid_dim, emb_dim)

    # Should the next 2 be layer specific or shared?? Try layer specific.
    self.attn_hid2emb = nn.Linear(hid_dim, emb_dim)
    self.attn_emb2hid = nn.Linear(emb_dim, hid_dim)

    self.fc_out = nn.Linear(emb_dim, output_dim)

    self.convs = nn.ModuleList([nn.Conv1d(in_channels = hid_dim,
                                          out_channels = 2*hid_dim,
                                          kernel_size = kernel_size)
                                for _ in range(n_layers)])
    
    self.dropout = nn.Dropout(dropout_p)

  def forward(self, trg, encoder_conved, encoder_combined):
    # trg: [batch_sz, trg_len]
    # encoder_conved & encoder_combined: [batch_sz, src_len, emb_dim]

    batch_sz = trg.shape[0]
    trg_len = trg.shape[1]

    # create a positional tensor
    pos = torch.arange(0, trg_len).unsqueeze(0)\
                      .repeat(batch_sz, 1).to(self.device) # [batch_sz, trg_len]

    tok_embedded = self.tok_embedding(trg) # [batch_sz, trg_len, emb_dim]
    pos_embedded = self.pos_embedding(pos) # [batch_sz, trg_len, emb_dim]

    embedded = self.dropout(tok_embedded + pos_embedded) # [batch_sz, trg_len, emb_dim]

    # project embedding to hid_dim size to pass through conv blocks
    conv_input = self.emb2hid(embedded) # [batch_sz, trg_len, hid_dim]

    # permute to convolve over the target length
    conv_input = conv_input.permute(0,2,1) # [batch_sz, hid_dim, trg_len]

    batch_sz = conv_input.shape[0]
    hid_dim = conv_input.shape[1]

    for i, conv in enumerate(self.convs):
      conv_input = self.dropout(conv_input) # [batch_sz, trg_len, hid_dim]
      # create a padded to prefix the sequence. this prevents the decoder from looking "forward"
      padding = torch.zeros(batch_sz, hid_dim, self.kernel_size-1)\
                  .fill_(self.trg_pad_idx).to(self.device) # [batch_sz, hid_dim, kerne_size-1]

      padded_conv_input = torch.cat((padding, conv_input), 
                                    dim=2) # [batch_sz, hid_dim, trg_len + kerne_size-1]

      conved = conv(padded_conv_input) # [batch_sz, 2*hid_dim, trg_len]
      conved = F.glu(conved, dim=1) # [batch_sz, hid_dim, trg_len]

      attention, conved = self.calculate_attention(embedded, conved,
                                                   encoder_conved, encoder_combined)

      # attention: [batch_sz, trg_len, src_len]
      # conved: [batch_sz, trg_len, hid_dim]

      # apply a residual connection
      conved = (conved + conv_input) * self.scale # [batch_sz, trg_len, hid_dim]

      conv_input = conved

    # permute back 
    conved = conved.permute(0,2,1) # [batch_sz, trg_len, hid_dim]
    conved = self.hid2emb(conved) # [batch_sz, trg_len, emb_dim]

    output = self.fc_out(self.dropout(conved)) # [batch_sz, trg_len, output_dim]
    
    return output, attention

  def calculate_attention(self, embedded, conved, 
                          encoder_conved, encoder_combined):
    # embedded: [batch_sz, trg_len, emb_dim]
    # conved: [batch_sz, hid_dim, trg_len]
    # encoder_conved & encoder_combined: [batch_sz, src_len, emb_dim]

    # permute & project down to emb_dim
    conved_emb = self.attn_hid2emb(conved.permute(0,2,1)) # [batch_sz, trg_len, emb_dim]

    combined = (conved_emb + embedded) * self.scale # [batch_sz, trg_len, emb_dim]

    energy = torch.matmul(combined, encoder_conved.permute(0,2,1)) # [batch_sz, trg_len, src_len]

    attention = F.softmax(energy, dim=2) # [batch_sz, trg_len, src_len]

    attended_encoding = torch.matmul(attention, encoder_combined) # [batch_sz, trg_len, emb_dim]

    attended_encoding = self.attn_emb2hid(attended_encoding) # [batch_sz, trg_len, hid_dim]

    # apply residual connection 
    attended_combined = (conved + attended_encoding.permute(0,2,1)) * self.scale # [batch_sz,hid_dim, trg_len]

    return attention, attended_combined 



## Seq2Seq Model

We do not pass the <eos> token to the decoder (not sure why it matters?). Encoding is similar, except that we receive a encoder_conved and an encoder_combined context vector. Since the decoding is all convolutions and in parallel there is no decoding loop. We pad the target at the end to prevent it from peeking forward. This also means that we cannot do "teacher-forcing".



In [5]:
class Seq2Seq(nn.Module):
  def __init__(self, encoder, decoder):
    super().__init__()

    self.encoder = encoder
    self.decoder = decoder

  def forward(self, src, trg):
    # src: [batch_sz, src_len]
    # trg: [batch_sz, trg_len - 1] # <eos> token chopped off the end

    encoder_conved, encoder_combined = self.encoder(src)
    # encoder_conved & encoder_combined: [batch_sz, src_len, emb_dim]

    # Calculate predictions of next words
    output, attention = self.decoder(trg, encoder_conved, encoder_combined)
    # output: [batch_sz, trg_len - 1, output_dim]
    # output is a batch of predictions for each word in the trg sentence
    # attention: [batch_sz, trg_len - 1, src_len]
    # attention is a batch of attention scores across the src sentence for each word in the trg sentence

    return output, attention

### Training the Seq2Seq Model

Rest of the training is similar to all the previous tuts. In the paper they find that it is more beneficial to use a small filter (kernel of 3) and a high number of layers (5+).

In [6]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
EMB_DIM = 256
HID_DIM = 512
ENC_LAYERS = 10 # number of conv blocks.
DEC_LAYERS = 10
ENC_KERNEL_SZ = 3 # must be odd
DEC_KERNEL_SZ = 3 # can be odd/even
DROPOUT_P = 0.25
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]

enc = Encoder(INPUT_DIM, EMB_DIM, HID_DIM, ENC_LAYERS, ENC_KERNEL_SZ,
              DROPOUT_P, device)
dec = Decoder(OUTPUT_DIM, EMB_DIM, HID_DIM, DEC_LAYERS, DEC_KERNEL_SZ,
              DROPOUT_P, TRG_PAD_IDX, device)

model = Seq2Seq(enc, dec).to(device)

# define a function that will calculate the number of trainable parameters in the model.
def count_parameters(model):
  return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Model has {count_parameters(model):,} trainable parameters.")

# function to tell us how long an epoch takes.
def epoch_time(start_time, end_time):
  elapsed_time = end_time - start_time
  elapsed_mins = int(elapsed_time / 60)
  elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
  return elapsed_mins, elapsed_secs 

optimizer = optim.Adam(model.parameters())
# The CrossEntropyLoss function calculates both the log softmax as well 
# as the negative log-likelihood of our predictions.
# Our loss function calculates the average loss per token, however by passing
# the index of the <pad> token as the ignore_index argument we ignore the loss 
# whenever the target token is a padding token.
criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

Model has 37,351,429 trainable parameters.


We strip off the <eos> token before passing the trg into the decoder. We hope the decoder will learn to predict the <eos> token. In the RNN based models we did this by stopping the decoder loop right before the <eos> token. The expected decoder output should be $ [y_1, y_2, y_3, eos]$ and we pass this to the loss function along with $trg[1:] = [x_1, x_2, x_3, eos]$ (aka strip the <sos> token).


In [7]:
def train(model, iterator, optimizer, criterion, clip):
  model.train()
  epoch_loss = 0
  for i, batch in enumerate(iterator):
    src = batch.src
    trg = batch.trg
    optimizer.zero_grad()
    output, _ = model(src, trg[:,:-1]) # -1 to chop off <eos>
    # output: [batch_sz, trg_len, out_dim]

    output_dim = output.shape[-1]

    # Flatten out across the batch to compute loss (possibly helps with parallelization?)
    output = output.contiguous().view(-1, output_dim) # [batch_sz*trg_len-1, out_dim]
    trg = trg[:,1:].contiguous().view(-1) # [batch_sz*trg_len-1] # 1: to chop off <sos>

    loss = criterion(output, trg)

    loss.backward()

    torch.nn.utils.clip_grad_norm_(model.parameters(), clip)

    optimizer.step()

    epoch_loss += loss.item()

  return epoch_loss/len(iterator)

# Evaluation loop is same as training, just without gradients
def evaluate(model, iterator, criterion):
  model.eval()
  epoch_loss = 0
  with torch.no_grad():
    for i, batch in enumerate(iterator):
      src = batch.src
      trg = batch.trg
      output, _ = model(src, trg[:,:-1])
      output_dim = output.shape[-1]

      output = output.contiguous().view(-1, output_dim) # [batch_sz*trg_len-1, out_dim]
      trg = trg[:,1:].contiguous().view(-1) # [batch_sz*trg_len-1] # 1: to chop off <sos>

      loss = criterion(output, trg)

      epoch_loss += loss.item()

  return epoch_loss/len(iterator)

N_EPOCHS = 10
CLIP = 0.1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
  start_time = time.time()

  train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
  valid_loss = evaluate(model, valid_iterator, criterion)

  end_time = time.time()

  epoch_mins, epoch_secs = epoch_time(start_time, end_time)

  if valid_loss < best_valid_loss:
    best_valid_loss = best_valid_loss
    torch.save(model.state_dict(), sandbox_path + 'tut5-model.pt')

  print(f"Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s") 
  print(f"\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}")
  print(f"\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}")

Epoch: 01 | Time: 1m 42s
	Train Loss: 4.422 | Train PPL:  83.236
	 Val. Loss: 3.020 |  Val. PPL:  20.492
Epoch: 02 | Time: 1m 42s
	Train Loss: 3.057 | Train PPL:  21.271
	 Val. Loss: 2.356 |  Val. PPL:  10.552
Epoch: 03 | Time: 1m 42s
	Train Loss: 2.620 | Train PPL:  13.738
	 Val. Loss: 2.107 |  Val. PPL:   8.225


KeyboardInterrupt: 

In [None]:
# Load the best model with best validation and run the test set
model.load_state_dict(torch.load(sandbox_path + 'tut5-model.pt'))

test_loss = evaluate(model, test_iterator, criterion)
print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

## Inference

Decoding is done sequentially in a loop during inference.

Steps:
- Tokenize the source sentence, append <sos> & <eos>.
- Numercalize, convert to a tensor & add a batch_dim.
- Feed src_sent to encoder.
- Initialize a list to hold the output sentence with a <sos> token.
- While we have not hit max length:
  - Convert current output sentence prediction into a tensor with a batch_dim.
  - Pass to decoder with 2 encoder context outputs and get next out prediction.
  - Add prediction to current output sentence.
  - Break if prediction was an <eos>.



In [None]:
def translate_sentence(sentence, src_field, trg_field, model, device, max_len=50):
  model.eval()

  if isinstance(sentence, str):
    np = spacy.load('de')
    tokens = [token.text.lower() for token in nlp(sentence)]
  else:
    tokens = [token.lower() for token in sentence]

  tokens = [src_field.init_token] + tokens + [src_field.eos_token]
  src_indexes = [src_field.vocab.stoi[token] for token in tokens]
  src_tensor = torch.LongTensor(src_indexes).unsqueeze(0).to(device)

  with torch.no_grad():
    encoder_conved, encoder_combined = model.encoder(src_tensor)

  trg_indexes = [trg_field.vocab.stoi[trg_field.init_token]]

  for i in range(max_len):
    trg_tensor = torch.LongTensor(trg_indexes).unsqueeze(0).to(device)
    with torch.no_grad():
      output, attention = model.decoder(trg_tensor, encoder_conved, encoder_combined)

    pred_token = output.argmax(2)[:,-1].item()

    trg_indexes.append(pred_token)

    if pred_token == trg_field.vocab.stoi[trg_field.eos_token]:
      break

  trg_tokens = [trg_field.vocab.itos[i] for i in trg_indexes]

  return trg_tokens[1:], attention

def display_attention(sentence, translation, attention):
  fig = plt.figure(figsize=(10,10))
  ax = fig.add_subplot(111)

  # attention: [batch_sz, trg_len - 1, src_len]
  # assumes batch sz is 1 so squeeze it out.
  attention = attention.squeeze(0).cpu().detach().numpy()

  cax = ax.matshow(attention, cmap='bone')

  ax.tick_params(labelsize=15)
  ax.set_xticklabels([''] + ['<sos>'] + [t.lower() for t in sentence] + ['<eos>'],
                     rotation=45)
  ax.set_yticklabels([''] + translation)

  ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
  ax.yaxis.set_major_locator(ticker.MultipleLocator(1))
      
  plt.show()
  plt.close()
    

In [None]:
example_idx = 2
src = vars(train_data.examples[example_idx])['src']
trg = vars(train_data.examples[example_idx])['trg']

print(f'src = {src}')
print(f'trg = {trg}')

translation, attention = translate_sentence(src, SRC, TRG, model, device)
print(f'predicted trg = {translation}')

display_attention(src, translation, attention)


In [None]:

example_idx = 2

src = vars(valid_data.examples[example_idx])['src']
trg = vars(valid_data.examples[example_idx])['trg']

print(f'src = {src}')
print(f'trg = {trg}')
translation, attention = translate_sentence(src, SRC, TRG, model, device)
print(f'predicted trg = {translation}')

display_attention(src, translation, attention)

## BLEU


In [None]:
#from torchtext.data.metrics import bleu_score
from torchtext.data.metrics import bleu_score

def calculate_bleu(data, src_field, trg_field, model, device, max_len=50):
  trgs = []
  pred_trgs = []

  for datum in data:
    src = vars(datum)['src']
    trg = vars(datum)['trg']

    pred_trg, _ = translate_sentence(src, src_field, trg_field, model, device, max_len)
    # cut off <eos>
    pred_trg = pred_trg[:-1]

    pred_trgs.append(pred_trg)
    trgs.append([trg])

  return bleu_score(pred_trgs, trgs)

bleu_score = calculate_bleu(test_data, SRC, TRG, model, device)

print(f"BLEU score = {bleu_score*100:.2f}")