<a href="https://colab.research.google.com/github/abdur75648/Seq2Seq-NMT-PyTorch/blob/main/NMT_Transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Language Translation with ``nn.Transformer`` and torchtext

This tutorial shows how to:
*   How to train a translation model from scratch using Transformer.
*   Use torchtext library to access  [Multi30k](http://www.statmt.org/wmt16/multimodal-task.html#task1) dataset to train a *German to English* translation model.


Make sure to install the dependencies.

In [1]:
!pip install torch==2.1.0+cu121
!pip install torchtext==0.16.0
!pip install torchdata==0.7.0
!pip install portalocker>=2.0.0
!pip install spacy===3.6.0
!python -m spacy download en_core_web_sm
!python -m spacy download de_core_news_sm

Collecting torchdata==0.7.0 (from torchtext)
  Using cached torchdata-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB)
Installing collected packages: torchdata
  Attempting uninstall: torchdata
    Found existing installation: torchdata 0.7.1
    Uninstalling torchdata-0.7.1:
      Successfully uninstalled torchdata-0.7.1
Successfully installed torchdata-0.7.0
2023-12-25 05:22:06.148258: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-25 05:22:06.148316: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-25 05:22:06.149772: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has a

**Now, Restart the session as new libraries were installed**

## Data Sourcing and Processing

[torchtext library](https://pytorch.org/text/stable/)_ has utilities for creating datasets that can be easily
iterated through for the purposes of creating a language translation
model. In this example, we show how to use torchtext's inbuilt datasets,
tokenize a raw text sentence, build vocabulary, and numericalize tokens into tensor. We will use
[Multi30k dataset from torchtext library](https://pytorch.org/text/stable/datasets.html#multi30k)_
that yields a pair of source-target raw sentences.

To access torchtext datasets, please install torchdata following instructions at https://github.com/pytorch/data.




In [1]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.datasets import multi30k, Multi30k
from typing import Iterable, List


# We need to modify the URLs for the dataset since the links to the original dataset are broken
# Refer to https://github.com/pytorch/text/issues/1756#issuecomment-1163664163 for more info
multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz"
multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz"
multi30k.URL["test"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt16_task1_test.tar.gz"

data_train_iterator = Multi30k(split='train')
data_val_iterator = Multi30k(split='valid')
data_test_iterator = Multi30k(split='test')

SRC_LANGUAGE = 'de' # German
TGT_LANGUAGE = 'en' # English

# Place-holders
token_transform = {}
vocab_transform = {}

train_dataset = list(data_train_iterator)
print(f"Length of the train dataset: {len(train_dataset)}")
# Print the first 5 samples
for i in range(5):
    src_sentence, tgt_sentence = train_dataset[i]
    print(f"Sample {i+1}:")
    print(f"Source Sentence: {src_sentence}")
    print(f"Target Sentence: {tgt_sentence}")
    print()

Length of the train dataset: 29001
Sample 1:
Source Sentence: Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.
Target Sentence: Two young, White males are outside near many bushes.

Sample 2:
Source Sentence: Mehrere Männer mit Schutzhelmen bedienen ein Antriebsradsystem.
Target Sentence: Several men in hard hats are operating a giant pulley system.

Sample 3:
Source Sentence: Ein kleines Mädchen klettert in ein Spielhaus aus Holz.
Target Sentence: A little girl climbing into a wooden playhouse.

Sample 4:
Source Sentence: Ein Mann in einem blauen Hemd steht auf einer Leiter und putzt ein Fenster.
Target Sentence: A man in a blue shirt is standing on a ladder cleaning a window.

Sample 5:
Source Sentence: Zwei Männer stehen am Herd und bereiten Essen zu.
Target Sentence: Two men are at the stove preparing food.



In [3]:
token_transform[SRC_LANGUAGE] = get_tokenizer('spacy', language='de_core_news_sm')
token_transform[TGT_LANGUAGE] = get_tokenizer('spacy', language='en_core_web_sm')


# helper function to yield list of tokens
def yield_tokens(data_iter: Iterable, language: str) -> List[str]:
    language_index = {SRC_LANGUAGE: 0, TGT_LANGUAGE: 1}

    for data_sample in data_iter:
        yield token_transform[language](data_sample[language_index[language]])

# Define special symbols and indices
UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3
# Make sure the tokens are in order of their indices to properly insert them in vocab
special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']

for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
    # Training data Iterator
    train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    # Create torchtext's Vocab object
    vocab_transform[ln] = build_vocab_from_iterator(yield_tokens(train_iter, ln),
                                                    min_freq=2,
                                                    specials=special_symbols,
                                                    special_first=True)

# Set ``UNK_IDX`` as the default index. This index is returned when the token is not found.
# If not set, it throws ``RuntimeError`` when the queried token is not found in the Vocabulary.
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
  vocab_transform[ln].set_default_index(UNK_IDX)


# Print vocab length & first 10 tokens
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
    print(f"Length of {ln} vocabulary: {len(vocab_transform[ln])}")
    print(f"First 10 tokens in {ln} vocabulary: {list(vocab_transform[ln].get_itos())[:10]}")

Length of de vocabulary: 8014
First 10 tokens in de vocabulary: ['<unk>', '<pad>', '<bos>', '<eos>', '.', 'Ein', 'einem', 'in', ',', 'und']
Length of en vocabulary: 6191
First 10 tokens in en vocabulary: ['<unk>', '<pad>', '<bos>', '<eos>', 'a', '.', 'A', 'in', 'the', 'on']


## Seq2Seq Network using Transformer

Transformer is a Seq2Seq model introduced in [“Attention is all you
need”](https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf)_
paper for solving machine translation tasks.
Below, we will create a Seq2Seq network that uses Transformer. The network
consists of three parts. First part is the embedding layer. This layer converts tensor of input indices
into corresponding tensor of input embeddings. These embedding are further augmented with positional
encodings to provide position information of input tokens to the model. The second part is the
actual [Transformer](https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html)_ model.
Finally, the output of the Transformer model is passed through linear layer
that gives unnormalized probabilities for each token in the target language.




In [4]:
from torch import Tensor
import torch
import torch.nn as nn
from torch.nn import Transformer
import math
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# helper Module that adds positional encoding to the token embedding to introduce a notion of word order.
class PositionalEncoding(nn.Module):
    def __init__(self,
                 emb_size: int,
                 dropout: float,
                 maxlen: int = 5000):
        super(PositionalEncoding, self).__init__()
        den = torch.exp(- torch.arange(0, emb_size, 2)* math.log(10000) / emb_size)
        pos = torch.arange(0, maxlen).reshape(maxlen, 1)
        pos_embedding = torch.zeros((maxlen, emb_size))
        pos_embedding[:, 0::2] = torch.sin(pos * den)
        pos_embedding[:, 1::2] = torch.cos(pos * den)
        pos_embedding = pos_embedding.unsqueeze(-2)

        self.dropout = nn.Dropout(dropout)
        self.register_buffer('pos_embedding', pos_embedding)

    def forward(self, token_embedding: Tensor):
        return self.dropout(token_embedding + self.pos_embedding[:token_embedding.size(0), :])

# helper Module to convert tensor of input indices into corresponding tensor of token embeddings
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens: Tensor):
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

# Seq2Seq Network
class Seq2SeqTransformer(nn.Module):
    def __init__(self,
                 num_encoder_layers: int,
                 num_decoder_layers: int,
                 emb_size: int,
                 nhead: int,
                 src_vocab_size: int,
                 tgt_vocab_size: int,
                 dim_feedforward: int = 512,
                 dropout: float = 0.1):
        super(Seq2SeqTransformer, self).__init__()
        self.transformer = Transformer(d_model=emb_size,
                                       nhead=nhead,
                                       num_encoder_layers=num_encoder_layers,
                                       num_decoder_layers=num_decoder_layers,
                                       dim_feedforward=dim_feedforward,
                                       dropout=dropout)
        self.generator = nn.Linear(emb_size, tgt_vocab_size)
        self.src_tok_emb = TokenEmbedding(src_vocab_size, emb_size)
        self.tgt_tok_emb = TokenEmbedding(tgt_vocab_size, emb_size)
        self.positional_encoding = PositionalEncoding(
            emb_size, dropout=dropout)

    def forward(self,
                src: Tensor,
                trg: Tensor,
                src_mask: Tensor,
                tgt_mask: Tensor,
                src_padding_mask: Tensor,
                tgt_padding_mask: Tensor,
                memory_key_padding_mask: Tensor):
        src_emb = self.positional_encoding(self.src_tok_emb(src))
        tgt_emb = self.positional_encoding(self.tgt_tok_emb(trg))
        outs = self.transformer(src_emb, tgt_emb, src_mask, tgt_mask, None,
                                src_padding_mask, tgt_padding_mask, memory_key_padding_mask)
        return self.generator(outs)

    def encode(self, src: Tensor, src_mask: Tensor):
        return self.transformer.encoder(self.positional_encoding(
                            self.src_tok_emb(src)), src_mask)

    def decode(self, tgt: Tensor, memory: Tensor, tgt_mask: Tensor):
        return self.transformer.decoder(self.positional_encoding(
                          self.tgt_tok_emb(tgt)), memory,
                          tgt_mask)

During training, we need a subsequent word mask that will prevent the model from looking into
the future words when making predictions. We will also need masks to hide
source and target padding tokens. Below, let's define a function that will take care of both.




In [5]:
def generate_square_subsequent_mask(sz):
    mask = (torch.triu(torch.ones((sz, sz), device=DEVICE)) == 1).transpose(0, 1)
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask


def create_mask(src, tgt):
    src_seq_len = src.shape[0]
    tgt_seq_len = tgt.shape[0]

    tgt_mask = generate_square_subsequent_mask(tgt_seq_len)
    src_mask = torch.zeros((src_seq_len, src_seq_len),device=DEVICE).type(torch.bool)

    src_padding_mask = (src == PAD_IDX).transpose(0, 1)
    tgt_padding_mask = (tgt == PAD_IDX).transpose(0, 1)
    return src_mask, tgt_mask, src_padding_mask, tgt_padding_mask

Let's now define the parameters of our model and instantiate the same. Below, we also
define our loss function which is the cross-entropy loss and the optimizer used for training.




In [6]:
torch.manual_seed(0)

SRC_VOCAB_SIZE = len(vocab_transform[SRC_LANGUAGE])
TGT_VOCAB_SIZE = len(vocab_transform[TGT_LANGUAGE])
EMB_SIZE = 512
NHEAD = 8
FFN_HID_DIM = 512
BATCH_SIZE = 128
NUM_ENCODER_LAYERS = 3
NUM_DECODER_LAYERS = 3

transformer = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE,
                                 NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM)

for p in transformer.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform_(p)

transformer = transformer.to(DEVICE)

loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

optimizer = torch.optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)



## Collation

As seen in the ``Data Sourcing and Processing`` section, our data iterator yields a pair of raw strings.
We need to convert these string pairs into the batched tensors that can be processed by our ``Seq2Seq`` network
defined previously. Below we define our collate function that converts a batch of raw strings into batch tensors that
can be fed directly into our model.




In [7]:
from torch.nn.utils.rnn import pad_sequence

# helper function to club together sequential operations
def sequential_transforms(*transforms):
    def func(txt_input):
        for transform in transforms:
            txt_input = transform(txt_input)
        return txt_input
    return func

# function to add BOS/EOS and create tensor for input sequence indices
def tensor_transform(token_ids: List[int]):
    return torch.cat((torch.tensor([BOS_IDX]),
                      torch.tensor(token_ids),
                      torch.tensor([EOS_IDX])))

# ``src`` and ``tgt`` language text transforms to convert raw strings into tensors indices
text_transform = {}
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
    text_transform[ln] = sequential_transforms(token_transform[ln], #Tokenization
                                               vocab_transform[ln], #Numericalization
                                               tensor_transform) # Add BOS/EOS and create tensor


# function to collate data samples into batch tensors
def collate_fn(batch):
    src_batch, tgt_batch = [], []
    for src_sample, tgt_sample in batch:
        src_batch.append(text_transform[SRC_LANGUAGE](src_sample.rstrip("\n")))
        tgt_batch.append(text_transform[TGT_LANGUAGE](tgt_sample.rstrip("\n")))

    src_batch = pad_sequence(src_batch, padding_value=PAD_IDX)
    tgt_batch = pad_sequence(tgt_batch, padding_value=PAD_IDX)
    return src_batch, tgt_batch

Let's define training and evaluation loop that will be called for each
epoch.




In [32]:
from torch.utils.data import DataLoader
from tqdm import tqdm

def train_epoch(model, optimizer):
    model.train()
    losses = 0
    train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    train_dataloader = DataLoader(train_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)

    for src, tgt in tqdm(train_dataloader):
        src = src.to(DEVICE)
        tgt = tgt.to(DEVICE)

        tgt_input = tgt[:-1, :]

        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        logits = model(src, tgt_input, src_mask, tgt_mask,src_padding_mask, tgt_padding_mask, src_padding_mask)

        optimizer.zero_grad()

        tgt_out = tgt[1:, :]
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        loss.backward()

        optimizer.step()
        losses += loss.item()

    return losses / len(list(train_dataloader))


def evaluate(model):
    model.eval()
    losses = 0

    val_iter = Multi30k(split='valid', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    val_dataloader = DataLoader(val_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)

    for src, tgt in val_dataloader:
        src = src.to(DEVICE)
        tgt = tgt.to(DEVICE)

        tgt_input = tgt[:-1, :]

        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        logits = model(src, tgt_input, src_mask, tgt_mask,src_padding_mask, tgt_padding_mask, src_padding_mask)

        tgt_out = tgt[1:, :]
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        losses += loss.item()

    return losses / len(list(val_dataloader))

Now we have all the ingredients to train our model. Let's do it!




In [16]:
from timeit import default_timer as timer
NUM_EPOCHS = 20

for epoch in range(1, NUM_EPOCHS+1):
    start_time = timer()
    train_loss = train_epoch(transformer, optimizer)
    end_time = timer()
    val_loss = evaluate(transformer)
    print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}, "f"Epoch time = {(end_time - start_time):.3f}s"))


227it [00:37,  6.10it/s]


Epoch: 1, Train loss: 3.293, Val loss: 2.845, Epoch time = 41.608s


227it [00:36,  6.17it/s]


Epoch: 2, Train loss: 2.830, Val loss: 2.514, Epoch time = 41.221s


227it [00:36,  6.17it/s]


Epoch: 3, Train loss: 2.514, Val loss: 2.289, Epoch time = 42.268s


227it [00:37,  6.05it/s]


Epoch: 4, Train loss: 2.279, Val loss: 2.120, Epoch time = 42.462s


227it [00:37,  5.98it/s]


Epoch: 5, Train loss: 2.088, Val loss: 1.990, Epoch time = 42.524s


227it [00:36,  6.18it/s]


Epoch: 6, Train loss: 1.928, Val loss: 1.888, Epoch time = 41.136s


227it [00:36,  6.19it/s]


Epoch: 7, Train loss: 1.794, Val loss: 1.813, Epoch time = 42.177s


227it [00:38,  5.90it/s]


Epoch: 8, Train loss: 1.673, Val loss: 1.755, Epoch time = 43.136s


227it [00:36,  6.14it/s]


Epoch: 9, Train loss: 1.567, Val loss: 1.717, Epoch time = 41.571s


227it [00:36,  6.16it/s]


Epoch: 10, Train loss: 1.475, Val loss: 1.685, Epoch time = 42.502s


227it [00:37,  6.09it/s]


Epoch: 11, Train loss: 1.391, Val loss: 1.657, Epoch time = 42.002s


227it [00:38,  5.92it/s]


Epoch: 12, Train loss: 1.314, Val loss: 1.643, Epoch time = 43.261s


227it [00:38,  5.83it/s]


Epoch: 13, Train loss: 1.244, Val loss: 1.626, Epoch time = 44.719s


227it [00:37,  6.09it/s]


Epoch: 14, Train loss: 1.182, Val loss: 1.612, Epoch time = 41.756s


227it [00:38,  5.96it/s]


Epoch: 15, Train loss: 1.121, Val loss: 1.607, Epoch time = 43.726s


227it [00:37,  6.04it/s]


Epoch: 16, Train loss: 1.066, Val loss: 1.608, Epoch time = 42.129s


227it [00:36,  6.14it/s]


Epoch: 17, Train loss: 1.011, Val loss: 1.610, Epoch time = 41.412s


227it [00:36,  6.14it/s]


Epoch: 18, Train loss: 0.959, Val loss: 1.617, Epoch time = 42.220s


227it [00:37,  6.09it/s]


Epoch: 19, Train loss: 0.911, Val loss: 1.627, Epoch time = 42.382s


227it [00:37,  6.13it/s]


Epoch: 20, Train loss: 0.866, Val loss: 1.642, Epoch time = 41.618s


In [24]:
# function to generate output sequence using greedy algorithm
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    src = src.to(DEVICE)
    src_mask = src_mask.to(DEVICE)

    memory = model.encode(src, src_mask)
    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(DEVICE)
    for i in range(max_len-1):
        memory = memory.to(DEVICE)
        tgt_mask = (generate_square_subsequent_mask(ys.size(0))
                    .type(torch.bool)).to(DEVICE)
        out = model.decode(ys, memory, tgt_mask)
        out = out.transpose(0, 1)
        prob = model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.item()

        ys = torch.cat([ys,
                        torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=0)
        if next_word == EOS_IDX:
            break
    return ys


# actual function to translate input sentence into target language
def translate(model: torch.nn.Module, src_sentence: str):
    model.eval()
    src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1)
    num_tokens = src.shape[0]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
    tgt_tokens = greedy_decode(
        model,  src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX).flatten()
    return " ".join(vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt_tokens.cpu().numpy()))).replace("<bos>", "").replace("<eos>", "")

In [25]:
print(translate(transformer, "Eine Gruppe von Menschen steht vor einem Iglu ."))

 A group of people stand in front of an igloo . 


In [29]:
# Load the validation dataset
valid_iter = Multi30k(split='valid', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
valid_dataloader = DataLoader(valid_iter, batch_size=1, collate_fn=collate_fn)

# Translate the first 5 sentences
for i, (src, tgt) in enumerate(valid_dataloader):
    src_sentence = " ".join(vocab_transform[SRC_LANGUAGE].lookup_tokens(list(src.cpu().numpy()))).replace("<bos>", "").replace("<eos>", "")
    print(f"Source Sentence: {src_sentence}")
    print(f"Translated Sentence: {translate(transformer, src_sentence)}")
    print()
    if i == 4:
        break

Source Sentence:  Eine Gruppe von Männern lädt <unk> auf einen Lastwagen 
Translated Sentence:  A group of men loading <unk> onto a truck onto a truck . 

Source Sentence:  Ein Mann schläft in einem grünen Raum auf einem Sofa . 
Translated Sentence:  A man sleeping on a couch in a green room . 

Source Sentence:  Ein Junge mit Kopfhörern sitzt auf den Schultern einer Frau . 
Translated Sentence:  A boy with headphones sitting on his shoulders on a woman . 

Source Sentence:  Zwei Männer bauen eine blaue <unk> auf einem <unk> See auf 
Translated Sentence:  Two men building a blue <unk> structure on a <unk> lake . 

Source Sentence:  Ein Mann mit beginnender Glatze , der eine rote Rettungsweste trägt , sitzt in einem kleinen Boot . 
Translated Sentence:  A balding man wearing a red life jacket sitting in a small life boat . 



In [30]:
!pip install python-Levenshtein

Collecting python-Levenshtein
  Downloading python_Levenshtein-0.23.0-py3-none-any.whl (9.4 kB)
Collecting Levenshtein==0.23.0 (from python-Levenshtein)
  Downloading Levenshtein-0.23.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (169 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m169.4/169.4 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rapidfuzz<4.0.0,>=3.1.0 (from Levenshtein==0.23.0->python-Levenshtein)
  Downloading rapidfuzz-3.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz, Levenshtein, python-Levenshtein
Successfully installed Levenshtein-0.23.0 python-Levenshtein-0.23.0 rapidfuzz-3.5.2


In [41]:
import Levenshtein as lev

def test(model, dataloader):
    model.eval()
    total_distance = 0
    total_length = 0

    with torch.no_grad():
        for src, tgt in dataloader:
            src_sentence = " ".join(vocab_transform[SRC_LANGUAGE].lookup_tokens(list(src[0].cpu().numpy()))).replace("<bos>", "").replace("<eos>", "")
            tgt_sentence = " ".join(vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt[0].cpu().numpy()))).replace("<bos>", "").replace("<eos>", "")
            translated_sentence = translate(model, src_sentence)

            # Compute Levenshtein distance
            distance = lev.distance(translated_sentence, tgt_sentence)
            total_distance += distance
            total_length += len(tgt_sentence)

    # Compute character-wise accuracy
    accuracy = 1 - total_distance / total_length

    return accuracy


# Load the validation/test dataset
valid_iter = Multi30k(split='valid', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
valid_dataloader = DataLoader(valid_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)

# Evaluate the model
accuracy = test(transformer, valid_dataloader)
print(f'Validation Accuracy: {accuracy*100:.4f}%')


Validation Accuracy: 2.3833%


## References

1. Attention is all you need paper.
   https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
2. The annotated transformer. https://nlp.seas.harvard.edu/2018/04/03/attention.html#positional-encoding

