The site: https://paperswithcode.com/ contains code reference for published papers. The original ref. for this code is from https://github.com/bentrevett/pytorch-seq2seq which was referenced in the paperswithcode.com link.

The code implements the paper "Sequence to Sequence Learning with Neural Networks", Ilya Sutskever, Oriol Vinyals, Quoc V. Le, available at: https://arxiv.org/abs/1409.3215

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import random
import numpy as np
import spacy
import datasets
import torchtext
import tqdm
import evaluate
from Encoder import_ipynb 

SyntaxError: invalid syntax (1332561765.py, line 11)

Set the seed in all libraries so that startup weights and other paramters are same in every run.

In [2]:
seed = 36

random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True

The dataset used here is the English to German translation text. Orig. data set is available here: https://github.com/multi30k/dataset. The HF datasets library has access to this under "bentrevett/multi30k". We will load this from HF.

This Dataset is already split into  Training, Validation and Test groups like most of the datasets in the HF library. 

In [3]:
dataset = datasets.load_dataset("bentrevett/multi30k")
dataset # Print the data split

DatasetDict({
    train: Dataset({
        features: ['en', 'de'],
        num_rows: 29000
    })
    validation: Dataset({
        features: ['en', 'de'],
        num_rows: 1014
    })
    test: Dataset({
        features: ['en', 'de'],
        num_rows: 1000
    })
})

Assign Train, Test and Validation into variables:


In [4]:
train_data, valid_data, test_data = (
    dataset["train"],
    dataset["validation"],
    dataset["test"],
)

Following are sample contents: train_data[0]

{'en': 'Two young, White males are outside near many bushes.',
 'de': 'Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.'}

For Tokenization we will use spaCy which is a newer library than NLTK and tiktoken. We first need to download the tokenization models for each language which will be en and de in this case.

Unlike NLTK, there is no way to download the models from code in Spacy, hence the following commands have to be run before loading the tokenizations models.

python -m spacy download en_core_web_sm
python -m spacy download de_core_news_sm

These models are downloaded in the folder:
.venv\Lib\site-packages\en_core_web_sm\en_core_web_sm-3.7.1


In [5]:
en_nlp = spacy.load("en_core_web_sm")
de_nlp = spacy.load("de_core_news_sm")

# Print download path
print(en_nlp._path)

c:\ChaitanyaBelwal\ACTIVE\Development\GitHub\Python\MachineLearning\ReferenceCode\NLP\PapersBasedCode\.venv\Lib\site-packages\en_core_web_sm\en_core_web_sm-3.7.1


Let us manually call the Tokenizer for Sample Text

In [6]:
string = "Top Gun is my favorite movie!"
tokens = en_nlp.tokenizer(string)

[token.text for token in tokens]


['Top', 'Gun', 'is', 'my', 'favorite', 'movie', '!']

Define the function to be used by the map method of the Datasets objects. Add the Start of sentence (sos) and the End of Sentence (eos) tokens which are passed to the function.

In [7]:
# These arguments can be passed as a kwargs dict.
def tokenize_for_map(example, en_nlp, de_nlp, max_length, lower, sos_token, eos_token):
    # max_length will terminate the string if longer than a specific length
    # this step is reapeated for each token
    en_tokens = [token.text for token in en_nlp.tokenizer(example["en"])][:max_length]
    de_tokens = [token.text for token in de_nlp.tokenizer(example["de"])][:max_length]
    if lower:
        en_tokens = [token.lower() for token in en_tokens]
        de_tokens = [token.lower() for token in de_tokens]
    en_tokens = [sos_token] + en_tokens + [eos_token]
    de_tokens = [sos_token] + de_tokens + [eos_token]
    # Return as a dict.
    return {"en_tokens": en_tokens, "de_tokens": de_tokens}


Set the parameters to pass to tokenize_for_map() as kw_args 

In [8]:
max_length = 1000
lower = True
sos_token = "<sos>"
eos_token = "<eos>"

kwargs = {
    "en_nlp": en_nlp,
    "de_nlp": de_nlp,
    "max_length": max_length,
    "lower": lower,
    "sos_token": sos_token,
    "eos_token": eos_token,
}

Call tokenize_for_map() with above arguements for each object.

In [9]:
train_data = train_data.map(tokenize_for_map, fn_kwargs=kwargs)
valid_data = valid_data.map(tokenize_for_map, fn_kwargs=kwargs)
test_data = test_data.map(tokenize_for_map, fn_kwargs=kwargs)

See a sample of the data after the tokenize_for_map() operations

In [10]:
train_data[0]

{'en': 'Two young, White males are outside near many bushes.',
 'de': 'Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.',
 'en_tokens': ['<sos>',
  'two',
  'young',
  ',',
  'white',
  'males',
  'are',
  'outside',
  'near',
  'many',
  'bushes',
  '.',
  '<eos>'],
 'de_tokens': ['<sos>',
  'zwei',
  'junge',
  'weiße',
  'männer',
  'sind',
  'im',
  'freien',
  'in',
  'der',
  'nähe',
  'vieler',
  'büsche',
  '.',
  '<eos>']}

We will build the vocabulary now which is assigning unique token_ids to each token, which serves as a lookup table mapping numbers to tokens. We also assign the Unknown '<unk>' and Pad '<pad>' token. 

The special_tokens variables is set to a list that will be passed to the torchtext.vocab.build_vocab_from_iterator()

The min_freq param specifies that only tokens who appear min_freq times should be considered in the dataset. If any token is less than min_freq times it will be treated a <unk> token.  

The parameters to create the vocab. are specified first

In [38]:

min_freq = 2
unk_token = "<unk>"
pad_token = "<pad>"

special_tokens = [
    unk_token,
    pad_token,
    sos_token,
    eos_token,
]

Now we will call the functions to build both the 'en' and 'de' vocabularies. vocab should only be built from training data, if some token is present in test/validation but in training, then it should be treated as unknown.

In [47]:
en_vocab = torchtext.vocab.build_vocab_from_iterator(
    train_data["en_tokens"],
    min_freq=min_freq,
    specials=special_tokens,
)

de_vocab = torchtext.vocab.build_vocab_from_iterator(
    train_data["de_tokens"],
    min_freq=min_freq,
    specials=special_tokens,
)

Print a sample of the vocab. Print the 1st 10 tokens in the vocab. The regular tokens are orderded from the most frequenct to least frequent, and the specical tokens are not subject to this. itos() shows the string for a given index while stoi() will give the index for a specific string. 

In [13]:
# NOTE: In .ipynb only the last line is printed
en_vocab.get_itos()[:10], de_vocab.get_itos()[:10]

(['<unk>', '<pad>', '<sos>', '<eos>', 'a', '.', 'in', 'the', 'on', 'man'],
 ['<unk>', '<pad>', '<sos>', '<eos>', '.', 'ein', 'einem', 'in', 'eine', ','])

Check some stoi() values

In [46]:
en_vocab.get_stoi()["my"]


0

Can also use the object as a dict.

In [15]:
en_vocab["my"]

1916

Get index of special tokens

In [16]:
en_vocab[unk_token], en_vocab[pad_token], de_vocab[unk_token], de_vocab[pad_token]

(0, 1, 0, 1)

We can also look up indices of multiple words

In [48]:
tokens = ["my","name","is","the","sugar"]
en_vocab.lookup_indices(tokens)



[1916, 1507, 10, 7, 0]

Get size of the vocabulary in both languages

In [40]:
len(en_vocab), len(de_vocab)

(5893, 7853)

The vocab. can be used like a Map and similar operations can be performed on it.

The special tokens will have the same id and using an Assert we can confirm that.

In [19]:
assert en_vocab[unk_token] == de_vocab[unk_token]
assert en_vocab[pad_token] == de_vocab[pad_token]

Set the default index. The default index is returned if some token is not found in the vocab. This is a very important step, as you will get 'key' not found errors.

In [49]:
en_vocab.set_default_index(en_vocab.get_stoi()[unk_token])
de_vocab.set_default_index(en_vocab.get_stoi()[unk_token])

The following function will return the indices for any passed group of string and will behave in a similar way like we call map. 

In [28]:
def numericalize_example(example, en_vocab, de_vocab):
    en_ids = en_vocab.lookup_indices(example["en_tokens"])
    de_ids = de_vocab.lookup_indices(example["de_tokens"])
    return {"en_ids": en_ids, "de_ids": de_ids} 

Using the above function let's add the token ids to the train, validation and test data

In [50]:
fn_kwargs = {"en_vocab": en_vocab, "de_vocab": de_vocab}

# Add the en_ids and de_ids rows
train_data = train_data.map(numericalize_example, fn_kwargs=fn_kwargs)
valid_data = valid_data.map(numericalize_example, fn_kwargs=fn_kwargs)
test_data = test_data.map(numericalize_example, fn_kwargs=fn_kwargs)

Map: 100%|██████████| 29000/29000 [00:02<00:00, 11001.75 examples/s]
Map: 100%|██████████| 1014/1014 [00:00<00:00, 12143.28 examples/s]
Map: 100%|██████████| 1000/1000 [00:00<00:00, 12170.35 examples/s]


Check one row and see the additional column that is added. 'en_ids' and 'de_ids' contain the ids (or token ids) of the words:

In [None]:
train_data[0]

Convert specific columns of en_ids and de_ids to torch tensors.We will use the with_format() function.

In [52]:
data_type = "torch"
format_columns = ["en_ids", "de_ids"]

train_data = train_data.with_format(
    type="torch", columns=format_columns, output_all_columns=True
)

valid_data = valid_data.with_format(type="torch",
    columns=format_columns,
    output_all_columns=True,
)

test_data = test_data.with_format(type="torch",
    columns=format_columns,
    output_all_columns=True,
)

Torch's  DataLoader class will be used to create the batch. Dataloader can call special functions when creating batches, and we will use a collate (Combine) function to pad the input sequence. Padding of input sequence is important as the matrix for weights is fixed. pad_sequence() of nn.utils will be used to pad the sequence.

Note that we use a closure type construct for get_collate_fn(). When DataLoader calls the collage function it only sends it the batch, as function param and not the pad_index. By using the function within a function the pad_index value needs to be passed once, and then collate_fn() can be called directly and will use the pad_index value defined before.

In [53]:

def get_collate_fn(pad_index): # This () is called once, assigns value of pad_index
    def collate_fn(batch):     # Called by the dataloader.
        batch_en_ids = [example["en_ids"] for example in batch]
        batch_de_ids = [example["de_ids"] for example in batch]
        batch_en_ids = nn.utils.rnn.pad_sequence(batch_en_ids, padding_value=pad_index)
        batch_de_ids = nn.utils.rnn.pad_sequence(batch_de_ids, padding_value=pad_index)
        batch = {
            "en_ids": batch_en_ids,
            "de_ids": batch_de_ids,
        }
        return batch

    return collate_fn

Create the get_data_loader() that will call the collate(). This will return the DataLoader objects. This function is called multiple times for train, test and validation sets.

In [54]:
def get_data_loader(dataset, batch_size, pad_index, shuffle=False):
    collate_fn = get_collate_fn(pad_index) # Get collate_fn as a value, same pad_index will be used
    data_loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=batch_size,
        collate_fn=collate_fn,
        shuffle=shuffle,
    )
    return data_loader

Now call get_data_loader() for train, test and validation. Set to a high batchs size, if GPU is available use the largest batch size that will fit in GPU memory. For training, data should be shuffled but not needed for test and validation.

In [55]:
batch_size = 128
pad_index = en_vocab[pad_token]

train_data_loader = get_data_loader(train_data, batch_size, pad_index, shuffle=True)
valid_data_loader = get_data_loader(valid_data, batch_size, pad_index)
test_data_loader = get_data_loader(test_data, batch_size, pad_index)

At this point we are ready to build the model. Model will be built in 3 parts, the Encoder, Decoder then the seq2seq linkage between the two. Note that this uses the nn.Embedding layer of pytorch, so embeddings weights will also be learned (we are not using any pretrained embeddings like word2vec). 

In [None]:
class Encoder(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, n_layers, dropout):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.n_layers = n_layers
        self.embedding = nn.Embedding(input_dim, embedding_dim)
        # LSTM(# inputs, #hidden, # layers), # hidden also corresponds to output count 
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=dropout)
        self.dropout = nn.Dropout(dropout) # dropout = prob. of dropout (for randomly zeroing the input tensor values)

    def forward(self, src):
        # src = [src length, batch size]
        embedded = self.dropout(self.embedding(src))
        # embedded = [src length, batch size, embedding dim]
        outputs, (hidden, cell) = self.rnn(embedded)
        # outputs = [src length, batch size, hidden dim * n directions]
        # hidden = [n layers * n directions, batch size, hidden dim]
        # cell = [n layers * n directions, batch size, hidden dim]
        # outputs are always from the top/last hidden layer
        return hidden, cell

The code for Decoder is shown next. This has a FC layer in the end to allow for a softmax like output to predict the probabilites of each token.

In [None]:
class Decoder(nn.Module):
    def __init__(self, output_dim, embedding_dim, hidden_dim, n_layers, dropout):
        super().__init__()
        self.output_dim = output_dim
        self.hidden_dim = hidden_dim
        self.n_layers = n_layers
        self.embedding = nn.Embedding(output_dim, embedding_dim) # 1
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=dropout) #2
        self.fc_out = nn.Linear(hidden_dim, output_dim) # 3: To make predictions for next token
        self.dropout = nn.Dropout(dropout)

    def forward(self, input, hidden, cell):
        # input = [batch size]
        # hidden = [n layers * n directions, batch size, hidden dim]
        # cell = [n layers * n directions, batch size, hidden dim]
        # n directions in the decoder will both always be 1, therefore:
        # hidden = [n layers, batch size, hidden dim]
        # context = [n layers, batch size, hidden dim]
        input = input.unsqueeze(0)
        # input = [1, batch size]
        embedded = self.dropout(self.embedding(input))
        # embedded = [1, batch size, embedding dim]
        output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
        # output = [seq length, batch size, hidden dim * n directions]
        # hidden = [n layers * n directions, batch size, hidden dim]
        # cell = [n layers * n directions, batch size, hidden dim]
        # seq length and n directions will always be 1 in this decoder, therefore:
        # output = [1, batch size, hidden dim]
        # hidden = [n layers, batch size, hidden dim]
        # cell = [n layers, batch size, hidden dim]
        prediction = self.fc_out(output.squeeze(0))
        # prediction = [batch size, output dim]
        return prediction, hidden, cell