### Transformers 101: Using Transformers for text generation


Transformers have taken up the Deep Learning field. But, to be honest I always found them difficult to understand. So, that's why I tried reading it over and over again and filled in the void in my understanding. In this project I will try to go through each step at a time and try to understand the Transformer network by coding them with delicate explanations.


To understand our work we are going to use the [Poems dataset](https://www.kaggle.com/datasets/charunisa/english-poems-dataset).


Without further adieu lets get started with our work and import the necessary packages.


In [1]:
## Importing necessary packages ##

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

from tqdm import tqdm

## For displaying exact values  and not in exponentiations##
torch.set_printoptions(sci_mode=False)

## Setting the device ##
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")


#### Loading and Understanding our dataset


As a first step we need to look at what we are doing and how our dataset looks like. I like to summarize it as follows.

- **Main Goal** : Generate English poems using Transformer Networks.
- **Data Loading** (_Not DataLoader_) : Read the text file and merge the contents at a single place.


In [2]:
## Reading the text file ##

data = []

with open("poems.txt", "r", encoding="utf8") as file:
    for i, each_line in enumerate(file):
        each_line = "<start> " + each_line.lower().strip("\n") + " <end>"
        data.extend(each_line.split())

print(f"Total dataset length is {len(data)}")


Total dataset length is 18033


Now our data is stored as a single big text corpus. What we want to do, is randomly pick up a sub-set of it and feed it to the transformer module to make it generate new words.


But before that we must build our word corpus or vocabulary. The easiest way of doing it is simply making a set of all the unique words that there are in the corpus and then giving them an index. Moreover a mapping of word to index is necessary to achieve this final goal.


In [3]:
## Making our vocabulary ##

vocab = set()
vocab = sorted(list(set(data)))
vocab.remove("<start>")
vocab.remove("<end>")
vocab.insert(0, "<start>")
vocab.insert(1, "<end>")
print(f"Vocabulary length is {len(vocab)}")


Vocabulary length is 4113


Now let us setup a vocabulary word to index mapping and vice versa for later use cases.


In [4]:
## Word 2 index mapping ##

word_2_idx = {k: v for v, k in enumerate(vocab)}

print(
    f"Start word index : {word_2_idx['<start>']}, end word index : {word_2_idx['<end>']}"
)
## Index 2 word mapping for generation ##

idx_2_word = {v: k for k, v in word_2_idx.items()}
print(f"Word at index 0 : {idx_2_word[0]}, Word at index 1 : {idx_2_word[1]}")


Start word index : 0, end word index : 1
Word at index 0 : <start>, Word at index 1 : <end>


Perfect we are very nicely setup. Now the idea is to create a Pytorch Dataset instance which would do the following.

- [ ] Given a data corpus and a sequence length, it would randomly pluck a sequence of data (of sequence length) from the data corpus and return it along with the labels which are nothing but 1 index shifted of the data corpus. This would become more clear on code-up.


In [5]:
## Pytorch dataset ##


class PoemsDataset(Dataset):
    """The custom poems dataset"""

    def __init__(self, data_corpus, sequence_length):
        super().__init__()
        self.data_corpus = data_corpus
        self.sequence_length = sequence_length

    def __getitem__(self, idx):
        inp = torch.tensor(
            [word_2_idx[w] for w in self.data_corpus[idx : idx + self.sequence_length]]
        )
        label = torch.tensor(
            [
                word_2_idx[w]
                for w in self.data_corpus[idx + 1 : idx + 1 + self.sequence_length]
            ]
        )
        return inp, label

    def __len__(self):
        return len(self.data_corpus) - (self.sequence_length + 1)


In [6]:
## Utility function to show data ##


def show_data(data, verbose=True):
    """Given a data tensor, maps them to string and prints them."""
    str_data = [idx_2_word[each_word.item()] for each_word in data.data]
    if verbose:
        print(str_data)
    else:
        return str_data


In [7]:
## For reproducing the same code ##
torch.manual_seed(97)

## Dataset ##

poems_data = PoemsDataset(data, 8)

## Testing our dataset ##
random_idx = int(torch.randint(low=0, high=len(poems_data), size=(1,)))
random_idx

print(f" Raw data : {poems_data[random_idx]}")

print("Inputs ->")
show_data(poems_data[random_idx][0])

print("Labels ->")
show_data(poems_data[random_idx][1])


 Raw data : (tensor([ 228, 4011, 2141, 1634, 3773, 3507,  275,    1]), tensor([4011, 2141, 1634, 3773, 3507,  275,    1,    0]))
Inputs ->
['art', 'with', 'me', 'here', 'upon', 'the', 'banks', '<end>']
Labels ->
['with', 'me', 'here', 'upon', 'the', 'banks', '<end>', '<start>']


Now since we have already gotten our hands dirty with the Dataset, it would be of the utmost importance to build up our Dataloader module too, which would feed in such batches of sequences for feeding to the Transformer network.


In [8]:
## Building dataloader ##


def build_dl(dataset, batch_size=4, shuffle=True):
    """Returns the dataloader object."""

    return DataLoader(dataset, batch_size=batch_size, shuffle=shuffle)


Perfect, our dataloader is also curated.

Now it would also be nice to see how our dataloader is sending out the data.


In [9]:
## Dataloader check ##

poems_dl = build_dl(poems_data)

for inp, labels in poems_dl:

    print(f"Input shape : {inp.shape} and Labels shape : {labels.shape}")

    for each_inp, each_label in zip(inp, labels):
        print("----------------------")
        print("Input :\t", end="")
        show_data(each_inp)
        print("Label :\t", end="")
        show_data(each_label)

    break


Input shape : torch.Size([4, 8]) and Labels shape : torch.Size([4, 8])
----------------------
Input :	['did', 'i', 'count!', '<end>', '<start>', 'blest', 'was', 'i']
Label :	['i', 'count!', '<end>', '<start>', 'blest', 'was', 'i', 'then']
----------------------
Input :	['mortality;', '<end>', '<start>', 'another', 'race', 'hath', 'been,', 'and']
Label :	['<end>', '<start>', 'another', 'race', 'hath', 'been,', 'and', 'other']
----------------------
Input :	['on', 'the', 'farm,', 'she', 'did', '<end>', '<start>', 'a']
Label :	['the', 'farm,', 'she', 'did', '<end>', '<start>', 'a', 'childlike']
----------------------
Input :	['it', 'die', 'away,', '<end>', '<start>', 'and', 'fade', 'into']
Label :	['die', 'away,', '<end>', '<start>', 'and', 'fade', 'into', 'the']


Perfect!! Our dataloader performs exactly like we want.

But before I forget, let me explain the structure of the input and the output.

So what are we building? **Generation** model which would generate beautiful poems.

Now the model can't just burp out everything all at once (even though it would have been nice!). It generates one word at a time. It is very similar to how we write something (one unit of word/sentence at a time)!

Now what we want to teach the model is that by taking the current word and all the words the model has predicted, the model must output the next word. So, the inputs are basically increasing with each step, at first step input is -> 'from', at second step -> 'from','his' , at third step it is 'from', 'his', "father's", and so on. And the corresponding output next word is given by the label, 1st step -> 'his', 2nd step -> "father's" , 3rd step -> 'eyes!' and so on.

I hope this makes everything clear, since, it will be at the core of the transformer block.

In the next few cell I will try to show how you can do this type of input and output combination using basic pytorch functions.


In [10]:
## Checking the masking of the inputs ##

test_data = poems_data[0]

test_inp, test_label = test_data
test_inp, test_label = test_inp.float(), test_label.float()

print(f"test_inp shape : {test_inp.shape}")

print(f"------------")
print(f"Current Input :\t{test_inp}")

## Making the input to have the required orientation ##
mask = torch.tril(torch.ones((8, 8)))

print(f"Mask :\t{mask}")

## Dot product of mask and test_inp will do the cascading ##
cascaded_inp = mask * test_inp

print(f"Cascaded Input :\t{cascaded_inp}")


test_inp shape : torch.Size([8])
------------
Current Input :	tensor([   0., 1739., 1571., 4007.,   82.,  379., 4049., 1283.])
Mask :	tensor([[1., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])
Cascaded Input :	tensor([[   0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.],
        [   0., 1739.,    0.,    0.,    0.,    0.,    0.,    0.],
        [   0., 1739., 1571.,    0.,    0.,    0.,    0.,    0.],
        [   0., 1739., 1571., 4007.,    0.,    0.,    0.,    0.],
        [   0., 1739., 1571., 4007.,   82.,    0.,    0.,    0.],
        [   0., 1739., 1571., 4007.,   82.,  379.,    0.,    0.],
        [   0., 1739., 1571., 4007.,   82.,  379., 4049.,    0.],
        [   0., 1739., 1571., 4007.,   82.,  379.,

Perfect, we could already do something like that. But the thing is, we dont want to somehow compress our multi-worded context for predicting the next word into a singular value. Such that even though we are considering multiple words, we would actually be giving a single number. So, how to do that?

- One simple way is to just add up the columns. And here we are going to do just the same.


In [11]:
## Adding up the columns to convert to single element ##

print(f"Attended inp : {cascaded_inp.sum(1)}")


Attended inp : tensor([    0.,  1739.,  3310.,  7317.,  7399.,  7778., 11827., 13110.])


But this entire setup was a two step setup. Can we somehow merge it to a single step? The answer is : **DOT** Products!!


In [12]:
## Doing dot product to do the same thing ##

test_data = poems_data[0]

test_inp, test_label = test_data
test_inp, test_label = test_data
test_inp, test_label = test_inp.float(), test_label.float()

print(test_inp.shape)

attended_inp = mask @ test_inp.unsqueeze(1)
print(f"Attended inp : {attended_inp.squeeze(1)}")


torch.Size([8])
Attended inp : tensor([    0.,  1739.,  3310.,  7317.,  7399.,  7778., 11827., 13110.])


And this is exactly the same as the result before. Amazing!!


Spoiler! Spoiler! Spoiler! This simplistic formulation is nothing but the Masked Attention, at the heart of Transformer decoder.

But its a bit more nuanced. Here, since I was taking "ones" as my mask, I was just doing averaging. Averaging is very simple. In attention there is no 1's, instead we have learned values. This learnt values are done through a product of two Matrices of batches called the **Query** and the **Key**. Now, for conversion to probability the product is passed through a Softmax.

The thing is, we just want values of the mask in the lower triangular matrix and all the values of the mask in the upper triangular matrix must be 0. So how can we achieve this directly after softmax? The easiest way to do it is after doing the dot product between the matrices we set everything above the diagonal to -inf, as such when passed through the softmax all these values will become 0, exactly to what we need. In the next cell lets try to do this for sanity check.


In [13]:
## For reproducibility ##
torch.manual_seed(97)

## Creating the mask ##

query = torch.randn(8, 8)
key = torch.randn(8, 8)

print(f"Key :\t\n{key}")
print("------------------")
print(f"Query :\t\n{query}")
print("------------------")

prod = torch.tril(query @ key)
prod[prod == 0] = float("-inf")

print(f"Prod before softmax :\t\n{prod}")
print("------------------")

prod = torch.nn.functional.softmax(prod, dim=1)

print(f"Prod after softmax :\t\n{prod}")


Key :	
tensor([[ 0.0493,  2.3374,  1.5298,  0.2551,  0.0994, -0.3862, -1.7110,  1.5092],
        [-1.4439,  1.1032, -0.7929,  1.1246,  0.3582,  2.4612,  0.1289, -0.4013],
        [-0.2494, -0.4767,  1.3613,  0.1978,  1.3810,  0.5879, -0.4244,  0.4247],
        [ 1.3456,  1.2036, -0.9834, -0.9720, -0.6042, -0.2600,  0.1238,  1.3787],
        [-0.8120,  1.6090, -1.4952, -1.6685, -1.3468, -1.5940, -0.0069, -0.5587],
        [-2.2360,  1.3833,  1.0433,  1.1669, -0.3820,  0.7385, -0.5996,  1.3503],
        [-0.0796, -1.4015,  0.8312,  0.3304,  0.0167, -3.0018,  0.7606,  0.4953],
        [-2.6711,  0.2311, -0.6963, -1.4831,  0.7733,  0.0064, -0.9866,  0.8588]])
------------------
Query :	
tensor([[    -0.7015,      0.8426,     -0.9616,     -1.8412,      1.0544,
              1.5917,      1.2750,     -1.5763],
        [     0.3223,      0.4414,     -0.1807,      1.3214,     -0.4039,
             -1.1014,     -1.4851,      0.3144],
        [     0.9922,      0.1144,      0.8940,      0.0193,  

So we need to do something like this in our Transformer network. This will create the masks, which in the paper are known as "weights". Now, as we earlier saw, to do the entire attention block, we need to multiply this mask with the actual values. And funny enough we are going to do that exactly via another matrix called **Values**.


Now I can say we are already ready to build our Transformer Decoder Block. (In this work we only need to do generation, hence, Transformer Decoder suffices our needs!!).


In [14]:
## Building transformer decoder ##


class TransformerDecoder(nn.Module):
    """The Transformer Decoder Block."""

    def __init__(self, sequence_length=8, d_model=512):
        """Constructor"""
        super().__init__()

        # Embedding #
        self.embed = nn.Embedding(num_embeddings=len(vocab), embedding_dim=d_model)

        # positional encoding network #
        self.positional_embed = nn.Embedding(
            num_embeddings=sequence_length, embedding_dim=d_model
        )

        # Key, query, value mlp #
        self.keys_mlp = nn.Linear(in_features=d_model, out_features=d_model, bias=False)
        self.queries_mlp = nn.Linear(
            in_features=d_model, out_features=d_model, bias=False
        )
        self.values_mlp = nn.Linear(
            in_features=d_model, out_features=d_model, bias=False
        )

        self.layer_norm = nn.LayerNorm(d_model)
        self.ffn = nn.Linear(in_features=d_model, out_features=d_model)

        self.linear = nn.Linear(in_features=d_model, out_features=len(vocab))

    def forward(self, x, device=device):

        # Embedding input #
        x = self.embed(x)
        B, L, C = x.shape

        # positional embedding #
        pos = torch.arange(L, device=device).repeat(B, 1)
        positional_embedding = self.positional_embed(pos)

        # Adding input with positional embedding #
        x = x + positional_embedding

        # Mapping keys, queries and values #
        k = self.keys_mlp(x)
        q = self.queries_mlp(x)
        v = self.values_mlp(x)

        # Attention! Wohooo! #
        out = torch.tril(k @ q.transpose(-1, -2)) / C**0.5
        out = out.masked_fill(out == 0, float("-inf"))
        out = nn.functional.softmax(out, dim=-1)
        out = out @ v

        # Residual connection and layernorm #
        x = self.layer_norm(out + x)

        # Residual connection and final feedforward #
        x = x + self.ffn(x)

        # Final linear layer #
        x = self.linear(x)

        return x


Boom!! Done... We just created our very simple Transformer network. This is very simple (but have all the core things) with a single head and only a single repeatation. I will derive the multi-head derivation in the next notebook. This notebook will grow very slowly and naively now.


In [15]:
## Testing ##

test_decoder = TransformerDecoder()

for batch_inp, batch_labels in poems_dl:
    a = test_decoder(batch_inp, device="cpu")
    print(a.shape)
    break


torch.Size([4, 8, 4113])


So, lets try training our model and see how it captures prediction.


In [16]:
## Creating dataloader ##
poems_dl = build_dl(poems_data)

## Setting our model ##
model = TransformerDecoder()
model = model.to(device)

## Loss function and optimizer ##

loss_func = torch.nn.CrossEntropyLoss()
optim = torch.optim.Adam(model.parameters(), lr=1e-3)


In [17]:
## Training Loop ##

batch_loss = []

for i in range(50):
    loop = tqdm(poems_dl)

    minibatch_loss = []

    for batch_inp, batch_labels in loop:
        batch_inp = batch_inp.to(device)
        batch_labels = batch_labels.to(device)
        B, L = batch_inp.shape
        pred = model(batch_inp)
        pred = pred.view(B * L, -1)
        batch_labels = batch_labels.view(B * L)
        loss = loss_func(pred, batch_labels)
        loop.set_description(f"Epoch : {i + 1} / 200 ::")
        minibatch_loss.append(loss.item())
        loop.set_postfix(Loss=sum(minibatch_loss) / len(minibatch_loss))
        optim.zero_grad()
        loss.backward()
        optim.step()

    batch_loss.append(sum(minibatch_loss) / len(minibatch_loss))


Epoch : 1 / 200 ::: 100%|██████████████████████████████████████████████| 4506/4506 [00:42<00:00, 106.02it/s, Loss=4.41]
Epoch : 2 / 200 ::: 100%|███████████████████████████████████████████████| 4506/4506 [00:41<00:00, 107.83it/s, Loss=3.3]
Epoch : 3 / 200 ::: 100%|██████████████████████████████████████████████| 4506/4506 [00:44<00:00, 101.50it/s, Loss=2.97]
Epoch : 4 / 200 ::: 100%|██████████████████████████████████████████████| 4506/4506 [00:42<00:00, 105.26it/s, Loss=2.46]
Epoch : 5 / 200 ::: 100%|██████████████████████████████████████████████| 4506/4506 [00:42<00:00, 106.47it/s, Loss=2.12]
Epoch : 6 / 200 ::: 100%|██████████████████████████████████████████████| 4506/4506 [00:42<00:00, 106.32it/s, Loss=1.81]
Epoch : 7 / 200 ::: 100%|██████████████████████████████████████████████| 4506/4506 [00:41<00:00, 107.98it/s, Loss=1.62]
Epoch : 8 / 200 ::: 100%|██████████████████████████████████████████████| 4506/4506 [00:42<00:00, 105.56it/s, Loss=1.58]
Epoch : 9 / 200 ::: 100%|███████████████

In [18]:
## Generating ##

model.eval()

start_idx = torch.zeros((1, 1), device="cuda").int()

for i in range(20):
    idx = start_idx
    while True:
        logits = model(idx[:, -8:])
        logits = logits[:, -1, :]
        prob = torch.nn.functional.softmax(logits, dim=1)
        next_idx = torch.multinomial(prob, 1)
        next_idx = next_idx.int()
        if next_idx == 1:
            print(" ".join(show_data(idx[0, 1:].data, verbose=False)))
            break
        idx = torch.cat([idx, next_idx], dim=1)


i dare to tell,
to be brim, and even above the grove, some waif

except forget to go in fear perhaps.
and, in the dust the grass was dry,
"how many are you, then," said i,
and, that immortal sea
a deep piece of some old running river
had we can by were to we but who could pass by
while the earth herself is adorning,
nor know that she's there,
it becomes a habit.
is too much with us; late and soon,
you must have seen me there,
"you say that two at conway dwell,
of my mother's door,
on the best way out as a look-off where we faced
beside the wall stands bare,
in this bush our sparrow built her nest,
i have to).


Amazing... This looks nice for a stop and jump back to the next notebook.
