# TinyTransformer  
Following the excellent lectures by Andrej Karpathy, we are going to code a small deep learning model which works just like GPT-2/GPT-3.
These systems are in essence transformer-decoder architecture models, with cross-attention to the transformer-encoder replaced with self-attention. We call it tinyTransformer as we will build a with very much the same architecture, just a little smaller. 

In [16]:
import os
import torch
import torch.nn as nn
from torch.nn import functional as F

torch.manual_seed(2112)


<torch._C.Generator at 0x7fa90871ed50>

# Data
We will start by loading our toy dataset. In this case it are transcriptions of the maarten van Rossem dataset which i collected previously using `whisper`. The goal of our model will be to create realistic sounding extensions to this dataset. 

In [2]:
paths = os.listdir("./data/maarten/")
files = []
for path in paths:
    with open("./data/maarten/" + path, "r") as file:
        files.append(file.read())


In [3]:
dataset = ""
for file in files:
    dataset += file


In [4]:
print(dataset[:1000])


Kijk, je kunt in Amerika ook universitair getraind stewardess worden.
Dat maakt niet uit daar.
Op alle niveaus is wel onderwijs wat bij ons denk veel hbo-achtige opleidingen zijn.
Dat is thuis, als allemaal in principe is...
Wat tertiaire is, is tertiaire onderwijs.
Spijt van Hossum, kunt u mij horen?
Mijn pensioen beschouwen is een bijzonder gelukkige fase van mijn leven.
Je doet hele leuke dingen?
Ja, ik doe leuke dingen.
Vooral de vervelende dingen kan ik zelf afschaffen.
Ik kan gewoon zeggen, sorry, maar ik ga helemaal niet vergaderen.
Jullie mogen zoveel vergaderen als je wilt.
Dat heb ik ruim voldoende gedaan in mijn leven en nog steeds ben ik van mening...
dat al die uren volkomen en totaal verspeeld zijn geweest.
Ik denk hierbij aan vergaderingen in de universitaire kring.
Ik hoef niet meer naar te kijken.
Waar ik een enorme hekel aan had, eerlijk gezegd.
Omdat het een stuk waardeloos werk is.
Dan word je zo moe van al die verkeerde antwoorden.
Van die scripties die zo slecht z

In [5]:
print(len(dataset))


11687778


In [32]:
# all unique characters in this dataset are:
chars = sorted(list(set(dataset)))
vocab_size = len(chars)
print(r"".join(chars))



 !%&'+,-.0123456789?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzÖßàáäçèéêëïóôöüēńș…脚


Note that character 0 is not printed, this is the newline character

It's interesting that the text returned a mandarin character, as this is not a language used in this Dutch podcast, but we'll leave it in the dataset for now.

In [37]:
# create a mapping to tokens
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: "".join(itos[i] for i in l)


In [8]:
data = torch.tensor(encode(dataset), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:25])


torch.Size([11687778]) torch.int64
tensor([31, 55, 56, 57,  7,  1, 56, 51,  1, 57, 67, 60, 66,  1, 55, 60,  1, 21,
        59, 51, 64, 55, 57, 47,  1])


In [9]:
# create a train, validation, and test set
n_tr, n_val = int(0.8 * len(data)), int(0.9 * len(data))
train_data = data[:n_tr]
validation_data = data[n_tr:n_val]
test_data = data[n_val:]


In [10]:
test_data


tensor([ 1, 59, 47,  ..., 65,  9,  0])

In [48]:
torch.manual_seed(2112)
batch_size = 64
block_size = 32


def get_batch(split):
    # return random training and testing batch from the dataset
    data = train_data if split == "train" else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i : i + block_size] for i in ix])
    y = torch.stack([data[i + 1 : i + 1 + block_size] for i in ix])
    return x, y


xb, yb = get_batch("train")
print(xb)
print(yb)
for b in range(batch_size):
    for t in range(block_size):
        context = xb[b, : t + 1]
        target = yb[b, t]
        print(f"when input is {context.tolist()} the target: {target}.")


tensor([[58,  7,  1,  ..., 65,  1, 50],
        [51,  1, 52,  ...,  1, 60, 47],
        [50, 55, 51,  ..., 61, 51, 66],
        ...,
        [66,  7,  1,  ...,  1, 23, 47],
        [47, 47, 64,  ..., 65,  1, 51],
        [72, 51,  1,  ..., 54, 51, 66]])
tensor([[ 7,  1, 57,  ...,  1, 50, 47],
        [ 1, 52, 61,  ..., 60, 47, 66],
        [55, 51,  1,  ..., 51, 66,  1],
        ...,
        [ 7,  1, 60,  ..., 23, 47, 60],
        [47, 64,  1,  ...,  1, 51, 51],
        [51,  1, 60,  ..., 51, 66,  1]])
when input is [58] the target: 7.
when input is [58, 7] the target: 1.
when input is [58, 7, 1] the target: 57.
when input is [58, 7, 1, 57] the target: 47.
when input is [58, 7, 1, 57, 47] the target: 47.
when input is [58, 7, 1, 57, 47, 47] the target: 64.
when input is [58, 7, 1, 57, 47, 47, 64] the target: 66.
when input is [58, 7, 1, 57, 47, 47, 64, 66] the target: 7.
when input is [58, 7, 1, 57, 47, 47, 64, 66, 7] the target: 1.
when input is [58, 7, 1, 57, 47, 47, 64, 66, 7, 1] th

## Start with a simpel bigram model

In [49]:
class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        logits = self.token_embedding_table(idx)  # shape (Batchsize, Blocksize, Tokens)
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape

            # We need to make views that conform to the pytorch input conventions
            logits = logits.view(B * T, C)
            targets = targets.view(B * T)
            loss = F.cross_entropy(logits, targets)
        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :]  # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)  # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)
        return idx


M_bigram = BigramLanguageModel(vocab_size)
logits, loss = M_bigram(xb, yb)
print(logits.shape)
print(loss)

print(
    decode(
        M_bigram.generate(
            idx=torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100
        )[0].tolist()
    )
)


torch.Size([2048, 93])
tensor(5.0019, grad_fn=<NllLossBackward0>)

.wD6óLTlvRô8Hv5
ëëQ9脚5ēïuX1Hàw.4ZM'EcáLNKhSoD&,u,RjïmPSü,ń%qDEjdÖS,ZoazXbEuUZHTC-RI+'Whv6w脚T!èwäGF'u


In [50]:
optimizer = torch.optim.AdamW(M_bigram.parameters(), lr=1e-3)


In [51]:
for step in range(10000):
    xb, yb = get_batch("train")

    # evaluate the loss
    logits, loss = M_bigram(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(loss.item())


2.2846109867095947


In [52]:
print(
    decode(
        M_bigram.generate(
            idx=torch.zeros((1, 1), dtype=torch.long), max_new_tokens=300
        )[0].tolist()
    )
)



TOfoechelstwegelink d hent ok n.
En jneteniserl haroret s, maaan jnzoeeweldaaat el hen drijeg rorij zt d daan dys h was a Qantar to nindeget ijäzochende den wat ierik i hijk ereenimien digen iters chel donterdiece gt pat.
miedijangerasst he elaacit je en oo'schezessst g wenscijkur was e ewen, PS-che


This already looks like actual text, and it looks a bit like dutch. It's a bit like how a non-Dutch person would write when pretending to write Dutch while only knowing what it sounds like.

# Now we try and implement self-attention

In [55]:
# version 4: self-attention!
torch.manual_seed(2112)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
wei =  q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)

tril = torch.tril(torch.ones(T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

v = value(x)
out = wei @ v

out.shape

torch.Size([4, 8, 16])