<a href="https://colab.research.google.com/github/dominiksakic/zero_to_hero/blob/main/adv_04_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

- EX2: Train the GPT on your own dataset of choice! What other data could be fun to blabber on about? (A fun advanced suggestion if you like: train a GPT to do addition of two numbers, i.e. a+b=c. You may find it helpful to predict the digits of c in reverse order, as the typical addition algorithm (that you're hoping it learns) would proceed right to left too. You may want to modify the data loader to simply serve random problems and skip the generation of train.bin, val.bin. You may want to mask out the loss at the input positions of a+b that just specify the problem using y=-1 in the targets (see CrossEntropyLoss ignore_index). Does your Transformer learn to add? Once you have this, swole doge project: build a calculator clone in GPT, for all of +-*/. Not an easy problem. You may need Chain of Thought traces.)
- EX3: Find a dataset that is very large, so large that you can't see a gap between train and val loss. Pretrain the transformer on this data, then initialize with that model and finetune it on tiny shakespeare with a smaller number of steps and lower learning rate. Can you obtain a lower validation loss by the use of pretraining?
- EX4: Read some transformer papers and implement one additional feature or change that people seem to use. Does it improve the performance of your GPT?

In [1]:
!pip install kagglehub -q

In [2]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("ffatty/plain-text-wikipedia-simpleenglish")
print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/ffatty/plain-text-wikipedia-simpleenglish?dataset_version_number=2...


100%|██████████| 128M/128M [00:01<00:00, 78.2MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/ffatty/plain-text-wikipedia-simpleenglish/versions/2


In [3]:
import os
file_path = os.path.join(path, "AllCombined.txt")
with open(file_path, 'r', encoding='utf-8') as f:
    text = f.read()

In [4]:
print("Length of loaded text:", len(text))
print("First 500 characters:\n", text[:500])

Length of loaded text: 178255102
First 500 characters:
 
April

April (Apr.) is the fourth month of the year in the Julian and Gregorian calendars, and comes between March and May. It is one of four months to have 30 days.

April always begins on the same day of the week as July, and additionally, January in leap years. April always ends on the same day of the week as December.

April comes between March and May, making it the fourth month of the year. It also comes first in the year out of the four months that have 30 days, as June, September and No


In [5]:
## STARTING CODE ##
import math
import torch
import torch.nn as nn
from torch.nn import functional as F

## PREP DATA
chars = sorted(list(set(text)))
vocab_size = len(chars)
stoi = {ch:i  for i, ch in enumerate(chars)}
itos = {i : ch for i, ch in enumerate(chars)}

encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

data = torch.tensor(encode(text), dtype=torch.long)

n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

In [6]:
print(f"Vocabulary consist of: {chars}")
print(f"Vocab size is: {len(chars)}")
print(f"Train data has the length: {len(train_data)}")
print(f"Val data has the length: {len(val_data)}")

Vocabulary consist of: ['\n', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '^', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~', '\x80', '\x92', '\x9d', '\xa0', '¡', '¢', '£', '¤', '¥', '§', '¨', '©', 'ª', '«', '¬', '\xad', '®', '¯', '°', '±', '²', '³', '´', 'µ', '¶', '·', '¸', '¹', 'º', '»', '¼', '½', '¾', '¿', 'À', 'Á', 'Â', 'Ã', 'Ä', 'Å', 'Æ', 'Ç', 'È', 'É', 'Ê', 'Ë', 'Ì', 'Í', 'Î', 'Ð', 'Ñ', 'Ò', 'Ó', 'Ô', 'Õ', 'Ö', '×', 'Ø', 'Ù', 'Ú', 'Û', 'Ü', 'Ý', 'Þ', 'ß', 'à', 'á', 'â', 'ã', 'ä', 'å', 'æ', 'ç', 'è', 'é', 'ê', 'ë', 'ì', 'í', 'î', 'ï', 'ð', 'ñ', 'ò', 'ó', 'ô', 'õ', 'ö', '÷', 'ø', 'ù', 'ú', 'û', 'ü', 'ý', 'þ',

In [7]:
def apply_rope(q, k):
    # q, k: (B, n_head, T, head_size), where head_size must be even
    B, nh, T, hs = q.shape
    assert hs % 2 == 0, "head_size must be even for RoPE"
    half = hs // 2

    freqs = torch.exp(-torch.arange(0, half, dtype=torch.float32) * math.log(10000) / half).to(q.device)  # (half,)
    positions = torch.arange(T, device=q.device).float()  # (T,)
    angles = torch.einsum('t,d->td', positions, freqs)  # (T, half)
    sin = angles.sin().unsqueeze(0).unsqueeze(0)  # (1, 1, T, half)
    cos = angles.cos().unsqueeze(0).unsqueeze(0)  # (1, 1, T, half)

    q1, q2 = q[..., :half], q[..., half:]
    k1, k2 = k[..., :half], k[..., half:]
    q_rotated = torch.cat([q1 * cos - q2 * sin, q1 * sin + q2 * cos], dim=-1)
    k_rotated = torch.cat([k1 * cos - k2 * sin, k1 * sin + k2 * cos], dim=-1)
    return q_rotated, k_rotated


batch_size = 16
block_size = 32
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0

torch.manual_seed(1337)


@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

In [8]:
class MultiHeadAttention(nn.Module):
    """Multiple heads of self-attention in parallel."""

    def __init__(self, n_embd, num_heads, dropout):
        super().__init__()
        assert n_embd % num_heads == 0
        self.n_head = num_heads
        self.head_size = n_embd // num_heads
        self.dropout = nn.Dropout(dropout)

        # Linear projections for q, k, v (combined head projection)
        self.key = nn.Linear(n_embd, n_embd, bias=False)
        self.query = nn.Linear(n_embd, n_embd, bias=False)
        self.value = nn.Linear(n_embd, n_embd, bias=False)

        # Final projection layer
        self.proj = nn.Linear(n_embd, n_embd)

    def forward(self, x):
        B, T, C = x.shape  # (batch, time, channels)

        # Project and reshape into multiple heads
        k = self.key(x).view(B, T, self.n_head, self.head_size).transpose(1, 2)  # (B, nh, T, hs)
        q = self.query(x).view(B, T, self.n_head, self.head_size).transpose(1, 2)  # (B, nh, T, hs)
        v = self.value(x).view(B, T, self.n_head, self.head_size).transpose(1, 2)  # (B, nh, T, hs)

        # Apply Rotary Positional Embedding if available
        q, k = apply_rope(q, k)

        # Compute attention weights
        wei = q @ k.transpose(-2, -1) * (self.head_size ** -0.5)  # (B, nh, T, T)

        # Mask to prevent attending to future tokens (causal attention)
        mask = torch.tril(torch.ones(T, T, device=x.device)).unsqueeze(0).unsqueeze(0)  # (1, 1, T, T)
        wei = wei.masked_fill(mask == 0, float('-inf'))

        wei = F.softmax(wei, dim=-1)
        wei = self.dropout(wei)

        # Weighted sum of values
        out = wei @ v  # (B, nh, T, hs)
        out = out.transpose(1, 2).contiguous().view(B, T, C)  # reassemble heads (B, T, C)
        out = self.dropout(self.proj(out))
        return out



class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_embd, n_head, dropout)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

In [9]:
class GPTLanguageModelRoPE(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

        # better init, not covered in the original GPT video, but important, will cover in followup video
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        x = self.blocks(tok_emb) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = GPTLanguageModelRoPE()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

0.88893 M parameters


In [10]:
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

In [11]:
for iter in range(max_iters):
    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

step 0: train loss 8.5840, val loss 8.5854
step 100: train loss 2.9957, val loss 3.0391
step 200: train loss 2.6121, val loss 2.6511
step 300: train loss 2.4580, val loss 2.4953
step 400: train loss 2.3523, val loss 2.3959
step 500: train loss 2.3153, val loss 2.3614
step 600: train loss 2.2760, val loss 2.3009
step 700: train loss 2.2111, val loss 2.2384
step 800: train loss 2.1979, val loss 2.2225
step 900: train loss 2.1794, val loss 2.1970
step 1000: train loss 2.1609, val loss 2.1870
step 1100: train loss 2.1257, val loss 2.1713
step 1200: train loss 2.1058, val loss 2.1448
step 1300: train loss 2.0817, val loss 2.1264
step 1400: train loss 2.0632, val loss 2.0827
step 1500: train loss 2.0433, val loss 2.0757
step 1600: train loss 2.0374, val loss 2.0752
step 1700: train loss 2.0302, val loss 2.0491
step 1800: train loss 2.0188, val loss 2.0355
step 1900: train loss 1.9894, val loss 2.0271
step 2000: train loss 1.9814, val loss 2.0181
step 2100: train loss 1.9878, val loss 2.0089


In [12]:
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))




اheallory

Haxe Ebañamy (born 24 January, 2018) was goody the metern lesong cances. If 1 kingther defence closes built of childramily in claw mayns in the and became though in gained to east (subs down ouvated he rese tolandted censuse of a four term (an be due-sturni-fermal Libie" for these namedy That have at.

Nore get 21,003 and who have number has many who and Saeli Kean, when that the raisional dechever to Ł in the East Pokldeban down In 2020, backs and hee eason his largest compuper, kn


In [15]:
torch.save(model, "transformer_wikipedia")

#fine tune the model with new dataset and lower learning rate!


In [16]:
## LOAD DATA Tiny shakespear
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt


with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

## PREP DATA
chars = sorted(list(set(text)))
vocab_size = len(chars)
stoi = {ch:i  for i, ch in enumerate(chars)}
itos = {i : ch for i, ch in enumerate(chars)}

encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

data = torch.tensor(encode(text), dtype=torch.long)

n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

--2025-08-05 12:32:10--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2025-08-05 12:32:10 (23.8 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [17]:
learning_rate = 5e-5
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

In [18]:
for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

step 0: train loss 5.8224, val loss 5.8176
step 100: train loss 3.7831, val loss 3.8245
step 200: train loss 3.4517, val loss 3.4820
step 300: train loss 3.2261, val loss 3.2573
step 400: train loss 3.0831, val loss 3.0872
step 500: train loss 2.9432, val loss 2.9466
step 600: train loss 2.8312, val loss 2.8379
step 700: train loss 2.7398, val loss 2.7421
step 800: train loss 2.6709, val loss 2.6679
step 900: train loss 2.6056, val loss 2.6077
step 1000: train loss 2.5526, val loss 2.5507
step 1100: train loss 2.5123, val loss 2.5149
step 1200: train loss 2.4809, val loss 2.4827
step 1300: train loss 2.4423, val loss 2.4469
step 1400: train loss 2.4106, val loss 2.4230
step 1500: train loss 2.3847, val loss 2.4008
step 1600: train loss 2.3587, val loss 2.3699
step 1700: train loss 2.3318, val loss 2.3484
step 1800: train loss 2.3079, val loss 2.3426
step 1900: train loss 2.2999, val loss 2.3079
step 2000: train loss 2.2855, val loss 2.2918
step 2100: train loss 2.2617, val loss 2.2821


In [19]:
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))


MORES:
Moke mary.

BIARWOIURUS:
Is have my good nomes the men.

CORY IsOz:
I he will wetived, with courfs:
I shout migh.

And as wetlace.

SONsCLed Perbent oveven moparst to the fkeever soie so ume his rare, lake ny bet Cloovecest he hand dusher.
For mone.

NENEHA:
That
The vin
Now my To with sels Fond the well no crive,
Tchead
For, ild.
I know the no not with the so the of tius done no come sut now portanot what that the muke peiclesenke
Pon so hance paib.
That chotive worter, for ;
you conto e
