<a href="https://colab.research.google.com/github/dominiksakic/zero_to_hero/blob/main/basics_06_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

- source: https://www.youtube.com/watch?v=kCc8FmEb1nY&t=126s

In [1]:
import torch
import torch.nn as nn
from torch.nn import functional as F

In [2]:
# get data
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2025-07-19 07:19:32--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2025-07-19 07:19:32 (22.5 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [3]:
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

print(text[:100])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


In [4]:
# make decoder, encoder
chars = sorted(list(set(text)))
vocab_size = len(chars)
stoi = {ch:i  for i, ch in enumerate(chars)}
itos = {i : ch for i, ch in enumerate(chars)}

encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])
print(encode(text[:50]))
print(decode(encode(text[:50])))
print(f"Vocav size: {vocab_size}")

[18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 14, 43, 44, 53, 56, 43, 1, 61, 43, 1, 54, 56, 53, 41, 43, 43, 42, 1, 39, 52, 63, 1, 44, 59, 56, 58, 46, 43, 56, 6, 1, 46, 43, 39, 56]
First Citizen:
Before we proceed any further, hear
Vocav size: 65


In [5]:
# Tokenize data, and create test/val
data = torch.tensor(encode(text), dtype=torch.long)

n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

In [None]:
# excursion into how the model predicts next token from one sentence
block_size = 8

x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]  #<---
    target = y[t]
    print(f'timestep {t}: when input is {context} the target is {target}')

timestep 0: when input is tensor([18]) the target is 47
timestep 1: when input is tensor([18, 47]) the target is 56
timestep 2: when input is tensor([18, 47, 56]) the target is 57
timestep 3: when input is tensor([18, 47, 56, 57]) the target is 58
timestep 4: when input is tensor([18, 47, 56, 57, 58]) the target is 1
timestep 5: when input is tensor([18, 47, 56, 57, 58,  1]) the target is 15
timestep 6: when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target is 47
timestep 7: when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target is 58


  - Result is that the model learns to complete from various lengths.
  - from one characters up to 8.


In [None]:
# lets make the example more complex by introducing a batch dimension
torch.manual_seed(1337)
batch_size = 4

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+ block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')

for b in range(batch_size):
  for t in range(block_size):
    context = xb[b, :t+1]
    target = yb[b, t]

    print(f"Batch {b}: when context is {context}, target is {target}")
  print("\n")

Batch 0: when context is tensor([24]), target is 43
Batch 0: when context is tensor([24, 43]), target is 58
Batch 0: when context is tensor([24, 43, 58]), target is 5
Batch 0: when context is tensor([24, 43, 58,  5]), target is 57
Batch 0: when context is tensor([24, 43, 58,  5, 57]), target is 1
Batch 0: when context is tensor([24, 43, 58,  5, 57,  1]), target is 46
Batch 0: when context is tensor([24, 43, 58,  5, 57,  1, 46]), target is 43
Batch 0: when context is tensor([24, 43, 58,  5, 57,  1, 46, 43]), target is 39


Batch 1: when context is tensor([44]), target is 53
Batch 1: when context is tensor([44, 53]), target is 56
Batch 1: when context is tensor([44, 53, 56]), target is 1
Batch 1: when context is tensor([44, 53, 56,  1]), target is 58
Batch 1: when context is tensor([44, 53, 56,  1, 58]), target is 46
Batch 1: when context is tensor([44, 53, 56,  1, 58, 46]), target is 39
Batch 1: when context is tensor([44, 53, 56,  1, 58, 46, 39]), target is 58
Batch 1: when context is 

In [None]:
# create a baseline model/bigram
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):
  def __init__(self, vocab_size):
    super().__init__()
    self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

  def forward(self, idx, targets=None):
    logits = self.token_embedding_table(idx) # (B,T,C)

    if targets is None:
      loss = None
    else:
      B, T, C = logits.shape

      logits = logits.view(B*T, C)
      targets = targets.view(B*T)
      loss = F.cross_entropy(logits, targets)

    return logits, loss

  def generate(self, idx, max_new_tokens):
      # idx is (B, T) array of indices in the current context
      for _ in range(max_new_tokens):
          # get the predictions
          logits, loss = self(idx)
          # focus only on the last time step
          logits = logits[:, -1, :] # becomes (B, C)
          # apply softmax to get probabilities
          probs = F.softmax(logits, dim=-1) # (B, C)
          # sample from the distribution
          idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
          # append sampled index to the running sequence
          idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
      return idx

m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss.item())

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=50)[0].tolist()))

torch.Size([256, 65])
4.648044586181641

Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLER


In [None]:
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [None]:
# train the baseline
batch_size = 32

for s in range(100000):
  # trainig data
  xb, yb = get_batch('train')

  # Forward pass
  logits, loss = m(xb, yb)
  optimizer.zero_grad(set_to_none=True)

  #Backward pass
  loss.backward()
  optimizer.step()

print(loss.item())

2.4081521034240723


In [None]:
# sample after training
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()))



MAUEdingn I
GBENoutl, Thasuriopro sorllll le.

Mat,
FI s ICKI ambe mithevis LLEShe ysste ar s blllyorswon
Clmbe t uruk
CLLarf w p ar lye twemen ulif hercefowive:
YBOUSTIOVO: gh ced s p ay anore iveatothe ierave yanccu wind s; oalllak omad ste?
h;

JUThow h llde iouge thes whe yomeathistlieis moma hit me o ind.

F hik, thite:
TRThe hal at w!
Whase t ma T:

Bareomast yethin athe stt geloupr msh f wh n
Yorinkeshave pan t,
NGAnthe or,
Wh tro joullieallisube:
Fin matthese V:
Aporn geng y ll yr mofor


In [None]:
xv, yv = get_batch("val")
_, loss = m(xv, yv)

print(f"Validation loss: {loss}")

Validation loss: 2.419926404953003


# Transformers


In [None]:
# Gensis of Transformer (weighted aggregation)
import torch

torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))
print('Starting Tensor a')
print({a})

a = a / torch.sum(a, 1, keepdim=True)
print(f'\nShape of a: {a.shape}')
print('Divide values by the sum along the 1 axis: ')
print({a})

b = torch.randint(0,10,(3,2)).float()
print(f'\nShape of b: {b.shape}')
print('Starting tensor b is')
print(b)

c = a @ b
print('\nresult of c is')
print(c)


Starting Tensor a
{tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])}

Shape of a: torch.Size([3, 3])
Divide values by the sum along the 1 axis: 
{tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])}

Shape of b: torch.Size([3, 2])
Starting tensor b is
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])

result of c is
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


In [None]:
torch.manual_seed(1337)
B,T,C = 4,8,2 # batch, time, channels
x = torch.randn(B,T,C)
print(f'Batch 0')
print(x[0])

Batch 0
tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])


In [None]:
xbow = torch.zeros((B,T,C))
for b in range(B):
    print(f'Batch: {b}')
    for t in range(T):
      xprev = x[b, :t+1]
      xbow[b,t] = torch.mean(xprev, 0)
      print(f'time step {t}, after aggregation: {xbow[b,t]}')

    if b == 0:
      break

Batch: 0
time step 0, after aggregation: tensor([ 0.1808, -0.0700])
time step 1, after aggregation: tensor([-0.0894, -0.4926])
time step 2, after aggregation: tensor([ 0.1490, -0.3199])
time step 3, after aggregation: tensor([ 0.3504, -0.2238])
time step 4, after aggregation: tensor([0.3525, 0.0545])
time step 5, after aggregation: tensor([ 0.0688, -0.0396])
time step 6, after aggregation: tensor([ 0.0927, -0.0682])
time step 7, after aggregation: tensor([-0.0341,  0.1332])


- For each time step we want to have the previous timesteps accumulated(mean)

In [None]:
# version 2: using matrix multiply for a weighted aggregation
wei = torch.tril(torch.ones(T, T))
print(f'shape of wei: {wei.shape}')
print(f'wei is: {wei}')

wei = wei / wei.sum(1, keepdim=True)
print(f'\nwei after normalization: {wei}')

xbow2 = wei @ x # (B, T, T) @ (B, T, C) ----> (B, T, C)
print(f'\nxbow2 batch 0 after multiplication: {xbow2[0]}')


shape of wei: torch.Size([8, 8])
wei is: tensor([[1., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])

wei after normalization: tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

xb

In [None]:
import torch.nn.functional as F
# version 3: use Softmax
tril = torch.tril(torch.ones(T, T))
print(f'tril: {tril}')

wei = torch.zeros((T,T))
print(f'\nwei: {wei}')

wei = wei.masked_fill(tril == 0, float('-inf'))
print(f'\nwei after masking: {wei}')

wei = F.softmax(wei, dim=-1)
print(f'\nwei after softmax: {wei}')

xbow3 = wei @ x
print(f'\nxbow3 batch 0: {xbow3[0]}')

tril: tensor([[1., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])

wei: tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.]])

wei after masking: tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., -

In [None]:
# version 4: self-attention!
torch.manual_seed(1337)
B,T,C = 4,8,32
x = torch.randn(B,T,C)

# single Head
head_size = 16
key = nn.Linear(C, head_size, bias=False) # what I am asking about
query = nn.Linear(C, head_size, bias=False) # waht others offer
value = nn.Linear(C, head_size, bias=False) # what others contain
# The original input vector 32-dimension gets mapped inot a smaller space
k = key(x)   # B, T, 16
q = query(x) # B, T, 16

# Attention scores for every position with each other
wei =  q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)

# mask future tokens, for autoregressive models
tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

# weighted sums of values/contextualization
v = value(x)
out = wei @ v
#out = wei @ x

out.shape

torch.Size([4, 8, 16])

In [None]:
wei[0]

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)

In [None]:
#Solving the transpose mystery(for me)
B,T,C = 1,2,4
x = torch.randn(B,T,C)
print(f'x: \n{x}\n')

head_size = 3
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)
q = query(x)

print(f'key: \n{key}\n')
print(f'k: \n{k}\n')

# Question why cant I just do q @ k  = (1,2,3) @ (1,2,3)
# Answer the inner dimension dont align (2,3) @ (2,3)
"""
              x11 x12 x123
              x21 x22 x223
x11 x12 x123
x21 x22 x223

== not possible thus swap the last layer (-1) and the one before(-2)

              x11 x21
              x12 x22
              x13 x23
x12 x22 x123
x21 x22 x223
"""
print(f'before: k values: \n{k}\n')
print(f'after: k values: \n{k.transpose(-2, -1)}\n')


x: 
tensor([[[ 1.0729, -0.5504,  0.4417,  1.8560],
         [-1.6223, -1.8155, -0.8447, -2.1575]]])

key: 
Linear(in_features=4, out_features=3, bias=False)

k: 
tensor([[[-0.8246,  0.3190,  0.2538],
         [-0.1150, -0.0447,  0.1061]]], grad_fn=<UnsafeViewBackward0>)

before: k values: 
tensor([[[-0.8246,  0.3190,  0.2538],
         [-0.1150, -0.0447,  0.1061]]], grad_fn=<UnsafeViewBackward0>)

after: k values: 
tensor([[[-0.8246, -0.1150],
         [ 0.3190, -0.0447],
         [ 0.2538,  0.1061]]], grad_fn=<TransposeBackward0>)



Notes from Andrej:
- Attention is a **communication mechanism**. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
- There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
- Each example across batch dimension is of course processed completely independently and never "talk" to each other
- In an "encoder" attention block just delete the single line that does masking with `tril`, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
- "self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
- "Scaled" attention additional divides `wei` by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below

In [None]:
k = torch.randn(B,T,head_size)
q = torch.randn(B,T,head_size)
wei = q @ k.transpose(-2, -1) * head_size**-0.5

In [None]:
k.var()

tensor(1.3187)

In [None]:
q.var()

tensor(0.6392)

In [None]:
wei.var()

tensor(0.3140)

In [None]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)

tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])

In [None]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1) # gets too peaky, converges to one-hot

tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])

# Main Code

In [21]:
# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

In [22]:
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

# Papers about positional embeddings:
- https://arxiv.org/abs/1810.04805
- https://arxiv.org/abs/1803.02155
- https://arxiv.org/abs/2104.09864
- https://arxiv.org/abs/2002.12327

In [None]:
# Probing into position_embedding_table
text_chunk = encode(text[:32])
block_size = 32
vocab_size = 65
n_embd = 64

token_embedding_table = nn.Embedding(vocab_size, n_embd)
position_embedding_table = nn.Embedding(block_size, n_embd)

idx = torch.tensor(text_chunk).unsqueeze(0)
B, T = idx.shape
print(f"{idx.shape}\n")

print(f"Before encoding: {text[:32]}\n")
print(f"After encoding : {text_chunk}\n")
print(f'token_embedding_table: \n    -{token_embedding_table}\n') # ---> B,T in B, T, C out // plucking the rows outs
print(f'position_embedding_table: \n    -{position_embedding_table}\n')


tok_emb = token_embedding_table(idx) # (B,T,C)
pos_emb = position_embedding_table(torch.arange(T, )) # (T,C)
x = tok_emb + pos_emb # (B,T,C)

print(f"tok_emb shape: {tok_emb.shape}")
print(f"pos_emb shape: {pos_emb.shape}")
print(f"Result of tok_emb +  pos_emb: {x.shape}\n")

print("Embeddings for the letter i")
print(tok_emb[:,1:2,:])

print("\nPosition embedding for the second letter i")
print(pos_emb[1])

print("\nPosition embedding for the seventh letter i")
print(pos_emb[7])

torch.Size([1, 32])

Before encoding: First Citizen:
Before we proceed

After encoding : [18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 14, 43, 44, 53, 56, 43, 1, 61, 43, 1, 54, 56, 53, 41, 43, 43, 42]

token_embedding_table: 
    -Embedding(65, 64)

position_embedding_table: 
    -Embedding(32, 64)

tok_emb shape: torch.Size([1, 32, 64])
pos_emb shape: torch.Size([32, 64])
Result of tok_emb +  pos_emb: torch.Size([1, 32, 64])

Embeddings for the letter i
tensor([[[ 0.3328,  0.0923, -1.1710, -0.3889, -0.0586,  0.7351, -0.2375,
          -0.0794,  0.1304, -0.3239, -1.0098,  0.9415, -1.0418,  0.5966,
           0.9954,  0.3853,  0.5848, -0.3529, -0.9835,  0.9844, -0.3214,
          -2.2650,  0.3112,  1.0370, -0.2482,  0.2878,  0.1709,  0.2090,
           0.3340, -0.2204,  2.2747, -0.7287,  0.1802,  1.7495, -2.3586,
           0.9739,  0.0316,  0.2331,  0.7056, -1.3177, -1.2374, -0.4133,
          -0.6679,  1.1041, -1.1024, -0.1291,  1.2628,  0.0929,  1.2860,
          -0.7879

In [6]:
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0

torch.manual_seed(1337)

<torch._C.Generator at 0x7d45a83e5790>

In [None]:
# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [None]:
model = BigramLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

0.209729 M parameters


In [None]:
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

In [None]:
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

step 0: train loss 4.4116, val loss 4.4022
step 100: train loss 2.6568, val loss 2.6670
step 200: train loss 2.5090, val loss 2.5058
step 300: train loss 2.4197, val loss 2.4337
step 400: train loss 2.3501, val loss 2.3562
step 500: train loss 2.2965, val loss 2.3127
step 600: train loss 2.2412, val loss 2.2502
step 700: train loss 2.2053, val loss 2.2196
step 800: train loss 2.1631, val loss 2.1862
step 900: train loss 2.1233, val loss 2.1499
step 1000: train loss 2.1035, val loss 2.1306
step 1100: train loss 2.0688, val loss 2.1182
step 1200: train loss 2.0383, val loss 2.0801
step 1300: train loss 2.0240, val loss 2.0645
step 1400: train loss 1.9929, val loss 2.0366
step 1500: train loss 1.9699, val loss 2.0311
step 1600: train loss 1.9599, val loss 2.0447
step 1700: train loss 1.9396, val loss 2.0123
step 1800: train loss 1.9080, val loss 1.9932
step 1900: train loss 1.9080, val loss 1.9860
step 2000: train loss 1.8818, val loss 1.9941
step 2100: train loss 1.8714, val loss 1.9764


In [None]:
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))



KING RICHAND II:
Shall by become to musbe doest thrust the gate
My art that usque, God?

MEXENES:
Butwere my feanst, I zormur
Yourselfom in heart mile dill, at misters, I in latient,
Worsts, and the now our was twells no me upolds;
Hond my sprunt as speak you: none
In Boyanterioly home.
Who like agaion,---And thee, by we still,
His The shience poor of his but
that nobrurtef so;
Angint my monte in excations, Pried my of.

HENRY BOLINGBROY::
Saday Warwick to Bauintchir accanny, rents I am you!
My fireass, I may.
And your gament so a cempres-ennome.

GLOUCESTER:
Your may in son thee, bod, with confessy.
Which migh.

ANCAMrown:
My when.

LARIAPNA:
Well, to imdut?

LUCIO:
For it?
Youse so upon surre eRpetALE:
What's I have nows: will hear news our house,
Havake but fravius wran some. Do selsemmader scolly not
To yourself I surre desing
My sirre's med Cadius
festing of time the shows tears man ip you ou what that that saw
Of Becoln madon so hand reford.
But I love, wout is forcest, in ginv

In [7]:
class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

In [8]:
class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

In [9]:
class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

In [10]:
class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

In [None]:
class GPTLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

        # better init, not covered in the original GPT video, but important, will cover in followup video
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = GPTLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

0.209729 M parameters


In [None]:
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

In [None]:
for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

step 0: train loss 4.1999, val loss 4.2014
step 100: train loss 2.6165, val loss 2.6175
step 200: train loss 2.4788, val loss 2.4753
step 300: train loss 2.3859, val loss 2.3917
step 400: train loss 2.3474, val loss 2.3619
step 500: train loss 2.2928, val loss 2.3048
step 600: train loss 2.2332, val loss 2.2614
step 700: train loss 2.2060, val loss 2.2439
step 800: train loss 2.1516, val loss 2.1810
step 900: train loss 2.1103, val loss 2.1427
step 1000: train loss 2.0811, val loss 2.1371
step 1100: train loss 2.0558, val loss 2.0903
step 1200: train loss 2.0118, val loss 2.0513
step 1300: train loss 1.9939, val loss 2.0645
step 1400: train loss 1.9722, val loss 2.0553
step 1500: train loss 1.9288, val loss 2.0244
step 1600: train loss 1.9200, val loss 2.0165
step 1700: train loss 1.9010, val loss 2.0007
step 1800: train loss 1.8945, val loss 1.9957
step 1900: train loss 1.8655, val loss 1.9750
step 2000: train loss 1.8535, val loss 1.9823
step 2100: train loss 1.8311, val loss 1.9684


In [None]:
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))


BUCKINGHAM:
Thou, his lost that thy time grone. I may good fast the dead me the eint.

DUKE OR:
Not.
Citinuon, he lot Gaol Secolm prine,
And than those me wound sifens it
Lord:
Thee hein before in than thy compoun many
Where plent: him ere lamor, you nongued
He'd down thy madunest some sooth
To Engead her me when in
Of apponies. Elaked betty, dislives?

BRUTUS:
I shall thy be of the me.

DUKEN ELIZABETH:
My lail'd not my oll; so he -the wook,
I'll you Godgerer the less not teeds with furth seel



# Ablation of Postional Embedings

In [None]:
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0

torch.manual_seed(1337)

<torch._C.Generator at 0x7ab7ed1d1350>

In [None]:
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

In [None]:
class GPTLanguageModelWithoutPosEncoding(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

        # better init, not covered in the original GPT video, but important, will cover in followup video
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        x = self.blocks(tok_emb) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = GPTLanguageModelWithoutPosEncoding()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

0.209729 M parameters


In [None]:
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

In [None]:
for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

step 0: train loss 4.1788, val loss 4.1777
step 100: train loss 2.6182, val loss 2.6158
step 200: train loss 2.4558, val loss 2.4793
step 300: train loss 2.4022, val loss 2.4071
step 400: train loss 2.3517, val loss 2.3632
step 500: train loss 2.3318, val loss 2.3315
step 600: train loss 2.2984, val loss 2.3157
step 700: train loss 2.2741, val loss 2.2931
step 800: train loss 2.2554, val loss 2.2770
step 900: train loss 2.2237, val loss 2.2575
step 1000: train loss 2.1884, val loss 2.2262
step 1100: train loss 2.1805, val loss 2.2163
step 1200: train loss 2.1443, val loss 2.1947
step 1300: train loss 2.1265, val loss 2.1756
step 1400: train loss 2.1069, val loss 2.1559
step 1500: train loss 2.0792, val loss 2.1671
step 1600: train loss 2.0845, val loss 2.1422
step 1700: train loss 2.0484, val loss 2.1328
step 1800: train loss 2.0231, val loss 2.0974
step 1900: train loss 2.0050, val loss 2.0845
step 2000: train loss 2.0046, val loss 2.0782
step 2100: train loss 1.9794, val loss 2.0548


In [None]:
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))


Will before whow accking froth
Your to obe to take Ond my day hands:
Whith full hath buraced?

Doates will my fachsance zonoun
Yours, tof it heart milend;
Whate misters, acin latest in overs, and Warrwewick you son leling tear.

KING RICHARY:
Supp ais's lety mour made.
By pater to whom
Glow, to that
mont not when evily o' a moscerious gentleed to-o more for unto a my cruptef so;
Angr to shall is all one to farse?

KING HErefurse, courdddes:
Saday With Earth, hoph cour acraney it-lip chan you!

J


# Hardcoding the pos elements(sinusoidal)

In [None]:
block_size = 5
n_embd = 3

pe = torch.zeros(block_size, n_embd)
for pos in range(block_size):
    for i in range(0, n_embd, 2):
        pe[pos, i] = math.sin(pos / (10000 ** ((2 * i)/n_embd)))
        if i + 1 < n_embd:
            pe[pos, i + 1] = math.cos(pos / (10000 ** ((2 * i)/n_embd)))

print(pe)

tensor([[ 0.0000e+00,  1.0000e+00,  0.0000e+00],
        [ 8.4147e-01,  5.4030e-01,  4.6416e-06],
        [ 9.0930e-01, -4.1615e-01,  9.2832e-06],
        [ 1.4112e-01, -9.8999e-01,  1.3925e-05],
        [-7.5680e-01, -6.5364e-01,  1.8566e-05]])


In [None]:
import math

class GPTLanguageModelWithoutPosEncoding(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

        pe = torch.zeros(block_size, n_embd)
        for pos in range(block_size):
            for i in range(0, n_embd, 2):
                pe[pos, i] = math.sin(pos / (10000 ** ((2 * i)/n_embd)))
                if i + 1 < n_embd:
                    pe[pos, i + 1] = math.cos(pos / (10000 ** ((2 * i)/n_embd)))
        self.register_buffer("positional_encoding", pe)

        # better init, not covered in the original GPT video, but important, will cover in followup video
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx)  # (B,T,C)
        pos_emb = self.positional_encoding[:T]     # (T, C)
        x = tok_emb + pos_emb                      # (B, T, C)
        x = self.blocks(tok_emb) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = GPTLanguageModelWithoutPosEncoding()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

0.207681 M parameters


In [None]:
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

In [None]:
for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

step 0: train loss 4.1634, val loss 4.1645
step 100: train loss 2.5903, val loss 2.5886
step 200: train loss 2.4678, val loss 2.4825
step 300: train loss 2.4178, val loss 2.4189
step 400: train loss 2.3597, val loss 2.3641
step 500: train loss 2.3223, val loss 2.3363
step 600: train loss 2.2882, val loss 2.2930
step 700: train loss 2.2603, val loss 2.3000
step 800: train loss 2.2500, val loss 2.2769
step 900: train loss 2.2185, val loss 2.2525
step 1000: train loss 2.2029, val loss 2.2289
step 1100: train loss 2.1795, val loss 2.1950
step 1200: train loss 2.1549, val loss 2.1902
step 1300: train loss 2.1333, val loss 2.1736
step 1400: train loss 2.1131, val loss 2.1593
step 1500: train loss 2.0893, val loss 2.1297
step 1600: train loss 2.0653, val loss 2.1269
step 1700: train loss 2.0276, val loss 2.1047
step 1800: train loss 2.0270, val loss 2.1176
step 1900: train loss 2.0201, val loss 2.0964
step 2000: train loss 2.0031, val loss 2.0835
step 2100: train loss 2.0013, val loss 2.0883


In [None]:
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))


MONE:
Froke mary.

HENRRY BROLImes BRD:Those.--'Somes the me madany shows;
Will what we bear, whom counfe: I should sol.

AnEOSTABRT:
Tear, thou earte beher very imdis?

LADY:
And mee?
You;

PROPEY Andord evertal ny, come, overes night,
All, I deeepox the have bagakes against's
Awway k sleve buelow; and stoneld not no dearcks
Lors, hilde in himpser's in now this mee thee?

CLARGENteorn, Thee subgn Clicktanon what than the build Edwllembke, on sold;
For for claw thine. You there farc;
you comes. 


# RoPE
- relative position encoding
- example the cat sat on....
- the model is processing the token at position 4: the
- the model looks back at the previous tokens:
  - on at pos 3 | relative position -1
  - sat at pos 2 | relative position -2
  - cat at pos 1 | relative position -3
  - the at post 0 | relative position -4

# absolute positions
- each token at index i gets position embedding PE[i]
- dot(Wq * (x_i + PE[i]), Wk * (x_j + PE[j]))
- attention score depnds on i and j separately

# RoPE:
- The vectors x_i and x_j are rotated according to position.
- q = rotate(Wq * x_i, θ * i)
- k = rotate(Wk * x_j, θ * j)
- score = q @ k
- rotation causes: score ∝ cos(θ * (i - j))





In [None]:
def apply_rope(q, k):
    # q, k: (B, T, C), where C must be even
    B, T, C = q.shape
    half = C // 2
    freqs = torch.exp(-torch.arange(0, half, dtype=torch.float32) * math.log(10000) / half).to(q.device)  # (half,)
    positions = torch.arange(T, device=q.device).float()  # (T,)
    angles = torch.einsum('t,d->td', positions, freqs)  # (T, half)
    sin = angles.sin().unsqueeze(0)  # (1, T, half)
    cos = angles.cos().unsqueeze(0)  # (1, T, half)

    q1, q2 = q[..., :half], q[..., half:]
    k1, k2 = k[..., :half], k[..., half:]
    q_rotated = torch.cat([q1 * cos - q2 * sin, q1 * sin + q2 * cos], dim=-1)
    k_rotated = torch.cat([k1 * cos - k2 * sin, k1 * sin + k2 * cos], dim=-1)
    return q_rotated, k_rotated

In [14]:
class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)   # (B,T,hs)
        q = self.query(x) # (B,T,hs)

        # 🌀 Apply RoPE here!
        q, k = apply_rope(q, k)

        wei = q @ k.transpose(-2, -1) * (C ** -0.5)  # (B,T,T)
        wei = wei.masked_fill(torch.tril(torch.ones(T, T, device=x.device)) == 0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        wei = self.dropout(wei)

        v = self.value(x)  # (B,T,hs)
        out = wei @ v      # (B,T,hs)
        return out

In [17]:
class GPTLanguageModelRoPE(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

        # better init, not covered in the original GPT video, but important, will cover in followup video
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        x = self.blocks(tok_emb) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = GPTLanguageModelRoPE()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

0.207681 M parameters


In [18]:
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

In [25]:
import math
for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

step 0: train loss 4.2143, val loss 4.2171
step 100: train loss 2.5749, val loss 2.5743
step 200: train loss 2.4001, val loss 2.4026
step 300: train loss 2.2972, val loss 2.3093
step 400: train loss 2.2002, val loss 2.2044
step 500: train loss 2.1168, val loss 2.1433
step 600: train loss 2.0700, val loss 2.1063
step 700: train loss 2.0148, val loss 2.0701
step 800: train loss 1.9726, val loss 2.0354
step 900: train loss 1.9355, val loss 2.0070
step 1000: train loss 1.9239, val loss 1.9884
step 1100: train loss 1.8795, val loss 1.9821
step 1200: train loss 1.8695, val loss 1.9592
step 1300: train loss 1.8465, val loss 1.9666
step 1400: train loss 1.8244, val loss 1.9342
step 1500: train loss 1.7966, val loss 1.9228
step 1600: train loss 1.7900, val loss 1.9188
step 1700: train loss 1.7787, val loss 1.9156
step 1800: train loss 1.7778, val loss 1.9023
step 1900: train loss 1.7533, val loss 1.8923
step 2000: train loss 1.7425, val loss 1.8898
step 2100: train loss 1.7401, val loss 1.9000


In [26]:
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))


And they bridl'd and is by be mades;
Thou but take Onation agaidess: 'tis he usquich
At are that ane away, my facher,
And I'll now, Ladom was I coveriand;
Whices is eye, I in latiumain overs, and Warwick on you mustleling peace thus, once by stay; and plaw you:
That I croopes, and whom
Is would that
To Windon him eiills the most rive with impusion,
Ke show butter danger, the son; if his shall I male oftence, Privant, or, and bubb!

QUEEN KING HENRY VI:
Whereforious trayard, your his coff you!
My
