## Building a GPT

Companion notebook to the [Zero To Hero](https://karpathy.ai/zero-to-hero.html) video on GPT.

In [None]:
# read it in to inspect it
with open('kinyas_kayra_clean.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [None]:
print("length of dataset in characters: ", len(text))
#1115394

NameError: name 'text' is not defined

In [None]:
# let's look at the first 1000 characters
print(text[:1000])

Hepsi yaralar, sonuncusu öldürür! Birinci Kitap Kinyas, Kayra ve Hayat Asansör dördüncü katta durdu.

Kapısında 17 yazan daireye girdik.

Tahmin ettiğim gibi evde çok az mobilya vardı.

Salonun duvarları fotoğraflar ve afişlerle kaplanmıştı.

Ortada, eskiciden alınmış izlenimi veren ceviz yemek masası, ucuz barlarda çıkması muhtemel kavgalarda hasarı önlemek amacıyla yere çakılmışçasına duruyordu.

Ve dört adet çelik sandalye tarafından kuşatılmıştı.

Yerlerde yüzlerce içki şişesi parkeyi bir halı gibi kaplıyordu.

Kapalı perdelerden, pencerelerin çok uzun zamandır açılmadığı anlaşılıyordu.

Zaten havaya hkim olan keskin alkol ve tütün kokusu da bunu gösteriyordu.

Masanın üstündeki boş ve dağınık kğıtlar, cesetler gibi, birileri tarafından toplanmayı bekliyordu.

Ve salondaki en değerli eşya kğıtların yanında duran, üç ayrı köşedeki abajurun ışığıyla hayat bulan, olduğu yere kendini hiç de ait hissetmeyen ve benim çok eskilerden hatırladığım altın kaplamalı dolmakalemdi.

Hareketsiz, 

### 🎭 **Vocabulary Analizi - Detaylı Açıklama**

**Ne yapıyor:** Metindeki tüm unique karakterleri bulup vocabulary oluşturuyor.

**Derin bilgiler:**
- **`set(text)`:** String'i set'e çevirerek duplicate karakterleri kaldırır
- **`sorted(list(...))`:** Karakterleri alfabetik sıraya koyar (tutarlılık için)
- **Vocabulary size: 65 karakter**
  - Harfler: a-z, A-Z (52 adet)
  - Rakamlar: 0-9 (sadece 3 adet görünüyor)
  - Noktalama: !$&',-.3:;? ve boşluk
- **Character encoding implications:**
  - Her karakter bir index alacak (0-64)
  - Embedding table 65x(embedding_dim) olacak
- **Comparison with word-level:**
  - Word vocabulary: 10K-50K+ words
  - Character vocabulary: ~65 chars
  - Çok daha kompakt representation


In [None]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !+,-.0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzÇÖÜçöüğİıŞş
84


### 🔢 **Tokenization: String ↔ Integer Dönüşümü - Detaylı Açıklama**

**Ne yapıyor:** Karakterleri integer'lara ve integer'ları karakterlere dönüştüren encoder/decoder fonksiyonları oluşturuyor.

**Derin bilgiler:**
- **`stoi` (string to integer):** Dictionary mapping karakterden sayıya
- **`itos` (integer to string):** Dictionary mapping sayıdan karaktere
- **`encode` lambda fonksiyonu:**
  - Input: String ("hii there")
  - Output: List of integers ([46, 47, 47, 1, 58, 46, 43, 56, 43])
  - Her karakter vocabulary'deki index'ine çevriliyor
- **`decode` lambda fonksiyonu:**
  - Input: List of integers
  - Output: Original string
  - Reverse operation of encode
- **Neural network requirement:**
  - NN'ler sayılarla çalışır, metinle değil
  - Bu mapping bidirectional ve lossless olmalı
- **Lambda functions:** Concise function definition syntax


In [None]:
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

print(encode("Kinyas ve Kayra"))
print(decode(encode("Kinyas ve Kayra")))

[31, 55, 60, 71, 47, 65, 1, 68, 51, 1, 31, 47, 71, 64, 47]
Kinyas ve Kayra


### 🧠 **PyTorch Tensor'e Dönüşüm - Detaylı Açıklama**

**Ne yapıyor:** Tüm metni encode edip PyTorch tensor'ına çeviriyor.

**Derin bilgiler:**
- **`torch.tensor()`:** Python list'ini PyTorch tensor'ına çevirir
- **`dtype=torch.long`:** 64-bit integer type
  - Token index'leri için yeterli (0-64 arası)
  - GPU operasyonları için optimize
  - Embedding layer input olarak gerekli
- **Tensor shape:** `[1115394]` - 1D tensor
- **Memory efficiency:**
  - Original text: ~1.1MB (UTF-8 strings)
  - Tensor: ~4.4MB (int64 * 1,115,394)
  - Trade-off: memory vs. computation speed
- **GPU readiness:** Tensor format GPU'ya transfer edilebilir
- **Vectorization:** Batch operations için hazır format
- **Data type importance:** Wrong dtype → runtime errors


In [None]:
# let's now encode the entire text dataset and store it into a torch.Tensor
import torch # we use PyTorch: https://pytorch.org
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000]) # the 1000 characters we looked at earier will to the GPT look like this

torch.Size([1097793]) torch.int64
tensor([28, 51, 62, 65, 55,  1, 71, 47, 64, 47, 58, 47, 64,  4,  1, 65, 61, 60,
        67, 60, 49, 67, 65, 67,  1, 77, 58, 50, 78, 64, 78, 64,  2,  1, 22, 55,
        64, 55, 60, 49, 55,  1, 31, 55, 66, 47, 62,  1, 31, 55, 60, 71, 47, 65,
         4,  1, 31, 47, 71, 64, 47,  1, 68, 51,  1, 28, 47, 71, 47, 66,  1, 21,
        65, 47, 60, 65, 77, 64,  1, 50, 77, 64, 50, 78, 60, 49, 78,  1, 57, 47,
        66, 66, 47,  1, 50, 67, 64, 50, 67,  6,  0,  0, 31, 47, 62, 81, 65, 81,
        60, 50, 47,  1,  8, 14,  1, 71, 47, 72, 47, 60,  1, 50, 47, 55, 64, 51,
        71, 51,  1, 53, 55, 64, 50, 55, 57,  6,  0,  0, 40, 47, 54, 59, 55, 60,
         1, 51, 66, 66, 55, 79, 55, 59,  1, 53, 55, 48, 55,  1, 51, 68, 50, 51,
         1, 76, 61, 57,  1, 47, 72,  1, 59, 61, 48, 55, 58, 71, 47,  1, 68, 47,
        64, 50, 81,  6,  0,  0, 39, 47, 58, 61, 60, 67, 60,  1, 50, 67, 68, 47,
        64, 58, 47, 64, 81,  1, 52, 61, 66, 61, 79, 64, 47, 52, 58, 47, 64,  1,
      

In [None]:
# Let's now split up the data into train and validation sets
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

### 🎯 **Block Size Kavramı - Detaylı Açıklama**

**Ne yapıyor:** Context window boyutunu 8 karakter olarak belirler ve örnek gösterir.

**Derin bilgiler:**
- **Block size = Context length = Sequence length:** Aynı kavram
- **8 karakterlik window:** Model aynı anda 8 karaktere kadar bakabilir
- **tensor([18, 47, 56, 57, 58, 1, 15, 47, 58]):** 9 element
  - İlk 8: input context
  - Son 8: target predictions (1 shift)
- **Sliding window approach:** Her pozisyon için prediction
- **Transformer limitation:**
  - Fixed maximum context length
  - Real GPT models: 2048, 4096, 100K+ tokens
- **Memory complexity:** O(n²) attention computation
- **Training efficiency:** Küçük block size = daha hızlı training

In [None]:
block_size = 8
train_data[:block_size+1]

tensor([28, 51, 62, 65, 55,  1, 71, 47, 64])

### 🎯 **Autoregressive Training Yaklaşımı - Detaylı Açıklama**

**Ne yapıyor:** Her pozisyon için context-target çiftlerini gösterir.

**Derin bilgiler:**
- **Autoregressive modeling:** Her token, önceki tüm token'lara bakarak tahmin edilir
- **8 farklı training example:** Tek sequence'dan 8 öğrenme örneği
  - Context [18] → Target: 47
  - Context [18,47] → Target: 56
  - ... vb.
- **Teacher forcing:** Training sırasında gerçek token'ları kullan
- **Progressive context:** Giderek daha fazla bilgi veriliyor
- **Efficiency:** Tek forward pass'te 8 prediction
- **Causal masking:** Gelecek token'ları görme yasağı
- **Maximum likelihood training:** Next token probability maximize edilir
- **Sequence modeling temel prensibi:** P(w₁,w₂,...,wₙ) = ∏P(wᵢ|w₁,...,wᵢ₋₁)


In [None]:
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

when input is tensor([28]) the target: 51
when input is tensor([28, 51]) the target: 62
when input is tensor([28, 51, 62]) the target: 65
when input is tensor([28, 51, 62, 65]) the target: 55
when input is tensor([28, 51, 62, 65, 55]) the target: 1
when input is tensor([28, 51, 62, 65, 55,  1]) the target: 71
when input is tensor([28, 51, 62, 65, 55,  1, 71]) the target: 47
when input is tensor([28, 51, 62, 65, 55,  1, 71, 47]) the target: 64


### 🚀 **Batch Processing Sistemi - Detaylı Açıklama**

**Ne yapıyor:** Mini-batch'ler oluşturup paralel training için hazırlar.

**Derin bilgiler:**
- **Batch size = 4:** 4 farklı sequence paralel işlenir
- **Random sampling:** `torch.randint()` ile rastgele başlangıç pozisyonları
- **Tensor shapes:**
  - Input `xb`: [4, 8] - 4 sequence, her biri 8 token
  - Target `yb`: [4, 8] - shifted targets
- **get_batch() fonksiyonu:**
  - Split parametresi: 'train' veya 'val'
  - Dynamic data loading
  - GPU-ready tensors
- **Parallelization benefits:**
  - GPU cores'u etkili kullanım
  - Batch normalization için gerekli
  - Gradient estimation iyileşir
- **32 training example:** 4 sequence × 8 position = 32 simultaneous prediction
- **Memory vs. Speed trade-off:** Büyük batch = daha fazla memory, daha stabil gradients


In [None]:
torch.manual_seed(1337)
batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size): # batch dimension
    for t in range(block_size): # time dimension
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()} the target: {target}")

inputs:
torch.Size([4, 8])
tensor([[79, 81,  1, 55, 76, 55, 60,  1],
        [71, 58, 51, 59, 51, 57,  1, 55],
        [47,  1, 78, 76,  1, 53, 78, 60],
        [48, 47, 50, 81, 64,  6,  0,  0]])
targets:
torch.Size([4, 8])
tensor([[81,  1, 55, 76, 55, 60,  1, 71],
        [58, 51, 59, 51, 57,  1, 55, 65],
        [ 1, 78, 76,  1, 53, 78, 60, 50],
        [47, 50, 81, 64,  6,  0,  0, 31]])
----
when input is [79] the target: 81
when input is [79, 81] the target: 1
when input is [79, 81, 1] the target: 55
when input is [79, 81, 1, 55] the target: 76
when input is [79, 81, 1, 55, 76] the target: 55
when input is [79, 81, 1, 55, 76, 55] the target: 60
when input is [79, 81, 1, 55, 76, 55, 60] the target: 1
when input is [79, 81, 1, 55, 76, 55, 60, 1] the target: 71
when input is [71] the target: 58
when input is [71, 58] the target: 51
when input is [71, 58, 51] the target: 59
when input is [71, 58, 51, 59] the target: 51
when input is [71, 58, 51, 59, 51] the target: 57
when input is [71

### 👁️ **Input Tensor İncelemesi - Detaylı Açıklama**

**Ne yapıyor:** Transformer'a gidecek input tensor'ını gösterir.

**Derin bilgiler:**
- **Tensor içeriği:** 4×8 matrix, her element bir token ID (0-64 arası)
- **Batch dimension (dim=0):** 4 farklı sequence
- **Sequence dimension (dim=1):** Her sequence'ta 8 token
- **Token meanings:**
  - 24 → 'L', 43 → 'e', 58 → 't', vb.
  - Gerçek Shakespeare karakterleri
- **No embeddings yet:** Ham token ID'leri, henüz vector representation'a çevrilmedi
- **Transformer input format:** Standard [Batch, Sequence, ...] convention
- **Memory layout:** Contiguous tensor, GPU transfer için optimize
- **Next step:** Bu integer'lar embedding table'dan vector'lara çevrilecek


In [None]:
print(xb) # our input to the transformer

tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])


### 🤖 **Bigram Language Model - Detaylı Açıklama**

**Ne yapıyor:** En basit language model'i implement eder: sadece önceki karaktere bakar.

**Derin bilgiler:**
- **Bigram Model:** P(next_char | previous_char) - sadece 1 token geriye bakar
- **Architecture:**
  - `token_embedding_table`: [vocab_size, vocab_size] = [65, 65]
  - Her token ID → probability distribution over next tokens
- **Forward pass:**
  - Input: token indices [B, T]
  - Embedding lookup → logits [B, T, C]
  - Cross-entropy loss hesaplanır
- **Loss = 4.8786:** Random baseline ~4.17 (log(65)), biraz daha iyi
- **Generate method:**
  - Autoregressive sampling
  - Multinomial sampling from softmax probabilities
  - No temperature control (raw probabilities)
- **Limitations:** Çok kısa memory, complex patterns öğrenemez
- **Baseline model:** Daha complex modeller için karşılaştırma noktası


In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))


torch.Size([32, 84])
tensor(4.8234, grad_fn=<NllLossBackward0>)

,F.,-HÖ;BmNJ,l
VMMPmgçRRçO
J:SFe6WGSPeÖ5IZ0e+sıVhJWXöxmoI!tUöf8tOnBO-oJRçeXN7!9;WZgPsıve9ii?Ta;aFJjZ


### ⚙️ **Optimizer Kurulumu - Detaylı Açıklama**

**Ne yapıyor:** AdamW optimizer'ı learning rate 1e-3 ile kurar.

**Derin bilgiler:**
- **AdamW (Adam with Weight Decay):**
  - Adaptive moment estimation
  - Weight decay regularization
  - Transformer'lar için best practice
- **Learning rate 1e-3 = 0.001:**
  - Conservative başlangıç
  - Too high → unstable training
  - Too low → very slow convergence
- **m.parameters():** Model'deki tüm trainable weights
  - Embedding table: 65×65 = 4,225 params
  - Bias yok (bias=False)
- **Optimizer state:**
  - Momentum (first moment)
  - Variance (second moment)
  - Memory usage ~2x model parameters
- **AdamW vs. Adam:** Better generalization with proper weight decay


In [None]:
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [None]:
batch_size = 32
for steps in range(100): # increase number of steps for good results...

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(loss.item())


4.713395118713379


In [None]:
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()))


9?PEk:8-JO8 +
hXCMnx7ıTkjHaV,gç4 o:ığ4+LI:?imBde2İZ,9aVHçbbO
:Ed;nrZ
ŞDı8ööoFDÖZT6wFJvchi3qbvüxjC9qb8:RIraO+yMOüdOApI-U6Mmw+6WvWW0öjvSsNWuk!;+0zü
dnVG
AŞVç1M2qTÇğtbhlAp1ŞNbŞğühoS.2ğ7öÖÇğp00?ZhoeaEU.5Ya+yyNVgQqÖö8,; J
9D
0I5zğ29G+XÇğNKCcUQCRaRvKQdğ4=QNI:O7fŞ
CuEe:QdY?ZaxHİVt,:vivŞİIMşTcRWS15Ou0ö+xAkü+68 EJqK=;ç:A5kWcvşq-Er+AtbüxJe6V!FNbH72qIJq2iiW iGjHeYYCthsDwH2üspRDPlSP9?VzLıRtYpZM6xt2ğÇ.,üiSQOT3PRk.H=ğFEÇ35:
iA9ÖG!:a
püPoqdÜXj,X5zğOT;ömJüÇlİKxö:2ğ3KqXFZy!N19CUÖ8k6 xŞ9sJÜ7İ8OPSwıSPBşEdTyÇmpwa,Ü


## The mathematical trick in self-attention

In [None]:
# toy example illustrating how matrix multiplication can be used for a "weighted aggregation"
torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, 1, keepdim=True)
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

a=
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
--
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


In [None]:
# consider the following toy example:

torch.manual_seed(1337)
B,T,C = 4,8,2 # batch, time, channels
x = torch.randn(B,T,C)
x.shape

torch.Size([4, 8, 2])

In [None]:
# We want x[b,t] = mean_{i<=t} x[b,i]
xbow = torch.zeros((B,T,C))
for b in range(B):
    for t in range(T):
        xprev = x[b,:t+1] # (t,C)
        xbow[b,t] = torch.mean(xprev, 0)


In [None]:
# version 2: using matrix multiply for a weighted aggregation
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x # (B, T, T) @ (B, T, C) ----> (B, T, C)
torch.allclose(xbow, xbow2)

False

In [None]:
# version 3: use Softmax
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x
torch.allclose(xbow, xbow3)


False

In [None]:
# version 4: self-attention!
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
wei =  q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)

tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

v = value(x)
out = wei @ v
#out = wei @ x

out.shape

torch.Size([4, 8, 16])

In [None]:
wei[0]

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)

Notes:
- Attention is a **communication mechanism**. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
- There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
- Each example across batch dimension is of course processed completely independently and never "talk" to each other
- In an "encoder" attention block just delete the single line that does masking with `tril`, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
- "self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
- "Scaled" attention additional divides `wei` by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below

In [None]:
k = torch.randn(B,T,head_size)
q = torch.randn(B,T,head_size)
wei = q @ k.transpose(-2, -1) * head_size**-0.5

In [None]:
k.var()

tensor(1.0449)

In [None]:
q.var()

tensor(1.0700)

In [None]:
wei.var()

tensor(1.0918)

In [None]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)

tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])

In [None]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1) # gets too peaky, converges to one-hot

tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])

In [None]:
class LayerNorm1d: # (used to be BatchNorm1d)

  def __init__(self, dim, eps=1e-5, momentum=0.1):
    self.eps = eps
    self.gamma = torch.ones(dim)
    self.beta = torch.zeros(dim)

  def __call__(self, x):
    # calculate the forward pass
    xmean = x.mean(1, keepdim=True) # batch mean
    xvar = x.var(1, keepdim=True) # batch variance
    xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance
    self.out = self.gamma * xhat + self.beta
    return self.out

  def parameters(self):
    return [self.gamma, self.beta]

torch.manual_seed(1337)
module = LayerNorm1d(100)
x = torch.randn(32, 100) # batch size 32 of 100-dimensional vectors
x = module(x)
x.shape

torch.Size([32, 100])

In [None]:
x[:,0].mean(), x[:,0].std() # mean,std of one feature across all batch inputs

(tensor(0.1469), tensor(0.8803))

In [None]:
x[0,:].mean(), x[0,:].std() # mean,std of a single input from the batch, of its features

(tensor(-9.5367e-09), tensor(1.0000))

In [None]:
# French to English translation example:

# <--------- ENCODE ------------------><--------------- DECODE ----------------->
# les réseaux de neurones sont géniaux! <START> neural networks are awesome!<END>



### Full finished code, for reference

You may want to refer directly to the git repo instead though.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [60]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 64 # how many independent sequences will we process in parallel?
block_size = 256 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 500
learning_rate = 3e-4

grad_clip = 1.0
best_val_loss = float('inf')
patience_counter = 0
patience = 5

eval_iters = 200
n_embd = 384
n_head = 6
n_layer = 6
dropout = 0.3

device = 'cuda' if torch.cuda.is_available() else 'cpu'
# ------------

In [48]:
torch.manual_seed(1337)

with open('/content/drive/MyDrive/ML-MODELS/GPT/GPT - Base/kinyas_kayra_clean.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [49]:
import torch
from collections import Counter

class ByteLevelBPE:
    def __init__(self, text, num_merges=500):
        self.text = text
        self.num_merges = num_merges
        self.vocab = None
        self.merges = []
        self.token_to_id = {}
        self.id_to_token = {}
        self._learn_bpe()
        self._build_token_vocab()

    def _get_vocab(self):
        vocab = Counter()
        words = self.text.strip().split()

        for word in words:
            word_bytes = list(word.encode('utf-8'))
            word_bytes_str = [f"{b:03d}" for b in word_bytes]
            tokenized = ' '.join(word_bytes_str + ['</w>'])
            vocab[tokenized] += 1

        return vocab

    def _get_stats(self, vocab):
        pairs = Counter()
        for word, freq in vocab.items():
            symbols = word.split()
            for i in range(len(symbols) - 1):
                if symbols[i+1] == '</w>':
                    continue
                pairs[(symbols[i], symbols[i+1])] += freq
        return pairs

    def _merge_vocab(self, pair, vocab_in):
        vocab_out = {}
        replacement = pair[0] + pair[1]

        for word, freq in vocab_in.items():
            symbols = word.split()
            new_symbols = []
            i = 0

            while i < len(symbols):
                if i < len(symbols) - 1 and (symbols[i], symbols[i+1]) == pair:
                    new_symbols.append(replacement)
                    i += 2
                else:
                    new_symbols.append(symbols[i])
                    i += 1

            new_word = ' '.join(new_symbols)
            vocab_out[new_word] = freq

        return vocab_out

    def _learn_bpe(self):
        self.vocab = self._get_vocab()
        for i in range(self.num_merges):
            pairs = self._get_stats(self.vocab)
            if not pairs:
                break
            best = max(pairs, key=pairs.get)
            self.vocab = self._merge_vocab(best, self.vocab)
            self.merges.append(best)
            if i % 100 == 0 or i == self.num_merges - 1:
                print(f"Merge {i+1}: {best}")

        self.merges_set = set(self.merges)

    def _build_token_vocab(self):
        # BPE tokenları: başlangıçta tüm byte kodları + merged tokenlar
        tokens = set()
        # Tüm kelimeler
        for word in self.vocab.keys():
            for token in word.split():
                tokens.add(token)
        # Ayrıca merge'lerle oluşan tokenlar
        for a,b in self.merges:
            tokens.add(a+b)
        tokens.discard('</w>')  # </w> genelde tokenize edilmez veya farklı işlenir
        tokens = sorted(list(tokens))
        self.token_to_id = {tok: idx for idx, tok in enumerate(tokens)}
        self.id_to_token = {idx: tok for tok, idx in self.token_to_id.items()}

    def encode(self, word):
        word_bytes = [f"{b:03d}" for b in word.encode('utf-8')] + ['</w>']

        while True:
            pairs = [(word_bytes[i], word_bytes[i+1]) for i in range(len(word_bytes)-1)]
            mergeable = [p for p in pairs if p in self.merges]

            if not mergeable:
                break

            best = None
            for merge in self.merges:
                if merge in pairs:
                    best = merge
                    break

            if best is None:
                break

            new_word = []
            i = 0
            while i < len(word_bytes):
                if i < len(word_bytes) - 1 and (word_bytes[i], word_bytes[i+1]) == best:
                    new_word.append(word_bytes[i] + word_bytes[i+1])
                    i += 2
                else:
                    new_word.append(word_bytes[i])
                    i += 1

            word_bytes = new_word

        encoded_ids = []
        for token in word_bytes:
            if token == '</w>':
                continue
            encoded_ids.append(self.token_to_id[token])
        return encoded_ids

    def decode(self, token_ids):
        tokens = [self.id_to_token[id_] for id_ in token_ids]
        byte_sequence = []
        for token in tokens:
            for i in range(0, len(token), 3):
                byte_sequence.append(int(token[i:i+3]))
        return bytes(byte_sequence).decode('utf-8', errors='replace')

bpe = ByteLevelBPE(text, num_merges=8000)

word = "Kinyas"
encoded = bpe.encode(word)
print("Encoded:", encoded)

decoded = bpe.decode(encoded)
print("Decoded:", decoded)



# here are all the unique characters that occur in this text
#chars = sorted(list(set(text)))
#vocab_size = len(chars)
# create a mapping from characters to integers
#stoi = { ch:i for i,ch in enumerate(chars) }
#itos = { i:ch for i,ch in enumerate(chars) }
#encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
#decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

Merge 1: ('196', '177')
Merge 101: ('197159', '101')
Merge 201: ('105', '104')
Merge 301: ('075', '105110121097')
Merge 401: ('097114', '107')
Merge 501: ('116', '097114097')
Merge 601: ('108097', '121097')
Merge 701: ('111108109097', '108196177')
Merge 801: ('196176', '107105')
Merge 901: ('105110', '099105')
Merge 1001: ('097110', '110101')
Merge 1101: ('107097114', '196177')
Merge 1201: ('103', '195188108')
Merge 1301: ('117196159', '114097')
Merge 1401: ('107097', '102')
Merge 1501: ('100097110', '046')
Merge 1601: ('112', '105122')
Merge 1701: ('098097', '122196177')
Merge 1801: ('076', '111')
Merge 1901: ('197159', '097110')
Merge 2001: ('097', '105116')
Merge 2101: ('098101110122101', '121101110')
Merge 2201: ('101116', '116105109')
Merge 2301: ('116105', '116114101')
Merge 2401: ('100195188110121097', '121097')
Merge 2501: ('111108117114', '100117')
Merge 2601: ('107097114197159196177', '108196177196159196177110100097')
Merge 2701: ('107097108196177', '114')
Merge 2801: ('10011

In [61]:
def encode_text_with_bpe_ids(bpe_obj, text):
    tokens = []
    for word in text.strip().split():
        tokens.extend(bpe_obj.encode(word))
    return tokens

tokens = encode_text_with_bpe_ids(bpe, text)
data = torch.tensor(tokens, dtype=torch.long)

n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

print(f"Total token: {len(data)}")
print(f"Train data size: {len(train_data)}")
print(f"Val data size: {len(val_data)}")

# Train and test splits
#data = torch.tensor(encode(text), dtype=torch.long)
#n = int(0.9*len(data)) # first 90% will be train, rest val
#train_data = data[:n]
#val_data = data[n:]

Total token: 213475
Train data size: 192127
Val data size: 21348


In [71]:
# data loading
def get_batch(split):
    data_split = train_data if split == 'train' else val_data
    ix = torch.randint(len(data_split) - block_size, (batch_size,))
    x = torch.stack([data_split[i:i+block_size] for i in ix])
    y = torch.stack([data_split[i+1:i+block_size+1] for i in ix])
    return x.to(device), y.to(device)

In [72]:
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

In [73]:
class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

In [74]:
class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

In [75]:
class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

In [76]:
class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x


In [77]:
# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -block_size:]
            logits, _ = self(idx_cond)
            logits = logits[:, -1, :]  # (B, vocab_size)

            logits = logits / temperature

            if top_k is not None:
                v, _ = torch.topk(logits, top_k)
                min_v = v[:, -1].unsqueeze(1)
                logits = torch.where(logits < min_v, torch.full_like(logits, -float('Inf')), logits)

            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=1)
        return idx

In [78]:
vocab_size = len(bpe.token_to_id)
vocab_size

8071

In [79]:
model = BigramLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate, weight_decay=1e-2)

16.945543 M parameters


In [None]:
import torch
import math
from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter(log_dir="runs/bpe_transformer")

def get_lr(it, warmup_iters=500, max_lr=1e-3, total_iters=5000):
    if it < warmup_iters:
        return max_lr * it / warmup_iters
    elif it > total_iters:
        return 0.0
    else:
        decay_ratio = (it - warmup_iters) / (total_iters - warmup_iters)
        return max_lr * 0.5 * (1.0 + math.cos(math.pi * decay_ratio))

for iter in range(max_iters):
    # Learning rate scheduler
    lr = get_lr(iter)
    for g in optimizer.param_groups:
        g['lr'] = lr

    # Değerlendirme ve log
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        train_loss = losses['train']
        val_loss = losses['val']

        print(f"Step {iter}: Train {train_loss:.4f}, Val {val_loss:.4f}, LR {lr:.6f}")
        writer.add_scalar("Loss/train", train_loss, iter)
        writer.add_scalar("Loss/val", val_loss, iter)
        writer.add_scalar("Learning Rate", lr, iter)

        if val_loss < best_val_loss:
            best_val_loss = val_loss
            patience_counter = 0
            torch.save(model.state_dict(), 'best_model.pt')
            print("✨ Best model saved.")
        else:
            patience_counter += 1
            if patience_counter >= patience:
                print("🛑 Early stopping.")
                break

    # Eğitim adımı
    xb, yb = get_batch('train')
    logits, loss = model(xb, yb)
    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()


#for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
#    if iter % eval_interval == 0 or iter == max_iters - 1:
#        losses = estimate_loss()
#        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
#    xb, yb = get_batch('train')

    # evaluate the loss
#    logits, loss = model(xb, yb)
#    optimizer.zero_grad(set_to_none=True)
#    loss.backward()
#    optimizer.step()


Step 0: Train 9.1643, Val 9.1696, LR 0.000000
✨ Best model saved.
Step 500: Train 4.3856, Val 7.1656, LR 0.001000
✨ Best model saved.
Step 1000: Train 1.3357, Val 9.2076, LR 0.000970


In [None]:
prompt = "Kinyas ve"
prompt_tokens = []
for w in prompt.strip().split():
    prompt_tokens.extend(bpe.encode(w))
context = torch.tensor(prompt_tokens, dtype=torch.long, device=device).unsqueeze(0)

generated_ids = model.generate(context, max_new_tokens=50, temperature=0.7, top_k=50)[0].tolist()
print("Generated text:")
print(bpe.decode(generated_ids))

# generate from the model
#context = torch.zeros((1, 1), dtype=torch.long, device=device)
#print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))