# Project: Transformer-based Language Model from Scratch

Training of a nano-GPT style Transformer language model on tiny Shakespeare dataset.

Conceptual Overview

The heart of modern LLMs is the Self-Attention mechanism.

**Concept: Why Attention?**

In the Bigram model, when processing the sentence "The dog barks," the model at the word "barks" no longer knew that "dog" came before. A Transformer looks at all previous tokens and dynamically decides which ones are important.

This works through three vectors that each token possesses (the so-called "Key, Query, Value" analogy):

1. Query (Q): What am I looking for? (e.g., "I am a verb, I'm looking for the subject that performs the action").
2. Key (K): What do I offer? (e.g., "I am a noun/subject").
3. Value (V): What is my actual content? (e.g., "dog").

When Query and Key match (high mathematical similarity), then much of the Value flows into the current token. Here is the mathematical formula that we will see in the code shortly:

$$Attention(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

### Import of required libraries for building a bigram language model using PyTorch.

In [1]:
import torch
import torch.nn as nn
from torch.nn import functional as F

Hyperparameter-Definitionen für das Bigram-Sprachmodell

In [2]:
# --- Hyperparameter ---
batch_size = 64 # More power through M4
block_size = 256 # Context: The model looks back 256 characters
max_iters = 5000 # Reduced for quicker training
eval_interval = 250     # evaluate every 250 steps
learning_rate = 3e-4 # slightly lower for more complex networks
eval_iters = 200
n_embd = 384     # size of the embedding vectors (dimension)
n_head = 6       # number of attention heads (384 / 6 = 64 dim per head)
n_layer = 6      # number of transformer blocks
dropout = 0.2    # against overfitting

# device configuration
device = 'mps' if torch.backends.mps.is_available() else 'cpu' # M4 Check!
print(f"Using device: {device}")


Using device: mps


## 1. Load data and tokenization

Loading data from a text file and creating character-level tokenization

**Tokenization & Encoding**
Wir nutzen hier Character-Level Tokenization. a -> 1, b -> 2.

Modernere Modelle wie GPT-4 nutzen "Sub-word Tokenization" (Tiktoken), wo häufige Wortteile (z.B. "ing" oder "Pre") ein einziges Token sind. Für unser Verständnis reicht Character-Level völlig aus und macht den Code schlanker.

In [3]:
DATAPATH = 'data/tinyshakespeare.txt'

In [4]:
# !curl -o {DATAPATH} https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

In [5]:
# set random seed for reproducibility
torch.manual_seed(42)

# Load text data
with open(DATAPATH, 'r', encoding='utf-8') as f:
    text = f.read()
    print("Text data loaded.")
    print(f"Length of dataset in characters: {len(text)}")

Text data loaded.
Length of dataset in characters: 1115394


Sorting and Mapping of characters to indices and vice versa

In [6]:
# Sorting and Mapping of characters to indices and vice versa
chars = sorted(list(set(text)))
vocab_size = len(chars)
print("All unique characters:", ''.join(chars))
print(f"Vocab size: {vocab_size}")

# Mapping: Zeichen zu Integers (Tokenization)
stoi = { ch:i for i,ch in enumerate(chars) } # string to int
itos = { i:ch for i,ch in enumerate(chars) } # int to string
encode = lambda s: [stoi[c] for c in s] # Encoder: String -> Liste von ints
decode = lambda l: ''.join([itos[i] for i in l]) # Decoder: Liste von ints -> String
print(encode("hello world"))
print(decode(encode("hello world")))

All unique characters: 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Vocab size: 65
[46, 43, 50, 50, 53, 1, 61, 53, 56, 50, 42]
hello world


Data preparation: splitting into training and validation sets

In [7]:
# Train/Test Split
data = torch.tensor(encode(text), dtype=torch.long) # Convert the entire text into a list of token IDs
# Split into training and validation data
n = int(0.9*len(data)) # 90% for Training, 10% for Validation
train_data = data[:n] # train_data 
val_data = data[n:] # val_data

Auxillary functions for data batching and loss estimation

In [8]:
# --- Helper function: Data batching ---
def get_batch(split):
    # Generates a small batch of inputs (x) and targets (y)
    data = train_data if split == 'train' else val_data
    # We choose random starting points in the text
    ix = torch.randint(len(data) - block_size, (batch_size,))
    # x is the context, y is the target (the next character)
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device) # Move to the M4
    return x, y

# --- Helper function: Loss estimation (without backprop) ---
@torch.no_grad() # Disable gradient tracking for efficiency
def estimate_loss(model):
    out = {}
    model.eval() # set model to evaluation mode
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train() # set model back to training mode
    return out

## 2. Definition of Transformer Model Components

### What is a Transformer Model?

A transformer model is a type of neural network architecture that is designed to process sequential data, such as text. Unlike traditional recurrent neural networks (RNNs), which process data sequentially, transformers use a mechanism called self-attention to weigh the importance of different words in a sentence, regardless of their position. This allows transformers to capture long-range dependencies and relationships between words more effectively.

### How it works?

1. **Tokens & Positional Encoding**: The model is trained on a large body of text (a corpus). Each word is converted into a vector (embedding), and positional encodings are added to give the model information about the position of each word in the sequence.
2. **Self-Attention Mechanism**: The core innovation. It lets the model weigh how relevant every other word in the input is to the current word, understanding context (e.g., what "it" refers to in a sentence). It uses self-attention to compute a representation of each word in the context of all other words in the sentence.
3. **Multi-Head Attention**: Instead of having a single attention mechanism, transformers use multiple attention heads to capture different types of relationships and dependencies in the data.

In [13]:
class Head(nn.Module):
    """ One attention head """
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        # Register buffer, so that it is not a trainable parameter
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size))) # Lower triangular matrix
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,head_size)
        q = self.query(x) # (B,T,head_size)
        # Compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B,T,T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # type: ignore # Masking future tokens
        wei = F.softmax(wei, dim=-1) # (B,T,T)
        wei = self.dropout(wei)
        # Perform the weighted aggregation of the values
        v = self.value(x) # (B,T,head_size)
        out = wei @ v # (B,T,head_size)
        return out
    
class MultiHeadAttention(nn.Module):
    """ Multiple Heads in parallel """
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # Output der Heads konkatenieren
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ Ein einfaches lineares Layer gefolgt von Nicht-Linearität """
    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd), # Expansion (Standard in Transformern)
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd), # Projection zurück
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer Block: Kommunikation gefolgt von Berechnung """
    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size) # Communication
        self.ffwd = FeedFoward(n_embd)                  # Computation
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        # Residual Connections (x + ...) sind extrem wichtig für tiefes Lernen!
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x



### The GPT Model Definition


In [14]:
class GPTLanguageModel(nn.Module):
    """ The GPT Language Model """
    def __init__(self):
        super().__init__()
        # Embedding für Token-Identität
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        # Embedding für Position im Satz (WICHTIG!)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        
        # Die Transformer Blöcke
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # Finaler Layer Norm
        self.lm_head = nn.Linear(n_embd, vocab_size) # Projektion auf Vokabular

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # Token Emb + Positional Emb
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        
        # Durch die Transformer Blöcke
        x = self.blocks(x) 
        x = self.ln_f(x)
        
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            # Kontext beschneiden, wenn er zu lang wird (wir haben max block_size PosEmbeddings)
            idx_cond = idx[:, -block_size:]
            logits, loss = self(idx_cond)
            logits = logits[:, -1, :] 
            probs = F.softmax(logits, dim=-1) 
            idx_next = torch.multinomial(probs, num_samples=1) 
            idx = torch.cat((idx, idx_next), dim=1) 
        return idx

**Note to Embeddings (nn.Embedding Layer == Table):**

In this simple model, the embedding table does not yet function as a semantic vector space (like "King - Man + Woman = Queen"). Here it is a simple lookup table. When the model sees the letter "a", it looks up row "a" in the table. There are probability scores (logits) for all possible letters that could come next.

## 3. Initialization and Training of Transformer Model

### Model initialization

Initialize the model and move to device

In [15]:
# initialize the model and move to device
model = GPTLanguageModel()
model = model.to(device) # Move model to M4
# Print number of parameters
print(f"Number of parameters: {sum(p.numel() for p in model.parameters())}")

# Optimizer (AdamW is standard for LLMs)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

Number of parameters: 10788929


### Training Loop
Training the bigram language model using mini-batch gradient descent and periodic loss estimation.

In [38]:
print("Start training ... (this may take a while)")
for iter in range(max_iters):
    # Every eval_interval iterations, estimate loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss(model)
        print(f"Step {iter}: Train Loss {losses['train']:.4f}, Val Loss {losses['val']:.4f}")

    # Get a batch of data
    xb, yb = get_batch('train')

    # Forward pass
    logits, loss = model(xb, yb)

    # Backward pass and optimization step
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

Start training ... (this may take a while)
Step 0: Train Loss 4.3327, Val Loss 4.3355
Step 250: Train Loss 2.3843, Val Loss 2.4145
Step 500: Train Loss 2.0241, Val Loss 2.0974
Step 750: Train Loss 1.7672, Val Loss 1.9069
Step 1000: Train Loss 1.6152, Val Loss 1.7761
Step 1250: Train Loss 1.5155, Val Loss 1.7087
Step 1500: Train Loss 1.4436, Val Loss 1.6387
Step 1750: Train Loss 1.3956, Val Loss 1.6084
Step 2000: Train Loss 1.3482, Val Loss 1.5897
Step 2250: Train Loss 1.3177, Val Loss 1.5654
Step 2500: Train Loss 1.2822, Val Loss 1.5357
Step 2750: Train Loss 1.2556, Val Loss 1.5129
Step 3000: Train Loss 1.2312, Val Loss 1.5123
Step 3250: Train Loss 1.2116, Val Loss 1.5103
Step 3500: Train Loss 1.1880, Val Loss 1.4995
Step 3750: Train Loss 1.1678, Val Loss 1.4918
Step 4000: Train Loss 1.1518, Val Loss 1.4965
Step 4250: Train Loss 1.1316, Val Loss 1.5024
Step 4500: Train Loss 1.1117, Val Loss 1.4885
Step 4750: Train Loss 1.0938, Val Loss 1.4892
Step 4999: Train Loss 1.0801, Val Loss 1.50

## 4. Save the trained model

In [16]:
model_path = "models/nano_gpt_shakespeare.pt"
torch.save(model.state_dict(), model_path)
print(f"\nModell-Gewichte gespeichert unter: {model_path}")


Modell-Gewichte gespeichert unter: models/nano_gpt_shakespeare.pt


## 5. Deployment and Text Generation

In [39]:
print("Generation of text:")
context = torch.zeros((1, 1), dtype=torch.long, device=device) # start with a single zero token
generated_indices = model.generate(context, max_new_tokens=500)[0].tolist()
print(decode(generated_indices))

Generation of text:

Little heaven, and finst thee, believe thee
And not in him plainted stones.

Provost:
What more, than my bosom, here had they were they?
And withal: this woman a two, Tramn.

DUKE VINCENTIO:
Say thou art a just: donestrate hour reposed:
Most redeat frtning fools, thou spoke
Look'd with outragely madam?

DUKE VINCENTIO:
I am gled time well.

LUCIO:
By good, sir, to-morrow.

Nurse:
Should you hence!

ESpator:
When you are now toe, and since mewhat worse't, what's
stain'd off cales.
Bad, there, Vau
