# Project: Transformer-based Language Model from Scratch

Training of a nano-GPT style Transformer language model on tiny Shakespeare dataset.

Conceptual Overview

The heart of modern LLMs is the Self-Attention mechanism.

**Concept: Why Attention?**

In the Bigram model, when processing the sentence "The dog barks," the model at the word "barks" no longer knew that "dog" came before. A Transformer looks at all previous tokens and dynamically decides which ones are important.

This works through three vectors that each token possesses (the so-called "Key, Query, Value" analogy):

1. Query (Q): What am I looking for? (e.g., "I am a verb, I'm looking for the subject that performs the action").
2. Key (K): What do I offer? (e.g., "I am a noun/subject").
3. Value (V): What is my actual content? (e.g., "dog").

When Query and Key match (high mathematical similarity), then much of the Value flows into the current token. Here is the mathematical formula that we will see in the code shortly:

$$Attention(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

### Import of required libraries for building a bigram language model using PyTorch.

In [49]:
import torch
import torch.nn as nn
from torch.nn import functional as F

Hyperparameter-Definitionen für das Bigram-Sprachmodell

In [50]:
# Hyperparameter settings for training a transformer model
batch_size = 32      # number of sequences processed in parallel
max_iters = 5000 # Reduced for quicker training
eval_interval = 250     # evaluate every 250 steps
learning_rate = 3e-4 # slightly lower for more complex networks
eval_iters = 200


# device configuration
device = 'mps' if torch.backends.mps.is_available() else 'cpu' # M4 Check!
print(f"Using device: {device}")


Using device: mps


## 1. Load data and tokenization

Loading data from a text file and creating character-level tokenization

**Tokenization & Encoding**
Wir nutzen hier Character-Level Tokenization. a -> 1, b -> 2.

Modernere Modelle wie GPT-4 nutzen "Sub-word Tokenization" (Tiktoken), wo häufige Wortteile (z.B. "ing" oder "Pre") ein einziges Token sind. Für unser Verständnis reicht Character-Level völlig aus und macht den Code schlanker.

In [51]:
DATAPATH = 'data/tinyshakespeare.txt'

In [52]:
# !curl -o {DATAPATH} https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

In [53]:
# set random seed for reproducibility
torch.manual_seed(42)

# Load text data
with open(DATAPATH, 'r', encoding='utf-8') as f:
    text = f.read()
    print("Text data loaded.")
    print(f"Length of dataset in characters: {len(text)}")

Text data loaded.
Length of dataset in characters: 1115394


Sorting and Mapping of characters to indices and vice versa

In [54]:
# Sorting and Mapping of characters to indices and vice versa
chars = sorted(list(set(text)))
vocab_size = len(chars)
print("All unique characters:", ''.join(chars))
print(f"Vocab size: {vocab_size}")

# Mapping: Zeichen zu Integers (Tokenization)
stoi = { ch:i for i,ch in enumerate(chars) } # string to int
itos = { i:ch for i,ch in enumerate(chars) } # int to string
encode = lambda s: [stoi[c] for c in s] # Encoder: String -> Liste von ints
decode = lambda l: ''.join([itos[i] for i in l]) # Decoder: Liste von ints -> String
print(encode("hello world"))
print(decode(encode("hello world")))

All unique characters: 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Vocab size: 65
[46, 43, 50, 50, 53, 1, 61, 53, 56, 50, 42]
hello world


Data preparation: splitting into training and validation sets

In [55]:
# Train/Test Split
data = torch.tensor(encode(text), dtype=torch.long) # Convert the entire text into a list of token IDs
# Split into training and validation data
n = int(0.9*len(data)) # 90% for Training, 10% for Validation
train_data = data[:n] # train_data 
val_data = data[n:] # val_data

## 2. Definition of Transformer Model Components

### What is a Transformer Model?

A transformer model is a type of neural network architecture that is designed to process sequential data, such as text. Unlike traditional recurrent neural networks (RNNs), which process data sequentially, transformers use a mechanism called self-attention to weigh the importance of different words in a sentence, regardless of their position. This allows transformers to capture long-range dependencies and relationships between words more effectively.

### How it works?

1. **Tokens & Positional Encoding**: The model is trained on a large body of text (a corpus). Each word is converted into a vector (embedding), and positional encodings are added to give the model information about the position of each word in the sequence.
2. **Self-Attention Mechanism**: The core innovation. It lets the model weigh how relevant every other word in the input is to the current word, understanding context (e.g., what "it" refers to in a sentence). It uses self-attention to compute a representation of each word in the context of all other words in the sentence.
3. **Multi-Head Attention**: Instead of having a single attention mechanism, transformers use multiple attention heads to capture different types of relationships and dependencies in the data.

**Note to Embeddings (nn.Embedding Layer == Table):**

In this simple model, the embedding table does not yet function as a semantic vector space (like "King - Man + Woman = Queen"). Here it is a simple lookup table. When the model sees the letter "a", it looks up row "a" in the table. There are probability scores (logits) for all possible letters that could come next.

## 3. Initialization and Training of Transformer Model

### Model initialization

Initialize the model and move to device

In [56]:
from llmlib import nanoGPT

# Hyperparameter
n_embd = 384     # size of the embedding vectors (dimension)
n_head = 6       # number of attention heads (384 / 6 = 64 dim per head)
n_layer = 6      # number of transformer blocks
block_size = 256 # Context: The model looks back 256 characters
vocab_size = 65  # number of unique characters in the vocabulary
dropout = 0.2    # against overfitting

device = 'mps' if torch.backends.mps.is_available() else 'cpu'

# initialize the model and move to device
model = nanoGPT.GPTLanguageModel()
model = model.to(device) # Move model to M4
# Print number of parameters
print(f"Number of parameters: {sum(p.numel() for p in model.parameters())}")

# Optimizer (AdamW is standard for LLMs)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

Number of parameters: 10788929


Auxiliary functions for data batching and loss estimation

In [57]:
# --- Helper function: Data batching ---
def get_batch(split):
    # Generates a small batch of inputs (x) and targets (y)
    data = train_data if split == 'train' else val_data
    # We choose random starting points in the text
    ix = torch.randint(len(data) - block_size, (batch_size,))
    # x is the context, y is the target (the next character)
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device) # Move to the M4
    return x, y

# --- Helper function: Loss estimation (without backprop) ---
@torch.no_grad() # Disable gradient tracking for efficiency
def estimate_loss(model):
    out = {}
    model.eval() # set model to evaluation mode
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train() # set model back to training mode
    return out

### Training Loop
Training the bigram language model using mini-batch gradient descent and periodic loss estimation.

In [58]:
import os

# Path to save the trained model
train_path = 'train/nanoGPT_shakespeare.pt'
best_train_loss = float('inf')
best_val_loss = float('inf')

# --- Training Loop ---
print("Start training ... (this may take a while)")
for iter in range(max_iters):
    # Every eval_interval iterations, estimate loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss(model)
        print(f"Step {iter}: Train Loss {losses['train']:.4f}, Val Loss {losses['val']:.4f}")
        # Save the model if the validation loss is the best we've seen so far
        if losses['val'] < best_val_loss and losses['train'] < best_train_loss:
            best_val_loss = float(losses['val'])
            best_train_loss = float(losses['train'])
            torch.save(model.state_dict(), train_path)
            print(f"Model saved to {train_path}")


    # Get a batch of data
    xb, yb = get_batch('train')

    # Forward pass
    logits, loss = model(xb, yb)

    # Backward pass and optimization step
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

Start training ... (this may take a while)
Step 0: Train Loss 4.3329, Val Loss 4.3363
Model saved to train/nanoGPT_shakespeare.pt
Step 250: Train Loss 2.4187, Val Loss 2.4492
Model saved to train/nanoGPT_shakespeare.pt
Step 500: Train Loss 2.1459, Val Loss 2.1931
Model saved to train/nanoGPT_shakespeare.pt
Step 750: Train Loss 1.8992, Val Loss 2.0110
Model saved to train/nanoGPT_shakespeare.pt
Step 1000: Train Loss 1.7299, Val Loss 1.8751
Model saved to train/nanoGPT_shakespeare.pt
Step 1250: Train Loss 1.6188, Val Loss 1.7879
Model saved to train/nanoGPT_shakespeare.pt
Step 1500: Train Loss 1.5468, Val Loss 1.7358
Model saved to train/nanoGPT_shakespeare.pt
Step 1750: Train Loss 1.4889, Val Loss 1.6901
Model saved to train/nanoGPT_shakespeare.pt
Step 2000: Train Loss 1.4439, Val Loss 1.6416
Model saved to train/nanoGPT_shakespeare.pt
Step 2250: Train Loss 1.4030, Val Loss 1.6090
Model saved to train/nanoGPT_shakespeare.pt
Step 2500: Train Loss 1.3719, Val Loss 1.5933
Model saved to tr

## 4. Save the trained model

In [59]:
model_path = "models/nano_gpt_shakespeare.pt"
torch.save(model.state_dict(), model_path)
print(f"\nModell-Gewichte gespeichert unter: {model_path}")


Modell-Gewichte gespeichert unter: models/nano_gpt_shakespeare.pt


## 5. Deployment and Text Generation

In [60]:
print("Generation of text:")
context = torch.zeros((1, 1), dtype=torch.long, device=device) # start with a single zero token
generated_indices = model.generate(context, max_new_tokens=500)[0].tolist()
print(decode(generated_indices))

Generation of text:

Title there.

THOMASTASASTINGBROKE:
Moonio!
Ay, not in her mistrikes to be power a bone.
Is it is't, a man: belihful, and ask to the bright;
My garments are not a woman-placed clied
And Itally kneets and respere deaught.

GLOUCESTER:
Here's humother string for your doop!
That the valour it eyes of Edward Raven and a hrone?

EDWARD:
Ay, let's flesh the confesseth of dale was train'd
'What thou shalt slay that I say
Bear I do the rather and my fair trow'd,
And learn that I mean, all my favour,
Whi
