Toy LLM Engine

A pedagogical implementation of continuous batching for LLM inference, featuring PagedAttention and concurrent request serving. Built to understand how production inference engines like vLLM work under the hood.

🎯 What This Project Does

This engine demonstrates the core techniques used to serve multiple LLM requests efficiently:

PagedAttention — KV cache is stored in fixed-size memory blocks (like virtual memory paging in OS), avoiding fragmentation and enabling dynamic memory allocation per request.
Continuous Batching — Multiple requests are processed concurrently in decode steps, maximizing GPU utilization instead of handling one request at a time.
Batched Prefill — Requests with identical prompt lengths are prefilled together in a single forward pass.
Custom Attention Kernel — Replaces HuggingFace's native attention with a custom implementation that reads/writes from the paged KV cache.

📊 Benchmark Results

On Apple Silicon (MPS) with Qwen2.5-1.5B-Instruct:

Wall Time Performance

Mode	8 Requests (465 tokens)	Throughput
Paged Sequential	31.5s	14.8 tok/s
Continuous Batching	24.4s	19.1 tok/s
Speedup	1.29x	29% faster

Per-Request Latency (Long Requests Win)

Request	Tokens	Sequential	Batched	Improvement
Req-8	13p+100g	31.5s	23.0s	+8.5s faster
Req-7	13p+120g	24.8s	24.4s	+0.4s faster
Req-6	8p+50g	16.5s	15.8s	+0.7s faster

Short requests experience higher latency (waiting in batch with long requests), but the system serves all requests 29% faster overall.

🏗️ Architecture

toy-llm-engine/
├── src/
│   ├── benchmark.py          # Performance comparison suite
│   ├── models/
│   │   └── qwen.py          # Custom PagedAttention implementation
│   └── toy_llm_engine/
│       ├── scheduler.py     # ContinuousBatchingScheduler
│       ├── worker.py        # InferenceWorker (model execution)
│       ├── memory.py        # PagedMemory + KVCachePool
│       └── utils.py         # Device detection
└── pyproject.toml

Key Components

1. MemoryManager (`memory.py`)

Pre-allocates a large KV cache pool in GPU memory
Allocates/frees fixed-size blocks (16 tokens per block) to requests
Each request has a block_table mapping logical token positions to physical blocks

2. ContinuousBatchingScheduler (`scheduler.py`)

Maintains a waiting queue and running batch
Prefill priority: New requests are prefilled (possibly in batches) until batch is full
Decode step: Generates 1 token for all active requests simultaneously
Evicts finished requests and admits new ones

3. CustomQwenPagedAttention (`models/qwen.py`)

Replaces native attention in Qwen2.5-1.5B
Writes new K/V tensors directly to physical memory blocks
Gathers full KV history from paged memory for each request
Uses batched scaled_dot_product_attention with padding masks for efficiency

4. InferenceWorker (`worker.py`)

Loads the model and patches it with custom attention layers
Constructs correct position IDs for RoPE (rotary embeddings)
Executes forward passes for prefill (batch_size × seq_len) and decode (batch_size × 1)

🚀 Quick Start

Installation

# Clone and install dependencies
cd toy-llm-engine
uv sync  # or pip install -e .

Run Standalone Engine

python src/toy_llm_engine/scheduler.py

Submits 3 test requests and streams output in real-time.

Run Benchmark

python src/benchmark.py

Compares three modes:

HF Sequential — Vanilla HuggingFace model, one request at a time
Paged Sequential — Custom paged attention, but sequential (batch_size=1)
Continuous Batching — Full batching with paged attention (batch_size=8)

Output includes:

Wall time and throughput
Per-request latency (from user's t=0)
P50/P95 latencies
TTFT (time to first token)

🔧 Configuration

Edit constants in scheduler.py:

pool = KVCachePool(
    num_layers=28,        # Model depth
    num_kv_heads=2,       # GQA heads
    head_dim=128,         # Attention head dimension
    num_blocks=200,       # Total memory blocks
    block_size=16,        # Tokens per block
    device=device
)

scheduler = ContinuousBatchingScheduler(
    worker,
    manager,
    max_batch_size=4,     # Max concurrent requests
    verbose=True,         # Print scheduling logs
    stream_output=True    # Stream per-token output
)

📚 How It Works

Prefill Phase

User: "Write a haiku about AI"
  ↓
Tokenizer: [101, 5559, 2023, ...]  (8 tokens)
  ↓
Allocate: 1 block (16-token capacity)
  ↓
Forward pass: Process all 8 tokens → establish KV cache
  ↓
Generate: First output token → add to running batch

Decode Phase

Running Batch: [Req-1, Req-2, Req-3]
  ↓
Allocate: 1 token of memory for each request
  ↓
Gather: Read KV history from paged blocks
  ↓
Attention: Compute attention with padding mask
  ↓
Forward: [batch_size, 1] → [batch_size, vocab_size]
  ↓
Output: Argmax → 3 new tokens (one per request)
  ↓
Check: Remove finished requests, admit new ones

Memory Layout

Physical Memory Pool (200 blocks × 16 tokens):
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│ Blk0│ Blk1│ Blk2│ Blk3│ Blk4│ Blk5│ ... │
└─────┴─────┴─────┴─────┴─────┴─────┴─────┘

Request Block Tables:
Req-1: [0, 4]        → 32 tokens capacity
Req-2: [1, 3]        → 32 tokens capacity  
Req-3: [2, 5, 6]     → 48 tokens capacity

🛠️ Implementation Details

Causal Masking

Prefill: Lower-triangular mask prevents future token attention
Decode: Single new token attends to full history (no mask needed)

Numerical Stability

All attention computed in float32 to avoid NaN overflows on MPS
SDPA handles internal numerics correctly for padded sequences

Batched Prefill

Groups same-length prompts together: [[prompt1], [prompt2], ...]
Reduces prefill steps from N to ~N/batch_size
Position IDs expanded: [0,1,2,...,seq_len] → [[0,1,2,...], [0,1,2,...]]

🧪 Testing

Individual component tests:

# Test memory allocation
python src/toy_llm_engine/memory.py

# Test scheduler with 3 requests
python src/toy_llm_engine/scheduler.py

# Full benchmark suite
python src/benchmark.py

📈 Performance Notes

Where Continuous Batching Wins

✅ Long-running requests — High decode step count amortizes prefill cost
✅ Heterogeneous workloads — Long requests don't block short ones in queue
✅ High throughput scenarios — System-wide token/s increases significantly

Where It Falls Short

❌ Very short requests — Batching overhead > benefit for 5–10 token outputs
❌ MPS backend — Lacks FlashAttention; padded SDPA is slower than native HF kernels
❌ Small batch sizes — Need 10+ concurrent requests to see clear wins

On CUDA vs MPS

CUDA with FlashAttention: ~2–3x speedup expected
MPS (current): 1.29x speedup due to naive SDPA implementation
Padding overhead on MPS hurts performance more than on CUDA

🎓 Learning Outcomes

This project teaches:

How PagedAttention reduces memory fragmentation
Why continuous batching improves GPU utilization
The tradeoffs between latency and throughput
How to instrument and benchmark inference systems
The gap between research ideas and production-ready code

📖 References

vLLM Paper — Original PagedAttention design
FlashAttention — Efficient attention kernels
Orca Paper — Continuous batching concept

🤝 Contributing

This is a pedagogical project. Focus areas for improvement:

Prefix caching (share KV cache for common prompts)
Speculative decoding
Priority scheduling (shortest-job-first)
Quantized KV cache (int8/int4)
Multi-GPU support

📝 License

MIT License — Free to use for educational purposes.

Built to learn. Optimized for understanding. 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
src		src
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Toy LLM Engine

🎯 What This Project Does

📊 Benchmark Results

Wall Time Performance

Per-Request Latency (Long Requests Win)

🏗️ Architecture

Key Components

1. MemoryManager (memory.py)

2. ContinuousBatchingScheduler (scheduler.py)

3. CustomQwenPagedAttention (models/qwen.py)

4. InferenceWorker (worker.py)

🚀 Quick Start

Installation

Run Standalone Engine

Run Benchmark

🔧 Configuration

📚 How It Works

Prefill Phase

Decode Phase

Memory Layout

🛠️ Implementation Details

Causal Masking

Numerical Stability

Batched Prefill

🧪 Testing

📈 Performance Notes

Where Continuous Batching Wins

Where It Falls Short

On CUDA vs MPS

🎓 Learning Outcomes

📖 References

🤝 Contributing

📝 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. MemoryManager (`memory.py`)

2. ContinuousBatchingScheduler (`scheduler.py`)

3. CustomQwenPagedAttention (`models/qwen.py`)

4. InferenceWorker (`worker.py`)

Packages