A pedagogical implementation of continuous batching for LLM inference, featuring PagedAttention and concurrent request serving. Built to understand how production inference engines like vLLM work under the hood.
This engine demonstrates the core techniques used to serve multiple LLM requests efficiently:
-
PagedAttention β KV cache is stored in fixed-size memory blocks (like virtual memory paging in OS), avoiding fragmentation and enabling dynamic memory allocation per request.
-
Continuous Batching β Multiple requests are processed concurrently in decode steps, maximizing GPU utilization instead of handling one request at a time.
-
Batched Prefill β Requests with identical prompt lengths are prefilled together in a single forward pass.
-
Custom Attention Kernel β Replaces HuggingFace's native attention with a custom implementation that reads/writes from the paged KV cache.
On Apple Silicon (MPS) with Qwen2.5-1.5B-Instruct:
| Mode | 8 Requests (465 tokens) | Throughput |
|---|---|---|
| Paged Sequential | 31.5s | 14.8 tok/s |
| Continuous Batching | 24.4s | 19.1 tok/s |
| Speedup | 1.29x | 29% faster |
| Request | Tokens | Sequential | Batched | Improvement |
|---|---|---|---|---|
| Req-8 | 13p+100g | 31.5s | 23.0s | +8.5s faster |
| Req-7 | 13p+120g | 24.8s | 24.4s | +0.4s faster |
| Req-6 | 8p+50g | 16.5s | 15.8s | +0.7s faster |
Short requests experience higher latency (waiting in batch with long requests), but the system serves all requests 29% faster overall.
toy-llm-engine/
βββ src/
β βββ benchmark.py # Performance comparison suite
β βββ models/
β β βββ qwen.py # Custom PagedAttention implementation
β βββ toy_llm_engine/
β βββ scheduler.py # ContinuousBatchingScheduler
β βββ worker.py # InferenceWorker (model execution)
β βββ memory.py # PagedMemory + KVCachePool
β βββ utils.py # Device detection
βββ pyproject.toml
- Pre-allocates a large KV cache pool in GPU memory
- Allocates/frees fixed-size blocks (16 tokens per block) to requests
- Each request has a
block_tablemapping logical token positions to physical blocks
- Maintains a waiting queue and running batch
- Prefill priority: New requests are prefilled (possibly in batches) until batch is full
- Decode step: Generates 1 token for all active requests simultaneously
- Evicts finished requests and admits new ones
- Replaces native attention in Qwen2.5-1.5B
- Writes new K/V tensors directly to physical memory blocks
- Gathers full KV history from paged memory for each request
- Uses batched
scaled_dot_product_attentionwith padding masks for efficiency
- Loads the model and patches it with custom attention layers
- Constructs correct position IDs for RoPE (rotary embeddings)
- Executes forward passes for prefill (batch_size Γ seq_len) and decode (batch_size Γ 1)
# Clone and install dependencies
cd toy-llm-engine
uv sync # or pip install -e .python src/toy_llm_engine/scheduler.pySubmits 3 test requests and streams output in real-time.
python src/benchmark.pyCompares three modes:
- HF Sequential β Vanilla HuggingFace model, one request at a time
- Paged Sequential β Custom paged attention, but sequential (
batch_size=1) - Continuous Batching β Full batching with paged attention (
batch_size=8)
Output includes:
- Wall time and throughput
- Per-request latency (from user's t=0)
- P50/P95 latencies
- TTFT (time to first token)
Edit constants in scheduler.py:
pool = KVCachePool(
num_layers=28, # Model depth
num_kv_heads=2, # GQA heads
head_dim=128, # Attention head dimension
num_blocks=200, # Total memory blocks
block_size=16, # Tokens per block
device=device
)
scheduler = ContinuousBatchingScheduler(
worker,
manager,
max_batch_size=4, # Max concurrent requests
verbose=True, # Print scheduling logs
stream_output=True # Stream per-token output
)User: "Write a haiku about AI"
β
Tokenizer: [101, 5559, 2023, ...] (8 tokens)
β
Allocate: 1 block (16-token capacity)
β
Forward pass: Process all 8 tokens β establish KV cache
β
Generate: First output token β add to running batch
Running Batch: [Req-1, Req-2, Req-3]
β
Allocate: 1 token of memory for each request
β
Gather: Read KV history from paged blocks
β
Attention: Compute attention with padding mask
β
Forward: [batch_size, 1] β [batch_size, vocab_size]
β
Output: Argmax β 3 new tokens (one per request)
β
Check: Remove finished requests, admit new ones
Physical Memory Pool (200 blocks Γ 16 tokens):
βββββββ¬ββββββ¬ββββββ¬ββββββ¬ββββββ¬ββββββ¬ββββββ
β Blk0β Blk1β Blk2β Blk3β Blk4β Blk5β ... β
βββββββ΄ββββββ΄ββββββ΄ββββββ΄ββββββ΄ββββββ΄ββββββ
Request Block Tables:
Req-1: [0, 4] β 32 tokens capacity
Req-2: [1, 3] β 32 tokens capacity
Req-3: [2, 5, 6] β 48 tokens capacity
- Prefill: Lower-triangular mask prevents future token attention
- Decode: Single new token attends to full history (no mask needed)
- All attention computed in float32 to avoid NaN overflows on MPS
- SDPA handles internal numerics correctly for padded sequences
- Groups same-length prompts together:
[[prompt1], [prompt2], ...] - Reduces prefill steps from
Nto~N/batch_size - Position IDs expanded:
[0,1,2,...,seq_len]β[[0,1,2,...], [0,1,2,...]]
Individual component tests:
# Test memory allocation
python src/toy_llm_engine/memory.py
# Test scheduler with 3 requests
python src/toy_llm_engine/scheduler.py
# Full benchmark suite
python src/benchmark.pyβ
Long-running requests β High decode step count amortizes prefill cost
β
Heterogeneous workloads β Long requests don't block short ones in queue
β
High throughput scenarios β System-wide token/s increases significantly
β Very short requests β Batching overhead > benefit for 5β10 token outputs
β MPS backend β Lacks FlashAttention; padded SDPA is slower than native HF kernels
β Small batch sizes β Need 10+ concurrent requests to see clear wins
- CUDA with FlashAttention: ~2β3x speedup expected
- MPS (current): 1.29x speedup due to naive SDPA implementation
- Padding overhead on MPS hurts performance more than on CUDA
This project teaches:
- How PagedAttention reduces memory fragmentation
- Why continuous batching improves GPU utilization
- The tradeoffs between latency and throughput
- How to instrument and benchmark inference systems
- The gap between research ideas and production-ready code
- vLLM Paper β Original PagedAttention design
- FlashAttention β Efficient attention kernels
- Orca Paper β Continuous batching concept
This is a pedagogical project. Focus areas for improvement:
- Prefix caching (share KV cache for common prompts)
- Speculative decoding
- Priority scheduling (shortest-job-first)
- Quantized KV cache (int8/int4)
- Multi-GPU support
MIT License β Free to use for educational purposes.
Built to learn. Optimized for understanding. π