Skip to content

ashishsubedi/ingine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Toy LLM Engine

A pedagogical implementation of continuous batching for LLM inference, featuring PagedAttention and concurrent request serving. Built to understand how production inference engines like vLLM work under the hood.

🎯 What This Project Does

This engine demonstrates the core techniques used to serve multiple LLM requests efficiently:

  1. PagedAttention β€” KV cache is stored in fixed-size memory blocks (like virtual memory paging in OS), avoiding fragmentation and enabling dynamic memory allocation per request.

  2. Continuous Batching β€” Multiple requests are processed concurrently in decode steps, maximizing GPU utilization instead of handling one request at a time.

  3. Batched Prefill β€” Requests with identical prompt lengths are prefilled together in a single forward pass.

  4. Custom Attention Kernel β€” Replaces HuggingFace's native attention with a custom implementation that reads/writes from the paged KV cache.

πŸ“Š Benchmark Results

On Apple Silicon (MPS) with Qwen2.5-1.5B-Instruct:

Wall Time Performance

Mode 8 Requests (465 tokens) Throughput
Paged Sequential 31.5s 14.8 tok/s
Continuous Batching 24.4s 19.1 tok/s
Speedup 1.29x 29% faster

Per-Request Latency (Long Requests Win)

Request Tokens Sequential Batched Improvement
Req-8 13p+100g 31.5s 23.0s +8.5s faster
Req-7 13p+120g 24.8s 24.4s +0.4s faster
Req-6 8p+50g 16.5s 15.8s +0.7s faster

Short requests experience higher latency (waiting in batch with long requests), but the system serves all requests 29% faster overall.

πŸ—οΈ Architecture

toy-llm-engine/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ benchmark.py          # Performance comparison suite
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   └── qwen.py          # Custom PagedAttention implementation
β”‚   └── toy_llm_engine/
β”‚       β”œβ”€β”€ scheduler.py     # ContinuousBatchingScheduler
β”‚       β”œβ”€β”€ worker.py        # InferenceWorker (model execution)
β”‚       β”œβ”€β”€ memory.py        # PagedMemory + KVCachePool
β”‚       └── utils.py         # Device detection
└── pyproject.toml

Key Components

1. MemoryManager (memory.py)

  • Pre-allocates a large KV cache pool in GPU memory
  • Allocates/frees fixed-size blocks (16 tokens per block) to requests
  • Each request has a block_table mapping logical token positions to physical blocks

2. ContinuousBatchingScheduler (scheduler.py)

  • Maintains a waiting queue and running batch
  • Prefill priority: New requests are prefilled (possibly in batches) until batch is full
  • Decode step: Generates 1 token for all active requests simultaneously
  • Evicts finished requests and admits new ones

3. CustomQwenPagedAttention (models/qwen.py)

  • Replaces native attention in Qwen2.5-1.5B
  • Writes new K/V tensors directly to physical memory blocks
  • Gathers full KV history from paged memory for each request
  • Uses batched scaled_dot_product_attention with padding masks for efficiency

4. InferenceWorker (worker.py)

  • Loads the model and patches it with custom attention layers
  • Constructs correct position IDs for RoPE (rotary embeddings)
  • Executes forward passes for prefill (batch_size Γ— seq_len) and decode (batch_size Γ— 1)

πŸš€ Quick Start

Installation

# Clone and install dependencies
cd toy-llm-engine
uv sync  # or pip install -e .

Run Standalone Engine

python src/toy_llm_engine/scheduler.py

Submits 3 test requests and streams output in real-time.

Run Benchmark

python src/benchmark.py

Compares three modes:

  1. HF Sequential β€” Vanilla HuggingFace model, one request at a time
  2. Paged Sequential β€” Custom paged attention, but sequential (batch_size=1)
  3. Continuous Batching β€” Full batching with paged attention (batch_size=8)

Output includes:

  • Wall time and throughput
  • Per-request latency (from user's t=0)
  • P50/P95 latencies
  • TTFT (time to first token)

πŸ”§ Configuration

Edit constants in scheduler.py:

pool = KVCachePool(
    num_layers=28,        # Model depth
    num_kv_heads=2,       # GQA heads
    head_dim=128,         # Attention head dimension
    num_blocks=200,       # Total memory blocks
    block_size=16,        # Tokens per block
    device=device
)

scheduler = ContinuousBatchingScheduler(
    worker,
    manager,
    max_batch_size=4,     # Max concurrent requests
    verbose=True,         # Print scheduling logs
    stream_output=True    # Stream per-token output
)

πŸ“š How It Works

Prefill Phase

User: "Write a haiku about AI"
  ↓
Tokenizer: [101, 5559, 2023, ...]  (8 tokens)
  ↓
Allocate: 1 block (16-token capacity)
  ↓
Forward pass: Process all 8 tokens β†’ establish KV cache
  ↓
Generate: First output token β†’ add to running batch

Decode Phase

Running Batch: [Req-1, Req-2, Req-3]
  ↓
Allocate: 1 token of memory for each request
  ↓
Gather: Read KV history from paged blocks
  ↓
Attention: Compute attention with padding mask
  ↓
Forward: [batch_size, 1] β†’ [batch_size, vocab_size]
  ↓
Output: Argmax β†’ 3 new tokens (one per request)
  ↓
Check: Remove finished requests, admit new ones

Memory Layout

Physical Memory Pool (200 blocks Γ— 16 tokens):
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ Blk0β”‚ Blk1β”‚ Blk2β”‚ Blk3β”‚ Blk4β”‚ Blk5β”‚ ... β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜

Request Block Tables:
Req-1: [0, 4]        β†’ 32 tokens capacity
Req-2: [1, 3]        β†’ 32 tokens capacity  
Req-3: [2, 5, 6]     β†’ 48 tokens capacity

πŸ› οΈ Implementation Details

Causal Masking

  • Prefill: Lower-triangular mask prevents future token attention
  • Decode: Single new token attends to full history (no mask needed)

Numerical Stability

  • All attention computed in float32 to avoid NaN overflows on MPS
  • SDPA handles internal numerics correctly for padded sequences

Batched Prefill

  • Groups same-length prompts together: [[prompt1], [prompt2], ...]
  • Reduces prefill steps from N to ~N/batch_size
  • Position IDs expanded: [0,1,2,...,seq_len] β†’ [[0,1,2,...], [0,1,2,...]]

πŸ§ͺ Testing

Individual component tests:

# Test memory allocation
python src/toy_llm_engine/memory.py

# Test scheduler with 3 requests
python src/toy_llm_engine/scheduler.py

# Full benchmark suite
python src/benchmark.py

πŸ“ˆ Performance Notes

Where Continuous Batching Wins

βœ… Long-running requests β€” High decode step count amortizes prefill cost
βœ… Heterogeneous workloads β€” Long requests don't block short ones in queue
βœ… High throughput scenarios β€” System-wide token/s increases significantly

Where It Falls Short

❌ Very short requests β€” Batching overhead > benefit for 5–10 token outputs
❌ MPS backend β€” Lacks FlashAttention; padded SDPA is slower than native HF kernels
❌ Small batch sizes β€” Need 10+ concurrent requests to see clear wins

On CUDA vs MPS

  • CUDA with FlashAttention: ~2–3x speedup expected
  • MPS (current): 1.29x speedup due to naive SDPA implementation
  • Padding overhead on MPS hurts performance more than on CUDA

πŸŽ“ Learning Outcomes

This project teaches:

  1. How PagedAttention reduces memory fragmentation
  2. Why continuous batching improves GPU utilization
  3. The tradeoffs between latency and throughput
  4. How to instrument and benchmark inference systems
  5. The gap between research ideas and production-ready code

πŸ“– References

🀝 Contributing

This is a pedagogical project. Focus areas for improvement:

  • Prefix caching (share KV cache for common prompts)
  • Speculative decoding
  • Priority scheduling (shortest-job-first)
  • Quantized KV cache (int8/int4)
  • Multi-GPU support

πŸ“ License

MIT License β€” Free to use for educational purposes.


Built to learn. Optimized for understanding. πŸš€

About

ingine = LLM Engine

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages