Companion code for a three-part blog series on inference infrastructure by Christopher Kosubinsky.
The series covers the full stack of serving large language models in production: from gateway architecture, to GPU internals and batching strategies, to building a cache-aware request router. This repo contains the working code, deployment manifests, benchmarking scripts, and infrastructure modules referenced throughout.
| # | Post | Companion Code |
|---|---|---|
| 1 | Designing an Inference Gateway | 01-architecture/ |
| 2 | Inside the GPU Server | 02-gpu-server/ |
| 3 | Routing Inference Requests | 03-routing/ |
This repo is designed as a hands-on companion to the blog series. Each numbered directory corresponds to a blog post. Start with the post that interests you, then explore the code.
You do not need to go in order. If you are most curious about how cache-aware routing works, jump straight to 03-routing/ and its DESIGN.md. If you want to understand GPU memory and KV cache behavior, start with 02-gpu-server/ and its testing notes. The architecture overview in 01-architecture/ ties everything together and is a good starting point if you have not read any of the posts yet.
This repo also includes a CLAUDE.md file that provides context for AI-assisted learning. Open it in Claude Code and ask questions as you work through the material.
01-architecture/ Gateway architecture overview
02-gpu-server/ vLLM deployment manifests, benchmarking scripts, testing notes
03-routing/ FastAPI router with cache-aware routing, backpressure, and load tests
04-infrastructure/ Terraform modules for GKE and EKS GPU clusters
appendix/
k8s-basics/ Local Kubernetes learning environment (kind)
learning-journal/ Weekly journal entries documenting the learning process
reading-list/ Paper breakdowns (vLLM, vLLM V1) and curated reading list
The main code artifact is the inference router in 03-routing/. It is a FastAPI application that proxies OpenAI-compatible requests to vLLM backends with:
- Cache-aware routing -- hashes system prompts to direct requests to backends with cached KV blocks
- Least-queue scoring -- ranks backends by
(queue_depth + in_flight + 1) * latency_ewma - Backpressure with priority shedding -- three priority tiers (high, normal, low) with configurable queue thresholds
- Health polling -- monitors backends via the vLLM
/metricsendpoint
The experimental setup used EKS (T4) and GKE (L4) clusters running Qwen 1.5B on vLLM. Key findings from the benchmarks:
- Cache-aware routing cut time-to-first-token by 42% at low concurrency (65ms vs 113ms)
- At high concurrency (32 concurrent requests), round-robin achieved 2.3x better throughput
- The queue threshold parameter controls this tradeoff
- Priority shedding successfully preserved capacity for high-priority traffic under load
The appendix/ directory contains material from the learning process itself: a weekly journal, paper breakdowns covering the vLLM paper and vLLM V1, and a curated reading list. Start there if you want to go deeper on the research behind these systems.
To run the router locally:
cd 03-routing
uv sync
uv run uvicorn router.main:app --reloadYou will need uv installed for dependency management.
You will also need vLLM backends running and accessible. See 02-gpu-server/ for deployment manifests, or point the router at any OpenAI-compatible endpoint. To provision GPU clusters on GKE or EKS, see 04-infrastructure/.
Python 3.11, FastAPI, vLLM, Terraform, Kubernetes (EKS + GKE)
MIT