Cloud Inference

Companion code for a three-part blog series on inference infrastructure by Christopher Kosubinsky.

The series covers the full stack of serving large language models in production: from gateway architecture, to GPU internals and batching strategies, to building a cache-aware request router. This repo contains the working code, deployment manifests, benchmarking scripts, and infrastructure modules referenced throughout.

Blog Series

#	Post	Companion Code
1	Designing an Inference Gateway	`01-architecture/`
2	Inside the GPU Server	`02-gpu-server/`
3	Routing Inference Requests	`03-routing/`

How to use this repo

This repo is designed as a hands-on companion to the blog series. Each numbered directory corresponds to a blog post. Start with the post that interests you, then explore the code.

You do not need to go in order. If you are most curious about how cache-aware routing works, jump straight to 03-routing/ and its DESIGN.md. If you want to understand GPU memory and KV cache behavior, start with 02-gpu-server/ and its testing notes. The architecture overview in 01-architecture/ ties everything together and is a good starting point if you have not read any of the posts yet.

This repo also includes a CLAUDE.md file that provides context for AI-assisted learning. Open it in Claude Code and ask questions as you work through the material.

What's Here

01-architecture/           Gateway architecture overview
02-gpu-server/             vLLM deployment manifests, benchmarking scripts, testing notes
03-routing/                FastAPI router with cache-aware routing, backpressure, and load tests
04-infrastructure/         Terraform modules for GKE and EKS GPU clusters
appendix/
  k8s-basics/              Local Kubernetes learning environment (kind)
  learning-journal/        Weekly journal entries documenting the learning process
  reading-list/            Paper breakdowns (vLLM, vLLM V1) and curated reading list

The Router

The main code artifact is the inference router in 03-routing/. It is a FastAPI application that proxies OpenAI-compatible requests to vLLM backends with:

Cache-aware routing -- hashes system prompts to direct requests to backends with cached KV blocks
Least-queue scoring -- ranks backends by (queue_depth + in_flight + 1) * latency_ewma
Backpressure with priority shedding -- three priority tiers (high, normal, low) with configurable queue thresholds
Health polling -- monitors backends via the vLLM /metrics endpoint

The experimental setup used EKS (T4) and GKE (L4) clusters running Qwen 1.5B on vLLM. Key findings from the benchmarks:

Cache-aware routing cut time-to-first-token by 42% at low concurrency (65ms vs 113ms)
At high concurrency (32 concurrent requests), round-robin achieved 2.3x better throughput
The queue threshold parameter controls this tradeoff
Priority shedding successfully preserved capacity for high-priority traffic under load

Appendix

The appendix/ directory contains material from the learning process itself: a weekly journal, paper breakdowns covering the vLLM paper and vLLM V1, and a curated reading list. Start there if you want to go deeper on the research behind these systems.

Getting Started

To run the router locally:

cd 03-routing
uv sync
uv run uvicorn router.main:app --reload

You will need uv installed for dependency management.

You will also need vLLM backends running and accessible. See 02-gpu-server/ for deployment manifests, or point the router at any OpenAI-compatible endpoint. To provision GPU clusters on GKE or EKS, see 04-infrastructure/.

Tech Stack

Python 3.11, FastAPI, vLLM, Terraform, Kubernetes (EKS + GKE)

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cloud Inference

Blog Series

How to use this repo

What's Here

The Router

Appendix

Getting Started

Tech Stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
01-architecture		01-architecture
02-gpu-server		02-gpu-server
03-routing		03-routing
04-infrastructure		04-infrastructure
appendix		appendix
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Cloud Inference

Blog Series

How to use this repo

What's Here

The Router

Appendix

Getting Started

Tech Stack

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages