Skip to content

christopherkosubinsky/cloud-inference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cloud Inference

Companion code for a three-part blog series on inference infrastructure by Christopher Kosubinsky.

The series covers the full stack of serving large language models in production: from gateway architecture, to GPU internals and batching strategies, to building a cache-aware request router. This repo contains the working code, deployment manifests, benchmarking scripts, and infrastructure modules referenced throughout.

Blog Series

# Post Companion Code
1 Designing an Inference Gateway 01-architecture/
2 Inside the GPU Server 02-gpu-server/
3 Routing Inference Requests 03-routing/

How to use this repo

This repo is designed as a hands-on companion to the blog series. Each numbered directory corresponds to a blog post. Start with the post that interests you, then explore the code.

You do not need to go in order. If you are most curious about how cache-aware routing works, jump straight to 03-routing/ and its DESIGN.md. If you want to understand GPU memory and KV cache behavior, start with 02-gpu-server/ and its testing notes. The architecture overview in 01-architecture/ ties everything together and is a good starting point if you have not read any of the posts yet.

This repo also includes a CLAUDE.md file that provides context for AI-assisted learning. Open it in Claude Code and ask questions as you work through the material.

What's Here

01-architecture/           Gateway architecture overview
02-gpu-server/             vLLM deployment manifests, benchmarking scripts, testing notes
03-routing/                FastAPI router with cache-aware routing, backpressure, and load tests
04-infrastructure/         Terraform modules for GKE and EKS GPU clusters
appendix/
  k8s-basics/              Local Kubernetes learning environment (kind)
  learning-journal/        Weekly journal entries documenting the learning process
  reading-list/            Paper breakdowns (vLLM, vLLM V1) and curated reading list

The Router

The main code artifact is the inference router in 03-routing/. It is a FastAPI application that proxies OpenAI-compatible requests to vLLM backends with:

  • Cache-aware routing -- hashes system prompts to direct requests to backends with cached KV blocks
  • Least-queue scoring -- ranks backends by (queue_depth + in_flight + 1) * latency_ewma
  • Backpressure with priority shedding -- three priority tiers (high, normal, low) with configurable queue thresholds
  • Health polling -- monitors backends via the vLLM /metrics endpoint

The experimental setup used EKS (T4) and GKE (L4) clusters running Qwen 1.5B on vLLM. Key findings from the benchmarks:

  • Cache-aware routing cut time-to-first-token by 42% at low concurrency (65ms vs 113ms)
  • At high concurrency (32 concurrent requests), round-robin achieved 2.3x better throughput
  • The queue threshold parameter controls this tradeoff
  • Priority shedding successfully preserved capacity for high-priority traffic under load

Appendix

The appendix/ directory contains material from the learning process itself: a weekly journal, paper breakdowns covering the vLLM paper and vLLM V1, and a curated reading list. Start there if you want to go deeper on the research behind these systems.

Getting Started

To run the router locally:

cd 03-routing
uv sync
uv run uvicorn router.main:app --reload

You will need uv installed for dependency management.

You will also need vLLM backends running and accessible. See 02-gpu-server/ for deployment manifests, or point the router at any OpenAI-compatible endpoint. To provision GPU clusters on GKE or EKS, see 04-infrastructure/.

Tech Stack

Python 3.11, FastAPI, vLLM, Terraform, Kubernetes (EKS + GKE)

License

MIT

About

No description or website provided.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors