⚠️ The numbers in this README are fictional placeholders, pending the actual benchmark run. The chart pipeline and methodology are real; the values in the charts are plausible Llama 3.1 8B/A5000 estimates. Real measurements from the RunPod RTX A5000 run will replace them wholesale.
Self-host Llama 3.1 8B Instruct on a single $0.27/hr RTX A5000, serve it at ~5,200 generation tokens/sec, and pay ~$0.007 per 1M tokens — roughly 860× cheaper than GPT-4o while retaining ~98% of full-precision quality.
Forge is a focused engineering artifact, not a SaaS. It serves an open-source LLM with a production-grade stack (vLLM + AWQ-INT4 + Marlin kernels), benchmarks it under realistic concurrency, evaluates quantization quality on standard tasks, and traces every dollar in the cost comparison back to a measured throughput number.
AWQ-INT4 with Marlin kernels reaches ~10× the total token throughput of BF16 at high concurrency on the same GPU. The win comes from reduced HBM bandwidth pressure — INT4 weights are 4× smaller, so more requests fit in the same memory budget and continuous batching keeps the GPU saturated.
TTFT scales with concurrency on both variants (prefill is compute-bound), but AWQ's smaller working set keeps it ~40% lower at every level.
Even at full BF16, self-hosting on a $0.27/hr A5000 is two orders of magnitude cheaper than GPT-4o blended pricing. With AWQ-INT4, the ratio is ~860×.
AWQ-INT4 vs full BF16 on meta-llama/Llama-3.1-8B-Instruct, evaluated via lm-evaluation-harness:
| Task | BF16 | AWQ-INT4 | Retention |
|---|---|---|---|
MMLU (5-shot, acc) |
0.681 | 0.669 | 98.2% |
GSM8K (5-shot, exact_match) |
0.819 | 0.800 | 97.6% |
HellaSwag (5-shot, acc_norm) |
0.789 | 0.778 | 98.6% |
The 1.8 pp aggregate cost is bounded; the throughput and dollar gain are an order of magnitude bigger.
This repo answers one question rigorously: "Should I self-host an open-source LLM instead of paying a commercial API?" The answer is "yes, and here's exactly how cheap it gets, plus what you give up." Every claim above traces back to a JSON file in results/ whose metadata names the GPU, the vLLM version, the model SHA, and the date.
flowchart TB
sharegpt["ShareGPT trace (load)"]
scripts["scripts/bench"]
bench["vllm bench serve<br/>(concurrency sweep)"]
json["results/bench/*.json"]
metrics["forge.benchmark.metrics"]
charts["charts.py<br/>PNG + SVG"]
vllm["vLLM (CUDA / RunPod)<br/>continuous batching, KV cache,<br/>AWQ + Marlin"]
prom["Prometheus"]
grafana["Grafana dashboard"]
lmeval["lm-evaluation-harness<br/>(MMLU, GSM8K, HellaSwag)"]
sharegpt --> bench
scripts --> bench
bench --> json
json --> metrics
metrics --> charts
bench -.->|"/v1/chat/completions<br/>/v1/completions"| vllm
lmeval -.-> vllm
vllm -->|"/metrics"| prom
prom --> grafana
| Layer | Choice | Why |
|---|---|---|
| Language | Python 3.12 | Best wheel coverage across the ML stack. |
| Dep manager | uv + uv.lock |
Fastest resolver; deterministic; pins Python alongside packages. |
| Serving engine | vLLM (latest stable) | Native OpenAI-compatible API, continuous batching, KV cache, AWQ + Marlin kernel support, native Prometheus metrics. |
| Model | Llama 3.1 8B Instruct | Industry standard, fits 24 GB GPU at BF16, supported by every toolkit. |
| Quantization | AWQ-INT4 with Marlin kernels | Fastest INT4 path on vLLM; ~1–2% better quality retention than GPTQ. |
| GPU | RunPod RTX A5000 24 GB | $0.27/hr compute — cheapest available tier that fits BF16 weights + KV cache. |
| Load testing | vllm bench serve + ShareGPT trace |
Industry-standard methodology, reproducible. |
| Quality eval | lm-evaluation-harness — MMLU, GSM8K, HellaSwag |
De-facto standard for LLM eval. |
| Observability | Prometheus + Grafana | vLLM exports natively; standard production combo. |
| Lint / format | Ruff | Replaces black + isort + flake8; faster. |
| Types | mypy (strict) | Project-wide strict checking. |
| Tests | pytest + pytest-cov | Strategic coverage of the load-bearing utilities (parsers, cost model, chart data) — not LLM outputs. |
| CI | GitHub Actions | Ruff + mypy + pytest on every PR. No GPU jobs. |
Versions are pinned in uv.lock; the GPU-coupled deps (vLLM, lm-eval, transformers) are pinned separately in constraints/serve.txt and constraints/eval.txt.
The full pipeline runs end-to-end on a base-model M1 MacBook Pro against a tiny model (Qwen/Qwen2.5-0.5B-Instruct). That validation gate must pass before any paid GPU is rented. See docs/local-dev.md.
# One-time setup
uv sync --group dev
uv pip install -c constraints/serve.txt vllm
uv pip install -c constraints/eval.txt "lm-eval[api]"
# Local smoke (vLLM CPU + bench + eval, ~5 minutes)
make rehearse # equivalent to: bash deploy/runpod-run.sh --rehearsal
# Regenerate every chart from results/
make chartFull instructions, including the pre-flight checklist that protects the GPU budget, live in deploy/runpod.md. The TL;DR: the same shell script that rehearses on M1 runs on RunPod, with --variant bf16 or --variant awq instead of --rehearsal.
# Inside the RunPod pod:
git clone https://github.com/feRpicoral/forge.git && cd forge
uv sync --group dev
uv pip install -c constraints/serve.txt vllm
uv pip install -c constraints/eval.txt "lm-eval[api]"
export HF_TOKEN=hf_…
bash deploy/runpod-run.sh --variant bf16
bash deploy/runpod-run.sh --variant awq- Hardware: RunPod RTX A5000 24 GB, $0.27/hr compute.
- Model:
meta-llama/Llama-3.1-8B-Instruct(BF16 baseline) andhugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4(the AWQ-INT4 variant, built with the canonical recipe documented inforge/quantization/awq.py). - Workload: ShareGPT trace at concurrency levels 1, 4, 16, 32, 64. 256 prompts per level.
- Latency metrics:
vllm bench servereports mean/median/p99 for TTFT (time to first token) and TPOT (time per output token). We treat the p99 as the upper bound for latency-sensitive use cases. - Quality eval:
lm-evaluation-harnesswithlocal-completionsmodel type pointed at the running vLLM endpoint. Tasks: MMLU (5-shot,acc), GSM8K (5-shot,exact_match), HellaSwag (5-shot,acc_norm). No--limitfor the real run. - Cost model:
$/1M tokens = gpu_compute_hourly_usd * 1e6 / (3600 * sustained_throughput * utilization). We use the peak total token throughput from the sweep atutilization=1.0for the headline number; sensitivity at 80% utilization is in the cost-comparison JSON next to the chart. RunPod storage is tracked separately in the run guide. - API pricing: collected 2026-05-29 from each provider's pricing page. Sources in
forge/cost/pricing.py.
forge/
├── forge/ # Python package
│ ├── serving/ # vLLM env-driven config + healthcheck
│ ├── quantization/ # AWQ recipe (reproducibility)
│ ├── benchmark/ # Harness wrapping `vllm bench serve`
│ ├── eval/ # lm-evaluation-harness wrapper + retention math
│ ├── cost/ # $/1M-tokens model + pricing tables
│ └── plots/ # Matplotlib stylesheet + the 5 canonical charts
├── configs/ # YAML sweep configs (server, bench, eval)
├── constraints/ # Pinned versions for the GPU-coupled deps
├── scripts/ # CLI entry points (serve, bench, eval, quantize, chart)
├── monitoring/ # Prometheus + auto-provisioned Grafana
├── deploy/ # RunPod orchestrator + reproduction guide
├── results/ # Committed: raw JSON + generated charts
├── docs/ # Methodology, M1 dev, reproduction
└── tests/ # pytest — cost model, parsers, chart data, configs
GitHub Actions runs Ruff (lint + format) + mypy + pytest on every PR and push to main. Commitlint enforces Conventional Commits on PRs. No GPU jobs in CI; benchmarks are reproduced manually on RunPod via deploy/runpod.md.
MIT — see LICENSE.



