Skip to content

v0.9.1 — Hook-based decode: 11 → 40 tok/s

Choose a tag to compare

@wizzense wizzense released this 02 Apr 20:59
· 19 commits to main since this release

Hook-based vLLM Integration (v0.9.1)

New turboquant.vllm.hooks module: monkey-patches TritonAttentionImpl.forward() to add TQ encode/decode without registering a custom attention backend. This preserves torch.compile + CUDA graphs compatibility that the custom backend approach broke.

Performance: 11 → 40 tok/s (3.6x)

Three optimizations that eliminated Python overhead in the decode path:

Fix Graph breaks eliminated Detail
Guard init functions 108/token → 0 _tq_init_layer, _ensure_quantizer, _ensure_norms were creating graph breaks on every call even though they're no-ops after warmup
Merge encode + decode 72/token → 36 Single @torch.compiler.disable call instead of two separate calls per layer
Branchless encode 72 CPU-GPU syncs → 0 Removed .any() boolean checks that triggered implicit CUDA synchronization

Net result: 144 → 36 graph breaks/token, 72 → 0 CPU-GPU syncs. L0 decode: 29.3ms → 1.9ms.

Usage

from turboquant.vllm.hooks import apply_tq_hooks

# Call after vLLM model is loaded:
apply_tq_hooks()

Or via pip: pip install aither-kvcache==0.9.1

Measurements (RTX 5090, Nemotron-8B-AWQ)

Metric Before After
Single-request tok/s 11.0 40.1
256-token wall time 24.3s 6.4s
L0 merged decode 29.3ms 1.9ms

Full writeup: blog post