v0.9.1 — Hook-based decode: 11 → 40 tok/s
Hook-based vLLM Integration (v0.9.1)
New turboquant.vllm.hooks module: monkey-patches TritonAttentionImpl.forward() to add TQ encode/decode without registering a custom attention backend. This preserves torch.compile + CUDA graphs compatibility that the custom backend approach broke.
Performance: 11 → 40 tok/s (3.6x)
Three optimizations that eliminated Python overhead in the decode path:
| Fix | Graph breaks eliminated | Detail |
|---|---|---|
| Guard init functions | 108/token → 0 | _tq_init_layer, _ensure_quantizer, _ensure_norms were creating graph breaks on every call even though they're no-ops after warmup |
| Merge encode + decode | 72/token → 36 | Single @torch.compiler.disable call instead of two separate calls per layer |
| Branchless encode | 72 CPU-GPU syncs → 0 | Removed .any() boolean checks that triggered implicit CUDA synchronization |
Net result: 144 → 36 graph breaks/token, 72 → 0 CPU-GPU syncs. L0 decode: 29.3ms → 1.9ms.
Usage
from turboquant.vllm.hooks import apply_tq_hooks
# Call after vLLM model is loaded:
apply_tq_hooks()Or via pip: pip install aither-kvcache==0.9.1
Measurements (RTX 5090, Nemotron-8B-AWQ)
| Metric | Before | After |
|---|---|---|
| Single-request tok/s | 11.0 | 40.1 |
| 256-token wall time | 24.3s | 6.4s |
| L0 merged decode | 29.3ms | 1.9ms |
Full writeup: blog post