Release v0.9.1 — Hook-based decode: 11 → 40 tok/s · Aitherium/aitherkvcache

Hook-based vLLM Integration (v0.9.1)

New turboquant.vllm.hooks module: monkey-patches TritonAttentionImpl.forward() to add TQ encode/decode without registering a custom attention backend. This preserves torch.compile + CUDA graphs compatibility that the custom backend approach broke.

Performance: 11 → 40 tok/s (3.6x)

Three optimizations that eliminated Python overhead in the decode path:

Fix	Graph breaks eliminated	Detail
Guard init functions	108/token → 0	`_tq_init_layer`, `_ensure_quantizer`, `_ensure_norms` were creating graph breaks on every call even though they're no-ops after warmup
Merge encode + decode	72/token → 36	Single `@torch.compiler.disable` call instead of two separate calls per layer
Branchless encode	72 CPU-GPU syncs → 0	Removed `.any()` boolean checks that triggered implicit CUDA synchronization

Net result: 144 → 36 graph breaks/token, 72 → 0 CPU-GPU syncs. L0 decode: 29.3ms → 1.9ms.

Usage

from turboquant.vllm.hooks import apply_tq_hooks

# Call after vLLM model is loaded:
apply_tq_hooks()

Or via pip: pip install aither-kvcache==0.9.1

Measurements (RTX 5090, Nemotron-8B-AWQ)

Metric	Before	After
Single-request tok/s	11.0	40.1
256-token wall time	24.3s	6.4s
L0 merged decode	29.3ms	1.9ms

Full writeup: blog post

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.9.1 — Hook-based decode: 11 → 40 tok/s

Choose a tag to compare

Sorry, something went wrong.