Skip to content

v0.6.0 — Fused KV update kernel, CUDA graph compat

Choose a tag to compare

@wizzense wizzense released this 30 Mar 19:22
· 26 commits to main since this release

What's New

Fused Triton KV Update Kernel

Single kernel launch replaces the 3-step encode path. Grid (num_tokens, num_kv_heads) — one program per (token, head). In-kernel rotation matmul, searchsorted quantize, nibble pack, direct scatter to cache tensors. Eliminates intermediate tensor allocation and extra kernel launches.

CUDA Graph Compatibility

torch.cuda.is_current_stream_capturing() guard enables torch.compile + CUDA graphs alongside TQ. Set AITHER_TQ_EAGER=0 for +145% throughput.

Performance

Metric v0.5.0 v0.6.0
5 concurrent (eager) 26.1 tok/s 26.1 tok/s
5 concurrent (compile+graph) -- 64.0 tok/s
10 concurrent (compile+graph) -- 138.2 tok/s

Env Vars

AITHER_TQ_BITS=4              # 2, 3, or 4
AITHER_TQ_FUSED=1             # fused Triton decode
AITHER_TQ_EAGER=0             # torch.compile + CUDA graphs (NEW — recommended)
AITHER_TQ_FORCE_TRITON=1      # required on Blackwell SM_100+

Install

pip install aither-kvcache==0.6.0