v0.6.0 — Fused KV update kernel, CUDA graph compat
What's New
Fused Triton KV Update Kernel
Single kernel launch replaces the 3-step encode path. Grid (num_tokens, num_kv_heads) — one program per (token, head). In-kernel rotation matmul, searchsorted quantize, nibble pack, direct scatter to cache tensors. Eliminates intermediate tensor allocation and extra kernel launches.
CUDA Graph Compatibility
torch.cuda.is_current_stream_capturing() guard enables torch.compile + CUDA graphs alongside TQ. Set AITHER_TQ_EAGER=0 for +145% throughput.
Performance
| Metric | v0.5.0 | v0.6.0 |
|---|---|---|
| 5 concurrent (eager) | 26.1 tok/s | 26.1 tok/s |
| 5 concurrent (compile+graph) | -- | 64.0 tok/s |
| 10 concurrent (compile+graph) | -- | 138.2 tok/s |
Env Vars
AITHER_TQ_BITS=4 # 2, 3, or 4
AITHER_TQ_FUSED=1 # fused Triton decode
AITHER_TQ_EAGER=0 # torch.compile + CUDA graphs (NEW — recommended)
AITHER_TQ_FORCE_TRITON=1 # required on Blackwell SM_100+Install
pip install aither-kvcache==0.6.0