Release v0.6.0 — Fused KV update kernel, CUDA graph compat · Aitherium/aitherkvcache

What's New

Fused Triton KV Update Kernel

Single kernel launch replaces the 3-step encode path. Grid (num_tokens, num_kv_heads) — one program per (token, head). In-kernel rotation matmul, searchsorted quantize, nibble pack, direct scatter to cache tensors. Eliminates intermediate tensor allocation and extra kernel launches.

CUDA Graph Compatibility

torch.cuda.is_current_stream_capturing() guard enables torch.compile + CUDA graphs alongside TQ. Set AITHER_TQ_EAGER=0 for +145% throughput.

Performance

Metric	v0.5.0	v0.6.0
5 concurrent (eager)	26.1 tok/s	26.1 tok/s
5 concurrent (compile+graph)	--	64.0 tok/s
10 concurrent (compile+graph)	--	138.2 tok/s

Env Vars

AITHER_TQ_BITS=4              # 2, 3, or 4
AITHER_TQ_FUSED=1             # fused Triton decode
AITHER_TQ_EAGER=0             # torch.compile + CUDA graphs (NEW — recommended)
AITHER_TQ_FORCE_TRITON=1      # required on Blackwell SM_100+

Install

pip install aither-kvcache==0.6.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.6.0 — Fused KV update kernel, CUDA graph compat

Choose a tag to compare

Sorry, something went wrong.