v0.8.0 — Graphable fused decode, CUDA graph capture

wizzense released this 31 Mar 15:31

· 24 commits to main since this release

6df6765

What's New

CUDA graph capture: The fused TQ attention path is now graphable — torch.compile and CUDA graphs can capture the decode kernel
87.9 tok/s aggregate throughput at 5 concurrent sequences (up from 51 tok/s pre-graph)
Algorithm file sync from AitherOS production

pip install aither-kvcache==0.8.0

Assets 2