v0.8.0 — Graphable fused decode, CUDA graph capture
What's New
- CUDA graph capture: The fused TQ attention path is now graphable —
torch.compileand CUDA graphs can capture the decode kernel - 87.9 tok/s aggregate throughput at 5 concurrent sequences (up from 51 tok/s pre-graph)
- Algorithm file sync from AitherOS production
Install
pip install aither-kvcache==0.8.0