Skip to content

v0.5.0 — Blackwell SM_120 validated

Choose a tag to compare

@wizzense wizzense released this 30 Mar 17:09
· 27 commits to main since this release

What's New

Fused Triton decode validated on NVIDIA Blackwell (RTX 5090, SM_120)

  • Triton kernels generate valid PTX for SM_120 without modification
  • AITHER_TQ_FORCE_TRITON=1 overrides conservative SM_100+ guard
  • 26.1 tok/s decode throughput at 5 concurrent requests
  • TQGPUCache: 36L x 856B @ TQ4, 512 MB VRAM + 512 MB DDR5
  • Zero runtime errors under concurrent load

Effective Context Budget (RTX 5090, 32 GiB)

Tier Format Capacity
Hot (VRAM) TQ4 ~280K tokens
Cold (DDR5) TQ4 ~3.9M tokens

Env Vars

AITHER_TQ_BITS=4           # 2, 3, or 4
AITHER_TQ_FUSED=1          # fused Triton decode
AITHER_TQ_FORCE_TRITON=1   # required on Blackwell SM_100+

Install

pip install aither-kvcache==0.5.0
pip install aither-kvcache[vllm]==0.5.0

Full blog post: https://demo.aitherium.com/blog/turboquant-sub-byte-kv-cache-from-paper-to-production