Skip to content

v2.0.0 — TriAttention spectral KV compression

Choose a tag to compare

@wizzense wizzense released this 07 Apr 18:55
· 4 commits to main since this release

What's New in v2.0.0

TriAttention — Spectral KV Cache Compression (NEW)

Trigonometric series attention for ~10× KV cache compression. Exploits the fact that RoPE attention is naturally a trig series in position difference — store only the dominant frequency pairs per token instead of full vectors.

Modules:

  • \�ither_kvcache.triattention.config\ — \TriAttentionConfig\
  • \�ither_kvcache.triattention.encoder\ — \SpectralKVEncoder\ (encode/decode)
  • \�ither_kvcache.triattention.scorer\ — \TrigSeriesScorer\ (pre-RoPE trig series scoring)
  • \�ither_kvcache.triattention.cache\ — \SpectralKVCache\ (paged block storage)
  • \�ither_kvcache.triattention.attention\ — \TriAttention\ (full pipeline)
  • \�ither_kvcache.triattention.calibration\ — Qwen3.5 family profiles (0.6B–32B)

Compression:

Mode Bytes/Token vs FP16
F=12, int4 26 9.85×
F=12, int8 38 6.74×
F=8, int4 18 14.2×

\\python
from aither_kvcache.triattention import TriAttention, TriAttentionConfig

config = TriAttentionConfig(head_dim=128, num_freqs=12, coeff_bits=4,
num_kv_heads=8, num_query_heads=32)
tri = TriAttention(config, device='cuda')
k_enc, v_enc = tri.encode_kv(keys, values)
output = tri.decode_step(query, k_enc, v_enc, query_pos, key_positions)
\\

TurboQuant Core Sync

  • Updated quantizer, hybrid_quantizer, fused_attention, fused_kv_update, triton_ops, block_selector from monorepo

Tests

  • 184 passed, 5 skipped (GPU-only tests)
  • 32 new TriAttention tests covering spectral math, encoder roundtrip, trig series scoring, paged cache, Qwen3.5 calibration

Full Changelog: v1.3.1...v2.0.0