Skip to content

v0.7.0 — Hybrid bit-width quantizer (tq35/tq25)

Choose a tag to compare

@wizzense wizzense released this 30 Mar 19:45
· 25 commits to main since this release

What's New

HybridTurboQuant — Split-Group Quantization with QJL Residual

Dimensions are split into outlier (high-variance) and regular groups with different bit budgets. Each group gets MSE codebook quantization plus 1-bit QJL residual encoding that captures the quantization error.

from turboquant import HybridTurboQuant

htq = HybridTurboQuant(head_dim=128, mode='tq35', device='cuda')
htq.calibrate_uniform()  # or htq.calibrate(sample_data)
packed = htq.encode(kv_vectors)
decoded = htq.decode(packed)

Modes

Mode Avg Bits Bytes/Vector vs FP16 Description
tq35 3.5 64 4.0x Better quality than TQ4 at same size
tq25 2.5 44 5.8x Maximum compression

Key Components

  • Fast Walsh-Hadamard Transform: O(d log d) rotation, non-power-of-2 support
  • Dimension-Aware Lloyd-Max Codebook: Exact Beta distribution (not Gaussian approx)
  • QJL Residual Encoding: 1-bit sign sketch captures quantization error
  • Variance-Based Calibration: Identifies outlier dimensions per head
  • Data-Oblivious Fallback: calibrate_uniform() for zero-calibration deployment

Install

pip install aither-kvcache==0.7.0

Inspired by wizzense/vllm-turboquant.