v0.7.0 — Hybrid bit-width quantizer (tq35/tq25)
What's New
HybridTurboQuant — Split-Group Quantization with QJL Residual
Dimensions are split into outlier (high-variance) and regular groups with different bit budgets. Each group gets MSE codebook quantization plus 1-bit QJL residual encoding that captures the quantization error.
from turboquant import HybridTurboQuant
htq = HybridTurboQuant(head_dim=128, mode='tq35', device='cuda')
htq.calibrate_uniform() # or htq.calibrate(sample_data)
packed = htq.encode(kv_vectors)
decoded = htq.decode(packed)Modes
| Mode | Avg Bits | Bytes/Vector | vs FP16 | Description |
|---|---|---|---|---|
| tq35 | 3.5 | 64 | 4.0x | Better quality than TQ4 at same size |
| tq25 | 2.5 | 44 | 5.8x | Maximum compression |
Key Components
- Fast Walsh-Hadamard Transform: O(d log d) rotation, non-power-of-2 support
- Dimension-Aware Lloyd-Max Codebook: Exact Beta distribution (not Gaussian approx)
- QJL Residual Encoding: 1-bit sign sketch captures quantization error
- Variance-Based Calibration: Identifies outlier dimensions per head
- Data-Oblivious Fallback:
calibrate_uniform()for zero-calibration deployment
Install
pip install aither-kvcache==0.7.0Inspired by wizzense/vllm-turboquant.