Blockwise 4D rotation quantization for LLM weights. Per-group SO(4) rotation parameterized by two unit quaternions, applied as a decorrelating transform before scalar quantization. Pure PyTorch, MIT-licensed.
For each contiguous group of 64 weight elements:
- Norm/direction split:
ρ = ‖x‖₂,x̄ = x / ρ. Storeρseparately. - Block partition: reshape
x̄into 16 blocks of 4 coordinates. - Per-tensor seed: sample 16 pairs of unit quaternions
(q_L⁽ⁱ⁾, q_R⁽ⁱ⁾)from a deterministic Haar-on-S³ draw seeded by SHA-256 of the tensor's name (truncated to 32 bits). - SO(4) sandwich rotation: apply
T(v) = q_L · v · q̄_Rto each block. - Affine quantize: group-wise asymmetric N-bit on the rotated coordinates (4-bit in this iteration).
The rotation is intended to spread within-block correlations and outliers more evenly across coordinates so the per-group dynamic range — which sets affine quantization quality — is more uniform across blocks.
The per-tensor seed costs 4 bytes of layer-name hashing (no storage in the
artifact). The per-group ρ adds one BF16 per group (~5.5 % overhead at
4 bits). Quantized codes are stored as uint8 (1 byte per code; bit-packing
2 nibbles per byte is a future 2x optimization).
This iteration ships only QuadRotor-Full at 4 bits, evaluated on two small models. Fast and 2D variants, more bit widths, and KV-cache compression are out of scope.
10 000 unit-norm isotropic Gaussian vectors per row, group_size=64, 4 bits, seed 0. Plain affine 4-bit is the baseline.
| d | bits | affine MSE | QuadRotor-Full MSE | ratio |
|---|---|---|---|---|
| 128 | 4 | 0.008028 | 0.008025 | 1.000 |
| 256 | 4 | 0.008040 | 0.008049 | 1.001 |
| 512 | 4 | 0.008023 | 0.008029 | 1.001 |
On isotropic synthetic data the rotation is a wash — random rotations of isotropic noise stay isotropic. The interesting comparison is on real (weakly-anisotropic, heavy-tailed) LLM weights, below.
| Model | BF16 | Affine 4-bit | QuadRotor-Full 4-bit | Δ rotor−affine |
|---|---|---|---|---|
| TinyLlama-1.1B-Chat-v1.0 | 7.9722 | 8.5230 | 8.5660 | +0.0431 |
| SmolLM2-1.7B-Instruct | 8.9391 | 11.0134 | 11.0197 | +0.0063 |
Headline takeaway: random Haar quaternions don't help. Across two architectures and very different degradation regimes (TinyLlama: ~7 % PPL increase from BF16; SmolLM2: ~23 %), the SO(4) rotation step applied with random per-tensor quaternions is approximately neutral vs. plain group-wise affine 4-bit quantization. On TinyLlama it's slightly worse (Δ +0.04); on SmolLM2 it's a wash (Δ +0.01).
This is consistent with the underlying intuition: random rotation preserves the distribution of weight magnitudes, so per-group affine quantization sees similar dynamic ranges. To extract a real benefit, the rotation needs to be calibrated — e.g., learned from data to minimize quantization error on representative inputs (TurboQuant-style optimisation on the manifold). This iteration deliberately uses the lightweight fixed-randomized variant from the paper §5.5 ("yielding random block rotations analogous to the randomized transform used by TurboQuant"). A calibration-based variant is left for future work.
Three fixed prompts, 64 tokens, greedy decoding. Full JSON in
benchmarks/results/tinyllama_gen.json
and
benchmarks/results/smollm2_gen.json.
For an eyeball check: TinyLlama and SmolLM2 4-bit completions look substantively similar to the BF16 baseline at this prompt scale. Both 4-bit variants exhibit the typical short-prompt repetition artifact ("Paris is the capital of France. Paris is …") that affects greedy decoding from quantized small models — present in both affine and QuadRotor variants alike. No visible additional degradation from the rotation step.
Weight-quantized HF artifacts produced by this repo:
majentik/TinyLlama-1.1B-Chat-v1.0-QuadRotor-Full-4bitmajentik/SmolLM2-1.7B-Instruct-QuadRotor-Full-4bit
These are loadable via quadrotor.state_dict.decode_state_dict (see
src/quadrotor/state_dict.py). A transformers-compatible loader that
monkeypatches from_pretrained is future work.
pip install -e .
# Plus benchmark deps to reproduce the numbers above:
pip install -e ".[benchmarks]"quadrotor-quantize \
--src ./tinyllama-bf16 \
--out ./tinyllama-quadrotor-full-4bit \
--bits 4 --group-size 64pytest -q25 tests covering quaternion algebra, SO(4) sandwich round-trip, affine round-trip error bounds, per-tensor and state-dict pipelines, synthetic-MSE reproduction, HF directory round-trip.
Smoke iteration complete. Pinned scope: QuadRotor-Full only, 4 bits, two ≤2 B models, MPS-only inference. Out of scope for this iteration:
- QuadRotor-Fast (single quaternion sandwich) and -2D variants
- More bit widths (2, 3, 5, 6, 8)
- KV-cache compression mode (the original paper's primary use case)
- Calibration-based quaternion learning (the most promising next step given the smoke null result)
- MLX / Metal kernels for speed
- Bit-packed nibble storage (currently uint8, one code per byte)
- A
transformersloader that auto-decodes onfrom_pretrained
MIT — see LICENSE.