Skip to content

ajentik/quadrotor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

QuadRotor

Blockwise 4D rotation quantization for LLM weights. Per-group SO(4) rotation parameterized by two unit quaternions, applied as a decorrelating transform before scalar quantization. Pure PyTorch, MIT-licensed.

What it does

For each contiguous group of 64 weight elements:

  1. Norm/direction split: ρ = ‖x‖₂, x̄ = x / ρ. Store ρ separately.
  2. Block partition: reshape into 16 blocks of 4 coordinates.
  3. Per-tensor seed: sample 16 pairs of unit quaternions (q_L⁽ⁱ⁾, q_R⁽ⁱ⁾) from a deterministic Haar-on-S³ draw seeded by SHA-256 of the tensor's name (truncated to 32 bits).
  4. SO(4) sandwich rotation: apply T(v) = q_L · v · q̄_R to each block.
  5. Affine quantize: group-wise asymmetric N-bit on the rotated coordinates (4-bit in this iteration).

The rotation is intended to spread within-block correlations and outliers more evenly across coordinates so the per-group dynamic range — which sets affine quantization quality — is more uniform across blocks.

The per-tensor seed costs 4 bytes of layer-name hashing (no storage in the artifact). The per-group ρ adds one BF16 per group (~5.5 % overhead at 4 bits). Quantized codes are stored as uint8 (1 byte per code; bit-packing 2 nibbles per byte is a future 2x optimization).

Smoke results (2026-05-08)

This iteration ships only QuadRotor-Full at 4 bits, evaluated on two small models. Fast and 2D variants, more bit widths, and KV-cache compression are out of scope.

Synthetic stage-1 reconstruction MSE

10 000 unit-norm isotropic Gaussian vectors per row, group_size=64, 4 bits, seed 0. Plain affine 4-bit is the baseline.

d bits affine MSE QuadRotor-Full MSE ratio
128 4 0.008028 0.008025 1.000
256 4 0.008040 0.008049 1.001
512 4 0.008023 0.008029 1.001

On isotropic synthetic data the rotation is a wash — random rotations of isotropic noise stay isotropic. The interesting comparison is on real (weakly-anisotropic, heavy-tailed) LLM weights, below.

WikiText-2 perplexity (seq_len 2048, MPS)

Model BF16 Affine 4-bit QuadRotor-Full 4-bit Δ rotor−affine
TinyLlama-1.1B-Chat-v1.0 7.9722 8.5230 8.5660 +0.0431
SmolLM2-1.7B-Instruct 8.9391 11.0134 11.0197 +0.0063

Headline takeaway: random Haar quaternions don't help. Across two architectures and very different degradation regimes (TinyLlama: ~7 % PPL increase from BF16; SmolLM2: ~23 %), the SO(4) rotation step applied with random per-tensor quaternions is approximately neutral vs. plain group-wise affine 4-bit quantization. On TinyLlama it's slightly worse (Δ +0.04); on SmolLM2 it's a wash (Δ +0.01).

This is consistent with the underlying intuition: random rotation preserves the distribution of weight magnitudes, so per-group affine quantization sees similar dynamic ranges. To extract a real benefit, the rotation needs to be calibrated — e.g., learned from data to minimize quantization error on representative inputs (TurboQuant-style optimisation on the manifold). This iteration deliberately uses the lightweight fixed-randomized variant from the paper §5.5 ("yielding random block rotations analogous to the randomized transform used by TurboQuant"). A calibration-based variant is left for future work.

Generation samples

Three fixed prompts, 64 tokens, greedy decoding. Full JSON in benchmarks/results/tinyllama_gen.json and benchmarks/results/smollm2_gen.json.

For an eyeball check: TinyLlama and SmolLM2 4-bit completions look substantively similar to the BF16 baseline at this prompt scale. Both 4-bit variants exhibit the typical short-prompt repetition artifact ("Paris is the capital of France. Paris is …") that affects greedy decoding from quantized small models — present in both affine and QuadRotor variants alike. No visible additional degradation from the rotation step.

Models

Weight-quantized HF artifacts produced by this repo:

These are loadable via quadrotor.state_dict.decode_state_dict (see src/quadrotor/state_dict.py). A transformers-compatible loader that monkeypatches from_pretrained is future work.

Install

pip install -e .
# Plus benchmark deps to reproduce the numbers above:
pip install -e ".[benchmarks]"

CLI

quadrotor-quantize \
    --src ./tinyllama-bf16 \
    --out ./tinyllama-quadrotor-full-4bit \
    --bits 4 --group-size 64

Tests

pytest -q

25 tests covering quaternion algebra, SO(4) sandwich round-trip, affine round-trip error bounds, per-tensor and state-dict pipelines, synthetic-MSE reproduction, HF directory round-trip.

Status

Smoke iteration complete. Pinned scope: QuadRotor-Full only, 4 bits, two ≤2 B models, MPS-only inference. Out of scope for this iteration:

  • QuadRotor-Fast (single quaternion sandwich) and -2D variants
  • More bit widths (2, 3, 5, 6, 8)
  • KV-cache compression mode (the original paper's primary use case)
  • Calibration-based quaternion learning (the most promising next step given the smoke null result)
  • MLX / Metal kernels for speed
  • Bit-packed nibble storage (currently uint8, one code per byte)
  • A transformers loader that auto-decodes on from_pretrained

License

MIT — see LICENSE.

About

Blockwise SO(4) quaternion rotation quantization for LLM weights — pure PyTorch, MIT

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages