QuadRotor

Blockwise 4D rotation quantization for LLM weights. Per-group SO(4) rotation parameterized by two unit quaternions, applied as a decorrelating transform before scalar quantization. Pure PyTorch, MIT-licensed.

What it does

For each contiguous group of 64 weight elements:

Norm/direction split: ρ = ‖x‖₂, x̄ = x / ρ. Store ρ separately.
Block partition: reshape x̄ into 16 blocks of 4 coordinates.
Per-tensor seed: sample 16 pairs of unit quaternions (q_L⁽ⁱ⁾, q_R⁽ⁱ⁾) from a deterministic Haar-on-S³ draw seeded by SHA-256 of the tensor's name (truncated to 32 bits).
SO(4) sandwich rotation: apply T(v) = q_L · v · q̄_R to each block.
Affine quantize: group-wise asymmetric N-bit on the rotated coordinates (4-bit in this iteration).

The rotation is intended to spread within-block correlations and outliers more evenly across coordinates so the per-group dynamic range — which sets affine quantization quality — is more uniform across blocks.

The per-tensor seed costs 4 bytes of layer-name hashing (no storage in the artifact). The per-group ρ adds one BF16 per group (~5.5 % overhead at 4 bits). Quantized codes are stored as uint8 (1 byte per code; bit-packing 2 nibbles per byte is a future 2x optimization).

Smoke results (2026-05-08)

This iteration ships only QuadRotor-Full at 4 bits, evaluated on two small models. Fast and 2D variants, more bit widths, and KV-cache compression are out of scope.

Synthetic stage-1 reconstruction MSE

10 000 unit-norm isotropic Gaussian vectors per row, group_size=64, 4 bits, seed 0. Plain affine 4-bit is the baseline.

d	bits	affine MSE	QuadRotor-Full MSE	ratio
128	4	0.008028	0.008025	1.000
256	4	0.008040	0.008049	1.001
512	4	0.008023	0.008029	1.001

On isotropic synthetic data the rotation is a wash — random rotations of isotropic noise stay isotropic. The interesting comparison is on real (weakly-anisotropic, heavy-tailed) LLM weights, below.

WikiText-2 perplexity (seq_len 2048, MPS)

Model	BF16	Affine 4-bit	QuadRotor-Full 4-bit	Δ rotor−affine
TinyLlama-1.1B-Chat-v1.0	7.9722	8.5230	8.5660	+0.0431
SmolLM2-1.7B-Instruct	8.9391	11.0134	11.0197	+0.0063

Headline takeaway: random Haar quaternions don't help. Across two architectures and very different degradation regimes (TinyLlama: ~7 % PPL increase from BF16; SmolLM2: ~23 %), the SO(4) rotation step applied with random per-tensor quaternions is approximately neutral vs. plain group-wise affine 4-bit quantization. On TinyLlama it's slightly worse (Δ +0.04); on SmolLM2 it's a wash (Δ +0.01).

This is consistent with the underlying intuition: random rotation preserves the distribution of weight magnitudes, so per-group affine quantization sees similar dynamic ranges. To extract a real benefit, the rotation needs to be calibrated — e.g., learned from data to minimize quantization error on representative inputs (TurboQuant-style optimisation on the manifold). This iteration deliberately uses the lightweight fixed-randomized variant from the paper §5.5 ("yielding random block rotations analogous to the randomized transform used by TurboQuant"). A calibration-based variant is left for future work.

Generation samples

Three fixed prompts, 64 tokens, greedy decoding. Full JSON in benchmarks/results/tinyllama_gen.json and benchmarks/results/smollm2_gen.json.

For an eyeball check: TinyLlama and SmolLM2 4-bit completions look substantively similar to the BF16 baseline at this prompt scale. Both 4-bit variants exhibit the typical short-prompt repetition artifact ("Paris is the capital of France. Paris is …") that affects greedy decoding from quantized small models — present in both affine and QuadRotor variants alike. No visible additional degradation from the rotation step.

Models

Weight-quantized HF artifacts produced by this repo:

These are loadable via quadrotor.state_dict.decode_state_dict (see src/quadrotor/state_dict.py). A transformers-compatible loader that monkeypatches from_pretrained is future work.

Install

pip install -e .
# Plus benchmark deps to reproduce the numbers above:
pip install -e ".[benchmarks]"

CLI

quadrotor-quantize \
    --src ./tinyllama-bf16 \
    --out ./tinyllama-quadrotor-full-4bit \
    --bits 4 --group-size 64

Tests

pytest -q

25 tests covering quaternion algebra, SO(4) sandwich round-trip, affine round-trip error bounds, per-tensor and state-dict pipelines, synthetic-MSE reproduction, HF directory round-trip.

Status

Smoke iteration complete. Pinned scope: QuadRotor-Full only, 4 bits, two ≤2 B models, MPS-only inference. Out of scope for this iteration:

QuadRotor-Fast (single quaternion sandwich) and -2D variants
More bit widths (2, 3, 5, 6, 8)
KV-cache compression mode (the original paper's primary use case)
Calibration-based quaternion learning (the most promising next step given the smoke null result)
MLX / Metal kernels for speed
Bit-packed nibble storage (currently uint8, one code per byte)
A transformers loader that auto-decodes on from_pretrained

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
benchmarks		benchmarks
docs		docs
src/quadrotor		src/quadrotor
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QuadRotor

What it does

Smoke results (2026-05-08)

Synthetic stage-1 reconstruction MSE

WikiText-2 perplexity (seq_len 2048, MPS)

Generation samples

Models

Install

CLI

Tests

Status

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

QuadRotor

What it does

Smoke results (2026-05-08)

Synthetic stage-1 reconstruction MSE

WikiText-2 perplexity (seq_len 2048, MPS)

Generation samples

Models

Install

CLI

Tests

Status

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages