OSCAR captures Q/K/V activations on a small calibration set, estimates attention-aware K/V covariance structures offline, and derives per-layer rotations + clipping thresholds that align KV quantization with the directions attention actually consumes. The result is INT2 storage for the bulk of the KV cache plus a small BF16 sink + recent window — ~7× compression of the KV-cache memory footprint vs BF16, with single-digit pp accuracy drop on GPQA for the dense reasoning models we validated.
OSCAR is built directly into the open-source SGLang framework: clone the repo, set up the single environment, and run the dump, rotation, and evaluation scripts end to end. It works out of the box, and we also provide a rotation zoo so users can download calibrated rotations directly instead of recomputing them.
- [Upcoming] OSCAR is testing minimax-m2.7 and GLM in 200K long horizon Agentic Tasks. Happy to see OSCAR being used in the wild!
- [2026-05-18] Full release: paper, code, website, and RotationZoo are all live — runs out of the box on SGLang.
- Main results
- Layout
- Setup
- Quick start (Qwen3-8B example)
- All configured models
- How the rotation is fit (spectral covariance)
- Serving with the rotation
- Calibration knobs
- Citation
- License & acknowledgements
Multi-Modal & LongBench
Use Rotation and Run Script in zhongzhu/VL branch.OCRBench comparison
| Method | Qwen3-VL-8B | Qwen3-VL-4B |
|---|---|---|
| 16-bit Baseline | 858 | 852 |
| QuaRot (INT2) | 722 | 773 |
| RotateKV (INT2) | 754 | 638 |
| KIVI (INT2) | 851 | 813 |
| OTT (INT2) | 850 | 831 |
| TurboQuant+ (2.5-bit) | 847 | 828 |
| OSCAR (Lloyd-Max) | 854 | 848 |
Omni-Modal LLMs: MMAU-Pro
| Method (Qwen3-Omni-30B-A3B) | Open-ended | Good Rate | AIF |
|---|---|---|---|
| 16-bit Baseline | 66.2 | 27.8 | 87.4 |
| KIVI (INT2) | 65.8 | 27.0 | 78.2 |
| OTT (INT2) | 65.8 | 26.9 | 83.9 |
| TurboQuant+ (2.5-bit) | 66.6 | 27.0 | 79.3 |
| OSCAR | 67.4 | 33.8 | 89.7 |
LongBench-E comparison
| Method | Qwen3-8B |
|---|---|
| 16-bit Baseline | 49.56 |
| QuaRot (INT2) | 40.13 |
| RotateKV (INT2) | 42.95 |
| KIVI (INT2) | 47.95 |
| OTT (INT2) | 48.21 |
| TurboQuant+ (2.5-bit) | 47.56 |
| OSCAR | 50.25 |
Setup. Each cell is the MEAN across 5 reasoning / coding benchmarks — GPQA, HumanEval, LiveCodeBench v6, AIME 25, MATH-500. To control single-seed variance, every benchmark is evaluated 5 times per (model, method) cell (3 times for GLM-4.7-FP8) and the per-seed scores are averaged before being averaged across benchmarks. TurboQuant rows are single-run (*) because its vLLM path is too slow for repeated 32K-context evaluations under our compute budget. All runs use 32K-token max generation length. BPE = effective bits per KV element at 128K context length. Higher is better; the BF16 row is the upper bound.
| Method | BPE | Qwen3-4B Thinking | Qwen3-8B | Qwen3-32B | GLM-4.7-FP8 (358B) |
|---|---|---|---|---|---|
| BF16 (upper bound) | 16.00 | 75.64 | 70.84 | 74.19 | 77.89 |
| Saw-INT4 | 4.25 | 73.11 | 69.97 | 74.43 | 77.95 |
| TurboQuant K3V3 * | 3.25 | 31.74 | 56.88 | 71.99 | 78.15 |
| QuaRot-INT2 | 2.25 | 1.40 | 10.14 | 7.90 | 75.14 |
| Naive INT2 | 2.25 | 0.00 | 0.00 | 0.00 | 60.49 |
| OSCAR (ours) | 2.28 | 71.86 | 69.42 | 74.17 | 78.16 |
| Gap of OSCAR vs BF16 | −3.78 | −1.42 | −0.02 | +0.27 |
Baseline notes — TurboQuant / QuaRot / Saw-INT4 / Naive INT2 configurations
For a fair comparison at a comparable bit-budget, TurboQuant results use vLLM's implementation (docs) modified so that all layers are quantized (no mixed precision); the original TurboQuant keeps the first, last, and selected middle layers in full precision. We run it in its K3V3 configuration (3-bit K, 3-bit V) to land near the OSCAR bit-budget.
QuaRot-INT2 is the standard 2-bit KV-quant recipe (data-free Hadamard rotation per layer). Saw-INT4 is an INT4 reference for context. Naive INT2 is per-token symmetric INT2 with no rotation.
* TurboQuant entries are single-run results because its vLLM path is too slow for repeated 32K-context evaluations under our compute budget.
Comparison with other INT2 KV-cache methods on AIME25
Most prior INT2 KV-cache methods do not provide framework-level support for efficient long-context generation, so 32K-generation evaluations are extremely slow and their papers do not report the full benchmark suite above. For this reason, we compare against the reported AIME25 setting where public numbers are available.
| Method | BPE | Qwen3-8B | Qwen3-32B |
|---|---|---|---|
| Original BF16 | 16.00 | 66.00 +/- 7.33 | 72.59 +/- 7.41 |
| KIVI-KV2 | 2.25 | 52.33 +/- 9.00 | 57.41 +/- 9.26 |
| KIVI-KV2* | 2.26 | 57.67 +/- 9.00 | 59.05 +/- 12.38 |
| Kitty | 2.39 | 59.67 +/- 10.33 | 69.26 +/- 9.26 |
| OSCAR (ours) | 2.38 | 66.67 +/- 3.33 | 74.00 +/- 5.48 |
OSCAR is the only INT2 method in this comparison that reaches BF16-level AIME25 accuracy at 32K generation while staying near a 2-bit KV-cache budget.
rotation/
eval_oscar_gpqa.sh generic GPQA eval driver
eval_oscar_lcb.sh generic LiveCodeBench v6 (128K) eval driver
compute_kv_rotation.py eigendecomposition + R·H·P_br composition
_dump_compat/ sgl_kernel compat shim for dump
<model>/
save_qkv_<model>.sh phase 1 — dump
compute_rotation.sh phase 2 — rotation
eval_gpqa.sh phase 3 — GPQA eval
eval_lcb.sh phase 3 — LCB v6 (128K) eval (where applicable)
GPQA/
seq<T>_prompt<N>_group<G>/
qkv_dumps/ dump output
rotations/ rotation .pt files
_eval_gpqa_oscar/ eval results from this rotation
_eval_lcb_v6_128k/ ...
sglang-research/ submodule — INT2 KV eval
sglang-dump-qkv/ vendored older sglang-fork — QKV dump (loaded via shim)
- 1 × H100 80 GB (for 4B/8B), 4 × H100 (for 32B / MiniMax-M2.7), 8 × H100 (for GLM-4.7-FP8)
- CUDA 12.8 or 12.9 (nvcc on
$PATH) - Python 3.12 + Conda
- HuggingFace access for the relevant model weights
git clone --recursive https://github.com/FutureMLS-Lab/OSCAR.git
cd OSCAROSCAR uses one conda env for both dump and eval. The dump-side sglang
(vendored as sglang-dump-qkv/) was originally built against an older
sgl_kernel; OSCAR ships a thin rotation/_dump_compat/ shim that stubs
the dropped legacy symbols at import time and falls back to PyTorch for
the runtime sampling kernels it references, so a single eval-side env
suffices.
conda create -n oscar python=3.12 -y
conda activate oscar
# Eval-side sglang (editable so future patches stick)
pip install -e sglang-research/python
# CUDA-12.8/12.9 compatible flashinfer + sgl_kernel build
# (see https://github.com/sgl-project/sglang for matching wheels)If nvcc and PyTorch's CUDA versions diverge (e.g. nvcc 12.6 but torch
built for 12.8), the JIT kernels in flashinfer may fail to compile. Pin
CUDA_HOME to the matching cuda-12.x directory before launching.
End-to-end on a single H100, ~20 minutes total.
cd OSCAR
# Phase 1 — dump Q/K/V (TP=1, default DUMP_KVCACHE_TOKENS=30000)
bash rotation/qwen3-8B/save_qkv_8b.sh
# → writes rotation/qwen3-8B/GPQA/seq30000_prompt<N>_group128/qkv_dumps/
# Phase 2 — fit the calibrated rotation
bash rotation/qwen3-8B/compute_rotation.sh
# → writes rotation/qwen3-8B/GPQA/seq30000_prompt<N>_group128/rotations/{k,v}_rotation_qqt_r_h_pbr.pt
# Phase 3 — GPQA eval against the rotation we just produced
ROT_DIR=rotation/qwen3-8B/GPQA/seq30000_prompt<N>_group128/rotations \
bash rotation/qwen3-8B/eval_gpqa.sh
# → writes results to rotation/qwen3-8B/GPQA/seq30000_prompt<N>_group128/_eval_gpqa_oscar/Pick the actual seq...prompt..._group... tag printed by phase 1, or:
ROT_DIR=$(ls -1d rotation/qwen3-8B/GPQA/seq*_prompt*_group*/rotations | tail -1) \
bash rotation/qwen3-8B/eval_gpqa.sh| Folder | HF model | TP (dump) | TP (eval) | Notes |
|---|---|---|---|---|
rotation/qwen3-4B-thinking-2507/ |
Qwen/Qwen3-4B-Thinking-2507 |
1 | 1 | thinking model |
rotation/qwen3-8B/ |
Qwen/Qwen3-8B |
1 | 1 | |
rotation/qwen3-32B/ |
Qwen/Qwen3-32B |
2-4 | 4 | |
rotation/MiniMax-M2.7/ |
MiniMaxAI/MiniMax-M2.7 |
4 | 4 | FP8 weights, --reasoning-parser minimax-append-think |
rotation/GLM-4.7/ |
zai-org/GLM-4.7-FP8 |
8 | 8 | FP8 weights, 92 layers |
For each transformer layer, given calibration (Q, K, V) activations, OSCAR estimates two attention-aware covariance matrices and uses their eigenspectra to derive rotations:
- K covariance (
qqt) — average attention-query covariance seen by K:Σ_K = (1/H_kv) · Σ_h Q_h^T Q_h / n_tokens(GQA-aware: query heads grouped under the matching KV head) - V covariance (
sst) — score-weighted V-side covariance:Σ_V = (1/H_kv) · Σ_h V_h^T diag(w_h) V_h / n_tokenswherew_h[t] = K_h[t] · (Q^T Q) · K_h[t]^Tis the per-token attention-score weight derived from K and the Q covariance torch.linalg.eigh(Σ)→ orthogonal eigenvectorsRplus the eigenvalues (used for ordering, not for scaling)- Composition
r_h_pbr:R_loaded = R · H_d · P_brH_d— head-dim HadamardP_br— bit-reversal permutation, sorted by eigenvalue magnitude; this interleaves high-variance directions evenly across quant groups so no single group concentrates outliers
Saved as fp32 per-layer (head_dim, head_dim) orthogonal matrices in
<calib_dir>/rotations/{k,v}_rotation_qqt_r_h_pbr.pt.
The eval driver eval_oscar_gpqa.sh and eval_oscar_lcb.sh set everything for you. The underlying sglang server flags are:
SGLANG_ENABLE_MIXED_KV_WINDOWS=1 \
SGLANG_OSCAR_K_ROTATION_PATH=.../k_rotation_qqt_r_h_pbr.pt \
SGLANG_OSCAR_V_ROTATION_PATH=.../v_rotation_sst_r_h_pbr.pt \
SGLANG_OSCAR_K_CLIP_RATIO=0.96 \
SGLANG_OSCAR_V_CLIP_RATIO=0.92 \
SGLANG_OSCAR_ABSORB_V_ROTATION=1 \
SGLANG_MIXED_KV_PREFIX_TOKENS=64 \
SGLANG_MIXED_KV_RECENT_TOKENS=256 \
SGLANG_MIXED_KV_HP_MAX_SPLITS=8 \
SGLANG_MIXED_KV_HP_DTYPE=bfloat16 \
SGLANG_MIXED_KV_SCALE_DTYPE=float32 \
python -m sglang.launch_server \
--model-path <model> \
--tensor-parallel-size <tp> \
--kv-cache-dtype int2 \
--kv-cache-quant-group-size 128 \
--prefill-attention-backend fa3 \
--decode-attention-backend triton \
--trust-remote-codeSink (PREFIX_TOKENS) and recent window (RECENT_TOKENS) stay in BF16; the rest of the cache is INT2-quantized into 128-element groups along head-dim.
Override per bash rotation/<model>/save_qkv_<model>.sh ENV=val:
| Env | Default | Effect |
|---|---|---|
DUMP_KVCACHE_TOKENS |
30000 | Total token budget for calibration |
GROUP_SIZE |
128 | KV quant group size, encoded in output dir name |
DATASET |
GPQA | Calibration dataset name |
MODEL |
per-model HF id | HuggingFace model id |
TP_SIZE |
per-model | Tensor parallel size for dump |
GPU |
per-model | CUDA_VISIBLE_DEVICES |
HF_HOME |
/shared/huggingface |
HF cache (set to $HOME/.cache/huggingface on a fresh machine) |
@misc{zhou2026oscarofflinespectralcovarianceaware,
title={OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization},
author={Zhongzhu Zhou and Donglin Zhuang and Jisen Li and Ziyan Chen and Shuaiwen Leon Song and Ben Athiwaratkun and Xiaoxia Wu},
year={2026},
eprint={2605.17757},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2605.17757},
}- Released under the MIT License.
- Built on top of sglang.


