Skip to content

FutureMLS-Lab/OSCAR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OSCAR INT2 KV-Cache

OSCAR

Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

Paper   Website   HuggingFace RotationZoo

OSCAR captures Q/K/V activations on a small calibration set, estimates attention-aware K/V covariance structures offline, and derives per-layer rotations + clipping thresholds that align KV quantization with the directions attention actually consumes. The result is INT2 storage for the bulk of the KV cache plus a small BF16 sink + recent window — ~7× compression of the KV-cache memory footprint vs BF16, with single-digit pp accuracy drop on GPQA for the dense reasoning models we validated.

OSCAR pipeline

OSCAR is built directly into the open-source SGLang framework: clone the repo, set up the single environment, and run the dump, rotation, and evaluation scripts end to end. It works out of the box, and we also provide a rotation zoo so users can download calibrated rotations directly instead of recomputing them.

🔥 Latest News

  • [Upcoming] OSCAR is testing minimax-m2.7 and GLM in 200K long horizon Agentic Tasks. Happy to see OSCAR being used in the wild!
  • [2026-05-18] Full release: paper, code, website, and RotationZoo are all live — runs out of the box on SGLang.

📖 Table of Contents

Main results

Multi-Modal & LongBench Use Rotation and Run Script in zhongzhu/VL branch.

OCRBench comparison

Method Qwen3-VL-8B Qwen3-VL-4B
16-bit Baseline 858 852
QuaRot (INT2) 722 773
RotateKV (INT2) 754 638
KIVI (INT2) 851 813
OTT (INT2) 850 831
TurboQuant+ (2.5-bit) 847 828
OSCAR (Lloyd-Max) 854 848

Omni-Modal LLMs: MMAU-Pro

Method (Qwen3-Omni-30B-A3B) Open-ended Good Rate AIF
16-bit Baseline 66.2 27.8 87.4
KIVI (INT2) 65.8 27.0 78.2
OTT (INT2) 65.8 26.9 83.9
TurboQuant+ (2.5-bit) 66.6 27.0 79.3
OSCAR 67.4 33.8 89.7

LongBench-E comparison

Method Qwen3-8B
16-bit Baseline 49.56
QuaRot (INT2) 40.13
RotateKV (INT2) 42.95
KIVI (INT2) 47.95
OTT (INT2) 48.21
TurboQuant+ (2.5-bit) 47.56
OSCAR 50.25

Setup. Each cell is the MEAN across 5 reasoning / coding benchmarksGPQA, HumanEval, LiveCodeBench v6, AIME 25, MATH-500. To control single-seed variance, every benchmark is evaluated 5 times per (model, method) cell (3 times for GLM-4.7-FP8) and the per-seed scores are averaged before being averaged across benchmarks. TurboQuant rows are single-run (*) because its vLLM path is too slow for repeated 32K-context evaluations under our compute budget. All runs use 32K-token max generation length. BPE = effective bits per KV element at 128K context length. Higher is better; the BF16 row is the upper bound.

Method BPE Qwen3-4B Thinking Qwen3-8B Qwen3-32B GLM-4.7-FP8 (358B)
BF16 (upper bound) 16.00 75.64 70.84 74.19 77.89
Saw-INT4 4.25 73.11 69.97 74.43 77.95
TurboQuant K3V3 * 3.25 31.74 56.88 71.99 78.15
QuaRot-INT2 2.25 1.40 10.14 7.90 75.14
Naive INT2 2.25 0.00 0.00 0.00 60.49
OSCAR (ours) 2.28 71.86 69.42 74.17 78.16
Gap of OSCAR vs BF16 −3.78 −1.42 −0.02 +0.27
Details for each task image
Baseline notes — TurboQuant / QuaRot / Saw-INT4 / Naive INT2 configurations

For a fair comparison at a comparable bit-budget, TurboQuant results use vLLM's implementation (docs) modified so that all layers are quantized (no mixed precision); the original TurboQuant keeps the first, last, and selected middle layers in full precision. We run it in its K3V3 configuration (3-bit K, 3-bit V) to land near the OSCAR bit-budget.

QuaRot-INT2 is the standard 2-bit KV-quant recipe (data-free Hadamard rotation per layer). Saw-INT4 is an INT4 reference for context. Naive INT2 is per-token symmetric INT2 with no rotation.

* TurboQuant entries are single-run results because its vLLM path is too slow for repeated 32K-context evaluations under our compute budget.

Comparison with other INT2 KV-cache methods on AIME25

Most prior INT2 KV-cache methods do not provide framework-level support for efficient long-context generation, so 32K-generation evaluations are extremely slow and their papers do not report the full benchmark suite above. For this reason, we compare against the reported AIME25 setting where public numbers are available.

Method BPE Qwen3-8B Qwen3-32B
Original BF16 16.00 66.00 +/- 7.33 72.59 +/- 7.41
KIVI-KV2 2.25 52.33 +/- 9.00 57.41 +/- 9.26
KIVI-KV2* 2.26 57.67 +/- 9.00 59.05 +/- 12.38
Kitty 2.39 59.67 +/- 10.33 69.26 +/- 9.26
OSCAR (ours) 2.38 66.67 +/- 3.33 74.00 +/- 5.48

OSCAR is the only INT2 method in this comparison that reaches BF16-level AIME25 accuracy at 32K generation while staying near a 2-bit KV-cache budget.

OSCAR is the only INT2 method that stays within a few pp of BF16 across every model. QuaRot-INT2 and naive INT2 collapse on reasoning + coding tasks. Saw-INT4 is a strong INT4 reference, but OSCAR matches or beats it **at roughly half the storage** (≈2 bits per KV element).

Layout

rotation/
  eval_oscar_gpqa.sh        generic GPQA eval driver
  eval_oscar_lcb.sh         generic LiveCodeBench v6 (128K) eval driver
  compute_kv_rotation.py    eigendecomposition + R·H·P_br composition
  _dump_compat/             sgl_kernel compat shim for dump
  <model>/
    save_qkv_<model>.sh     phase 1 — dump
    compute_rotation.sh     phase 2 — rotation
    eval_gpqa.sh            phase 3 — GPQA eval
    eval_lcb.sh             phase 3 — LCB v6 (128K) eval (where applicable)
    GPQA/
      seq<T>_prompt<N>_group<G>/
        qkv_dumps/          dump output
        rotations/          rotation .pt files
        _eval_gpqa_oscar/   eval results from this rotation
        _eval_lcb_v6_128k/  ...

sglang-research/            submodule — INT2 KV eval
sglang-dump-qkv/            vendored older sglang-fork — QKV dump (loaded via shim)

Setup

Requirements

  • 1 × H100 80 GB (for 4B/8B), 4 × H100 (for 32B / MiniMax-M2.7), 8 × H100 (for GLM-4.7-FP8)
  • CUDA 12.8 or 12.9 (nvcc on $PATH)
  • Python 3.12 + Conda
  • HuggingFace access for the relevant model weights

Clone

git clone --recursive https://github.com/FutureMLS-Lab/OSCAR.git
cd OSCAR

Conda env (single env, dump + eval)

OSCAR uses one conda env for both dump and eval. The dump-side sglang (vendored as sglang-dump-qkv/) was originally built against an older sgl_kernel; OSCAR ships a thin rotation/_dump_compat/ shim that stubs the dropped legacy symbols at import time and falls back to PyTorch for the runtime sampling kernels it references, so a single eval-side env suffices.

conda create -n oscar python=3.12 -y
conda activate oscar

# Eval-side sglang (editable so future patches stick)
pip install -e sglang-research/python

# CUDA-12.8/12.9 compatible flashinfer + sgl_kernel build
# (see https://github.com/sgl-project/sglang for matching wheels)

If nvcc and PyTorch's CUDA versions diverge (e.g. nvcc 12.6 but torch built for 12.8), the JIT kernels in flashinfer may fail to compile. Pin CUDA_HOME to the matching cuda-12.x directory before launching.

Quick start (Qwen3-8B example)

End-to-end on a single H100, ~20 minutes total.

cd OSCAR

# Phase 1 — dump Q/K/V (TP=1, default DUMP_KVCACHE_TOKENS=30000)
bash rotation/qwen3-8B/save_qkv_8b.sh
# → writes rotation/qwen3-8B/GPQA/seq30000_prompt<N>_group128/qkv_dumps/

# Phase 2 — fit the calibrated rotation
bash rotation/qwen3-8B/compute_rotation.sh
# → writes rotation/qwen3-8B/GPQA/seq30000_prompt<N>_group128/rotations/{k,v}_rotation_qqt_r_h_pbr.pt

# Phase 3 — GPQA eval against the rotation we just produced
ROT_DIR=rotation/qwen3-8B/GPQA/seq30000_prompt<N>_group128/rotations \
  bash rotation/qwen3-8B/eval_gpqa.sh
# → writes results to rotation/qwen3-8B/GPQA/seq30000_prompt<N>_group128/_eval_gpqa_oscar/

Pick the actual seq...prompt..._group... tag printed by phase 1, or:

ROT_DIR=$(ls -1d rotation/qwen3-8B/GPQA/seq*_prompt*_group*/rotations | tail -1) \
  bash rotation/qwen3-8B/eval_gpqa.sh

All configured models

Folder HF model TP (dump) TP (eval) Notes
rotation/qwen3-4B-thinking-2507/ Qwen/Qwen3-4B-Thinking-2507 1 1 thinking model
rotation/qwen3-8B/ Qwen/Qwen3-8B 1 1
rotation/qwen3-32B/ Qwen/Qwen3-32B 2-4 4
rotation/MiniMax-M2.7/ MiniMaxAI/MiniMax-M2.7 4 4 FP8 weights, --reasoning-parser minimax-append-think
rotation/GLM-4.7/ zai-org/GLM-4.7-FP8 8 8 FP8 weights, 92 layers

How the rotation is fit (spectral covariance)

For each transformer layer, given calibration (Q, K, V) activations, OSCAR estimates two attention-aware covariance matrices and uses their eigenspectra to derive rotations:

  • K covariance (qqt) — average attention-query covariance seen by K: Σ_K = (1/H_kv) · Σ_h Q_h^T Q_h / n_tokens (GQA-aware: query heads grouped under the matching KV head)
  • V covariance (sst) — score-weighted V-side covariance: Σ_V = (1/H_kv) · Σ_h V_h^T diag(w_h) V_h / n_tokens where w_h[t] = K_h[t] · (Q^T Q) · K_h[t]^T is the per-token attention-score weight derived from K and the Q covariance
  • torch.linalg.eigh(Σ) → orthogonal eigenvectors R plus the eigenvalues (used for ordering, not for scaling)
  • Composition r_h_pbr: R_loaded = R · H_d · P_br
    • H_d — head-dim Hadamard
    • P_br — bit-reversal permutation, sorted by eigenvalue magnitude; this interleaves high-variance directions evenly across quant groups so no single group concentrates outliers

Saved as fp32 per-layer (head_dim, head_dim) orthogonal matrices in <calib_dir>/rotations/{k,v}_rotation_qqt_r_h_pbr.pt.

Serving with the rotation

The eval driver eval_oscar_gpqa.sh and eval_oscar_lcb.sh set everything for you. The underlying sglang server flags are:

SGLANG_ENABLE_MIXED_KV_WINDOWS=1 \
SGLANG_OSCAR_K_ROTATION_PATH=.../k_rotation_qqt_r_h_pbr.pt \
SGLANG_OSCAR_V_ROTATION_PATH=.../v_rotation_sst_r_h_pbr.pt \
SGLANG_OSCAR_K_CLIP_RATIO=0.96 \
SGLANG_OSCAR_V_CLIP_RATIO=0.92 \
SGLANG_OSCAR_ABSORB_V_ROTATION=1 \
SGLANG_MIXED_KV_PREFIX_TOKENS=64 \
SGLANG_MIXED_KV_RECENT_TOKENS=256 \
SGLANG_MIXED_KV_HP_MAX_SPLITS=8 \
SGLANG_MIXED_KV_HP_DTYPE=bfloat16 \
SGLANG_MIXED_KV_SCALE_DTYPE=float32 \
python -m sglang.launch_server \
  --model-path <model> \
  --tensor-parallel-size <tp> \
  --kv-cache-dtype int2 \
  --kv-cache-quant-group-size 128 \
  --prefill-attention-backend fa3 \
  --decode-attention-backend triton \
  --trust-remote-code

Sink (PREFIX_TOKENS) and recent window (RECENT_TOKENS) stay in BF16; the rest of the cache is INT2-quantized into 128-element groups along head-dim.

Calibration knobs

Override per bash rotation/<model>/save_qkv_<model>.sh ENV=val:

Env Default Effect
DUMP_KVCACHE_TOKENS 30000 Total token budget for calibration
GROUP_SIZE 128 KV quant group size, encoded in output dir name
DATASET GPQA Calibration dataset name
MODEL per-model HF id HuggingFace model id
TP_SIZE per-model Tensor parallel size for dump
GPU per-model CUDA_VISIBLE_DEVICES
HF_HOME /shared/huggingface HF cache (set to $HOME/.cache/huggingface on a fresh machine)

Citation

@misc{zhou2026oscarofflinespectralcovarianceaware,
      title={OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization},
      author={Zhongzhu Zhou and Donglin Zhuang and Jisen Li and Ziyan Chen and Shuaiwen Leon Song and Ben Athiwaratkun and Xiaoxia Wu},
      year={2026},
      eprint={2605.17757},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.17757},
}

License & acknowledgements

  • Released under the MIT License.
  • Built on top of sglang.

About

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors