
OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond
β‘ Data-free Β· Training & Calibration-free Β· Plug-and-Play for X-LLMs
-
[Upcoming] π§ vLLM & SGLang backend integration β under active development, official support will be announced in future releases.
-
[2026-05-20] π Our paper "OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond" is now available on arXiv! [Link]
-
[2026-05-19] π Codebase and evaluation suite publicly released.
The rapid advancement toward long-context reasoning and multi-modal intelligence has made KV cache memory footprint a dominant bottleneck. We revisit the inherent limitations of the established per-channel quantization paradigm and identify Token Norm Imbalance (TNI) as the primary bottleneck to quantization fidelity.
Rather than relying on intricate pipelines, we follow the principle of Occam's Razor. We propose OScaR (Omni-Scaled Canalized Rotation) , an accurate and lightweight KV cache compression framework for X-LLMs (text-only, multi-modal, and omni-modal LLMs).
Text-Only LLMs![]() Low-norm outlier tokens (Attention Sink tokens) |
Multi-Modal LLMs![]() Large-norm outliers |
Multi-Modal LLMs![]() Inter-modality disparities |
TNI is pervasive across X-LLMs. In text-only models, it manifests as low-norm outlier tokens, also known as Attention Sink tokens. In multi-modal settings, TNI exhibits more diverse forms, including large-norm outliers, significant inter-modality disparities, and broader norm variations. Additional visualizations and detailed experimental configurations are provided in the paper.
-
π Unveils TNI as the structural bottleneck of per-channel quantization through both empirical and theoretical analysis.
-
πͺ Streamlined OScaR framework guided by Occam's Razor β requiring only two essential operations, Canalized Rotation and Omni-Token Scaling, with no training or calibration overhead.
-
π Redefines the Pareto front for X-LLMs KV quantization, delivering near-lossless INT2 quantization across diverse benchmarks while maintaining low computational complexity.
-
β‘ Optimized System Design and CUDA kernels built on BitDecoding and HadaCore with Tensor Core acceleration, achieving 3.0Γ decoding speedup, 5.3Γ memory reduction, and 4.1Γ throughput increase vs. BF16 FlashDecoding-v2.
OScaR achieves the highest average accuracy among all 2-bit methods on LongBench-E, outperforming KIVI, OTT, QuaRot, and TurboQuant+ across both Llama-3.1-8B and Qwen3-8B.
| Method | Llama-3.1-8B | Qwen3-8B |
|---|---|---|
| 16-bit Baseline | 41.70 | 49.56 |
| QuaRot (INT2) | 37.94 | 40.13 |
| RotateKV (INT2) | 37.98 | 42.95 |
| KIVI (INT2) | 39.84 | 47.95 |
| OTT (INT2) | 40.74 | 48.21 |
| TurboQuant+ (2.5-bit) | 40.03 | 47.56 |
| OScaR (INT2) | 41.75 | 48.74 |
On OCRBench, OScaR consistently outperforms other 2-bit methods across LLaVA-v1.6-vicuna-7B, Qwen3-VL-8B, and Qwen3-VL-4B.
| Method | LLaVA-v1.6-7B | Qwen3-VL-8B | Qwen3-VL-4B |
|---|---|---|---|
| 16-bit Baseline | 536 | 858 | 852 |
| QuaRot (INT2) | 481 | 722 | 773 |
| RotateKV (INT2) | 473 | 754 | 638 |
| KIVI (INT2) | 488 | 851 | 813 |
| OTT (INT2) | 513 | 850 | 831 |
| TurboQuant+ (2.5-bit) | 501 | 847 | 828 |
| OScaR (INT2) | 519 | 856 | 838 |
On the challenging MMAU-Pro benchmark for omni-modal understanding, OScaR surpasses both the 16-bit baseline and all quantized methods across open-ended QA, Good Rate, and Audio Instruction Following (AIF).
| Method (Qwen3-Omni-30B-A3B) | Open-ended | Good Rate | AIF |
|---|---|---|---|
| 16-bit Baseline | 66.2 | 27.8 | 87.4 |
| KIVI (INT2) | 65.8 | 27.0 | 78.2 |
| OTT (INT2) | 65.8 | 26.9 | 83.9 |
| TurboQuant+ (2.5-bit) | 66.6 | 27.0 | 79.3 |
| OScaR (INT2) | 67.4 | 29.8 | 88.5 |
Note: Detailed experimental setups and TurboQuant+ implementation details are available in the original paper.
git clone https://github.com/ZunhaiSu/OScaR-KV-Quant.git OScaR
cd OScaR
# Prerequisite: install `uv` and ensure it is available on PATH.
uv venv --python 3.10 --seed oscar-env
source oscar-env/bin/activate
# Required for CUTLASS headers used by oscar_cuda.
git submodule update --init --recursive
# flash-attn imports torch and psutil during its build, so they must exist first.
uv pip install "torch==2.6.0+cu124" psutil --index https://download.pytorch.org/whl/cu124
# Install dependencies declared in pyproject.toml, then install the project itself.
uv sync --active --no-install-project
uv pip install --no-build-isolation -e .If you clone with
--recursive, you should still rungit submodule update --init --recursivebefore building to ensurelibs/cutlassis present.The Python dependency source of truth is
pyproject.toml.tool.uv.sourcespinstorch==2.6.0to thecu124PyTorch index, andtool.uv.no-build-isolation-packagedisables build isolation forflash-attn. The explicit torch/psutil bootstrap step is still required becauseflash-attnimports them while building but does not declare them as build dependencies. The editable install uses--no-build-isolationbecause this repository's CUDA extension build imports PyTorch from the active environment.
Tested Environment:
- Python
3.10.17- PyTorch
2.6.0+cu124flash-attn 2.8.3transformers 5.8.1for a fresh installation from the currentpyproject.toml
Set the model path:
export MODEL_PATH=/path/to/Qwen3-8BQuick end-to-end accuracy validation using the Qasper-E benchmark:
CUDA_VISIBLE_DEVICES=0 $(which python) eval_longbench.py \
--model_path "$MODEL_PATH" \
--datasets qasper_e \
--max_input_len 32768 \
--dtype bfloat16 \
--device cuda:0 \
--offline_v_hadamard \
--output_dir pred_e/oscar-qasper \
--log_every 1 \
--resumeNote: This requires the following data files:
longbench_data/data/qasper_e.jsonllongbench_config/dataset2prompt.jsonlongbench_config/dataset2maxlen.jsonThe metric helper
longbench_metrics.pyis part of this repository, and its Python dependencies are included inpyproject.toml.
Run a single inference example with explicit configuration:
MODEL_PATH="${MODEL_PATH}" \
DTYPE=bfloat16 \
NUM_BITS=2 \
QUANT_MODE=k-channel \
GROUP_SIZE=32 \
KV_ROTATION=hadamard \
KV_NORM=1 \
ATTN_BACKEND=oscar \
bash evaluation/scripts/example.shIf you find OScaR useful for your research or production, please cite our paper:
@article{su2026oscar,
title={OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond},
author={Su, Zunhai and Yang, Rui and Zhang, Chao and Liu, Yaxiu and Zhang, Yifan and Wu, Wei and Xiong, Jing and Du, Dayou and Zhuang, Xialie and Qian, Yulei and Xie, Yuchen and Wu, Yik-Chung and Yang, Hongxia and Wong, Ngai},
journal={arXiv preprint arXiv:2605.19660},
year={2026}
}OScaR is inspired by many open-source libraries, including but not limited to BitDecoding, HadaCore, KIVI, and SGLang-FluentLLM.




