OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond
_{⚡ Data-free · Training & Calibration-free · Plug-and-Play for X-LLMs}

🔥 Latest News

[Upcoming] 🔧 vLLM & SGLang backend integration — under active development, official support will be announced in future releases.
[2026-05-20] 🎉 Our paper "OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond" is now available on arXiv! [Link]
[2026-05-19] 🚀 Codebase and evaluation suite publicly released.

📖 Overview

The rapid advancement toward long-context reasoning and multi-modal intelligence has made KV cache memory footprint a dominant bottleneck. We revisit the inherent limitations of the established per-channel quantization paradigm and identify Token Norm Imbalance (TNI) as the primary bottleneck to quantization fidelity.

Rather than relying on intricate pipelines, we follow the principle of Occam's Razor. We propose OScaR (Omni-Scaled Canalized Rotation) , an accurate and lightweight KV cache compression framework for X-LLMs (text-only, multi-modal, and omni-modal LLMs).

TNI in X-LLMs

Text-Only LLMs

Low-norm outlier tokens
(Attention Sink tokens)

Multi-Modal LLMs

Large-norm outliers

Multi-Modal LLMs

Inter-modality disparities

TNI is pervasive across X-LLMs. In text-only models, it manifests as low-norm outlier tokens, also known as Attention Sink tokens. In multi-modal settings, TNI exhibits more diverse forms, including large-norm outliers, significant inter-modality disparities, and broader norm variations. Additional visualizations and detailed experimental configurations are provided in the paper.

✨ Key Features

🔍 Unveils TNI as the structural bottleneck of per-channel quantization through both empirical and theoretical analysis.
🪒 Streamlined OScaR framework guided by Occam's Razor — requiring only two essential operations, Canalized Rotation and Omni-Token Scaling, with no training or calibration overhead.
📈 Redefines the Pareto front for X-LLMs KV quantization, delivering near-lossless INT2 quantization across diverse benchmarks while maintaining low computational complexity.
⚡ Optimized System Design and CUDA kernels built on BitDecoding and HadaCore with Tensor Core acceleration, achieving 3.0× decoding speedup, 5.3× memory reduction, and 4.1× throughput increase vs. BF16 FlashDecoding-v2.

📊 Main Results

Text-Only LLMs: LongBench-E

OScaR achieves the highest average accuracy among all 2-bit methods on LongBench-E, outperforming KIVI, OTT, QuaRot, and TurboQuant+ across both Llama-3.1-8B and Qwen3-8B.

Method	Llama-3.1-8B	Qwen3-8B
16-bit Baseline	41.70	49.56
QuaRot (INT2)	37.94	40.13
RotateKV (INT2)	37.98	42.95
KIVI (INT2)	39.84	47.95
OTT (INT2)	40.74	48.21
TurboQuant+ (2.5-bit)	40.03	47.56
OScaR (INT2)	41.75	48.74

Multi-Modal LLMs: OCRBench

On OCRBench, OScaR consistently outperforms other 2-bit methods across LLaVA-v1.6-vicuna-7B, Qwen3-VL-8B, and Qwen3-VL-4B.

Method	LLaVA-v1.6-7B	Qwen3-VL-8B	Qwen3-VL-4B
16-bit Baseline	536	858	852
QuaRot (INT2)	481	722	773
RotateKV (INT2)	473	754	638
KIVI (INT2)	488	851	813
OTT (INT2)	513	850	831
TurboQuant+ (2.5-bit)	501	847	828
OScaR (INT2)	519	856	838

Omni-Modal LLMs: MMAU-Pro

On the challenging MMAU-Pro benchmark for omni-modal understanding, OScaR surpasses both the 16-bit baseline and all quantized methods across open-ended QA, Good Rate, and Audio Instruction Following (AIF).

Method (Qwen3-Omni-30B-A3B)	Open-ended	Good Rate	AIF
16-bit Baseline	66.2	27.8	87.4
KIVI (INT2)	65.8	27.0	78.2
OTT (INT2)	65.8	26.9	83.9
TurboQuant+ (2.5-bit)	66.6	27.0	79.3
OScaR (INT2)	67.4	29.8	88.5

Note: Detailed experimental setups and TurboQuant+ implementation details are available in the original paper.

🛠️ Installation

git clone https://github.com/ZunhaiSu/OScaR-KV-Quant.git OScaR
cd OScaR

# Prerequisite: install `uv` and ensure it is available on PATH.
uv venv --python 3.10 --seed oscar-env
source oscar-env/bin/activate

# Required for CUTLASS headers used by oscar_cuda.
git submodule update --init --recursive

# flash-attn imports torch and psutil during its build, so they must exist first.
uv pip install "torch==2.6.0+cu124" psutil --index https://download.pytorch.org/whl/cu124

# Install dependencies declared in pyproject.toml, then install the project itself.
uv sync --active --no-install-project
uv pip install --no-build-isolation -e .

If you clone with --recursive, you should still run git submodule update --init --recursive before building to ensure libs/cutlass is present.

The Python dependency source of truth is pyproject.toml. tool.uv.sources pins torch==2.6.0 to the cu124 PyTorch index, and tool.uv.no-build-isolation-package disables build isolation for flash-attn. The explicit torch/psutil bootstrap step is still required because flash-attn imports them while building but does not declare them as build dependencies. The editable install uses --no-build-isolation because this repository's CUDA extension build imports PyTorch from the active environment.

Tested Environment:

Python 3.10.17

PyTorch 2.6.0+cu124

flash-attn 2.8.3

transformers 5.8.1 for a fresh installation from the current pyproject.toml

🚀 Quick Start

Set the model path:

export MODEL_PATH=/path/to/Qwen3-8B

Accuracy Evaluation (Qasper-E)

Quick end-to-end accuracy validation using the Qasper-E benchmark:

CUDA_VISIBLE_DEVICES=0 $(which python) eval_longbench.py \
  --model_path "$MODEL_PATH" \
  --datasets qasper_e \
  --max_input_len 32768 \
  --dtype bfloat16 \
  --device cuda:0 \
  --offline_v_hadamard \
  --output_dir pred_e/oscar-qasper \
  --log_every 1 \
  --resume

Note: This requires the following data files:

longbench_data/data/qasper_e.jsonl

longbench_config/dataset2prompt.json

longbench_config/dataset2maxlen.json

The metric helper longbench_metrics.py is part of this repository, and its Python dependencies are included in pyproject.toml.

Single Example

Run a single inference example with explicit configuration:

MODEL_PATH="${MODEL_PATH}" \
DTYPE=bfloat16 \
NUM_BITS=2 \
QUANT_MODE=k-channel \
GROUP_SIZE=32 \
KV_ROTATION=hadamard \
KV_NORM=1 \
ATTN_BACKEND=oscar \
bash evaluation/scripts/example.sh

Citation

If you find OScaR useful for your research or production, please cite our paper:

@article{su2026oscar,
  title={OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond},
  author={Su, Zunhai and Yang, Rui and Zhang, Chao and Liu, Yaxiu and Zhang, Yifan and Wu, Wei and Xiong, Jing and Du, Dayou and Zhuang, Xialie and Qian, Yulei and Xie, Yuchen and Wu, Yik-Chung and Yang, Hongxia and Wong, Ngai},
  journal={arXiv preprint arXiv:2605.19660},
  year={2026}
}

Acknowledgement

OScaR is inspired by many open-source libraries, including but not limited to BitDecoding, HadaCore, KIVI, and SGLang-FluentLLM.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
csrc/oscar		csrc/oscar
evaluation		evaluation
kv_cache_compression		kv_cache_compression
libs		libs
oscar		oscar
pictures		pictures
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
eval_longbench.py		eval_longbench.py
install.sh		install.sh
longbench_metrics.py		longbench_metrics.py
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond
_{⚡ Data-free · Training & Calibration-free · Plug-and-Play for X-LLMs}

🔥 Latest News

📚 Table of Contents

📖 Overview

TNI in X-LLMs

✨ Key Features

📊 Main Results

Text-Only LLMs: LongBench-E

Multi-Modal LLMs: OCRBench

Omni-Modal LLMs: MMAU-Pro

🛠️ Installation

🚀 Quick Start

Accuracy Evaluation (Qasper-E)

Single Example

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond ⚡ Data-free · Training & Calibration-free · Plug-and-Play for X-LLMs

🔥 Latest News

📚 Table of Contents

📖 Overview

TNI in X-LLMs

✨ Key Features

📊 Main Results

Text-Only LLMs: LongBench-E

Multi-Modal LLMs: OCRBench

Omni-Modal LLMs: MMAU-Pro

🛠️ Installation

🚀 Quick Start

Accuracy Evaluation (Qasper-E)

Single Example

Citation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond
_{⚡ Data-free · Training & Calibration-free · Plug-and-Play for X-LLMs}

Packages