Skip to content

ZunhaiSu/OScaR-KV-Quant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

44 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation


OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond
⚑ Data-free · Training & Calibration-free · Plug-and-Play for X-LLMs

THU HKU Team UoE arXiv Website

πŸ”₯ Latest News

  • [Upcoming] πŸ”§ vLLM & SGLang backend integration β€” under active development, official support will be announced in future releases.

  • [2026-05-20] πŸŽ‰ Our paper "OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond" is now available on arXiv! [Link]

  • [2026-05-19] πŸš€ Codebase and evaluation suite publicly released.

πŸ“š Table of Contents

πŸ“– Overview

The rapid advancement toward long-context reasoning and multi-modal intelligence has made KV cache memory footprint a dominant bottleneck. We revisit the inherent limitations of the established per-channel quantization paradigm and identify Token Norm Imbalance (TNI) as the primary bottleneck to quantization fidelity.

Rather than relying on intricate pipelines, we follow the principle of Occam's Razor. We propose OScaR (Omni-Scaled Canalized Rotation) , an accurate and lightweight KV cache compression framework for X-LLMs (text-only, multi-modal, and omni-modal LLMs).

TNI in X-LLMs

Text-Only LLMs

Low-norm outlier tokens
(Attention Sink tokens)
Multi-Modal LLMs

Large-norm outliers
Multi-Modal LLMs

Inter-modality disparities

TNI is pervasive across X-LLMs. In text-only models, it manifests as low-norm outlier tokens, also known as Attention Sink tokens. In multi-modal settings, TNI exhibits more diverse forms, including large-norm outliers, significant inter-modality disparities, and broader norm variations. Additional visualizations and detailed experimental configurations are provided in the paper.

✨ Key Features

  • πŸ” Unveils TNI as the structural bottleneck of per-channel quantization through both empirical and theoretical analysis.

  • πŸͺ’ Streamlined OScaR framework guided by Occam's Razor β€” requiring only two essential operations, Canalized Rotation and Omni-Token Scaling, with no training or calibration overhead.

  • πŸ“ˆ Redefines the Pareto front for X-LLMs KV quantization, delivering near-lossless INT2 quantization across diverse benchmarks while maintaining low computational complexity.

  • ⚑ Optimized System Design and CUDA kernels built on BitDecoding and HadaCore with Tensor Core acceleration, achieving 3.0Γ— decoding speedup, 5.3Γ— memory reduction, and 4.1Γ— throughput increase vs. BF16 FlashDecoding-v2.

πŸ“Š Main Results

Text-Only LLMs: LongBench-E

OScaR achieves the highest average accuracy among all 2-bit methods on LongBench-E, outperforming KIVI, OTT, QuaRot, and TurboQuant+ across both Llama-3.1-8B and Qwen3-8B.

Method Llama-3.1-8B Qwen3-8B
16-bit Baseline 41.70 49.56
QuaRot (INT2) 37.94 40.13
RotateKV (INT2) 37.98 42.95
KIVI (INT2) 39.84 47.95
OTT (INT2) 40.74 48.21
TurboQuant+ (2.5-bit) 40.03 47.56
OScaR (INT2) 41.75 48.74

Multi-Modal LLMs: OCRBench

On OCRBench, OScaR consistently outperforms other 2-bit methods across LLaVA-v1.6-vicuna-7B, Qwen3-VL-8B, and Qwen3-VL-4B.

Method LLaVA-v1.6-7B Qwen3-VL-8B Qwen3-VL-4B
16-bit Baseline 536 858 852
QuaRot (INT2) 481 722 773
RotateKV (INT2) 473 754 638
KIVI (INT2) 488 851 813
OTT (INT2) 513 850 831
TurboQuant+ (2.5-bit) 501 847 828
OScaR (INT2) 519 856 838

Omni-Modal LLMs: MMAU-Pro

On the challenging MMAU-Pro benchmark for omni-modal understanding, OScaR surpasses both the 16-bit baseline and all quantized methods across open-ended QA, Good Rate, and Audio Instruction Following (AIF).

Method (Qwen3-Omni-30B-A3B) Open-ended Good Rate AIF
16-bit Baseline 66.2 27.8 87.4
KIVI (INT2) 65.8 27.0 78.2
OTT (INT2) 65.8 26.9 83.9
TurboQuant+ (2.5-bit) 66.6 27.0 79.3
OScaR (INT2) 67.4 29.8 88.5

Note: Detailed experimental setups and TurboQuant+ implementation details are available in the original paper.

πŸ› οΈ Installation

git clone https://github.com/ZunhaiSu/OScaR-KV-Quant.git OScaR
cd OScaR

# Prerequisite: install `uv` and ensure it is available on PATH.
uv venv --python 3.10 --seed oscar-env
source oscar-env/bin/activate

# Required for CUTLASS headers used by oscar_cuda.
git submodule update --init --recursive

# flash-attn imports torch and psutil during its build, so they must exist first.
uv pip install "torch==2.6.0+cu124" psutil --index https://download.pytorch.org/whl/cu124

# Install dependencies declared in pyproject.toml, then install the project itself.
uv sync --active --no-install-project
uv pip install --no-build-isolation -e .

If you clone with --recursive, you should still run git submodule update --init --recursive before building to ensure libs/cutlass is present.

The Python dependency source of truth is pyproject.toml. tool.uv.sources pins torch==2.6.0 to the cu124 PyTorch index, and tool.uv.no-build-isolation-package disables build isolation for flash-attn. The explicit torch/psutil bootstrap step is still required because flash-attn imports them while building but does not declare them as build dependencies. The editable install uses --no-build-isolation because this repository's CUDA extension build imports PyTorch from the active environment.

Tested Environment:

  • Python 3.10.17
  • PyTorch 2.6.0+cu124
  • flash-attn 2.8.3
  • transformers 5.8.1 for a fresh installation from the current pyproject.toml

πŸš€ Quick Start

Set the model path:

export MODEL_PATH=/path/to/Qwen3-8B

Accuracy Evaluation (Qasper-E)

Quick end-to-end accuracy validation using the Qasper-E benchmark:

CUDA_VISIBLE_DEVICES=0 $(which python) eval_longbench.py \
  --model_path "$MODEL_PATH" \
  --datasets qasper_e \
  --max_input_len 32768 \
  --dtype bfloat16 \
  --device cuda:0 \
  --offline_v_hadamard \
  --output_dir pred_e/oscar-qasper \
  --log_every 1 \
  --resume

Note: This requires the following data files:

  • longbench_data/data/qasper_e.jsonl
  • longbench_config/dataset2prompt.json
  • longbench_config/dataset2maxlen.json

The metric helper longbench_metrics.py is part of this repository, and its Python dependencies are included in pyproject.toml.

Single Example

Run a single inference example with explicit configuration:

MODEL_PATH="${MODEL_PATH}" \
DTYPE=bfloat16 \
NUM_BITS=2 \
QUANT_MODE=k-channel \
GROUP_SIZE=32 \
KV_ROTATION=hadamard \
KV_NORM=1 \
ATTN_BACKEND=oscar \
bash evaluation/scripts/example.sh

Citation

If you find OScaR useful for your research or production, please cite our paper:

@article{su2026oscar,
  title={OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond},
  author={Su, Zunhai and Yang, Rui and Zhang, Chao and Liu, Yaxiu and Zhang, Yifan and Wu, Wei and Xiong, Jing and Du, Dayou and Zhuang, Xialie and Qian, Yulei and Xie, Yuchen and Wu, Yik-Chung and Yang, Hongxia and Wong, Ngai},
  journal={arXiv preprint arXiv:2605.19660},
  year={2026}
}

Acknowledgement

OScaR is inspired by many open-source libraries, including but not limited to BitDecoding, HadaCore, KIVI, and SGLang-FluentLLM.

About

πŸ† OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond β€” redefining the accuracy-efficiency Pareto front for X-LLMs KV quantization.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors