OSCAR: Attention-Aware 2-bit KV Cache Quantization #24112

rankaiyx · 2026-06-04T11:00:48Z

rankaiyx
Jun 4, 2026

Core Idea
OSCAR compresses KV cache to ~2.28 effective bits while preserving attention quality for long-context LLM serving. Unlike naive methods, it uses attention-aware covariance rotation (aligning with QᵀQ for Keys and VᵀSᵀSV for Values) to direct quantization error away from critical attention paths .

Key Features
• Offline Calibration: Precomputes fixed per-layer rotation matrices and clipping thresholds using calibration data.

Hybrid Serving: Employs a three-segment cache (BF16 Sink/Recent + INT2 History) to balance accuracy and memory efficiency .

Performance
• Accuracy: On Qwen3-4B-Thinking, achieves a mean score of 71.86 (vs. BF16 75.64), drastically outperforming TurboQuant (31.74) and naive INT2 (0.00) .

System Gains: Reduces KV memory by ~8× and increases large-batch throughput by up to ~7× .

Significance
OSCAR shifts the goal of KV quantization from minimizing tensor reconstruction error to protecting the attention mechanism, enabling practical long-context inference under tight memory constraints.

Links
• Paper: https://arxiv.org/abs/2605.17757

Code: https://github.com/FutureMLS-Lab/OSCAR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OSCAR: Attention-Aware 2-bit KV Cache Quantization #24112

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

OSCAR: Attention-Aware 2-bit KV Cache Quantization #24112

Uh oh!

rankaiyx Jun 4, 2026

Replies: 0 comments

rankaiyx
Jun 4, 2026