You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Core Idea
OSCAR compresses KV cache to ~2.28 effective bits while preserving attention quality for long-context LLM serving. Unlike naive methods, it uses attention-aware covariance rotation (aligning with QᵀQ for Keys and VᵀSᵀSV for Values) to direct quantization error away from critical attention paths .
Key Features
• Offline Calibration: Precomputes fixed per-layer rotation matrices and clipping thresholds using calibration data.
Hybrid Serving: Employs a three-segment cache (BF16 Sink/Recent + INT2 History) to balance accuracy and memory efficiency .
Performance
• Accuracy: On Qwen3-4B-Thinking, achieves a mean score of 71.86 (vs. BF16 75.64), drastically outperforming TurboQuant (31.74) and naive INT2 (0.00) .
System Gains: Reduces KV memory by ~8× and increases large-batch throughput by up to ~7× .
Significance
OSCAR shifts the goal of KV quantization from minimizing tensor reconstruction error to protecting the attention mechanism, enabling practical long-context inference under tight memory constraints.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Core Idea
OSCAR compresses KV cache to ~2.28 effective bits while preserving attention quality for long-context LLM serving. Unlike naive methods, it uses attention-aware covariance rotation (aligning with QᵀQ for Keys and VᵀSᵀSV for Values) to direct quantization error away from critical attention paths .
Key Features
• Offline Calibration: Precomputes fixed per-layer rotation matrices and clipping thresholds using calibration data.
Performance
• Accuracy: On Qwen3-4B-Thinking, achieves a mean score of 71.86 (vs. BF16 75.64), drastically outperforming TurboQuant (31.74) and naive INT2 (0.00) .
Significance
OSCAR shifts the goal of KV quantization from minimizing tensor reconstruction error to protecting the attention mechanism, enabling practical long-context inference under tight memory constraints.
Links
• Paper: https://arxiv.org/abs/2605.17757
Beta Was this translation helpful? Give feedback.
All reactions