4-bit key cache that scales each channel on its own (q4-size memory without the quality cliff plain q4_0 falls off on some models) #24518

mverrilli · 2026-06-12T13:36:37Z

mverrilli
Jun 12, 2026

The goal here was q4-level compression with q8-class ppl across model families.

At ctx 4096, kpc lands at ~q4_0 size (3.3–3.4× under f16) while its PPL sits in the q8_0 neighborhood:

model	kpc PPL	q8_0 PPL	kpc vs q8	KV size: kpc / q4_0
qwen2-0.5b	14.151	13.819	+2.4%	14 / 13
llama-1b	12.795	12.583	+1.7%	38 / 36
deepseek-1.3b	19.924	19.759	+0.8%	231 / 216
gemma2-2b	9.404	9.392	~0%	125 / 117
qwen3.5-4b	9.279	9.298	−0.2%	88 / 86
glm4-9b	9.644	9.844	−2.0%	48 / 45

(qwen3.5-4b is a hybrid, so its 88 MiB includes ~49 MiB of fixed linear-attn state that does not quantize and is not really comparable to the others.)

Every K channel gets its own scale and zero-point per 32-token group, so an outlier channel keeps its own range instead of getting flattened by one shared scale.
The scales would normally cost real memory as fp16, so they're int8 double-quantized too.
Context shift / RoPE works by dequant -> rope -> requant and regrouping the cells, and parallel sequences each carry per-token seq/pos so they don't clobber each other's scales.
The kernel's reasonably tuned: re-quantizes Q to int8 once per group for the dot, lets GQA sibling heads share a single KV pass (most of the decode win), and it's wired as real ggml ops rather than a custom-op shortcut, so it can move to GPU later.

This is basically similar to KIVI and KVQuant (per-channel quant for K) with double-quant'd scaled metadata similar to QLoRA. Not claiming the ideas are novel (although I think there are defensible positions for the deltas with the papers). If you think it is worth a PR let me know and I'll work on CUDA. AI Disclosure, I use it for coding assistance, test generation, reorganizing commits, not design.

Couple of updates since I first posted this. The collapse looks specific to the Qwen2/2.5 family rather than small models in general: it also shows up on qwen2.5-1.5b and an R1 distill, but StarCoder2 (same 2 KV heads plus a K bias) stays fine. And it has held up under more scrutiny since (KL-divergence tails, a couple of 16k long-context runs, and a clean run on real Apple M1/NEON), plus there is now a 3-bit V that takes total KV below q4_0 at the same quality.

Here are detailed results.

CPU (i5-8600, 6 threads), -fa on, wikitext-2; "kpc" = kpc4_1 K + q4_1 V.

Model	KV	PPL	KV (MiB)	Total RAM (MiB)	pp4096 t/s	tg @ d4096 t/s
qwen2-0.5b	f16	13.665	48	811	264.7	38.0
	q8_0	13.819	25	790	134.3	36.2
	q4_0	139.750	13	778	129.0	35.8
	kpc	14.151	14	779	177.7	40.2
llama-1b	f16	12.575	128	1149	178.6	17.9
	q8_0	12.583	68	1093	76.7	18.6
	q4_0	13.178	36	1061	78.1	19.6
	kpc	12.795	38	1064	113.0	22.9
deepseek-1.3b	f16	19.762	768	2203	107.7	8.5
	q8_0	19.759	408	1847	56.2	9.3
	q4_0	20.197	216	1655	58.9	10.5
	kpc	19.924	231	1671	68.1	11.8
gemma2-2b	f16	9.388	416	3579	67.4	6.3
	q8_0	9.392	221	3379	45.5	3.7
	q4_0	9.443	117	3275	46.6	7.4
	kpc	9.404	125	3288	53.5	7.5
qwen3.5-4b	f16	9.284	178	3285	43.7	6.9
	q8_0	9.298	118	3226	37.9	7.4
	q4_0	9.281	86	3194	38.4	7.9
	kpc	9.279	88	3196	41.5	8.0
glm4-9b	f16	9.863	160	6427	19.8	3.7
	q8_0	9.844	85	6352	14.0	3.6
	q4_0	10.259	45	6312	14.0	3.6
	kpc	9.644	48	6316	16.9	3.9

https://github.com/mverrilli/llama.cpp/tree/kpc-cpu-kv

mverrilli · 2026-06-16T16:25:13Z

mverrilli
Jun 16, 2026
Author

Detail benchmarks: https://gist.github.com/mverrilli/527cb6163795b84d0e04779f94b0c690

0 replies

mverrilli · 2026-06-17T16:46:35Z

mverrilli
Jun 17, 2026
Author

Speculative Decoding

Pair 1: Qwen2.5-1.5B target + Qwen2.5-0.5B draft (q4_0 collapses)

Target KV	KV vs f16	Quality	Collapse risk	Runs clean	Coherent	Distinct ratio	Draft accept	Diverge (n_max 1↔8)	Mean div token
f16	100%	reference	none	16/16	16/16	0.68	39%	6/6	32
q8_0	53%	≈ f16	none	16/16	16/16	0.64	41%	6/6	34
kpc	30%	≈ q8_0	none	16/16	16/16	0.64	44%	6/6	32
q4_0	28%	broken	high (Qwen2/2.5)	16/16	3/15 ✗	0.28 ✗	18%	6/6	13

Pair 2: Llama-3.2-3B target + Llama-3.2-1B draft (q4_0 does not collapse)

Target KV	KV vs f16	Quality	Collapse risk	Runs clean	Coherent	Distinct ratio	Draft accept	Diverge (n_max 1↔8)	Mean div token
f16	100%	reference	none	16/16	16/16	0.68	54%	5/6	44
q8_0	53%	≈ f16	none	16/16	16/16	0.68	53%	5/6	57
kpc	30%	≈ q8_0	none	16/16	16/16	0.69	51%	5/6	45
q4_0	28%	≈ f16	none	16/16	16/16	0.69	55%	5/6	48

This was just a check that kpc is safe and small for speculative decoding (clean, coherent, acceptance-neutral vs f16/q8, q4-size target cache, no collapse).

0 replies

mverrilli · 2026-06-19T12:50:33Z

mverrilli
Jun 19, 2026
Author

Cuda https://github.com/mverrilli/llama.cpp/tree/kpc-cuda-kv

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4-bit key cache that scales each channel on its own (q4-size memory without the quality cliff plain q4_0 falls off on some models) #24518

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

4-bit key cache that scales each channel on its own (q4-size memory without the quality cliff plain q4_0 falls off on some models) #24518

Uh oh!

Uh oh!

mverrilli Jun 12, 2026

Replies: 3 comments

Uh oh!

mverrilli Jun 16, 2026 Author

Uh oh!

mverrilli Jun 17, 2026 Author

Speculative Decoding

Pair 1: Qwen2.5-1.5B target + Qwen2.5-0.5B draft (q4_0 collapses)

Pair 2: Llama-3.2-3B target + Llama-3.2-1B draft (q4_0 does not collapse)

Uh oh!

mverrilli Jun 19, 2026 Author

mverrilli
Jun 12, 2026

mverrilli
Jun 16, 2026
Author

mverrilli
Jun 17, 2026
Author

mverrilli
Jun 19, 2026
Author