4-bit key cache that scales each channel on its own (q4-size memory without the quality cliff plain q4_0 falls off on some models) #24518
mverrilli
started this conversation in
Show and tell
Replies: 3 comments
-
|
Detail benchmarks: https://gist.github.com/mverrilli/527cb6163795b84d0e04779f94b0c690 |
Beta Was this translation helpful? Give feedback.
0 replies
-
Speculative DecodingPair 1: Qwen2.5-1.5B target + Qwen2.5-0.5B draft (q4_0 collapses)
Pair 2: Llama-3.2-3B target + Llama-3.2-1B draft (q4_0 does not collapse)
This was just a check that kpc is safe and small for speculative decoding (clean, coherent, acceptance-neutral vs f16/q8, q4-size target cache, no collapse). |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
Cuda https://github.com/mverrilli/llama.cpp/tree/kpc-cuda-kv |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
The goal here was q4-level compression with q8-class ppl across model families.
At ctx 4096, kpc lands at ~q4_0 size (3.3–3.4× under f16) while its PPL sits in the q8_0 neighborhood:
(qwen3.5-4b is a hybrid, so its 88 MiB includes ~49 MiB of fixed linear-attn state that does not quantize and is not really comparable to the others.)
This is basically similar to KIVI and KVQuant (per-channel quant for K) with double-quant'd scaled metadata similar to QLoRA. Not claiming the ideas are novel (although I think there are defensible positions for the deltas with the papers). If you think it is worth a PR let me know and I'll work on CUDA. AI Disclosure, I use it for coding assistance, test generation, reorganizing commits, not design.
Couple of updates since I first posted this. The collapse looks specific to the Qwen2/2.5 family rather than small models in general: it also shows up on qwen2.5-1.5b and an R1 distill, but StarCoder2 (same 2 KV heads plus a K bias) stays fine. And it has held up under more scrutiny since (KL-divergence tails, a couple of 16k long-context runs, and a clean run on real Apple M1/NEON), plus there is now a 3-bit V that takes total KV below q4_0 at the same quality.
Here are detailed results.
CPU (i5-8600, 6 threads),
-fa on, wikitext-2; "kpc" =kpc4_1K +q4_1V.https://github.com/mverrilli/llama.cpp/tree/kpc-cpu-kv
Beta Was this translation helpful? Give feedback.
All reactions