Quantized KV cache + Flash Attention on Adreno (OpenCL) — an on-device experiment with measurements #24109
hyeoktae
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
(Discussion draft — GitHub Discussions → Show and tell)
Suggested title: Quantized KV cache + Flash Attention on Adreno (OpenCL) — an on-device experiment with measurements
Hi all — sharing an experiment, not a finished contribution. Code review and corrections are very welcome.
Setup
What I did
I added quantized KV cache support to the OpenCL backend: new quantized
set_rowskernels (q8_0,q4_0,q5_0,iq4_nl) plus a dequant path inside the existing Flash Attention kernel, so K/V can be stored quantized on Adreno. (The FA kernel itself was already there — I added the quantized-KV path through it.)Why
I'm an iOS developer studying LLMs. I had run llama.cpp on iPhone where KV-cache quantization worked, but on Android (OpenCL/Adreno) there was no q4/q8 option. I started reading the code to understand why, and since other backends (CUDA, Vulkan, …) already support it, I figured I could implement something similar for OpenCL.
My main goal was longer context on-device. Other on-device NPU runtimes I tried (e.g. LiteRT, Qualcomm Hexagon/HTP) were limited to around 32K context in my experience. Quantizing the KV cache is what unlocks long context on larger models — for the 9B model below, F16 KV alone OOMs at 128K while quantized fits. (Smaller models are fine in F16: gemma-4-E4B (4B) loads 128K with F16 KV on 16 GB in my testing.)
Implementation (facts)
set_rowskernels:q8_0/q4_0/q5_0/iq4_nl(i64 & i32 index variants)view_srcdetection +view_offs) — fixes a SIGSEGV inllama_state_load_fileon the 2Dk_streamviewggml-quants.c); they match.Measurements
Output sanity check (simple prompts only) — gemma-4-E4B-it-Q4_0, 8K ctx, OpenCL GPU, temp = 0:
KV memory — gemma-4-E4B-it-Q4_0, 64K ctx, Flash Attention on:
On these simple prompts the outputs were coherent and matched F16. I've only tested simple queries so far — this is a sanity check, not a rigorous quality evaluation (no long-reasoning, coding, or long-context accuracy tests yet). tok/s is nearly independent of KV type (dequant cost and memory-bandwidth savings roughly cancel out).
Larger model & context — Qwen3.5-9B-Q4_K_M, OpenCL GPU:
So for a 9B model, quantized KV is what lets it reach long context on-device at all (F16 OOMs), while prefill speed itself is bounded by the GPU.
A note on prefill
Prefill is slow, and I wanted to understand why. As I studied it, the model architecture seemed to be the deciding factor: gemma runs attention across (almost) all layers during prefill (~3.8 tok/s), while Qwen3.5-9B — despite being larger — reached ~21 tok/s because it's a hybrid with only ~8 full-attention layers (the rest use lighter linear attention). I swept batch / ubatch sizes to try to speed it up, but my conclusion is that it's GPU-compute bound on this device — the software knobs barely moved it.
Code
26c0831):git diff 9e58d4d69..feat/opencl-quantized-kvlocally. (GitHub's online compare-vs-master view timed out when I tried it.)Disclosure
I used Claude (4.8) to analyze the existing code and to write the kernels. The design decisions and the on-device measurements are mine. I understand the code itself, but I'm still learning how LLMs work end-to-end, so I didn't feel it was ready for a PR — I'm sharing it as a discussion instead. Corrections and code review are very welcome.
Why I'm sharing
If anyone else wants to run LLMs on mobile like I do, I hope this helps even a little. 🙏
Beta Was this translation helpful? Give feedback.
All reactions