Quantized KV cache + Flash Attention on Adreno (OpenCL) — an on-device experiment with measurements #24109

hyeoktae · 2026-06-04T10:24:47Z

hyeoktae
Jun 4, 2026

(Discussion draft — GitHub Discussions → Show and tell)

Suggested title: Quantized KV cache + Flash Attention on Adreno (OpenCL) — an on-device experiment with measurements

Note: I'm not confident in English, so I wrote this in my native language and used Claude to translate it.

Hi all — sharing an experiment, not a finished contribution. Code review and corrections are very welcome.

Setup

RedMagic 10 Pro — Snapdragon 8 Elite, Adreno 830, 16GB
llama.cpp OpenCL backend (GPU)

What I did

I added quantized KV cache support to the OpenCL backend: new quantized set_rows kernels (q8_0, q4_0, q5_0, iq4_nl) plus a dequant path inside the existing Flash Attention kernel, so K/V can be stored quantized on Adreno. (The FA kernel itself was already there — I added the quantized-KV path through it.)

Why

I'm an iOS developer studying LLMs. I had run llama.cpp on iPhone where KV-cache quantization worked, but on Android (OpenCL/Adreno) there was no q4/q8 option. I started reading the code to understand why, and since other backends (CUDA, Vulkan, …) already support it, I figured I could implement something similar for OpenCL.

My main goal was longer context on-device. Other on-device NPU runtimes I tried (e.g. LiteRT, Qualcomm Hexagon/HTP) were limited to around 32K context in my experience. Quantizing the KV cache is what unlocks long context on larger models — for the 9B model below, F16 KV alone OOMs at 128K while quantized fits. (Smaller models are fine in F16: gemma-4-E4B (4B) loads 128K with F16 KV on 16 GB in my testing.)

Implementation (facts)

New quantized set_rows kernels: q8_0 / q4_0 / q5_0 / iq4_nl (i64 & i32 index variants)
Flash Attention dequant path for quantized K/V
KV-view state-load fix (view_src detection + view_offs) — fixes a SIGSEGV in llama_state_load_file on the 2D k_stream view
I cross-checked the quant math against the ggml reference (ggml-quants.c); they match.

The set_rows infrastructure (f32/f16 kernels, calling convention) is upstream. The quantized kernels, FA dequant, KV integration, and state fix are what I added. Related: #14661 tracks set_rows type coverage and lists OpenCL quantized types as not-yet-implemented — this fills that gap for Adreno.

Measurements

Output sanity check (simple prompts only) — gemma-4-E4B-it-Q4_0, 8K ctx, OpenCL GPU, temp = 0:

KV type	"17 × 3"	"3 primary colors"	finish	tok/s
Q8_0	51	(correct)	stop	5.0
Q5_0	51	(correct)	stop	4.8
Q4_0	51	(correct)	stop	5.0
IQ4_NL	51	(correct)	stop	4.9

KV memory — gemma-4-E4B-it-Q4_0, 64K ctx, Flash Attention on:

KV type	KV size	vs F16	tok/s	output
F16	1054 MiB	100%	4.75	baseline
Q8_0	560 MiB	53%	4.60	≈ F16
Q5_0	362 MiB	34%	4.64	≈ F16
IQ4_NL	296 MiB	28%	4.53	≈ F16
Q4_0	296 MiB	28%	4.57	≈ F16

On these simple prompts the outputs were coherent and matched F16. I've only tested simple queries so far — this is a sanity check, not a rigorous quality evaluation (no long-reasoning, coding, or long-context accuracy tests yet). tok/s is nearly independent of KV type (dequant cost and memory-bandwidth savings roughly cancel out).

Larger model & context — Qwen3.5-9B-Q4_K_M, OpenCL GPU:

metric	result
Largest context allocated	256K with Q4_0 KV (~12.5 GB RAM) — loads + runs short prompts (not full 256K-token inference; prefill too slow)
F16 KV at 128K	~4 GB KV alone → OOM (quantized KV is what makes it fit)
prefill @ 8K (7,292-token prompt)	~21 tok/s
prefill, F16 vs Q4 KV	20.8 vs 21.0 — nearly identical
prefill across batch / ubatch / context	all ~21 → GPU-compute bound, not KV-format bound

So for a 9B model, quantized KV is what lets it reach long context on-device at all (F16 OOMs), while prefill speed itself is bounded by the GPU.

A note on prefill

Prefill is slow, and I wanted to understand why. As I studied it, the model architecture seemed to be the deciding factor: gemma runs attention across (almost) all layers during prefill (~3.8 tok/s), while Qwen3.5-9B — despite being larger — reached ~21 tok/s because it's a hybrid with only ~8 full-attention layers (the rest use lighter linear attention). I swept batch / ubatch sizes to try to speed it up, but my conclusion is that it's GPU-compute bound on this device — the software knobs barely moved it.

Code

Fork: https://github.com/hyeoktae/llama.cpp
Commits (6): https://github.com/hyeoktae/llama.cpp/commits/feat/opencl-quantized-kv
Key files (permalinks at commit 26c0831):
- quantized set_rows kernels — https://github.com/hyeoktae/llama.cpp/blob/26c0831fc09bbb8a6cd18e66c115577438bad609/ggml/src/ggml-opencl/kernels/set_rows.cl
- Flash Attention dequant — https://github.com/hyeoktae/llama.cpp/blob/26c0831fc09bbb8a6cd18e66c115577438bad609/ggml/src/ggml-opencl/kernels/flash_attn_f32_f16.cl
- op support / dispatch — https://github.com/hyeoktae/llama.cpp/blob/26c0831fc09bbb8a6cd18e66c115577438bad609/ggml/src/ggml-opencl/ggml-opencl.cpp
For the full diff: browse the per-commit links above, or run git diff 9e58d4d69..feat/opencl-quantized-kv locally. (GitHub's online compare-vs-master view timed out when I tried it.)

Disclosure

I used Claude (4.8) to analyze the existing code and to write the kernels. The design decisions and the on-device measurements are mine. I understand the code itself, but I'm still learning how LLMs work end-to-end, so I didn't feel it was ready for a PR — I'm sharing it as a discussion instead. Corrections and code review are very welcome.

Why I'm sharing

If anyone else wants to run LLMs on mobile like I do, I hope this helps even a little. 🙏

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantized KV cache + Flash Attention on Adreno (OpenCL) — an on-device experiment with measurements #24109

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Quantized KV cache + Flash Attention on Adreno (OpenCL) — an on-device experiment with measurements #24109

Uh oh!

hyeoktae Jun 4, 2026

(Discussion draft — GitHub Discussions → Show and tell)

Setup

What I did

Why

Implementation (facts)

Measurements

A note on prefill

Code

Disclosure

Why I'm sharing

Replies: 0 comments

hyeoktae
Jun 4, 2026