How to support KV Cache injection for DFlash speculative decoding #24904

ruixiang63 · 2026-06-22T13:15:52Z

ruixiang63
Jun 22, 2026
Collaborator

We have a draft PR some time ago to support DFlash speculative decoding: #22105. Since the Eagle3 PR was merged, the unified API related issues should be resolved. At this point, the only blocker I see is how to support the KV-cache injection mechanism required by DFlash in llama.cpp.

Workflow from the original implementation is as follows:

e = dflash_encoder(raw_target_features) // the accepted tokens features extraction from target model
K_e = K_proj(e)
V_e = V_proj(e)

K_m = K_proj(mask_tokens)
V_m = V_proj(mask_tokens)
Q = Q_proj(mask_tokens)

K = concat(K_cached, K_e, K_m) 
V = concat(V_cached, V_e, V_m) 

attn(Q, K, V) // full attention build_attn_mha

// after draft step throw away mask tokens cache
K_cached = concat(K_cached, K_e)
V_cached = concat(V_cached, V_e)

The current workaround in the draft PR is to accumulate the features extracted from the target model and feed them back into the DFlash decoder each time. As a result, the accumulated target features keep growing, which is not an ideal solution. See the example below:

ctx_cached = []          // accumulates the DFlash encoder outputs e

e_new      = dflash_encoder(raw_target_features_new) // the accepted tokens features extraction from target model
ctx_cached = concat(ctx_cached, e_new)               // accumulate encoder outputs: whole accumulated context (e_0 .. e_{n-1})

K_e = K_proj(ctx_cached)     // full recompute over all accumulated tokens
V_e = V_proj(ctx_cached)

K_m = K_proj(mask_tokens)          // mask_tokens = [id_last, mask×(block-1)]
V_m = V_proj(mask_tokens)
Q   = Q_proj(mask_tokens) 

K = concat(K_e, K_m) 
V = concat(V_e, V_m)

attn(Q, K, V) // full attention build_attn_mha

// end of draft step:
//   - noise K/V is discarded automatically (no cache, the whole graph is rebuilt next round)
//   - the only thing persisted is ctx_cached (encoder outputs)

As a result, the DFlash decoder rebuilds its graph every iteration (graphs reused = 0). The root cause is that cross.n_enc — the length of the accumulated ctx_cached — grows monotonically, changing the shape of ctx_cached and invalidating all downstream tensor shapes.

The ideal solution mirrors the original implementation: maintain a persistent KV cache in the DFlash decoder (the K_cached / V_cached above) and reuse it every round. On each round we encode only the newly accepted tokens (e = dflash_encoder(accepted_target_features)) and write their K/V into that cache, so the per-round decode keeps a fixed shape and the graph can be reused.
The open question is how to support this in llama.cpp, since the DFlash decoder currently maintains no such cache and relies on the cache-less build_attn_mha now.

ggerganov · 2026-06-23T08:00:31Z

ggerganov
Jun 23, 2026
Maintainer

For now you can do like this:

In the DFlash decode graph, check if the batch has only tokens or embeddings
- If only embeddings -> project and cpy_k/v into memory
- If only tokens -> regular decode graph with attention
During common_speculative_process you also need to call llama_decode after llama_encode in order to inject the KV cache

3 replies

ruixiang63 Jun 23, 2026
Collaborator Author

Thanks! I would like to make sure I understand it correctly. So we would need a dual-mode DFlash decode graph:

embeddings batch (e) → project + cpy_k/v into cache
tokens batch (noise) → full build_attn
process() calls llama_decode after llama_encode to inject the committed K/V cache, and draft() does one more call llama_decode for draft token generation.

One more question:

The noise block (mask tokens) attn is non-causal. So we run the decode with causal_attn = false, right?

ggerganov Jun 24, 2026
Maintainer

So we would need a dual-mode DFlash decode graph

Yes, this is not ideal, but I don't have a better idea atm. After we see the first working implementation, we can think how to improve it. Might be possible to utilize ggml_build_forward_select() to make both paths work with a single graph.

The noise block (mask tokens) attn is non-causal. So we run the decode with causal_attn = false, right?

Yes, I think the simplest is during conversion to make the DFlash model hparams have causal_attn = false. Or you can also set it explicitly in the speculative.cpp.

ruixiang63 Jun 24, 2026
Collaborator Author

Got that. Thanks! I will share the working version ASAP.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to support KV Cache injection for DFlash speculative decoding #24904

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

How to support KV Cache injection for DFlash speculative decoding #24904

Uh oh!

Uh oh!

ruixiang63 Jun 22, 2026 Collaborator

Replies: 1 comment · 3 replies

Uh oh!

ggerganov Jun 23, 2026 Maintainer

Uh oh!

Uh oh!

ruixiang63 Jun 23, 2026 Collaborator Author

Uh oh!

ggerganov Jun 24, 2026 Maintainer

Uh oh!

ruixiang63 Jun 24, 2026 Collaborator Author

ruixiang63
Jun 22, 2026
Collaborator

Replies: 1 comment 3 replies

ggerganov
Jun 23, 2026
Maintainer

ruixiang63 Jun 23, 2026
Collaborator Author

ggerganov Jun 24, 2026
Maintainer

ruixiang63 Jun 24, 2026
Collaborator Author