How to support KV Cache injection for DFlash speculative decoding #24904
ruixiang63
started this conversation in
Ideas
Replies: 1 comment 3 replies
-
|
For now you can do like this:
|
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
We have a draft PR some time ago to support DFlash speculative decoding: #22105. Since the Eagle3 PR was merged, the unified API related issues should be resolved. At this point, the only blocker I see is how to support the KV-cache injection mechanism required by DFlash in llama.cpp.

Workflow from the original implementation is as follows:
The current workaround in the draft PR is to accumulate the features extracted from the target model and feed them back into the DFlash decoder each time. As a result, the accumulated target features keep growing, which is not an ideal solution. See the example below:
As a result, the DFlash decoder rebuilds its graph every iteration (graphs reused = 0). The root cause is that
cross.n_enc— the length of the accumulatedctx_cached— grows monotonically, changing the shape ofctx_cachedand invalidating all downstream tensor shapes.The ideal solution mirrors the original implementation: maintain a persistent KV cache in the DFlash decoder (the
K_cached/V_cachedabove) and reuse it every round. On each round we encode only the newly accepted tokens (e = dflash_encoder(accepted_target_features)) and write their K/V into that cache, so the per-round decode keeps a fixed shape and the graph can be reused.The open question is how to support this in llama.cpp, since the DFlash decoder currently maintains no such cache and relies on the cache-less
build_attn_mhanow.Beta Was this translation helpful? Give feedback.
All reactions