metal: support gated delta recurrent snapshots for MTP rollback#10
Open
i386 wants to merge 19 commits into
Open
metal: support gated delta recurrent snapshots for MTP rollback#10i386 wants to merge 19 commits into
i386 wants to merge 19 commits into
Conversation
* MTP: clean-up * review: use llama_context_type instead of llama_graph_type * review: remove llama_model_has_mtp * review: fix convert issues * convert: fix pycheck * review: formatting * use `mtp-` for identifying mtp models * convert: fix mtp conversion
Currently speculative checkpoint needs to restart from a checkpoint after some draft tokens are not accepted, this leads to some wastage in running the target again. This PR adds the ability to rollback upto `draft_max` by storing the GDN intermediates.
Extend the gated delta net kernel to store intermediate states for partial rollback support on the Metal backend. - Add K (snapshot slot count) as a function constant - Read input state from slot 0 of the 3D state tensor - Write intermediate states to different slots during token loop - For K=1, maintain backward-compatible single-slot behavior Ref: ggml-org@8c05923 Assisted-by: llama.cpp:local pi
2 tasks
5060c92 to
4ef8664
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Adds Metal backend support for the
GGML_OP_GATED_DELTA_NETrecurrent snapshot layout used by the MTP rollback path in ggml-org#22673.The MTP path widens the recurrent state input to
(S_v * S_v * H, K, n_seqs), whereK = 1 + n_rs_seq. Slot 0 remains the current input state. ForK > 1, the Metal kernel now writes the lastmin(n_tokens, K)per-token recurrent states into the trailing snapshot slots using the same slot mapping as the CPU/CUDA/Vulkan implementations:For
K == 1, behavior remains the existing final-state-only output.This also moves the snapshot slot count into the Metal kernel args instead of specializing the pipeline by
K, so the same gated-delta pipeline can handle different recurrent snapshot counts.Additional information
This is intended to complete the Metal side of the hybrid Qwen recurrent rollback support in ggml-org#22673.
Validated locally on Apple Metal:
Result:
I also ran end-to-end
llama-serverMTP benchmarks on an Apple M1 Ultra usingunsloth/Qwen3.6-27B-MTP-GGUFQ8_0, Metal backend, 9 prompt corpus, 192 generated tokens per prompt, prompt cache disabled, and one fresh server state per condition.No outliers over 5% from each mode's median were observed in that fresh-state run.
Requirements