Skip to content

metal: support gated delta recurrent snapshots for MTP rollback#10

Open
i386 wants to merge 19 commits into
am17an:mtp-cleanfrom
i386:mesh-llm/mtp-metal-snapshots
Open

metal: support gated delta recurrent snapshots for MTP rollback#10
i386 wants to merge 19 commits into
am17an:mtp-cleanfrom
i386:mesh-llm/mtp-metal-snapshots

Conversation

@i386
Copy link
Copy Markdown

@i386 i386 commented May 14, 2026

Overview

Adds Metal backend support for the GGML_OP_GATED_DELTA_NET recurrent snapshot layout used by the MTP rollback path in ggml-org#22673.

The MTP path widens the recurrent state input to (S_v * S_v * H, K, n_seqs), where K = 1 + n_rs_seq. Slot 0 remains the current input state. For K > 1, the Metal kernel now writes the last min(n_tokens, K) per-token recurrent states into the trailing snapshot slots using the same slot mapping as the CPU/CUDA/Vulkan implementations:

target_slot = t - (n_tokens - K)

For K == 1, behavior remains the existing final-state-only output.

This also moves the snapshot slot count into the Metal kernel args instead of specializing the pipeline by K, so the same gated-delta pipeline can handle different recurrent snapshot counts.

Additional information

This is intended to complete the Metal side of the hybrid Qwen recurrent rollback support in ggml-org#22673.

Validated locally on Apple Metal:

cmake -S /tmp/llama-mtp-metal -B /tmp/llama-mtp-metal/build-metal \
  -DGGML_METAL=ON \
  -DLLAMA_BUILD_TESTS=ON \
  -DLLAMA_BUILD_EXAMPLES=ON \
  -DCMAKE_BUILD_TYPE=Release

cmake --build /tmp/llama-mtp-metal/build-metal --config Release --parallel 8 \
  --target llama-server test-backend-ops

/tmp/llama-mtp-metal/build-metal/bin/test-backend-ops test -b MTL0 -o GATED_DELTA_NET

Result:

GATED_DELTA_NET: 30/30 tests passed
Backend MTL0: OK

I also ran end-to-end llama-server MTP benchmarks on an Apple M1 Ultra using unsloth/Qwen3.6-27B-MTP-GGUF Q8_0, Metal backend, 9 prompt corpus, 192 generated tokens per prompt, prompt cache disabled, and one fresh server state per condition.

Mode Runs Mean wall s Mean tok/s Mean speedup Accept
baseline 3 105.48 16.38 1.000x n/a
mtp_w2 3 78.87 21.91 1.337x 98.50%
mtp_w3 3 74.11 23.32 1.423x 97.37%

No outliers over 5% from each mode's median were observed in that fresh-state run.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES - AI assistance was used to help prepare and validate this patch and to draft this PR text. The submitted changes and PR description have been reviewed and edited by the human contributor before submission.

am17an and others added 19 commits May 13, 2026 14:25
* MTP: clean-up

* review: use llama_context_type instead of llama_graph_type

* review: remove llama_model_has_mtp

* review: fix convert issues

* convert: fix pycheck

* review: formatting

* use `mtp-` for identifying mtp models

* convert: fix mtp conversion
Currently speculative checkpoint needs to restart from a checkpoint
after some draft tokens are not accepted, this leads to some wastage in
running the target again. This PR adds the ability to rollback upto
`draft_max` by storing the GDN intermediates.
Extend the gated delta net kernel to store intermediate states for
partial rollback support on the Metal backend.

- Add K (snapshot slot count) as a function constant
- Read input state from slot 0 of the 3D state tensor
- Write intermediate states to different slots during token loop
- For K=1, maintain backward-compatible single-slot behavior

Ref: ggml-org@8c05923

Assisted-by: llama.cpp:local pi
@i386 i386 marked this pull request as ready for review May 14, 2026 11:57
@am17an am17an force-pushed the mtp-clean branch 2 times, most recently from 5060c92 to 4ef8664 Compare May 14, 2026 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants