metal: support gated delta recurrent snapshots for MTP rollback by i386 · Pull Request #10 · am17an/llama.cpp

i386 · 2026-05-14T11:56:17Z

Overview

Adds Metal backend support for the GGML_OP_GATED_DELTA_NET recurrent snapshot layout used by the MTP rollback path in ggml-org#22673.

The MTP path widens the recurrent state input to (S_v * S_v * H, K, n_seqs), where K = 1 + n_rs_seq. Slot 0 remains the current input state. For K > 1, the Metal kernel now writes the last min(n_tokens, K) per-token recurrent states into the trailing snapshot slots using the same slot mapping as the CPU/CUDA/Vulkan implementations:

target_slot = t - (n_tokens - K)

For K == 1, behavior remains the existing final-state-only output.

This also moves the snapshot slot count into the Metal kernel args instead of specializing the pipeline by K, so the same gated-delta pipeline can handle different recurrent snapshot counts.

Additional information

This is intended to complete the Metal side of the hybrid Qwen recurrent rollback support in ggml-org#22673.

Validated locally on Apple Metal:

cmake -S /tmp/llama-mtp-metal -B /tmp/llama-mtp-metal/build-metal \
  -DGGML_METAL=ON \
  -DLLAMA_BUILD_TESTS=ON \
  -DLLAMA_BUILD_EXAMPLES=ON \
  -DCMAKE_BUILD_TYPE=Release

cmake --build /tmp/llama-mtp-metal/build-metal --config Release --parallel 8 \
  --target llama-server test-backend-ops

/tmp/llama-mtp-metal/build-metal/bin/test-backend-ops test -b MTL0 -o GATED_DELTA_NET

Result:

GATED_DELTA_NET: 30/30 tests passed
Backend MTL0: OK

I also ran end-to-end llama-server MTP benchmarks on an Apple M1 Ultra using unsloth/Qwen3.6-27B-MTP-GGUF Q8_0, Metal backend, 9 prompt corpus, 192 generated tokens per prompt, prompt cache disabled, and one fresh server state per condition.

Mode	Runs	Mean wall s	Mean tok/s	Mean speedup	Accept
baseline	3	105.48	16.38	1.000x	n/a
mtp_w2	3	78.87	21.91	1.337x	98.50%
mtp_w3	3	74.11	23.32	1.423x	97.37%

No outliers over 5% from each mode's median were observed in that fresh-state run.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES - AI assistance was used to help prepare and validate this patch and to draft this PR text. The submitted changes and PR description have been reviewed and edited by the human contributor before submission.

* MTP: clean-up * review: use llama_context_type instead of llama_graph_type * review: remove llama_model_has_mtp * review: fix convert issues * convert: fix pycheck * review: formatting * use `mtp-` for identifying mtp models * convert: fix mtp conversion

Currently speculative checkpoint needs to restart from a checkpoint after some draft tokens are not accepted, this leads to some wastage in running the target again. This PR adds the ability to rollback upto `draft_max` by storing the GDN intermediates.

Extend the gated delta net kernel to store intermediate states for partial rollback support on the Metal backend. - Add K (snapshot slot count) as a function constant - Read input state from slot 0 of the 3D state tensor - Write intermediate states to different slots during token loop - For K=1, maintain backward-compatible single-slot behavior Ref: ggml-org@8c05923 Assisted-by: llama.cpp:local pi

am17an and others added 19 commits May 13, 2026 14:25

spec: support MTP

a55493b

fix batch size

edcaf12

rename files

180f8ff

cont : simplify (ggml-org#7)

5c58cc5

MTP: clean-up (ggml-org#9)

3c3aeba

* MTP: clean-up * review: use llama_context_type instead of llama_graph_type * review: remove llama_model_has_mtp * review: fix convert issues * convert: fix pycheck * review: formatting * use `mtp-` for identifying mtp models * convert: fix mtp conversion

mtp -> draft-mtp

f2200a3

remove unused llama_arch

19dd00b

add need_embd in speculative

e7b4848

fix pending state

9a3a487

vulkan: add GDN partial rollback

8c05923

meta: extend check to axis 1

a25be1b

delta_net_base: use ggml_pad instead of new_tensor

b9c2a4f

review: add need_rs_seq

757be51

review: rename part_bounded to n_rs

7cb69f5

review: deslop comments

66e47d1

review: rename, add asserts

a8a33f6

metal: support gated delta recurrent snapshots

8f4e83e

i386 marked this pull request as ready for review May 14, 2026 11:57

i386 mentioned this pull request May 14, 2026

llama + spec: MTP Support ggml-org/llama.cpp#22673

Open

2 tasks

github-actions Bot added ggml Apple Metal labels May 14, 2026

am17an force-pushed the mtp-clean branch 2 times, most recently from 5060c92 to 4ef8664 Compare May 14, 2026 14:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metal: support gated delta recurrent snapshots for MTP rollback#10

metal: support gated delta recurrent snapshots for MTP rollback#10
i386 wants to merge 19 commits into
am17an:mtp-cleanfrom
i386:mesh-llm/mtp-metal-snapshots

i386 commented May 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

i386 commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

i386 commented May 14, 2026 •

edited

Loading