spec: support MTP by am17an · Pull Request #6 · am17an/llama.cpp

am17an · 2026-05-11T03:29:12Z

I have removed the partial rollback changes and isolated changes for just qwen models. Things to work out

note that partial rollback is extremely important for the speed-up here, for the MoE model there is actually a slowdown with MTP on this branch

ggerganov · 2026-05-11T06:22:04Z

After the refactoring, all the state management of the draft context is perform outside of common/speculative - i.e. in the server_context. So all the logic for llama_memory_seq_rm can be removed - it is already taken into account in the server:

diff --git a/common/speculative.cpp b/common/speculative.cpp
index ef13edd34..95329b8a6 100644
--- a/common/speculative.cpp
+++ b/common/speculative.cpp
@@ -592,19 +592,6 @@ struct common_speculative_state_mtp : public common_speculative_impl {
         auto & draft_tokens = *dp.result;
         draft_tokens.clear();
 
-        if (last_n_drafted[seq_id] > 0) {
-            const int32_t n_to_drop = (int32_t) last_n_drafted[seq_id] - 1;
-            if (n_to_drop > 0) {
-                const llama_pos pos_max = llama_memory_seq_pos_max(llama_get_memory(ctx_dft), seq_id);
-                if (pos_max >= 0) {
-                    const llama_pos drop_from = pos_max - n_to_drop + 1;
-                    llama_memory_seq_rm(llama_get_memory(ctx_dft), seq_id, drop_from, -1);
-                }
-            }
-            last_n_drafted[seq_id]  = 0;
-            last_n_accepted[seq_id] = 0;
-        }
-
         // Effective draft length: min(global cap, per-sequence override).
         int32_t n_max = std::max(1, params.n_max);
         if (dp.n_max > 0) {
@@ -673,32 +660,9 @@ struct common_speculative_state_mtp : public common_speculative_impl {
             cond_tok = best;
             ++pos;
         }
-
-        last_n_drafted[seq_id] = (uint16_t) draft_tokens.size();
     }
 
     void accept(llama_seq_id seq_id, uint16_t n_accepted) override {
-        GGML_ASSERT(seq_id >= 0 && (size_t) seq_id < last_n_drafted.size());
-
-        auto * ctx_dft = this->params.ctx_dft;
-
-        const llama_pos pos_max = llama_memory_seq_pos_max(llama_get_memory(ctx_dft), seq_id);
-        const int32_t   n_drafted_last = (int32_t) last_n_drafted[seq_id];
-
-        const int32_t n_to_drop = std::max(0, n_drafted_last - (int32_t) n_accepted - 1);
-
-        if (pos_max < 0) {
-            last_n_accepted[seq_id] = (int32_t) n_accepted;
-            return;
-        }
-
-        if (n_to_drop > 0) {
-            const llama_pos drop_from = pos_max - n_to_drop + 1;
-            llama_memory_seq_rm(llama_get_memory(ctx_dft), seq_id, drop_from, -1);
-        }
-
-        last_n_drafted [seq_id] = 0;
-        last_n_accepted[seq_id] = (int32_t) n_accepted;
     }
 };

ggerganov · 2026-05-11T06:52:20Z

Give me ~1 hour an I'll open a PR here to simplify (wip: https://github.com/ggml-org/llama.cpp/tree/gg/spec-mtp-experiments)

ggerganov · 2026-05-11T09:14:08Z

note that partial rollback is extremely important for the speed-up here

In the partial rollback implementation, the accepted batch is not re-evaluated with the draft context, correct? I think this will narrow the difference a bit, though not very sure by how much.

Here are the mtp-bench.py results on M2 Ultra:

Qwen3.6-27B:

No MTP: "wall_s_total": 71.63
MTP (ggml-org:gg/spec-mtp-experiments): "wall_s_total": 43.11
MTP (am17an:mtp-clean): wall_s_total": 41.67

Qwen3.6-35B-A3B:

No MTP: "wall_s_total": 20.3
MTP (ggml-org:gg/spec-mtp-experiments): "wall_s_total": 16.15
MTP (am17an:mtp-clean): "wall_s_total": 16.22

am17an · 2026-05-11T09:32:36Z

on my DGX spark (patched with adding a draft acceptance loop)

Qwen3.5-35B-A3B (using spec-draft-n-max 2):
No-MTP: "wall_s_total": 27.68
gg-mtp-rebase: "wall_s_total": 26.05
mtp-clean: "wall_s_total": 22.19

Qwen3.5-27B (using spec-draft-n-max 3)
No-MTP: "wall_s_total": 201.10
gg-mtp-rebase: 97.83
mtp-clean: 81.23

am17an · 2026-05-11T09:34:22Z

Another thing is mtp-clean doesn't use the pinned memory as this PR, so wall time might be slightly inflated.

am17an · 2026-05-11T09:50:59Z

Basically at low acceptance rates < 0.5, the speed difference is going to much larger. From anecdotal usage, using this PR I seem to even hit 9 toks/sec when doing real coding work, vs with partial rollback I never hit below 14 toks/sec even when acceptance is low. You can perhaps try and use it, I felt the difference is quite real.

ggerganov · 2026-05-11T10:04:04Z

Did you use this branch or #7 ?

am17an · 2026-05-11T10:11:43Z

I used this branch, just saw #7

am17an · 2026-05-11T11:03:43Z

Just tried #7 as well,

Qwen3.6 27B - "wall_s_total": 100.33
Qwen3.6 35BA3B - "wall_s_total": 26.82

Somehow acceptance rates are suspiciously high, maybe some accounting error

  code_python        pred= 192 draft= 139 acc= 138 rate=0.993 tok/s=19.5
  code_cpp           pred= 192 draft= 129 acc= 127 rate=0.985 tok/s=16.7
  explain_concept    pred= 192 draft= 118 acc= 117 rate=0.992 tok/s=13.7
  summarize          pred=  55 draft=  35 acc=  35 rate=1.000 tok/s=16.0
  qa_factual         pred= 178 draft= 109 acc= 107 rate=0.982 tok/s=13.8
  translation        pred=  23 draft=  13 acc=  12 rate=0.923 tok/s=12.4
  creative_short     pred= 192 draft= 109 acc= 105 rate=0.963 tok/s=12.9
  stepwise_math      pred= 192 draft= 130 acc= 130 rate=1.000 tok/s=16.6
  long_code_review   pred= 192 draft= 119 acc= 115 rate=0.966 tok/s=13.1

For reference in mtp-clean, they are

  code_python        pred= 192 draft= 153 acc= 140 rate=0.915 tok/s=21.0
  code_cpp           pred= 192 draft= 171 acc= 134 rate=0.784 tok/s=17.8
  explain_concept    pred= 192 draft= 180 acc= 130 rate=0.722 tok/s=17.3
  summarize          pred=  55 draft=  54 acc=  36 rate=0.667 tok/s=15.9
  qa_factual         pred= 177 draft= 180 acc= 116 rate=0.644 tok/s=16.5
  translation        pred=  22 draft=  24 acc=  13 rate=0.542 tok/s=15.3
  creative_short     pred= 192 draft= 195 acc= 126 rate=0.646 tok/s=16.5
  stepwise_math      pred= 192 draft= 162 acc= 137 rate=0.846 tok/s=19.9
  long_code_review   pred= 192 draft= 186 acc= 129 rate=0.694 tok/s=15.8

ggerganov · 2026-05-11T11:08:50Z

Somehow acceptance rates are suspiciously high, maybe some accounting error

With the p_min logic that I added, we don't draft low-prob tokens, so the acceptance is very high.

ggerganov · 2026-05-11T11:09:18Z

You can observe the accepted drafts with LLAMA_TRACE=1 env variable.

am17an · 2026-05-11T11:10:18Z

~~I think p_min logic will also sample at every step, causing a logit transfer D2H - so it may increase overall time (since draft model is extremely lightweight)~~ not sure if this is right, but p_min does add some time

ggerganov · 2026-05-11T11:56:48Z

not sure if this is right, but p_min does add some time

Yes, I'm also not sure. On Mac it is always useful for some reason. On CUDA sometimes it helps sometimes not. In any case, it can be adjusted with the --spec-draft-p-min argument.

Regarding the partial rollback - it does bring a noticeable benefit on CUDA. But I still don't see a good way to support it cleanly. Among other drawbacks, the compute graph is also no longer static. The logic is not compatible with ngram speculative decoding because it uses long drafts of ~64 tokens which still need to be checkpointed. And for some reason that I still don't understand, it does not seem to help much on Mac.

ggerganov · 2026-05-11T12:38:24Z

+        // TODO: how to make it work with vision tokens?
+        if (batch_in.token == nullptr || batch_in.embd != nullptr) {
+            pending_pos[seq_id] = -1;
+            return true;
+        }


I'm not really sure what is the correct way to process the image embeddings with the MTP context. In any case, vision MTP seems to already work to good extent:

Here I ask it to OCR 100 random integers without speculative decoding and with MTP:

Without spec decoding

With MTP

With MTP it is ~2x faster which means the MTP context "knows" about the integers in some way. But at the same time, I'm pretty sure that the current way of processing is not 100% correct because inp->tokens tensor in the mtp graph is being used with stale data when the input batch has image embeddings and no tokens.

I think we will figure this out later - not super important atm.

spec: support MTP

a428b01

github-actions Bot added testing examples python server model labels May 11, 2026

am17an mentioned this pull request May 11, 2026

spec : parallel drafting support ggml-org/llama.cpp#22838

Open

1 task

fix batch size

c417ddf

ggerganov reviewed May 11, 2026

View reviewed changes

Comment thread src/models/qwen35-mtp.cpp

rename files

e1a11e2

ggerganov reviewed May 11, 2026

View reviewed changes

Conversation

am17an commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented May 11, 2026

Uh oh!

ggerganov commented May 11, 2026

Uh oh!

Uh oh!

ggerganov commented May 11, 2026

Uh oh!

am17an commented May 11, 2026

Uh oh!

am17an commented May 11, 2026

Uh oh!

am17an commented May 11, 2026

Uh oh!

ggerganov commented May 11, 2026

Uh oh!

am17an commented May 11, 2026

Uh oh!

am17an commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented May 11, 2026

Uh oh!

ggerganov commented May 11, 2026

Uh oh!

am17an commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented May 11, 2026

Uh oh!

ggerganov May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

am17an commented May 11, 2026 •

edited

Loading

am17an commented May 11, 2026 •

edited

Loading

am17an commented May 11, 2026 •

edited

Loading