Skip to content

spec: support MTP#6

Open
am17an wants to merge 3 commits intogg/spec-parallelfrom
gg-mtp-rebase
Open

spec: support MTP#6
am17an wants to merge 3 commits intogg/spec-parallelfrom
gg-mtp-rebase

Conversation

@am17an
Copy link
Copy Markdown
Owner

@am17an am17an commented May 11, 2026

I have removed the partial rollback changes and isolated changes for just qwen models. Things to work out

  • generic MTP loading (support both separate GGUF + grafted onto same GGUF?)
  • vision inputs
  • n_seq > 1
  • partial rollback
  • unaccounted memory
  • prefill speeds

note that partial rollback is extremely important for the speed-up here, for the MoE model there is actually a slowdown with MTP on this branch

@ggerganov
Copy link
Copy Markdown

After the refactoring, all the state management of the draft context is perform outside of common/speculative - i.e. in the server_context. So all the logic for llama_memory_seq_rm can be removed - it is already taken into account in the server:

diff --git a/common/speculative.cpp b/common/speculative.cpp
index ef13edd34..95329b8a6 100644
--- a/common/speculative.cpp
+++ b/common/speculative.cpp
@@ -592,19 +592,6 @@ struct common_speculative_state_mtp : public common_speculative_impl {
         auto & draft_tokens = *dp.result;
         draft_tokens.clear();
 
-        if (last_n_drafted[seq_id] > 0) {
-            const int32_t n_to_drop = (int32_t) last_n_drafted[seq_id] - 1;
-            if (n_to_drop > 0) {
-                const llama_pos pos_max = llama_memory_seq_pos_max(llama_get_memory(ctx_dft), seq_id);
-                if (pos_max >= 0) {
-                    const llama_pos drop_from = pos_max - n_to_drop + 1;
-                    llama_memory_seq_rm(llama_get_memory(ctx_dft), seq_id, drop_from, -1);
-                }
-            }
-            last_n_drafted[seq_id]  = 0;
-            last_n_accepted[seq_id] = 0;
-        }
-
         // Effective draft length: min(global cap, per-sequence override).
         int32_t n_max = std::max(1, params.n_max);
         if (dp.n_max > 0) {
@@ -673,32 +660,9 @@ struct common_speculative_state_mtp : public common_speculative_impl {
             cond_tok = best;
             ++pos;
         }
-
-        last_n_drafted[seq_id] = (uint16_t) draft_tokens.size();
     }
 
     void accept(llama_seq_id seq_id, uint16_t n_accepted) override {
-        GGML_ASSERT(seq_id >= 0 && (size_t) seq_id < last_n_drafted.size());
-
-        auto * ctx_dft = this->params.ctx_dft;
-
-        const llama_pos pos_max = llama_memory_seq_pos_max(llama_get_memory(ctx_dft), seq_id);
-        const int32_t   n_drafted_last = (int32_t) last_n_drafted[seq_id];
-
-        const int32_t n_to_drop = std::max(0, n_drafted_last - (int32_t) n_accepted - 1);
-
-        if (pos_max < 0) {
-            last_n_accepted[seq_id] = (int32_t) n_accepted;
-            return;
-        }
-
-        if (n_to_drop > 0) {
-            const llama_pos drop_from = pos_max - n_to_drop + 1;
-            llama_memory_seq_rm(llama_get_memory(ctx_dft), seq_id, drop_from, -1);
-        }
-
-        last_n_drafted [seq_id] = 0;
-        last_n_accepted[seq_id] = (int32_t) n_accepted;
     }
 };
 

@ggerganov
Copy link
Copy Markdown

Give me ~1 hour an I'll open a PR here to simplify (wip: https://github.com/ggml-org/llama.cpp/tree/gg/spec-mtp-experiments)

Comment thread src/models/qwen35-mtp.cpp
@ggerganov
Copy link
Copy Markdown

note that partial rollback is extremely important for the speed-up here

In the partial rollback implementation, the accepted batch is not re-evaluated with the draft context, correct? I think this will narrow the difference a bit, though not very sure by how much.

Here are the mtp-bench.py results on M2 Ultra:

Qwen3.6-27B:

  • No MTP: "wall_s_total": 71.63
  • MTP (ggml-org:gg/spec-mtp-experiments): "wall_s_total": 43.11
  • MTP (am17an:mtp-clean): wall_s_total": 41.67

Qwen3.6-35B-A3B:

  • No MTP: "wall_s_total": 20.3
  • MTP (ggml-org:gg/spec-mtp-experiments): "wall_s_total": 16.15
  • MTP (am17an:mtp-clean): "wall_s_total": 16.22

@am17an
Copy link
Copy Markdown
Owner Author

am17an commented May 11, 2026

on my DGX spark (patched with adding a draft acceptance loop)

Qwen3.5-35B-A3B (using spec-draft-n-max 2):
No-MTP: "wall_s_total": 27.68
gg-mtp-rebase: "wall_s_total": 26.05
mtp-clean: "wall_s_total": 22.19

Qwen3.5-27B (using spec-draft-n-max 3)
No-MTP: "wall_s_total": 201.10
gg-mtp-rebase: 97.83
mtp-clean: 81.23

@am17an
Copy link
Copy Markdown
Owner Author

am17an commented May 11, 2026

Another thing is mtp-clean doesn't use the pinned memory as this PR, so wall time might be slightly inflated.

@am17an
Copy link
Copy Markdown
Owner Author

am17an commented May 11, 2026

Basically at low acceptance rates < 0.5, the speed difference is going to much larger. From anecdotal usage, using this PR I seem to even hit 9 toks/sec when doing real coding work, vs with partial rollback I never hit below 14 toks/sec even when acceptance is low. You can perhaps try and use it, I felt the difference is quite real.

@ggerganov
Copy link
Copy Markdown

Did you use this branch or #7 ?

@am17an
Copy link
Copy Markdown
Owner Author

am17an commented May 11, 2026

I used this branch, just saw #7

@am17an
Copy link
Copy Markdown
Owner Author

am17an commented May 11, 2026

Just tried #7 as well,

Qwen3.6 27B - "wall_s_total": 100.33
Qwen3.6 35BA3B - "wall_s_total": 26.82

Somehow acceptance rates are suspiciously high, maybe some accounting error

  code_python        pred= 192 draft= 139 acc= 138 rate=0.993 tok/s=19.5
  code_cpp           pred= 192 draft= 129 acc= 127 rate=0.985 tok/s=16.7
  explain_concept    pred= 192 draft= 118 acc= 117 rate=0.992 tok/s=13.7
  summarize          pred=  55 draft=  35 acc=  35 rate=1.000 tok/s=16.0
  qa_factual         pred= 178 draft= 109 acc= 107 rate=0.982 tok/s=13.8
  translation        pred=  23 draft=  13 acc=  12 rate=0.923 tok/s=12.4
  creative_short     pred= 192 draft= 109 acc= 105 rate=0.963 tok/s=12.9
  stepwise_math      pred= 192 draft= 130 acc= 130 rate=1.000 tok/s=16.6
  long_code_review   pred= 192 draft= 119 acc= 115 rate=0.966 tok/s=13.1

For reference in mtp-clean, they are

  code_python        pred= 192 draft= 153 acc= 140 rate=0.915 tok/s=21.0
  code_cpp           pred= 192 draft= 171 acc= 134 rate=0.784 tok/s=17.8
  explain_concept    pred= 192 draft= 180 acc= 130 rate=0.722 tok/s=17.3
  summarize          pred=  55 draft=  54 acc=  36 rate=0.667 tok/s=15.9
  qa_factual         pred= 177 draft= 180 acc= 116 rate=0.644 tok/s=16.5
  translation        pred=  22 draft=  24 acc=  13 rate=0.542 tok/s=15.3
  creative_short     pred= 192 draft= 195 acc= 126 rate=0.646 tok/s=16.5
  stepwise_math      pred= 192 draft= 162 acc= 137 rate=0.846 tok/s=19.9
  long_code_review   pred= 192 draft= 186 acc= 129 rate=0.694 tok/s=15.8

@ggerganov
Copy link
Copy Markdown

Somehow acceptance rates are suspiciously high, maybe some accounting error

With the p_min logic that I added, we don't draft low-prob tokens, so the acceptance is very high.

@ggerganov
Copy link
Copy Markdown

You can observe the accepted drafts with LLAMA_TRACE=1 env variable.

@am17an
Copy link
Copy Markdown
Owner Author

am17an commented May 11, 2026

I think p_min logic will also sample at every step, causing a logit transfer D2H - so it may increase overall time (since draft model is extremely lightweight) not sure if this is right, but p_min does add some time

@ggerganov
Copy link
Copy Markdown

not sure if this is right, but p_min does add some time

Yes, I'm also not sure. On Mac it is always useful for some reason. On CUDA sometimes it helps sometimes not. In any case, it can be adjusted with the --spec-draft-p-min argument.

Regarding the partial rollback - it does bring a noticeable benefit on CUDA. But I still don't see a good way to support it cleanly. Among other drawbacks, the compute graph is also no longer static. The logic is not compatible with ngram speculative decoding because it uses long drafts of ~64 tokens which still need to be checkpointed. And for some reason that I still don't understand, it does not seem to help much on Mac.

Comment thread common/speculative.cpp
Comment on lines +480 to +484
// TODO: how to make it work with vision tokens?
if (batch_in.token == nullptr || batch_in.embd != nullptr) {
pending_pos[seq_id] = -1;
return true;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure what is the correct way to process the image embeddings with the MTP context. In any case, vision MTP seems to already work to good extent:

Here I ask it to OCR 100 random integers without speculative decoding and with MTP:

  • Without spec decoding
Image
  • With MTP
Image

With MTP it is ~2x faster which means the MTP context "knows" about the integers in some way. But at the same time, I'm pretty sure that the current way of processing is not 100% correct because inp->tokens tensor in the mtp graph is being used with stale data when the input batch has image embeddings and no tokens.

I think we will figure this out later - not super important atm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants