MTP: clean-up by am17an · Pull Request #9 · am17an/llama.cpp

am17an · 2026-05-12T08:16:01Z

Following changes:

tries to look for MTP when specifying -hf, --spec-draft takes precedence
removes the hardcoded arch loading and allows loading via llm_graph_input as suggested by @ngxson in llama : add MTP API ggml-org/llama.cpp#18886
convert has option for --no-mtp to strip off the mtp layers, otherwise default to grafted on
llama_memory stuff needs another look

This saves about 2.7GB in q8_0 MTP for VRAM when loading it together. So MTP is essentially just 400Mb of weights.

CISC · 2026-05-12T10:18:25Z

+    # When true, `--mtp` was passed: filter out trunk weights so the resulting
+    # GGUF carries only the MTP head and the shared embeddings/output tensors.
+    mtp_only: bool = False
+
+    # When true, `--no-mtp` was passed: drop `mtp.*` tensors and report block_count
+    # as the trunk-only layer count, producing a GGUF with no MTP head.
+    no_mtp: bool = False


Needs to be added to ModelBase, you may need to use super() or add a getter to properly access these.

CISC · 2026-05-12T10:25:59Z

        # Multimodal Qwen3.5/3.6 wrap the text model under `model.language_model.*`.
        if name.startswith("model.language_model."):
            name = "model." + name[len("model.language_model."):]
        elif name.startswith("language_model."):
            name = name[len("language_model."):]

+        if self.mtp_only:
+            # In --mtp mode keep only the MTP block plus the shared embedding/output tensors
+            # that the standalone MTP graph references at inference time.
+            keep = (
+                name.startswith("mtp.") or
+                name in ("model.embed_tokens.weight", "model.norm.weight", "lm_head.weight") or
+                name in ("embed_tokens.weight", "norm.weight")
+            )
+            if not keep:
+                return
+
        # Remap MTP block tensors to llama.cpp's layer-indexed nextn naming.


Suggested change

# Multimodal Qwen3.5/3.6 wrap the text model under `model.language_model.*`.

if name.startswith("model.language_model."):

name = "model." + name[len("model.language_model."):]

elif name.startswith("language_model."):

name = name[len("language_model."):]

if self.mtp_only:

# In --mtp mode keep only the MTP block plus the shared embedding/output tensors

# that the standalone MTP graph references at inference time.

keep = (

name.startswith("mtp.") or

name in ("model.embed_tokens.weight", "model.norm.weight", "lm_head.weight") or

name in ("embed_tokens.weight", "norm.weight")

)

if not keep:

return

# Remap MTP block tensors to llama.cpp's layer-indexed nextn naming.

# Remap MTP block tensors to llama.cpp's layer-indexed nextn naming.

The language_model stuff should be obsolete, and mtp_only can go in filter_tensors, no?

Can you take a look again?

I'm drowning ATM, don't have time to look into the details, but preferably the mtp flags should go into ModelBase even though it may complicate access from a classmethod.

If it works as-is right now, we can flag it for a later refactor instead.

It is part of ModelBase now

* MTP: clean-up * review: use llama_context_type instead of llama_graph_type * review: remove llama_model_has_mtp * review: fix convert issues * convert: fix pycheck * review: formatting * use `mtp-` for identifying mtp models * convert: fix mtp conversion

github-actions Bot added testing examples python server model labels May 12, 2026

am17an force-pushed the mtp-clean branch from 3bdc61f to ebe4fca Compare May 12, 2026 08:19