Skip to content

MTP: clean-up#9

Merged
am17an merged 8 commits into
mtp-cleanfrom
mtp-clean-up
May 13, 2026
Merged

MTP: clean-up#9
am17an merged 8 commits into
mtp-cleanfrom
mtp-clean-up

Conversation

@am17an
Copy link
Copy Markdown
Owner

@am17an am17an commented May 12, 2026

Following changes:

  • tries to look for MTP when specifying -hf, --spec-draft takes precedence
  • removes the hardcoded arch loading and allows loading via llm_graph_input as suggested by @ngxson in llama : add MTP API ggml-org/llama.cpp#18886
  • convert has option for --no-mtp to strip off the mtp layers, otherwise default to grafted on
  • llama_memory stuff needs another look

This saves about 2.7GB in q8_0 MTP for VRAM when loading it together. So MTP is essentially just 400Mb of weights.

Comment thread tools/server/server-context.cpp Outdated
Comment thread include/llama.h Outdated
Comment thread common/download.cpp Outdated
Comment thread include/llama.h Outdated
Comment thread include/llama.h Outdated
Comment thread convert_hf_to_gguf.py Outdated
Comment on lines +5566 to +5572
# When true, `--mtp` was passed: filter out trunk weights so the resulting
# GGUF carries only the MTP head and the shared embeddings/output tensors.
mtp_only: bool = False

# When true, `--no-mtp` was passed: drop `mtp.*` tensors and report block_count
# as the trunk-only layer count, producing a GGUF with no MTP head.
no_mtp: bool = False
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be added to ModelBase, you may need to use super() or add a getter to properly access these.

Comment thread convert_hf_to_gguf.py Outdated
Comment on lines 5616 to 5633
# Multimodal Qwen3.5/3.6 wrap the text model under `model.language_model.*`.
if name.startswith("model.language_model."):
name = "model." + name[len("model.language_model."):]
elif name.startswith("language_model."):
name = name[len("language_model."):]

if self.mtp_only:
# In --mtp mode keep only the MTP block plus the shared embedding/output tensors
# that the standalone MTP graph references at inference time.
keep = (
name.startswith("mtp.") or
name in ("model.embed_tokens.weight", "model.norm.weight", "lm_head.weight") or
name in ("embed_tokens.weight", "norm.weight")
)
if not keep:
return

# Remap MTP block tensors to llama.cpp's layer-indexed nextn naming.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Multimodal Qwen3.5/3.6 wrap the text model under `model.language_model.*`.
if name.startswith("model.language_model."):
name = "model." + name[len("model.language_model."):]
elif name.startswith("language_model."):
name = name[len("language_model."):]
if self.mtp_only:
# In --mtp mode keep only the MTP block plus the shared embedding/output tensors
# that the standalone MTP graph references at inference time.
keep = (
name.startswith("mtp.") or
name in ("model.embed_tokens.weight", "model.norm.weight", "lm_head.weight") or
name in ("embed_tokens.weight", "norm.weight")
)
if not keep:
return
# Remap MTP block tensors to llama.cpp's layer-indexed nextn naming.
# Remap MTP block tensors to llama.cpp's layer-indexed nextn naming.

The language_model stuff should be obsolete, and mtp_only can go in filter_tensors, no?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you take a look again?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm drowning ATM, don't have time to look into the details, but preferably the mtp flags should go into ModelBase even though it may complicate access from a classmethod.

If it works as-is right now, we can flag it for a later refactor instead.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is part of ModelBase now

Comment thread src/llama-context.cpp Outdated
Comment thread src/llama-context.cpp Outdated
Comment thread src/llama-model.cpp
Comment thread tools/server/server-context.cpp Outdated
Comment thread src/llama-cparams.h
Comment thread src/llama-context.cpp Outdated
Comment thread src/llama-context.cpp Outdated
Comment thread include/llama.h Outdated
@am17an am17an merged commit a421d66 into mtp-clean May 13, 2026
38 of 52 checks passed
am17an added a commit that referenced this pull request May 13, 2026
* MTP: clean-up

* review: use llama_context_type instead of llama_graph_type

* review: remove llama_model_has_mtp

* review: fix convert issues

* convert: fix pycheck

* review: formatting

* use `mtp-` for identifying mtp models

* convert: fix mtp conversion
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants