MTP: clean-up#9
Merged
Merged
Conversation
ggerganov
reviewed
May 12, 2026
2 tasks
ggerganov
reviewed
May 12, 2026
ggerganov
reviewed
May 12, 2026
ggerganov
reviewed
May 12, 2026
ggerganov
reviewed
May 12, 2026
CISC
reviewed
May 12, 2026
Comment on lines
+5566
to
+5572
| # When true, `--mtp` was passed: filter out trunk weights so the resulting | ||
| # GGUF carries only the MTP head and the shared embeddings/output tensors. | ||
| mtp_only: bool = False | ||
|
|
||
| # When true, `--no-mtp` was passed: drop `mtp.*` tensors and report block_count | ||
| # as the trunk-only layer count, producing a GGUF with no MTP head. | ||
| no_mtp: bool = False |
There was a problem hiding this comment.
Needs to be added to ModelBase, you may need to use super() or add a getter to properly access these.
Comment on lines
5616
to
5633
| # Multimodal Qwen3.5/3.6 wrap the text model under `model.language_model.*`. | ||
| if name.startswith("model.language_model."): | ||
| name = "model." + name[len("model.language_model."):] | ||
| elif name.startswith("language_model."): | ||
| name = name[len("language_model."):] | ||
|
|
||
| if self.mtp_only: | ||
| # In --mtp mode keep only the MTP block plus the shared embedding/output tensors | ||
| # that the standalone MTP graph references at inference time. | ||
| keep = ( | ||
| name.startswith("mtp.") or | ||
| name in ("model.embed_tokens.weight", "model.norm.weight", "lm_head.weight") or | ||
| name in ("embed_tokens.weight", "norm.weight") | ||
| ) | ||
| if not keep: | ||
| return | ||
|
|
||
| # Remap MTP block tensors to llama.cpp's layer-indexed nextn naming. |
There was a problem hiding this comment.
Suggested change
| # Multimodal Qwen3.5/3.6 wrap the text model under `model.language_model.*`. | |
| if name.startswith("model.language_model."): | |
| name = "model." + name[len("model.language_model."):] | |
| elif name.startswith("language_model."): | |
| name = name[len("language_model."):] | |
| if self.mtp_only: | |
| # In --mtp mode keep only the MTP block plus the shared embedding/output tensors | |
| # that the standalone MTP graph references at inference time. | |
| keep = ( | |
| name.startswith("mtp.") or | |
| name in ("model.embed_tokens.weight", "model.norm.weight", "lm_head.weight") or | |
| name in ("embed_tokens.weight", "norm.weight") | |
| ) | |
| if not keep: | |
| return | |
| # Remap MTP block tensors to llama.cpp's layer-indexed nextn naming. | |
| # Remap MTP block tensors to llama.cpp's layer-indexed nextn naming. |
The language_model stuff should be obsolete, and mtp_only can go in filter_tensors, no?
Owner
Author
There was a problem hiding this comment.
Can you take a look again?
There was a problem hiding this comment.
I'm drowning ATM, don't have time to look into the details, but preferably the mtp flags should go into ModelBase even though it may complicate access from a classmethod.
If it works as-is right now, we can flag it for a later refactor instead.
Owner
Author
There was a problem hiding this comment.
It is part of ModelBase now
ggerganov
reviewed
May 12, 2026
ggerganov
reviewed
May 12, 2026
ggerganov
reviewed
May 12, 2026
am17an
commented
May 12, 2026
ggerganov
approved these changes
May 12, 2026
am17an
added a commit
that referenced
this pull request
May 13, 2026
* MTP: clean-up * review: use llama_context_type instead of llama_graph_type * review: remove llama_model_has_mtp * review: fix convert issues * convert: fix pycheck * review: formatting * use `mtp-` for identifying mtp models * convert: fix mtp conversion
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Following changes:
-hf,--spec-drafttakes precedencellm_graph_inputas suggested by @ngxson in llama : add MTP API ggml-org/llama.cpp#18886--no-mtpto strip off the mtp layers, otherwise default to grafted onThis saves about 2.7GB in q8_0 MTP for VRAM when loading it together. So MTP is essentially just 400Mb of weights.