server: force non-mmap load for MTP head to avoid Metal full-model duplicate by frozename · Pull Request #22941 · ggml-org/llama.cpp

frozename · 2026-05-11T10:16:37Z

Targets the MTP path introduced in #22673 (currently a draft). Branched off the PR's head (5d5f1b46). Should be cherry-picked into #22673 / squashed into the same series before that PR lands. Filing as a separate PR for review visibility — happy to close once it's incorporated into #22673.

Bug

The MTP head is loaded by reopening the same GGUF with override_arch = qwen35_mtp[_moe], which registers tensors near both the start of the file (tok_embd) and the end (output, nextn.*, last transformer block). With the default mmap-backed buffer path at src/llama-model.cpp:1463-1483, the backend allocates a buffer covering the full [first_tensor_offset, last_tensor_offset) range — which for Qwen 3.6 27B spans nearly the entire file.

On Apple Silicon the result is a Metal buffer the same size as the main model, duplicated. Server log shows two MTL0_Mapped model buffer size entries with identical byte counts: one for the main model, one for the MTP head.

This is observable as unaccounted ≈ model_size in common_memory_breakdown_print (the MTP context's allocations live in a sibling llama_context and don't roll up into the target context's self).

Reproduction (M4 Pro 48 GB, Apple10 / pre-M5 Metal, PR head `5d5f1b46`)

llama-server -m Qwen3.6-27B-Q5_K_M-mtp.gguf \
  --ctx-size 8192 --no-warmup -np 1 -ngl 99 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --spec-type mtp --spec-draft-n-max 3

Pre-fix server log:

load_tensors:  MTL0_Mapped model buffer size = 18760.13 MiB   <-- main model
load_tensors:  MTL0_Mapped model buffer size = 18760.13 MiB   <-- MTP head, full duplicate

OOMs mid-decode on the 38 GB Metal cap (M4 Pro). Q8_0 is even worse (~28 GB each).

Fix

Force use_mmap = false on the MTP load only. The non-mmap allocator path (llama-model.cpp:1492) sizes the backend buffer to the registered tensors instead of the mmap range — which for the MTP arch is ~1.4-1.7 GB instead of ~19-28 GB.

Before / after (M4 Pro 48 GB, Qwen 3.6 27B, froggeric's MTP GGUFs)

Quant	Metal MTP buffer pre-fix	Metal MTP buffer post-fix	Reduction
Q5_K_M-mtp (19 GB file)	18 760 MiB	1 425 MiB	13.2×
Q8_0-mtp (29 GB file)	28 213 MiB	1 719 MiB	16.4×

Both quants were OOMing mid-decode pre-fix; both now run cleanly.

Aggregate decode tok/s with the OP's recipe (-ctk q8_0 -ctv q8_0, no flash-attn, default ub), 9-prompt suite (am17an's gist, temperature=0 seed=42 n_predict=192):

Quant	Vanilla	MTP (post-fix)	Ratio	Accept
Q4_K_M	11.9	10.0	0.85×	0.701
Q5_K_M	9.5	8.6	0.91×	0.713
Q8_0	7.4	11.1	1.49×	0.725

Q4 is unchanged (it fit pre-fix; no regression). Q5 and Q8 are new datapoints because they OOM'd pre-fix. Q8 clears the speculative-decoding win threshold on this hardware — speculative savings scale with main-pass cost, so MTP shines at the largest quant the box can hold.

Testing scope

Built and tested only on M4 Pro 48 GB / Apple10 Metal (only platform I have access to). The non-mmap allocator path is well-exercised on all backends (it's the fallback when mmap fails or is disabled by --no-mmap), so functional correctness should be unchanged everywhere. Performance impact at MTP load time on other backends is a slower direct-read of ~1-2 GB of tensor data instead of mmap pointer aliasing — a one-time cost of a few seconds, no inference-time effect.

Would appreciate someone running this on CUDA / Vulkan to confirm no regression there.

Notes

The fix only triggers when params_base.speculative.type == COMMON_SPECULATIVE_TYPE_MTP — vanilla / draft / mmproj load paths are untouched.
Interactions with --no-mmap and --mlock were considered: the override is a no-op if the user already disabled mmap, and --mlock still works via the existing mlock_bufs path (llama-model.cpp:1497-1502).
A more surgical alternative — conditionally skipping tok_embd / output registration in qwen35_mtp.cpp::load_arch_tensors when the nextn.* equivalents are present — was considered but rejected: it requires GGUF-metadata sniffing before arch instantiation, and breaks any future MTP GGUF that lacks the nextn variants. The non-mmap override keeps the arch logic simple and the failure mode contained.

Currently speculative checkpoint needs to restart from a checkpoint after some draft tokens are not accepted, this leads to some wastage in running the target again. This PR adds the ability to rollback upto `draft_max` by storing the GDN intermediates.

…plicate The MTP head is loaded by reopening the same GGUF with override_arch = qwen35_mtp[_moe], which registers tensors near both the start of the file (tok_embd) and the end (output, nextn.*, last transformer block). With the default mmap-backed buffer path (llama-model.cpp:1463-1483), the backend allocates a buffer covering the full [first_tensor_offset, last_tensor_offset) range — which for Qwen 3.6 27B spans nearly the entire file. On Apple Silicon the result is a ~model-sized Metal duplicate. Server log shows two MTL0_Mapped model buffer size entries of the same byte count (one for the main model, one for the MTP head). Before/after on M4 Pro 48 GB, Qwen 3.6 27B (Apple10 / pre-M5 Metal): Q5_K_M-mtp Metal MTP buffer: 18760 MiB -> 1425 MiB (13.2x) Q8_0-mtp Metal MTP buffer: 28213 MiB -> 1719 MiB (16.4x) Q5 and Q8 were both OOMing mid-decode pre-fix; both now run cleanly. Q8_0 produces a 1.49x aggregate decode win vs vanilla on M4 Pro. Force use_mmap=false on the MTP load so the non-mmap allocator path sizes the buffer to the registered tensors only. Vanilla, draft-model and mmproj load paths are untouched.

ggml-gh-bot · 2026-05-11T10:21:42Z

Hi @frozename, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Multiple backend changes in one PR: When adding support for a new model or feature, focus on CPU support only in the initial PR. Add support for other backends like CUDA in follow-up PRs. If you have a good reason to modify multiple backends in one PR, please explain it.
AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.
Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

0cc4m · 2026-05-11T10:23:00Z

This should not be a PR, you made a 1 line change. Why not just discuss/propose it on the MTP PR?

frozename · 2026-05-11T10:27:49Z

This should not be a PR, you made a 1 line change. Why not just discuss/propose it on the MTP PR?

Ok I'll do that

CISC · 2026-05-11T10:43:40Z

I'd also like to stress that this kind of spam-by-ai will normally earn submitters a ban, the only reason you were not is because of the way you responded.

frozename · 2026-05-11T10:59:19Z

I'd also like to stress that this kind of spam-by-ai will normally earn submitters a ban, the only reason you were not is because of the way you responded.

Thanks for the heads up. I'll keep that in mind for future contributions

Traced the M4 Pro doubled-allocation OOM to its source: the MTP arch registers tensors at both ends of the GGUF (tok_embd near the start, output/nextn/last layer at the end), so the mmap-backed buffer path covers [first, last) ~ full file. Apple Metal allocates that whole range, duplicating the main model in VRAM. Server log shows two identical MTL0_Mapped model buffer entries; the duplicate doesn't roll up into the target context's self in the breakdown so it appears as unaccounted memory. One-line fix: force use_mmap=false on the MTP load. Non-mmap allocator sizes the buffer to registered tensors only. Filed upstream as ggml-org/llama.cpp#22941. Patch preserved at tools/llama-cpp-mtp/0001-mtp-mmap-fix.patch. Results on M4 Pro 48 GB: - Q5_K_M Metal MTP buffer: 18760 -> 1425 MiB (13.2x reduction) - Q8_0 Metal MTP buffer: 28213 -> 1719 MiB (16.4x reduction) - Q8_0 decode: 7.4 -> 11.1 tok/s (1.49x vs vanilla, accept 0.725) - Q4_K_M unchanged from pre-fix (no regression) - Q5_K_M still 0.91x (bandwidth-bound at smaller quants) Slice-A results doc gets a 2026-05-11 addendum with the full mechanism walkthrough and code references.

am17an and others added 12 commits May 4, 2026 20:15

add enum for part sequence removal to enable checkpoints

589490f

review: rename rollback to rs_seq and remove public API

c5e0227

llama + spec: MTP support

10829db

add qwen35moe_mtp

f8c6b03

vulkan: add gdn keep_intermediates=true path

b8ec085

metal: add keep_intermediates=true path for GDN

038d787

convert: fix python type check

d6c4de8

test-llama-arch: ignore mtp heads

267f8af

fix double free

86d9f15

fix: use rs for only MTP

5d5f1b4

frozename requested review from a team, CISC, JohannesGaessler and ggerganov as code owners May 11, 2026 10:16

0cc4m closed this May 11, 2026

frozename mentioned this pull request May 11, 2026

llama + spec: MTP Support #22673

Draft

1 task

github-actions Bot added model Model specific testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend examples labels May 11, 2026

github-actions Bot added python python script changes server ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: force non-mmap load for MTP head to avoid Metal full-model duplicate#22941

server: force non-mmap load for MTP head to avoid Metal full-model duplicate#22941
frozename wants to merge 12 commits into
ggml-org:masterfrom
frozename:fix/mtp-server-non-mmap-load

frozename commented May 11, 2026

Uh oh!

ggml-gh-bot Bot commented May 11, 2026

Uh oh!

0cc4m commented May 11, 2026

Uh oh!

frozename commented May 11, 2026

Uh oh!

CISC commented May 11, 2026

Uh oh!

frozename commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

frozename commented May 11, 2026

Bug

Reproduction (M4 Pro 48 GB, Apple10 / pre-M5 Metal, PR head 5d5f1b46)

Fix

Before / after (M4 Pro 48 GB, Qwen 3.6 27B, froggeric's MTP GGUFs)

Testing scope

Notes

Uh oh!

ggml-gh-bot Bot commented May 11, 2026

Uh oh!

0cc4m commented May 11, 2026

Uh oh!

frozename commented May 11, 2026

Uh oh!

CISC commented May 11, 2026

Uh oh!

frozename commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Reproduction (M4 Pro 48 GB, Apple10 / pre-M5 Metal, PR head `5d5f1b46`)