Skip to content

server: force non-mmap load for MTP head to avoid Metal full-model duplicate#22941

Closed
frozename wants to merge 12 commits into
ggml-org:masterfrom
frozename:fix/mtp-server-non-mmap-load
Closed

server: force non-mmap load for MTP head to avoid Metal full-model duplicate#22941
frozename wants to merge 12 commits into
ggml-org:masterfrom
frozename:fix/mtp-server-non-mmap-load

Conversation

@frozename
Copy link
Copy Markdown

Targets the MTP path introduced in #22673 (currently a draft). Branched off the PR's head (5d5f1b46). Should be cherry-picked into #22673 / squashed into the same series before that PR lands. Filing as a separate PR for review visibility — happy to close once it's incorporated into #22673.

Bug

The MTP head is loaded by reopening the same GGUF with override_arch = qwen35_mtp[_moe], which registers tensors near both the start of the file (tok_embd) and the end (output, nextn.*, last transformer block). With the default mmap-backed buffer path at src/llama-model.cpp:1463-1483, the backend allocates a buffer covering the full [first_tensor_offset, last_tensor_offset) range — which for Qwen 3.6 27B spans nearly the entire file.

On Apple Silicon the result is a Metal buffer the same size as the main model, duplicated. Server log shows two MTL0_Mapped model buffer size entries with identical byte counts: one for the main model, one for the MTP head.

This is observable as unaccounted ≈ model_size in common_memory_breakdown_print (the MTP context's allocations live in a sibling llama_context and don't roll up into the target context's self).

Reproduction (M4 Pro 48 GB, Apple10 / pre-M5 Metal, PR head 5d5f1b46)

llama-server -m Qwen3.6-27B-Q5_K_M-mtp.gguf \
  --ctx-size 8192 --no-warmup -np 1 -ngl 99 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --spec-type mtp --spec-draft-n-max 3

Pre-fix server log:

load_tensors:  MTL0_Mapped model buffer size = 18760.13 MiB   <-- main model
load_tensors:  MTL0_Mapped model buffer size = 18760.13 MiB   <-- MTP head, full duplicate

OOMs mid-decode on the 38 GB Metal cap (M4 Pro). Q8_0 is even worse (~28 GB each).

Fix

Force use_mmap = false on the MTP load only. The non-mmap allocator path (llama-model.cpp:1492) sizes the backend buffer to the registered tensors instead of the mmap range — which for the MTP arch is ~1.4-1.7 GB instead of ~19-28 GB.

Before / after (M4 Pro 48 GB, Qwen 3.6 27B, froggeric's MTP GGUFs)

Quant Metal MTP buffer pre-fix Metal MTP buffer post-fix Reduction
Q5_K_M-mtp (19 GB file) 18 760 MiB 1 425 MiB 13.2×
Q8_0-mtp (29 GB file) 28 213 MiB 1 719 MiB 16.4×

Both quants were OOMing mid-decode pre-fix; both now run cleanly.

Aggregate decode tok/s with the OP's recipe (-ctk q8_0 -ctv q8_0, no flash-attn, default ub), 9-prompt suite (am17an's gist, temperature=0 seed=42 n_predict=192):

Quant Vanilla MTP (post-fix) Ratio Accept
Q4_K_M 11.9 10.0 0.85× 0.701
Q5_K_M 9.5 8.6 0.91× 0.713
Q8_0 7.4 11.1 1.49× 0.725

Q4 is unchanged (it fit pre-fix; no regression). Q5 and Q8 are new datapoints because they OOM'd pre-fix. Q8 clears the speculative-decoding win threshold on this hardware — speculative savings scale with main-pass cost, so MTP shines at the largest quant the box can hold.

Testing scope

Built and tested only on M4 Pro 48 GB / Apple10 Metal (only platform I have access to). The non-mmap allocator path is well-exercised on all backends (it's the fallback when mmap fails or is disabled by --no-mmap), so functional correctness should be unchanged everywhere. Performance impact at MTP load time on other backends is a slower direct-read of ~1-2 GB of tensor data instead of mmap pointer aliasing — a one-time cost of a few seconds, no inference-time effect.

Would appreciate someone running this on CUDA / Vulkan to confirm no regression there.

Notes

  • The fix only triggers when params_base.speculative.type == COMMON_SPECULATIVE_TYPE_MTP — vanilla / draft / mmproj load paths are untouched.
  • Interactions with --no-mmap and --mlock were considered: the override is a no-op if the user already disabled mmap, and --mlock still works via the existing mlock_bufs path (llama-model.cpp:1497-1502).
  • A more surgical alternative — conditionally skipping tok_embd / output registration in qwen35_mtp.cpp::load_arch_tensors when the nextn.* equivalents are present — was considered but rejected: it requires GGUF-metadata sniffing before arch instantiation, and breaks any future MTP GGUF that lacks the nextn variants. The non-mmap override keeps the arch logic simple and the failure mode contained.

am17an and others added 12 commits May 4, 2026 20:15
Currently speculative checkpoint needs to restart from a checkpoint
after some draft tokens are not accepted, this leads to some wastage in
running the target again. This PR adds the ability to rollback upto
`draft_max` by storing the GDN intermediates.
…plicate

The MTP head is loaded by reopening the same GGUF with override_arch =
qwen35_mtp[_moe], which registers tensors near both the start of the
file (tok_embd) and the end (output, nextn.*, last transformer block).
With the default mmap-backed buffer path
(llama-model.cpp:1463-1483), the backend allocates a buffer covering
the full [first_tensor_offset, last_tensor_offset) range — which for
Qwen 3.6 27B spans nearly the entire file.

On Apple Silicon the result is a ~model-sized Metal duplicate. Server
log shows two MTL0_Mapped model buffer size entries of the same byte
count (one for the main model, one for the MTP head).

Before/after on M4 Pro 48 GB, Qwen 3.6 27B (Apple10 / pre-M5 Metal):

  Q5_K_M-mtp  Metal MTP buffer:  18760 MiB -> 1425 MiB (13.2x)
  Q8_0-mtp    Metal MTP buffer:  28213 MiB -> 1719 MiB (16.4x)

Q5 and Q8 were both OOMing mid-decode pre-fix; both now run cleanly.
Q8_0 produces a 1.49x aggregate decode win vs vanilla on M4 Pro.

Force use_mmap=false on the MTP load so the non-mmap allocator path
sizes the buffer to the registered tensors only. Vanilla, draft-model
and mmproj load paths are untouched.
@frozename frozename requested review from a team, CISC, JohannesGaessler and ggerganov as code owners May 11, 2026 10:16
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented May 11, 2026

Hi @frozename, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Multiple backend changes in one PR: When adding support for a new model or feature, focus on CPU support only in the initial PR. Add support for other backends like CUDA in follow-up PRs. If you have a good reason to modify multiple backends in one PR, please explain it.

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

  • Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.


Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@0cc4m
Copy link
Copy Markdown
Contributor

0cc4m commented May 11, 2026

This should not be a PR, you made a 1 line change. Why not just discuss/propose it on the MTP PR?

@0cc4m 0cc4m closed this May 11, 2026
@frozename frozename mentioned this pull request May 11, 2026
1 task
@frozename
Copy link
Copy Markdown
Author

This should not be a PR, you made a 1 line change. Why not just discuss/propose it on the MTP PR?

Ok I'll do that

@github-actions github-actions Bot added model Model specific testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend examples labels May 11, 2026
@github-actions github-actions Bot added python python script changes server ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels May 11, 2026
@CISC
Copy link
Copy Markdown
Member

CISC commented May 11, 2026

I'd also like to stress that this kind of spam-by-ai will normally earn submitters a ban, the only reason you were not is because of the way you responded.

@frozename
Copy link
Copy Markdown
Author

I'd also like to stress that this kind of spam-by-ai will normally earn submitters a ban, the only reason you were not is because of the way you responded.

Thanks for the heads up. I'll keep that in mind for future contributions

frozename added a commit to frozename/llamactl that referenced this pull request May 11, 2026
Traced the M4 Pro doubled-allocation OOM to its source: the MTP arch
registers tensors at both ends of the GGUF (tok_embd near the start,
output/nextn/last layer at the end), so the mmap-backed buffer path
covers [first, last) ~ full file. Apple Metal allocates that whole
range, duplicating the main model in VRAM. Server log shows two
identical MTL0_Mapped model buffer entries; the duplicate doesn't roll
up into the target context's self in the breakdown so it appears as
unaccounted memory.

One-line fix: force use_mmap=false on the MTP load. Non-mmap allocator
sizes the buffer to registered tensors only.

Filed upstream as ggml-org/llama.cpp#22941. Patch preserved at
tools/llama-cpp-mtp/0001-mtp-mmap-fix.patch.

Results on M4 Pro 48 GB:
- Q5_K_M Metal MTP buffer: 18760 -> 1425 MiB (13.2x reduction)
- Q8_0   Metal MTP buffer: 28213 -> 1719 MiB (16.4x reduction)
- Q8_0 decode: 7.4 -> 11.1 tok/s (1.49x vs vanilla, accept 0.725)
- Q4_K_M unchanged from pre-fix (no regression)
- Q5_K_M still 0.91x (bandwidth-bound at smaller quants)

Slice-A results doc gets a 2026-05-11 addendum with the full
mechanism walkthrough and code references.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) examples ggml changes relating to the ggml tensor library for machine learning model Model specific Nvidia GPU Issues specific to Nvidia GPUs python python script changes server testing Everything test related Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants