server: force non-mmap load for MTP head to avoid Metal full-model duplicate#22941
server: force non-mmap load for MTP head to avoid Metal full-model duplicate#22941frozename wants to merge 12 commits into
Conversation
Currently speculative checkpoint needs to restart from a checkpoint after some draft tokens are not accepted, this leads to some wastage in running the target again. This PR adds the ability to rollback upto `draft_max` by storing the GDN intermediates.
…plicate The MTP head is loaded by reopening the same GGUF with override_arch = qwen35_mtp[_moe], which registers tensors near both the start of the file (tok_embd) and the end (output, nextn.*, last transformer block). With the default mmap-backed buffer path (llama-model.cpp:1463-1483), the backend allocates a buffer covering the full [first_tensor_offset, last_tensor_offset) range — which for Qwen 3.6 27B spans nearly the entire file. On Apple Silicon the result is a ~model-sized Metal duplicate. Server log shows two MTL0_Mapped model buffer size entries of the same byte count (one for the main model, one for the MTP head). Before/after on M4 Pro 48 GB, Qwen 3.6 27B (Apple10 / pre-M5 Metal): Q5_K_M-mtp Metal MTP buffer: 18760 MiB -> 1425 MiB (13.2x) Q8_0-mtp Metal MTP buffer: 28213 MiB -> 1719 MiB (16.4x) Q5 and Q8 were both OOMing mid-decode pre-fix; both now run cleanly. Q8_0 produces a 1.49x aggregate decode win vs vanilla on M4 Pro. Force use_mmap=false on the MTP load so the non-mmap allocator path sizes the buffer to the registered tensors only. Vanilla, draft-model and mmproj load paths are untouched.
|
Hi @frozename, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
|
This should not be a PR, you made a 1 line change. Why not just discuss/propose it on the MTP PR? |
Ok I'll do that |
|
I'd also like to stress that this kind of spam-by-ai will normally earn submitters a ban, the only reason you were not is because of the way you responded. |
Thanks for the heads up. I'll keep that in mind for future contributions |
Traced the M4 Pro doubled-allocation OOM to its source: the MTP arch registers tensors at both ends of the GGUF (tok_embd near the start, output/nextn/last layer at the end), so the mmap-backed buffer path covers [first, last) ~ full file. Apple Metal allocates that whole range, duplicating the main model in VRAM. Server log shows two identical MTL0_Mapped model buffer entries; the duplicate doesn't roll up into the target context's self in the breakdown so it appears as unaccounted memory. One-line fix: force use_mmap=false on the MTP load. Non-mmap allocator sizes the buffer to registered tensors only. Filed upstream as ggml-org/llama.cpp#22941. Patch preserved at tools/llama-cpp-mtp/0001-mtp-mmap-fix.patch. Results on M4 Pro 48 GB: - Q5_K_M Metal MTP buffer: 18760 -> 1425 MiB (13.2x reduction) - Q8_0 Metal MTP buffer: 28213 -> 1719 MiB (16.4x reduction) - Q8_0 decode: 7.4 -> 11.1 tok/s (1.49x vs vanilla, accept 0.725) - Q4_K_M unchanged from pre-fix (no regression) - Q5_K_M still 0.91x (bandwidth-bound at smaller quants) Slice-A results doc gets a 2026-05-11 addendum with the full mechanism walkthrough and code references.
Targets the MTP path introduced in #22673 (currently a draft). Branched off the PR's head (
5d5f1b46). Should be cherry-picked into #22673 / squashed into the same series before that PR lands. Filing as a separate PR for review visibility — happy to close once it's incorporated into #22673.Bug
The MTP head is loaded by reopening the same GGUF with
override_arch = qwen35_mtp[_moe], which registers tensors near both the start of the file (tok_embd) and the end (output,nextn.*, last transformer block). With the default mmap-backed buffer path atsrc/llama-model.cpp:1463-1483, the backend allocates a buffer covering the full[first_tensor_offset, last_tensor_offset)range — which for Qwen 3.6 27B spans nearly the entire file.On Apple Silicon the result is a Metal buffer the same size as the main model, duplicated. Server log shows two
MTL0_Mapped model buffer sizeentries with identical byte counts: one for the main model, one for the MTP head.This is observable as
unaccounted ≈ model_sizeincommon_memory_breakdown_print(the MTP context's allocations live in a siblingllama_contextand don't roll up into the target context'sself).Reproduction (M4 Pro 48 GB, Apple10 / pre-M5 Metal, PR head
5d5f1b46)Pre-fix server log:
OOMs mid-decode on the 38 GB Metal cap (M4 Pro). Q8_0 is even worse (~28 GB each).
Fix
Force
use_mmap = falseon the MTP load only. The non-mmap allocator path (llama-model.cpp:1492) sizes the backend buffer to the registered tensors instead of the mmap range — which for the MTP arch is ~1.4-1.7 GB instead of ~19-28 GB.Before / after (M4 Pro 48 GB, Qwen 3.6 27B, froggeric's MTP GGUFs)
Both quants were OOMing mid-decode pre-fix; both now run cleanly.
Aggregate decode tok/s with the OP's recipe (
-ctk q8_0 -ctv q8_0, no flash-attn, default ub), 9-prompt suite (am17an's gist,temperature=0 seed=42 n_predict=192):Q4 is unchanged (it fit pre-fix; no regression). Q5 and Q8 are new datapoints because they OOM'd pre-fix. Q8 clears the speculative-decoding win threshold on this hardware — speculative savings scale with main-pass cost, so MTP shines at the largest quant the box can hold.
Testing scope
Built and tested only on M4 Pro 48 GB / Apple10 Metal (only platform I have access to). The non-mmap allocator path is well-exercised on all backends (it's the fallback when mmap fails or is disabled by
--no-mmap), so functional correctness should be unchanged everywhere. Performance impact at MTP load time on other backends is a slower direct-read of ~1-2 GB of tensor data instead of mmap pointer aliasing — a one-time cost of a few seconds, no inference-time effect.Would appreciate someone running this on CUDA / Vulkan to confirm no regression there.
Notes
params_base.speculative.type == COMMON_SPECULATIVE_TYPE_MTP— vanilla / draft / mmproj load paths are untouched.--no-mmapand--mlockwere considered: the override is a no-op if the user already disabled mmap, and--mlockstill works via the existingmlock_bufspath (llama-model.cpp:1497-1502).tok_embd/outputregistration inqwen35_mtp.cpp::load_arch_tensorswhen thenextn.*equivalents are present — was considered but rejected: it requires GGUF-metadata sniffing before arch instantiation, and breaks any future MTP GGUF that lacks the nextn variants. The non-mmap override keeps the arch logic simple and the failure mode contained.