common: make --fit validate shared UMA memory budgets by fl0rianr · Pull Request #22602 · ggml-org/llama.cpp

fl0rianr · 2026-05-02T01:17:00Z

Overview

This PR makes --fit validate memory against explicit budget groups before accepting fitted parameters

The most important change is that UMA / integrated GPU systems are no longer treated like discrete GPU systems where device memory and host memory are independent pools. On UMA systems, both device-labeled and host-labeled allocations can pressure the same physical memory. The fit logic now handles that:

The fit path tracks:

budget group 0: physical or API-reported device budget
budget group 1: host budget
budget group 2: shared / overlapping UMA budget

Candidate fit results are also verified against the selected parameters before they are committed. In particular, context reduction is no longer accepted purely from aggregate interpolation.

Main changes

This PR updates the --fit path to:

abort startup when common_fit_params() returns FAILURE or ERROR
keep fitting transactional and only commit parameter changes after success
avoid falling back from invalid 0/0 device-memory reports to host memory for device budgets
validate reduced-context candidates with a real fit probe before accepting them
preserve the existing weight-placement fallback path when the full model cannot remain on the device
account for UMA / APU shared-memory overlap instead of treating host and device memory as independent
add a simple tiered shared-memory reserve for UMA systems

The UMA reserve is intentionally simple and bounded. It avoids an unbounded percentage penalty on large shared-memory systems while still leaving room for OS pressure and backend startup allocations.

Currently:

~32 GiB shared memory class   -> 3 GiB reserve
~128 GiB shared memory class  -> 6 GiB reserve
~256 GiB shared memory class  -> 9 GiB reserve

This should be easy to adjust later if more hardware data suggests better thresholds.

Bugs / failure modes addressed

Confirmed by local logs

The following cases are covered by local logs and the main motivation for this PR:

UMA / APU shared-budget overcommit

On ROCm and Vulkan UMA systems, --fit could approve parameters based on the API-reported device budget while not correctly charging host/shared allocations against the same physical memory pool.

The new budget group 2 fixes this class by validating the shared / overlapping UMA budget directly.
llama-fit-128k-times-10_rocm_gfx1151.log
llama-fit-128k-times-10-vulkan_gfx11.log
Weights/load pressure OOM before cache allocation

The updated accounting avoids approving configurations that fit only in the final-looking device budget but still overcommit the shared memory pool during load/startup pressure. Here is the fixed log in contrast to the issue.
llama-fit-128k-times-10-vulkan_weights_loading_error.log

Direct code-level fixes

The following are direct control-flow / correctness fixes in the fit path:

common_fit_params() failure is no longer ignored by the caller
fit parameter updates are transactional instead of leaving partial mutations on failure
invalid 0/0 device-memory reports are not silently treated as host memory for device budgets
reduced context candidates are re-measured before being accepted
the weight-placement fallback path is preserved for cases where full device placement does not fit

Improved but may need more backend coverage

The verified candidate re-probe also improves the broader class of cases where aggregate interpolation was too optimistic, especially on multi-device and lifecycle-sensitive paths.

I have not exhaustively tested every backend combination. The intent is to preserve existing dGPU behavior by keeping device and host budget groups separate unless shared/overlapping memory accounting is needed.

Testing

Tested locally on AMD Strix Halo / gfx1151.

ROCm and Vulkan, see logs above.

Model: GLM-4.5-Air-GGUF / UD-Q4_K_XL

I also tested that dGPU cases are still running fine on a RTX 5090 and a context which is "too huge":
--override-kv glm4moe.context_length=int:1310720

Model: Qwen3.6-27B-GGUF/UD-Q4_K_XL

Notes for reviewers

The key behavioral change is not that --fit simply adds a larger safety margin. The important part is that the fit decision now validates the correct memory budget on UMA systems.

For dGPU-style systems, device and host memory remain separate budget groups.

For UMA / integrated GPU systems, the shared / overlapping memory group prevents fit from treating host and device allocations as independent when they are backed by the same physical RAM.

AI usage disclosure

AI assistance was used during investigation and review to help compare logs, organize the bug analysis, and discuss possible fit-accounting approaches.

The submitted code was manually reviewed, tested locally, and I am prepared to explain the changes during review.

--fit now treats UMA systems as a shared-memory budget instead of assuming that device and host allocations are independent. This adds explicit budget groups for API-reported device memory, host memory, and shared/overlapping UMA memory. The fit decision is verified against the selected candidate parameters before committing them, so context reduction is no longer accepted purely from an aggregate interpolation. The change also keeps fit transactional, avoids falling back from invalid 0/0 device-memory reports to host memory, aborts startup when fitting fails, and preserves the existing weight-placement fallback path for cases where the full model cannot remain on the device. For integrated/UMA devices, the shared-memory reserve uses simple RAM-size tiers so large shared-memory systems do not get an unbounded percentage penalty while still leaving enough room for OS pressure and backend startup allocations. Signed-off-by: Florian Reinle <f.reinle@otec.de>

ggml-gh-bot · 2026-05-02T01:20:50Z

Hi @fl0rianr, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

JohannesGaessler · 2026-05-02T05:54:54Z

TL;DR. 1000 LoC is completely overengineered for UMA device handling.

fl0rianr · 2026-05-02T09:32:53Z

Thanks for taking a look. Fair point on the scope.
I agree that this PR is very large and hard to review as it is.
I opened it after chasing several related --fit failure modes together, working on it over quite some time which gave me a complete picture, but in hindsight the safer path is to reduce this to small mechanical fixes first.
Would you be open addressing this points piece by piece? During my testing i did found improvement potential beyond UMA cases.
If you have a preferred order or boundary for acceptable follow-up PRs, I would appreciate your guidance.

fl0rianr requested review from a team and JohannesGaessler as code owners May 2, 2026 01:17

fl0rianr marked this pull request as draft May 2, 2026 01:17

JohannesGaessler closed this May 2, 2026

fl0rianr mentioned this pull request May 2, 2026

common: do not fit to unknown device memory #22614

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

common: make --fit validate shared UMA memory budgets#22602

common: make --fit validate shared UMA memory budgets#22602
fl0rianr wants to merge 1 commit intoggml-org:masterfrom
fl0rianr:fix/fit_behaviour

fl0rianr commented May 2, 2026

Uh oh!

ggml-gh-bot Bot commented May 2, 2026

Uh oh!

JohannesGaessler commented May 2, 2026

Uh oh!

fl0rianr commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fl0rianr commented May 2, 2026

Overview

Main changes

Bugs / failure modes addressed

Confirmed by local logs

Direct code-level fixes

Improved but may need more backend coverage

Testing

Notes for reviewers

AI usage disclosure

Uh oh!

ggml-gh-bot Bot commented May 2, 2026

Uh oh!

JohannesGaessler commented May 2, 2026

Uh oh!

fl0rianr commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants