Skip to content

common: make --fit validate shared UMA memory budgets#22602

Closed
fl0rianr wants to merge 1 commit intoggml-org:masterfrom
fl0rianr:fix/fit_behaviour
Closed

common: make --fit validate shared UMA memory budgets#22602
fl0rianr wants to merge 1 commit intoggml-org:masterfrom
fl0rianr:fix/fit_behaviour

Conversation

@fl0rianr
Copy link
Copy Markdown
Contributor

@fl0rianr fl0rianr commented May 2, 2026

Closes #22592

Overview

This PR makes --fit validate memory against explicit budget groups before accepting fitted parameters

The most important change is that UMA / integrated GPU systems are no longer treated like discrete GPU systems where device memory and host memory are independent pools. On UMA systems, both device-labeled and host-labeled allocations can pressure the same physical memory. The fit logic now handles that:

The fit path tracks:

budget group 0: physical or API-reported device budget
budget group 1: host budget
budget group 2: shared / overlapping UMA budget

Candidate fit results are also verified against the selected parameters before they are committed. In particular, context reduction is no longer accepted purely from aggregate interpolation.

Main changes

This PR updates the --fit path to:

  • abort startup when common_fit_params() returns FAILURE or ERROR
  • keep fitting transactional and only commit parameter changes after success
  • avoid falling back from invalid 0/0 device-memory reports to host memory for device budgets
  • validate reduced-context candidates with a real fit probe before accepting them
  • preserve the existing weight-placement fallback path when the full model cannot remain on the device
  • account for UMA / APU shared-memory overlap instead of treating host and device memory as independent
  • add a simple tiered shared-memory reserve for UMA systems

The UMA reserve is intentionally simple and bounded. It avoids an unbounded percentage penalty on large shared-memory systems while still leaving room for OS pressure and backend startup allocations.

Currently:

~32 GiB shared memory class   -> 3 GiB reserve
~128 GiB shared memory class  -> 6 GiB reserve
~256 GiB shared memory class  -> 9 GiB reserve

This should be easy to adjust later if more hardware data suggests better thresholds.

Bugs / failure modes addressed

Confirmed by local logs

The following cases are covered by local logs and the main motivation for this PR:

  1. UMA / APU shared-budget overcommit

    On ROCm and Vulkan UMA systems, --fit could approve parameters based on the API-reported device budget while not correctly charging host/shared allocations against the same physical memory pool.

    The new budget group 2 fixes this class by validating the shared / overlapping UMA budget directly.
    llama-fit-128k-times-10_rocm_gfx1151.log
    llama-fit-128k-times-10-vulkan_gfx11.log

  2. Weights/load pressure OOM before cache allocation

    The updated accounting avoids approving configurations that fit only in the final-looking device budget but still overcommit the shared memory pool during load/startup pressure. Here is the fixed log in contrast to the issue.
    llama-fit-128k-times-10-vulkan_weights_loading_error.log

Direct code-level fixes

The following are direct control-flow / correctness fixes in the fit path:

  • common_fit_params() failure is no longer ignored by the caller
  • fit parameter updates are transactional instead of leaving partial mutations on failure
  • invalid 0/0 device-memory reports are not silently treated as host memory for device budgets
  • reduced context candidates are re-measured before being accepted
  • the weight-placement fallback path is preserved for cases where full device placement does not fit

Improved but may need more backend coverage

The verified candidate re-probe also improves the broader class of cases where aggregate interpolation was too optimistic, especially on multi-device and lifecycle-sensitive paths.

I have not exhaustively tested every backend combination. The intent is to preserve existing dGPU behavior by keeping device and host budget groups separate unless shared/overlapping memory accounting is needed.

Testing

Tested locally on AMD Strix Halo / gfx1151.

ROCm and Vulkan, see logs above.

Model: GLM-4.5-Air-GGUF / UD-Q4_K_XL

I also tested that dGPU cases are still running fine on a RTX 5090 and a context which is "too huge":
--override-kv glm4moe.context_length=int:1310720

Model: Qwen3.6-27B-GGUF/UD-Q4_K_XL

Notes for reviewers

The key behavioral change is not that --fit simply adds a larger safety margin. The important part is that the fit decision now validates the correct memory budget on UMA systems.

For dGPU-style systems, device and host memory remain separate budget groups.

For UMA / integrated GPU systems, the shared / overlapping memory group prevents fit from treating host and device allocations as independent when they are backed by the same physical RAM.

AI usage disclosure

AI assistance was used during investigation and review to help compare logs, organize the bug analysis, and discuss possible fit-accounting approaches.

The submitted code was manually reviewed, tested locally, and I am prepared to explain the changes during review.

--fit now treats UMA systems as a shared-memory budget instead of assuming that device and host allocations are independent.

This adds explicit budget groups for API-reported device memory, host memory, and shared/overlapping UMA memory. The fit decision is verified against the selected candidate parameters before committing them, so context reduction is no longer accepted purely from an aggregate interpolation.

The change also keeps fit transactional, avoids falling back from invalid 0/0 device-memory reports to host memory, aborts startup when fitting fails, and preserves the existing weight-placement fallback path for cases where the full model cannot remain on the device.

For integrated/UMA devices, the shared-memory reserve uses simple RAM-size tiers so large shared-memory systems do not get an unbounded percentage penalty while still leaving enough room for OS pressure and backend startup allocations.

Signed-off-by: Florian Reinle <f.reinle@otec.de>
@fl0rianr fl0rianr requested review from a team and JohannesGaessler as code owners May 2, 2026 01:17
@fl0rianr fl0rianr marked this pull request as draft May 2, 2026 01:17
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented May 2, 2026

Hi @fl0rianr, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@JohannesGaessler
Copy link
Copy Markdown
Contributor

TL;DR. 1000 LoC is completely overengineered for UMA device handling.

@fl0rianr
Copy link
Copy Markdown
Contributor Author

fl0rianr commented May 2, 2026

Thanks for taking a look. Fair point on the scope.
I agree that this PR is very large and hard to review as it is.
I opened it after chasing several related --fit failure modes together, working on it over quite some time which gave me a complete picture, but in hindsight the safer path is to reduce this to small mechanical fixes first.
Would you be open addressing this points piece by piece? During my testing i did found improvement potential beyond UMA cases.
If you have a preferred order or boundary for acceptable follow-up PRs, I would appreciate your guidance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: --fit should be authoritative: fix fit.cpp failure handling, per-device validation, probe OOM risk, and UMA accounting

2 participants