common: make --fit validate shared UMA memory budgets#22602
common: make --fit validate shared UMA memory budgets#22602fl0rianr wants to merge 1 commit intoggml-org:masterfrom
Conversation
--fit now treats UMA systems as a shared-memory budget instead of assuming that device and host allocations are independent. This adds explicit budget groups for API-reported device memory, host memory, and shared/overlapping UMA memory. The fit decision is verified against the selected candidate parameters before committing them, so context reduction is no longer accepted purely from an aggregate interpolation. The change also keeps fit transactional, avoids falling back from invalid 0/0 device-memory reports to host memory, aborts startup when fitting fails, and preserves the existing weight-placement fallback path for cases where the full model cannot remain on the device. For integrated/UMA devices, the shared-memory reserve uses simple RAM-size tiers so large shared-memory systems do not get an unbounded percentage penalty while still leaving enough room for OS pressure and backend startup allocations. Signed-off-by: Florian Reinle <f.reinle@otec.de>
|
Hi @fl0rianr, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
|
TL;DR. 1000 LoC is completely overengineered for UMA device handling. |
|
Thanks for taking a look. Fair point on the scope. |
Closes #22592
Overview
This PR makes
--fitvalidate memory against explicit budget groups before accepting fitted parametersThe most important change is that UMA / integrated GPU systems are no longer treated like discrete GPU systems where device memory and host memory are independent pools. On UMA systems, both device-labeled and host-labeled allocations can pressure the same physical memory. The fit logic now handles that:
The fit path tracks:
Candidate fit results are also verified against the selected parameters before they are committed. In particular, context reduction is no longer accepted purely from aggregate interpolation.
Main changes
This PR updates the --fit path to:
common_fit_params()returns FAILURE or ERRORThe UMA reserve is intentionally simple and bounded. It avoids an unbounded percentage penalty on large shared-memory systems while still leaving room for OS pressure and backend startup allocations.
Currently:
This should be easy to adjust later if more hardware data suggests better thresholds.
Bugs / failure modes addressed
Confirmed by local logs
The following cases are covered by local logs and the main motivation for this PR:
UMA / APU shared-budget overcommit
On ROCm and Vulkan UMA systems, --fit could approve parameters based on the API-reported device budget while not correctly charging host/shared allocations against the same physical memory pool.
The new budget group 2 fixes this class by validating the shared / overlapping UMA budget directly.
llama-fit-128k-times-10_rocm_gfx1151.log
llama-fit-128k-times-10-vulkan_gfx11.log
Weights/load pressure OOM before cache allocation
The updated accounting avoids approving configurations that fit only in the final-looking device budget but still overcommit the shared memory pool during load/startup pressure. Here is the fixed log in contrast to the issue.
llama-fit-128k-times-10-vulkan_weights_loading_error.log
Direct code-level fixes
The following are direct control-flow / correctness fixes in the fit path:
common_fit_params()failure is no longer ignored by the callerImproved but may need more backend coverage
The verified candidate re-probe also improves the broader class of cases where aggregate interpolation was too optimistic, especially on multi-device and lifecycle-sensitive paths.
I have not exhaustively tested every backend combination. The intent is to preserve existing dGPU behavior by keeping device and host budget groups separate unless shared/overlapping memory accounting is needed.
Testing
Tested locally on AMD Strix Halo / gfx1151.
ROCm and Vulkan, see logs above.
Model: GLM-4.5-Air-GGUF / UD-Q4_K_XL
I also tested that dGPU cases are still running fine on a RTX 5090 and a context which is "too huge":
--override-kv glm4moe.context_length=int:1310720Model: Qwen3.6-27B-GGUF/UD-Q4_K_XL
Notes for reviewers
The key behavioral change is not that --fit simply adds a larger safety margin. The important part is that the fit decision now validates the correct memory budget on UMA systems.
For dGPU-style systems, device and host memory remain separate budget groups.
For UMA / integrated GPU systems, the shared / overlapping memory group prevents fit from treating host and device allocations as independent when they are backed by the same physical RAM.
AI usage disclosure
AI assistance was used during investigation and review to help compare logs, organize the bug analysis, and discuss possible fit-accounting approaches.
The submitted code was manually reviewed, tested locally, and I am prepared to explain the changes during review.