ggml : split graph allocations according to backend max buffer size #15815

Acly · 2025-09-05T09:42:31Z

This PR makes ggml_gallocr distribute allocations to multiple backend buffers depending on the maximum allocation size reported by the backend. This allows eg. the Vulkan backend to process graphs which require >4 GB of memory.

I tried to avoid risk and minimize changes/complexity:

No API changes
No change in existing behavior (buffer layout / tensor offsets stay exactly the same as on master)

Implementation:

ggml_gallocr: almost no changes here, it continues to operate with contiguous offsets in [0, SIZE_MAX). Instead of using ggml_backend_buffer directly it now uses vbuffer
vbuffer: small local abstraction which distributes a virtual memory range to one or more backend buffers ("chunks")
ggml_dyn_tallocr: is now aware of backend's maximum buffer size to ensure tensors are not allocated across multiple chunks. This is done by setting the size of the last free_block to the maximum buffer size, and introducing a new block at the end of the range when additional memory is required.

Vulkan: modified to report actual maximum allocation size. This will change how weights are allocated. I'm not sure how important it is to keep the previous behavior there, happy to discuss alternatives.

* if the graph requires more memory than can fit into a single allocation, split it into multiple backend buffers * vulkan: report the actual max allocation size in buffer type interface

0cc4m · 2025-09-05T09:50:17Z

I don't understand the change yet, what you describe is how it was already working, at least how I understood it. The graph allocator is merging as many tensors into one allocation as possible, as long as it stays below the backend's max allocation size.

We use the suballocation size in the Vulkan backend to reduce the allocation sizes for performance reasons, if possible. If a single tensor requires more than the actual max_allocation size, currently it will try to allocate that and usually the driver will respond with an exception. I don't think you are addressing this issue (and I don't think that's really possible from the GGML side).

Acly · 2025-09-05T10:16:35Z

@0cc4m It was working like that for allocating weights. But the allocator for the compute buffers always ignored the backend max size. It batched all tensors into one large buffer and tried to allocate it, failing for Vulkan if it's >4GB. See eg. #11332

It's uncommon to hit that limitation with LLMs I think, they have huge weights but relatively small computation. For images (and video) it becomes an issue as soon as you increase resolution a little.

A single tensor beyond the maximum allocation size is not possible, no change there.

The reason "suballocation size" gets in the way here is that all allocations to be done are mapped out first, before trying to do the actual backend allocation. The algorithm needs to know the actual maximum here, not a "soft" maximum. I'd also argue that in this case you don't want to artificially reduce batching, as it will increase total memory required due to increased fragmentation (harder to reuse memory of previous computations).

I'm sure we can find a way to reintrodce the soft max for weight allocation though, just wasn't sure why it was there exactly and how big of a difference it makes.

0cc4m · 2025-09-05T12:40:46Z

I understand, thank you for the explanation. But we do need to keep that suballocation ~~limit~~ recommendation in some way, IMO.

Acly · 2025-09-05T21:45:41Z

But we do need to keep that suballocation limit recommendation in some way, IMO.

Okay I read some of #11520 #12434 and related issues... in summary smaller buffers help with host-visible memory and driver issues. I see 2 options:

Add a separate backend function to return a recommended max batch size for buffer allocations and use that for weight allocation in ggml_backend_alloc_ctx_tensors_from_buft.
Track sizes of individual buffers in ggml_dyn_tallocr. That would enable it to work with a smaller max size and use similar logic to weight allocation (batch tensors up to max size, but still support larger allocations if a single tensor requires it)

Option 2 is nicer I guess since it can also avoid those allocation problems for the compute buffers. It increases complexity of ggml_dyn_tallocr a bit. Also currently the maximum number of buffers is 8, probably need to raise that if they're only ~1GB each (or make it dynamic).

I'll wait a bit to see if there are more opinions before implementing something.

0cc4m · 2025-09-13T15:29:19Z

@Acly Sorry for the delay! I'm fine with both options, either is an improvement to the problems we're having with allocation sizes, but I'm not too familiar with the allocation code.

Any opinions @jeffbolznv @slaren @ggerganov ?

…nks. revert to use suballocation_block_size as max chunk size for vulkan.

Acly · 2025-09-13T18:58:50Z

@0cc4m no worries, I didn't get around to working on it until yesterday, and also expect it needs some more eyes first

I tried option 2, and think it's generally an improvement, so I'm pushing it on this branch. Reverted the change in Vulkan backend, the 1GB maximum will now be used for both weights and compute allocations. It may cost a bit more memory when compute size > 1GB, but avoids the compatibility issues. Single compute tensor with >1 GB works too with similar behavior to the weights allocation (attempt to allocate anyway and let the backend return an error).

Going to run a few more tests on this state when I get around to it.

0cc4m · 2025-09-16T17:49:08Z

From my side this is fine, of course. Your test passes with the Vulkan backend on my hardware.

0cc4m · 2025-09-16T17:49:26Z

@slaren Could you take a look?

ggml/src/ggml-alloc.c

wbruna · 2025-09-16T19:13:45Z

Just tested 44d3ee4 (RX 7600 XT and Ryzen 5 3400G, radv and amdvlk, Linux 6.12). Seems to be working fine, and it fixed leejet/stable-diffusion.cpp#791 for me.

* simpler, don't need loops to map between local/global offsets * touches more code

ggml/src/ggml-alloc.c

…ring multiple times * make vbuffer allocation follow the same logic as backend_buffer did before

… new one has been created

* they're still allocated together, but start/end of each chunk is tracked, and allocate/free iterate over sub-ranges * exhaust freed blocks of all chunks before considering their last blocks with unallocated space * start with 0 chunks/blocks and create chunks as needed * allow the last chunk to grow beyond max size

…tions

…h one * needs a bit more memory/allocations/indirections, but code is simpler

slaren · 2025-09-24T13:44:34Z

Any idea why ubuntu-latest-cmake-sanitizer is failing? I cannot reproduce it locally.

Acly · 2025-09-24T14:04:32Z

Not sure, couldn't reproduce it either so far, I'll try with matching GCC version

slaren · 2025-09-24T14:13:27Z

Looks like it was an issue with ccache, it passed now after deleting the cache.

…gml-org#15815) * ggml : make gallocr respect the backend's max buffer size * if the graph requires more memory than can fit into a single allocation, split it into multiple backend buffers * vulkan: report the actual max allocation size in buffer type interface * fix missing newline, apple-clang warning * track size of individual chunks in ggml_dyn_tallocr and raise max chunks. revert to use suballocation_block_size as max chunk size for vulkan. * track (chunk, offset) pairs instead of "global" offsets through gallocr. * simpler, don't need loops to map between local/global offsets * touches more code * fix dyn_tallocr_max_size and initialization * fix memory leak when buffers are reused due to same buffer type appearing multiple times * make vbuffer allocation follow the same logic as backend_buffer did before * continue to use leftover unallocated space of previous chunks after a new one has been created * treat free blocks of each chunk as separate list * they're still allocated together, but start/end of each chunk is tracked, and allocate/free iterate over sub-ranges * exhaust freed blocks of all chunks before considering their last blocks with unallocated space * start with 0 chunks/blocks and create chunks as needed * allow the last chunk to grow beyond max size * refactor: move adding new free block and new chunk into separate functions * allocate chunks individually with a separate free-blocks list for each one * needs a bit more memory/allocations/indirections, but code is simpler * fix warnings (missing static) & debug checks

…r size (ggml-org#15815)" This reverts commit f2a789e.

slaren · 2025-10-02T21:06:41Z

ggml/src/ggml-alloc.c

        }

-        size_t cur_size = galloc->buffers[i] ? ggml_backend_buffer_get_size(galloc->buffers[i]) : 0;
+        size_t cur_size = galloc->buffers[i] ? ggml_vbuffer_size(galloc->buffers[i]) : 0;


This may be the cause of #16383. It is necessary to check each sub-buffer to determine if it needs to be reallocated, not just the overall size.

I think you're right that this can be an issue, it's something I missed after changing chunks to have individual size.

It would still have to conincidentally end up with the same size, I can try to reproduce if that's what is happening.

(edit: nvm it just needs a chunk to be larger while total size is same or smaller, which is not so unlikely)

ggml : make gallocr respect the backend's max buffer size

20ffb58

* if the graph requires more memory than can fit into a single allocation, split it into multiple backend buffers * vulkan: report the actual max allocation size in buffer type interface

Acly requested review from ggerganov and slaren September 5, 2025 09:42

Acly requested a review from 0cc4m as a code owner September 5, 2025 09:42

github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Sep 5, 2025

wbruna mentioned this pull request Sep 5, 2025

Vulkan Memory Allocation Failure on Large Buffer Request (ErrorOutOfDeviceMemory) leejet/stable-diffusion.cpp#734

Open

fix missing newline, apple-clang warning

8d3c5d9

wbruna mentioned this pull request Sep 6, 2025

--offload-to-cpu may cause OOM errors on Vulkan leejet/stable-diffusion.cpp#791

Closed

track size of individual chunks in ggml_dyn_tallocr and raise max chu…

44d3ee4

…nks. revert to use suballocation_block_size as max chunk size for vulkan.

0cc4m mentioned this pull request Sep 14, 2025

Eval bug: GPT-OSS-120B: Vulkan backend fails to allocate KV cache with OOM error, despite enough free memory #15120

Open

slaren reviewed Sep 16, 2025

View reviewed changes

ggml/src/ggml-alloc.c Outdated Show resolved Hide resolved

ggml/src/ggml-alloc.c Outdated Show resolved Hide resolved

Acly added 2 commits September 17, 2025 15:08

track (chunk, offset) pairs instead of "global" offsets through gallocr.

973d55b

* simpler, don't need loops to map between local/global offsets * touches more code

fix dyn_tallocr_max_size and initialization

059afdb

slaren reviewed Sep 19, 2025

View reviewed changes

ggml/src/ggml-alloc.c Outdated Show resolved Hide resolved

Acly added 5 commits September 20, 2025 12:04

fix memory leak when buffers are reused due to same buffer type appea…

7b0d76b

…ring multiple times * make vbuffer allocation follow the same logic as backend_buffer did before

continue to use leftover unallocated space of previous chunks after a…

57381c5

… new one has been created

refactor: move adding new free block and new chunk into separate func…

5a916c7

…tions

allocate chunks individually with a separate free-blocks list for eac…

69964e0

…h one * needs a bit more memory/allocations/indirections, but code is simpler

fix warnings (missing static) & debug checks

ed69280

Acly force-pushed the gallocr-max-size branch from 40232e7 to ed69280 Compare September 24, 2025 11:06

slaren approved these changes Sep 24, 2025

View reviewed changes

slaren merged commit f2a789e into ggml-org:master Sep 24, 2025
204 of 216 checks passed

kyano added a commit to kyano/llama.cpp that referenced this pull request Sep 27, 2025

Revert "ggml : split graph allocations according to backend max buffe…

eabbd3f

…r size (ggml-org#15815)" This reverts commit f2a789e.

kyano added a commit to kyano/llama.cpp that referenced this pull request Sep 27, 2025

Revert "ggml : split graph allocations according to backend max buffe…

587c810

…r size (ggml-org#15815)" This reverts commit f2a789e.

kyano added a commit to kyano/llama.cpp that referenced this pull request Sep 28, 2025

Revert "ggml : split graph allocations according to backend max buffe…

91ccc80

…r size (ggml-org#15815)" This reverts commit f2a789e.

kyano added a commit to kyano/llama.cpp that referenced this pull request Sep 28, 2025

Revert "ggml : split graph allocations according to backend max buffe…

cc0d047

…r size (ggml-org#15815)" This reverts commit f2a789e.

kyano added a commit to kyano/llama.cpp that referenced this pull request Sep 29, 2025

Revert "ggml : split graph allocations according to backend max buffe…

29380a4

…r size (ggml-org#15815)" This reverts commit f2a789e.

kyano added a commit to kyano/llama.cpp that referenced this pull request Sep 30, 2025

Revert "ggml : split graph allocations according to backend max buffe…

7aff9de

…r size (ggml-org#15815)" This reverts commit f2a789e.

kyano added a commit to kyano/llama.cpp that referenced this pull request Oct 1, 2025

Revert "ggml : split graph allocations according to backend max buffe…

7a44de4

…r size (ggml-org#15815)" This reverts commit f2a789e.

kyano added a commit to kyano/llama.cpp that referenced this pull request Oct 2, 2025

Revert "ggml : split graph allocations according to backend max buffe…

1c4c099

…r size (ggml-org#15815)" This reverts commit f2a789e.

kyano added a commit to kyano/llama.cpp that referenced this pull request Oct 2, 2025

Revert "ggml : split graph allocations according to backend max buffe…

eb5966e

…r size (ggml-org#15815)" This reverts commit f2a789e.

slaren mentioned this pull request Oct 2, 2025

Misc. bug: Core dumped with Vulkan using Default Physical Batch Size. #16383

Closed

slaren reviewed Oct 2, 2025

View reviewed changes

raftario mentioned this pull request Oct 9, 2025

feat: Vulkan split allocations withcatai/node-llama-cpp#512

Closed

5 tasks

ggml : split graph allocations according to backend max buffer size #15815

ggml : split graph allocations according to backend max buffer size #15815

Conversation

Acly commented Sep 5, 2025

Uh oh!

0cc4m commented Sep 5, 2025

Uh oh!

Acly commented Sep 5, 2025

Uh oh!

0cc4m commented Sep 5, 2025

Uh oh!

Acly commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented Sep 13, 2025

Uh oh!

Acly commented Sep 13, 2025

Uh oh!

0cc4m commented Sep 16, 2025

Uh oh!

0cc4m commented Sep 16, 2025

Uh oh!

Uh oh!

Uh oh!

wbruna commented Sep 16, 2025

Uh oh!

Uh oh!

slaren commented Sep 24, 2025

Uh oh!

Acly commented Sep 24, 2025

Uh oh!

slaren commented Sep 24, 2025

Uh oh!

Uh oh!

slaren Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

Acly Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Acly commented Sep 5, 2025 •

edited

Loading

Acly Oct 2, 2025 •

edited

Loading