sycl: Add optional USM system allocations by ifdu · Pull Request #22526 · ggml-org/llama.cpp

ifdu · 2026-04-29T15:34:18Z

Overview

This introduces an optional feature to allocate large GPU buffers (≥ 1GB) using USM system allocations if supported by the device. It allows using buffers from the system allocator then letting the system manage memory migrations between host and device as necessary.

This feature is disabled by default and requires the GGML_SYCL_USM_SYSTEM environment variable to enable. If USM system allocations are not supported by the device or the system, we fallback to regular allocations.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

arthw · 2026-04-29T15:51:54Z

@ifdu
I will verify and review it after coming long public vacation (5 days).

Thank you for your contribution!

arthw

@ifdu
I test on B580 Ubuntu 26.04.
Because the VRAM of B580 is 12GB.
Run LLM qwen2.5-32b-instruct-q4_0.gguf (18GB).

Disable USM by export GGML_SYCL_USM_SYSTEM=0.

There is error:
ggml_backend_sycl_buffer_type_alloc_buffer: can't allocate 12745011200 Bytes of memory on device

It's as expected.

Enable USM by export GGML_SYCL_USM_SYSTEM=1.

There is crash:

export GGML_SYCL_USM_SYSTEM=1
export ONEAPI_DEVICE_SELECTOR="level_zero:0"
./examples/sycl/test.sh -m ../models/qwen2.5-32b-instruct-q4_0.gguf -lv 4

...
llama_kv_cache: layer  61: dev = SYCL0
llama_kv_cache: layer  62: dev = SYCL0
llama_kv_cache: layer  63: dev = SYCL0
llama_kv_cache:      SYCL0 KV buffer size =  1024.00 MiB
level_zero backend failed with error: 20 (UR_RESULT_ERROR_DEVICE_LOST)
Exception caught at file:/home/zjy/ws/llama.cpp/ifdu_svm/ggml/src/ggml-sycl/ggml-sycl.cpp, line:586, func:operator()
SYCL error: CHECK_TRY_ERROR((*stream) .memset(ctx->dev_ptr, value, buffer->size) .wait()): Exception caught in this line of code.
  in function ggml_backend_sycl_buffer_clear at /home/zjy/ws/llama.cpp/ifdu_svm/ggml/src/ggml-sycl/ggml-sycl.cpp:586
/home/zjy/ws/llama.cpp/ifdu_svm/ggml/src/ggml-sycl/../ggml-sycl/common.hpp:134: SYCL error
[New LWP 28045]
[New LWP 27976]
âš ï¸� warning: File "/opt/intel/oneapi/compiler/2025.3/lib/libsycl.so.8.0.0-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
	add-auto-load-safe-path /opt/intel/oneapi/compiler/2025.3/lib/libsycl.so.8.0.0-gdb.py
line to your configuration file "/home/zjy/.config/gdb/gdbinit".
To completely disable this security protection add
	set auto-load safe-path /
line to your configuration file "/home/zjy/.config/gdb/gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
	info "(gdb)Auto-loading safe path"
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/x86_64-linux-gnu/libthread_db.so.1".
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
âš ï¸� warning: 56	../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: No such file or directory
#0  __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56	in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1  0x0000704778ca067c in __internal_syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=0, a6=0, nr=61) at ./nptl/cancellation.c:49
âš ï¸� warning: 49	./nptl/cancellation.c: No such file or directory
#2  __syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75	in ./nptl/cancellation.c
#3  0x0000704778d1cdcf in __GI___wait4 (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>, usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait4.c:30
âš ï¸� warning: 30	../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#4  0x000070477bb2c63a in ggml_print_backtrace () from /home/zjy/ws/llama.cpp/ifdu_svm/build/bin/libggml-base.so.0
#5  0x000070477bb2b7b9 in ggml_abort () from /home/zjy/ws/llama.cpp/ifdu_svm/build/bin/libggml-base.so.0
#6  0x00007047795036b8 in ggml_sycl_error(char const*, char const*, char const*, int, char const*) () from /home/zjy/ws/llama.cpp/ifdu_svm/build/bin/libggml-sycl.so.0
#7  0x000070477950d92c in ggml_backend_sycl_buffer_clear(ggml_backend_buffer*, unsigned char) () from /home/zjy/ws/llama.cpp/ifdu_svm/build/bin/libggml-sycl.so.0
#8  0x000070477bcdd5d7 in llama_kv_cache::llama_kv_cache(llama_model const&, ggml_type, ggml_type, bool, bool, bool, unsigned int, unsigned int, unsigned int, unsigned int, llama_swa_type, std::function<bool (int)> const&, std::function<int (int)> const&) () from /home/zjy/ws/llama.cpp/ifdu_svm/build/bin/libllama.so.0
#9  0x000070477bda6818 in llama_model::create_memory(llama_memory_params const&, llama_cparams const&) const () from /home/zjy/ws/llama.cpp/ifdu_svm/build/bin/libllama.so.0
#10 0x000070477bc92ade in llama_context::llama_context(llama_model const&, llama_context_params) () from /home/zjy/ws/llama.cpp/ifdu_svm/build/bin/libllama.so.0
#11 0x000070477bc9d4bd in llama_init_from_model () from /home/zjy/ws/llama.cpp/ifdu_svm/build/bin/libllama.so.0
#12 0x00000000004cac84 in common_init_result::common_init_result(common_params&) ()
#13 0x00000000004cb31f in common_init_from_params(common_params&) ()
#14 0x00000000004275d7 in main ()
[Inferior 1 (process 27960) detached]

arthw · 2026-05-11T13:37:09Z

When GPU's VRAM is enough to load whole LLM, the performance impact to MoE is less:

It's passed with following cmd on one B580:

export GGML_SYCL_USM_SYSTEM=1
./examples/sycl/test.sh -m ../models/DeepSeek-MoE-16b-Chat.Q8_0.gguf -mg 0

...
common_perf_print: prompt eval time =   28854.07 ms /    19 tokens ( 1518.64 ms per token,     0.66 tokens per second)
common_perf_print:        eval time =  686496.92 ms /   399 runs   ( 1720.54 ms per token,     0.58 tokens per second)

2 B580

export GGML_SYCL_USM_SYSTEM=1
./examples/sycl/test.sh -m ../models/DeepSeek-MoE-16b-Chat.Q8_0.gguf
...
common_perf_print: prompt eval time =    1385.05 ms /    19 tokens (   72.90 ms per token,    13.72 tokens per second)
common_perf_print:        eval time =   16066.21 ms /   399 runs   (   40.27 ms per token,    24.83 tokens per second)

2 B580 without this feature

export GGML_SYCL_USM_SYSTEM=0
./examples/sycl/test.sh -m ../models/DeepSeek-MoE-16b-Chat.Q8_0.gguf
common_perf_print: prompt eval time =    1123.95 ms /    19 tokens (   59.16 ms per token,    16.90 tokens per second)
common_perf_print:        eval time =   16053.11 ms /   399 runs   (   40.23 ms per token,    24.85 tokens per second)

arthw · 2026-05-16T09:32:45Z

@ifdu
Have you fixed the crash issue by following case?

export GGML_SYCL_USM_SYSTEM=1
export ONEAPI_DEVICE_SELECTOR="level_zero:0"
./examples/sycl/test.sh -m ../models/qwen2.5-32b-instruct-q4_0.gguf -lv 4

arthw

It's good job!

This PR build up the base to support USM feature of new Linux kernel.
It will be good reference for more USM features of SYCL backend.

Thank you!

arthw · 2026-06-13T01:11:13Z

@ifdu
The PR shouldn't break building on windows.
Please update it!

Thank you!

This introduces an optional feature to allocate large GPU buffers (≥ 1GB) using USM system allocations if supported by the device. It allows using buffers from the system allocator then letting the system manage memory migrations between host and device as necessary. This feature is disabled by default and requires the GGML_SYCL_USM_SYSTEM environment variable to enable. If USM system allocations are not supported by the device or the system, we fallback to regular allocations. This feature can allow VRAM overcommit. For example, the test below fails on B580 due to lack of memory for allocation, but it passes when enabling USM system allocations: ./examples/sycl/test.sh -m Qwen3.5-27B-Q3_K_M.gguf -lv 4 Signed-off-by: Francois Dugast <francois.dugast@intel.com>

arthw

It's good job!

This PR involves to support USM provided by new Linux kernel.
It will be the base of the features of USM in the future.

It's experimental now.

Thank you!

This introduces an optional feature to allocate large GPU buffers (≥ 1GB) using USM system allocations if supported by the device. It allows using buffers from the system allocator then letting the system manage memory migrations between host and device as necessary. This feature is disabled by default and requires the GGML_SYCL_USM_SYSTEM environment variable to enable. If USM system allocations are not supported by the device or the system, we fallback to regular allocations. This feature can allow VRAM overcommit. For example, the test below fails on B580 due to lack of memory for allocation, but it passes when enabling USM system allocations: ./examples/sycl/test.sh -m Qwen3.5-27B-Q3_K_M.gguf -lv 4 Signed-off-by: Francois Dugast <francois.dugast@intel.com>

ifdu requested a review from a team as a code owner April 29, 2026 15:34

github-actions Bot added documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Apr 29, 2026

NeoZhangJianyu mentioned this pull request Apr 30, 2026

SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations #21597

Merged

7 tasks

arthw reviewed May 9, 2026

View reviewed changes

ifdu force-pushed the svm branch from 92d516e to b53bd0e Compare May 12, 2026 09:38

ifdu force-pushed the svm branch from b53bd0e to f8e0675 Compare June 5, 2026 09:02

ifdu force-pushed the svm branch from f8e0675 to 631a2b1 Compare June 12, 2026 10:47

arthw approved these changes Jun 13, 2026

View reviewed changes

ifdu force-pushed the svm branch from 631a2b1 to 5bc5e45 Compare June 15, 2026 14:54

arthw approved these changes Jun 16, 2026

View reviewed changes

arthw added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label Jun 16, 2026

ggerganov merged commit 9b260fc into ggml-org:master Jun 17, 2026
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sycl: Add optional USM system allocations#22526

sycl: Add optional USM system allocations#22526
ggerganov merged 1 commit into
ggml-org:masterfrom
ifdu:svm

ifdu commented Apr 29, 2026

Uh oh!

arthw commented Apr 29, 2026

Uh oh!

arthw left a comment •

edited

Loading

Uh oh!

arthw commented May 11, 2026

Uh oh!

arthw commented May 16, 2026

Uh oh!

arthw left a comment

Uh oh!

arthw commented Jun 13, 2026

Uh oh!

arthw left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ifdu commented Apr 29, 2026

Overview

Requirements

Uh oh!

arthw commented Apr 29, 2026

Uh oh!

arthw left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arthw commented May 11, 2026

Uh oh!

arthw commented May 16, 2026

Uh oh!

arthw left a comment

Choose a reason for hiding this comment

Uh oh!

arthw commented Jun 13, 2026

Uh oh!

arthw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

arthw left a comment •

edited

Loading