Skip to content

sycl: Add optional USM system allocations#22526

Merged
ggerganov merged 1 commit into
ggml-org:masterfrom
ifdu:svm
Jun 17, 2026
Merged

sycl: Add optional USM system allocations#22526
ggerganov merged 1 commit into
ggml-org:masterfrom
ifdu:svm

Conversation

@ifdu

@ifdu ifdu commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

Overview

This introduces an optional feature to allocate large GPU buffers (≥ 1GB) using USM system allocations if supported by the device. It allows using buffers from the system allocator then letting the system manage memory migrations between host and device as necessary.

This feature is disabled by default and requires the GGML_SYCL_USM_SYSTEM environment variable to enable. If USM system allocations are not supported by the device or the system, we fallback to regular allocations.

Requirements

@ifdu ifdu requested a review from a team as a code owner April 29, 2026 15:34
@arthw

arthw commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

@ifdu
I will verify and review it after coming long public vacation (5 days).

Thank you for your contribution!

@github-actions github-actions Bot added documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Apr 29, 2026

@arthw arthw left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ifdu
I test on B580 Ubuntu 26.04.
Because the VRAM of B580 is 12GB.
Run LLM qwen2.5-32b-instruct-q4_0.gguf (18GB).

  1. Disable USM by export GGML_SYCL_USM_SYSTEM=0.

There is error:
ggml_backend_sycl_buffer_type_alloc_buffer: can't allocate 12745011200 Bytes of memory on device

It's as expected.

  1. Enable USM by export GGML_SYCL_USM_SYSTEM=1.

There is crash:

export GGML_SYCL_USM_SYSTEM=1
export ONEAPI_DEVICE_SELECTOR="level_zero:0"
./examples/sycl/test.sh -m ../models/qwen2.5-32b-instruct-q4_0.gguf -lv 4

...
llama_kv_cache: layer  61: dev = SYCL0
llama_kv_cache: layer  62: dev = SYCL0
llama_kv_cache: layer  63: dev = SYCL0
llama_kv_cache:      SYCL0 KV buffer size =  1024.00 MiB
level_zero backend failed with error: 20 (UR_RESULT_ERROR_DEVICE_LOST)
Exception caught at file:/home/zjy/ws/llama.cpp/ifdu_svm/ggml/src/ggml-sycl/ggml-sycl.cpp, line:586, func:operator()
SYCL error: CHECK_TRY_ERROR((*stream) .memset(ctx->dev_ptr, value, buffer->size) .wait()): Exception caught in this line of code.
  in function ggml_backend_sycl_buffer_clear at /home/zjy/ws/llama.cpp/ifdu_svm/ggml/src/ggml-sycl/ggml-sycl.cpp:586
/home/zjy/ws/llama.cpp/ifdu_svm/ggml/src/ggml-sycl/../ggml-sycl/common.hpp:134: SYCL error
[New LWP 28045]
[New LWP 27976]
⚠� warning: File "/opt/intel/oneapi/compiler/2025.3/lib/libsycl.so.8.0.0-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
	add-auto-load-safe-path /opt/intel/oneapi/compiler/2025.3/lib/libsycl.so.8.0.0-gdb.py
line to your configuration file "/home/zjy/.config/gdb/gdbinit".
To completely disable this security protection add
	set auto-load safe-path /
line to your configuration file "/home/zjy/.config/gdb/gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
	info "(gdb)Auto-loading safe path"
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/x86_64-linux-gnu/libthread_db.so.1".
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
⚠� warning: 56	../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: No such file or directory
#0  __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56	in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1  0x0000704778ca067c in __internal_syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=0, a6=0, nr=61) at ./nptl/cancellation.c:49
⚠� warning: 49	./nptl/cancellation.c: No such file or directory
#2  __syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75	in ./nptl/cancellation.c
#3  0x0000704778d1cdcf in __GI___wait4 (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>, usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait4.c:30
⚠� warning: 30	../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#4  0x000070477bb2c63a in ggml_print_backtrace () from /home/zjy/ws/llama.cpp/ifdu_svm/build/bin/libggml-base.so.0
#5  0x000070477bb2b7b9 in ggml_abort () from /home/zjy/ws/llama.cpp/ifdu_svm/build/bin/libggml-base.so.0
#6  0x00007047795036b8 in ggml_sycl_error(char const*, char const*, char const*, int, char const*) () from /home/zjy/ws/llama.cpp/ifdu_svm/build/bin/libggml-sycl.so.0
#7  0x000070477950d92c in ggml_backend_sycl_buffer_clear(ggml_backend_buffer*, unsigned char) () from /home/zjy/ws/llama.cpp/ifdu_svm/build/bin/libggml-sycl.so.0
#8  0x000070477bcdd5d7 in llama_kv_cache::llama_kv_cache(llama_model const&, ggml_type, ggml_type, bool, bool, bool, unsigned int, unsigned int, unsigned int, unsigned int, llama_swa_type, std::function<bool (int)> const&, std::function<int (int)> const&) () from /home/zjy/ws/llama.cpp/ifdu_svm/build/bin/libllama.so.0
#9  0x000070477bda6818 in llama_model::create_memory(llama_memory_params const&, llama_cparams const&) const () from /home/zjy/ws/llama.cpp/ifdu_svm/build/bin/libllama.so.0
#10 0x000070477bc92ade in llama_context::llama_context(llama_model const&, llama_context_params) () from /home/zjy/ws/llama.cpp/ifdu_svm/build/bin/libllama.so.0
#11 0x000070477bc9d4bd in llama_init_from_model () from /home/zjy/ws/llama.cpp/ifdu_svm/build/bin/libllama.so.0
#12 0x00000000004cac84 in common_init_result::common_init_result(common_params&) ()
#13 0x00000000004cb31f in common_init_from_params(common_params&) ()
#14 0x00000000004275d7 in main ()
[Inferior 1 (process 27960) detached]

@arthw

arthw commented May 11, 2026

Copy link
Copy Markdown
Contributor

When GPU's VRAM is enough to load whole LLM, the performance impact to MoE is less:

It's passed with following cmd on one B580:

export GGML_SYCL_USM_SYSTEM=1
./examples/sycl/test.sh -m ../models/DeepSeek-MoE-16b-Chat.Q8_0.gguf -mg 0

...
common_perf_print: prompt eval time =   28854.07 ms /    19 tokens ( 1518.64 ms per token,     0.66 tokens per second)
common_perf_print:        eval time =  686496.92 ms /   399 runs   ( 1720.54 ms per token,     0.58 tokens per second)

2 B580

export GGML_SYCL_USM_SYSTEM=1
./examples/sycl/test.sh -m ../models/DeepSeek-MoE-16b-Chat.Q8_0.gguf
...
common_perf_print: prompt eval time =    1385.05 ms /    19 tokens (   72.90 ms per token,    13.72 tokens per second)
common_perf_print:        eval time =   16066.21 ms /   399 runs   (   40.27 ms per token,    24.83 tokens per second)

2 B580 without this feature

export GGML_SYCL_USM_SYSTEM=0
./examples/sycl/test.sh -m ../models/DeepSeek-MoE-16b-Chat.Q8_0.gguf
common_perf_print: prompt eval time =    1123.95 ms /    19 tokens (   59.16 ms per token,    16.90 tokens per second)
common_perf_print:        eval time =   16053.11 ms /   399 runs   (   40.23 ms per token,    24.85 tokens per second)

@arthw

arthw commented May 16, 2026

Copy link
Copy Markdown
Contributor

@ifdu
Have you fixed the crash issue by following case?

export GGML_SYCL_USM_SYSTEM=1
export ONEAPI_DEVICE_SELECTOR="level_zero:0"
./examples/sycl/test.sh -m ../models/qwen2.5-32b-instruct-q4_0.gguf -lv 4

@arthw arthw left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good job!

This PR build up the base to support USM feature of new Linux kernel.
It will be good reference for more USM features of SYCL backend.

Thank you!

@arthw

arthw commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

@ifdu
The PR shouldn't break building on windows.
Please update it!

Thank you!

This introduces an optional feature to allocate large GPU buffers (≥ 1GB)
using USM system allocations if supported by the device. It allows using
buffers from the system allocator then letting the system manage memory
migrations between host and device as necessary.

This feature is disabled by default and requires the GGML_SYCL_USM_SYSTEM
environment variable to enable. If USM system allocations are not supported
by the device or the system, we fallback to regular allocations.

This feature can allow VRAM overcommit. For example, the test below fails
on B580 due to lack of memory for allocation, but it passes when enabling
USM system allocations:

  ./examples/sycl/test.sh -m Qwen3.5-27B-Q3_K_M.gguf -lv 4

Signed-off-by: Francois Dugast <francois.dugast@intel.com>

@arthw arthw left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good job!

This PR involves to support USM provided by new Linux kernel.
It will be the base of the features of USM in the future.

It's experimental now.

Thank you!

@arthw arthw added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label Jun 16, 2026
@ggerganov ggerganov merged commit 9b260fc into ggml-org:master Jun 17, 2026
28 checks passed
papamoose pushed a commit to papamoose/llama.cpp that referenced this pull request Jun 27, 2026
This introduces an optional feature to allocate large GPU buffers (≥ 1GB)
using USM system allocations if supported by the device. It allows using
buffers from the system allocator then letting the system manage memory
migrations between host and device as necessary.

This feature is disabled by default and requires the GGML_SYCL_USM_SYSTEM
environment variable to enable. If USM system allocations are not supported
by the device or the system, we fallback to regular allocations.

This feature can allow VRAM overcommit. For example, the test below fails
on B580 due to lack of memory for allocation, but it passes when enabling
USM system allocations:

  ./examples/sycl/test.sh -m Qwen3.5-27B-Q3_K_M.gguf -lv 4

Signed-off-by: Francois Dugast <francois.dugast@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants