sycl: Add optional USM system allocations#22526
Conversation
|
@ifdu Thank you for your contribution! |
There was a problem hiding this comment.
@ifdu
I test on B580 Ubuntu 26.04.
Because the VRAM of B580 is 12GB.
Run LLM qwen2.5-32b-instruct-q4_0.gguf (18GB).
- Disable USM by export GGML_SYCL_USM_SYSTEM=0.
There is error:
ggml_backend_sycl_buffer_type_alloc_buffer: can't allocate 12745011200 Bytes of memory on device
It's as expected.
- Enable USM by export GGML_SYCL_USM_SYSTEM=1.
There is crash:
export GGML_SYCL_USM_SYSTEM=1
export ONEAPI_DEVICE_SELECTOR="level_zero:0"
./examples/sycl/test.sh -m ../models/qwen2.5-32b-instruct-q4_0.gguf -lv 4
...
llama_kv_cache: layer 61: dev = SYCL0
llama_kv_cache: layer 62: dev = SYCL0
llama_kv_cache: layer 63: dev = SYCL0
llama_kv_cache: SYCL0 KV buffer size = 1024.00 MiB
level_zero backend failed with error: 20 (UR_RESULT_ERROR_DEVICE_LOST)
Exception caught at file:/home/zjy/ws/llama.cpp/ifdu_svm/ggml/src/ggml-sycl/ggml-sycl.cpp, line:586, func:operator()
SYCL error: CHECK_TRY_ERROR((*stream) .memset(ctx->dev_ptr, value, buffer->size) .wait()): Exception caught in this line of code.
in function ggml_backend_sycl_buffer_clear at /home/zjy/ws/llama.cpp/ifdu_svm/ggml/src/ggml-sycl/ggml-sycl.cpp:586
/home/zjy/ws/llama.cpp/ifdu_svm/ggml/src/ggml-sycl/../ggml-sycl/common.hpp:134: SYCL error
[New LWP 28045]
[New LWP 27976]
⚠� warning: File "/opt/intel/oneapi/compiler/2025.3/lib/libsycl.so.8.0.0-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
add-auto-load-safe-path /opt/intel/oneapi/compiler/2025.3/lib/libsycl.so.8.0.0-gdb.py
line to your configuration file "/home/zjy/.config/gdb/gdbinit".
To completely disable this security protection add
set auto-load safe-path /
line to your configuration file "/home/zjy/.config/gdb/gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual. E.g., run from the shell:
info "(gdb)Auto-loading safe path"
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/x86_64-linux-gnu/libthread_db.so.1".
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
⚠� warning: 56 ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: No such file or directory
#0 __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56 in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1 0x0000704778ca067c in __internal_syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=0, a6=0, nr=61) at ./nptl/cancellation.c:49
⚠� warning: 49 ./nptl/cancellation.c: No such file or directory
#2 __syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75 in ./nptl/cancellation.c
#3 0x0000704778d1cdcf in __GI___wait4 (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>, usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait4.c:30
⚠� warning: 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#4 0x000070477bb2c63a in ggml_print_backtrace () from /home/zjy/ws/llama.cpp/ifdu_svm/build/bin/libggml-base.so.0
#5 0x000070477bb2b7b9 in ggml_abort () from /home/zjy/ws/llama.cpp/ifdu_svm/build/bin/libggml-base.so.0
#6 0x00007047795036b8 in ggml_sycl_error(char const*, char const*, char const*, int, char const*) () from /home/zjy/ws/llama.cpp/ifdu_svm/build/bin/libggml-sycl.so.0
#7 0x000070477950d92c in ggml_backend_sycl_buffer_clear(ggml_backend_buffer*, unsigned char) () from /home/zjy/ws/llama.cpp/ifdu_svm/build/bin/libggml-sycl.so.0
#8 0x000070477bcdd5d7 in llama_kv_cache::llama_kv_cache(llama_model const&, ggml_type, ggml_type, bool, bool, bool, unsigned int, unsigned int, unsigned int, unsigned int, llama_swa_type, std::function<bool (int)> const&, std::function<int (int)> const&) () from /home/zjy/ws/llama.cpp/ifdu_svm/build/bin/libllama.so.0
#9 0x000070477bda6818 in llama_model::create_memory(llama_memory_params const&, llama_cparams const&) const () from /home/zjy/ws/llama.cpp/ifdu_svm/build/bin/libllama.so.0
#10 0x000070477bc92ade in llama_context::llama_context(llama_model const&, llama_context_params) () from /home/zjy/ws/llama.cpp/ifdu_svm/build/bin/libllama.so.0
#11 0x000070477bc9d4bd in llama_init_from_model () from /home/zjy/ws/llama.cpp/ifdu_svm/build/bin/libllama.so.0
#12 0x00000000004cac84 in common_init_result::common_init_result(common_params&) ()
#13 0x00000000004cb31f in common_init_from_params(common_params&) ()
#14 0x00000000004275d7 in main ()
[Inferior 1 (process 27960) detached]
|
When GPU's VRAM is enough to load whole LLM, the performance impact to MoE is less: It's passed with following cmd on one B580: 2 B580 2 B580 without this feature |
|
@ifdu |
arthw
left a comment
There was a problem hiding this comment.
It's good job!
This PR build up the base to support USM feature of new Linux kernel.
It will be good reference for more USM features of SYCL backend.
Thank you!
|
@ifdu Thank you! |
This introduces an optional feature to allocate large GPU buffers (≥ 1GB) using USM system allocations if supported by the device. It allows using buffers from the system allocator then letting the system manage memory migrations between host and device as necessary. This feature is disabled by default and requires the GGML_SYCL_USM_SYSTEM environment variable to enable. If USM system allocations are not supported by the device or the system, we fallback to regular allocations. This feature can allow VRAM overcommit. For example, the test below fails on B580 due to lack of memory for allocation, but it passes when enabling USM system allocations: ./examples/sycl/test.sh -m Qwen3.5-27B-Q3_K_M.gguf -lv 4 Signed-off-by: Francois Dugast <francois.dugast@intel.com>
arthw
left a comment
There was a problem hiding this comment.
It's good job!
This PR involves to support USM provided by new Linux kernel.
It will be the base of the features of USM in the future.
It's experimental now.
Thank you!
This introduces an optional feature to allocate large GPU buffers (≥ 1GB) using USM system allocations if supported by the device. It allows using buffers from the system allocator then letting the system manage memory migrations between host and device as necessary. This feature is disabled by default and requires the GGML_SYCL_USM_SYSTEM environment variable to enable. If USM system allocations are not supported by the device or the system, we fallback to regular allocations. This feature can allow VRAM overcommit. For example, the test below fails on B580 due to lack of memory for allocation, but it passes when enabling USM system allocations: ./examples/sycl/test.sh -m Qwen3.5-27B-Q3_K_M.gguf -lv 4 Signed-off-by: Francois Dugast <francois.dugast@intel.com>
Overview
This introduces an optional feature to allocate large GPU buffers (≥ 1GB) using USM system allocations if supported by the device. It allows using buffers from the system allocator then letting the system manage memory migrations between host and device as necessary.
This feature is disabled by default and requires the GGML_SYCL_USM_SYSTEM environment variable to enable. If USM system allocations are not supported by the device or the system, we fallback to regular allocations.
Requirements