ggml-cuda: fixes for concurrent streams #18496

am17an · 2025-12-30T17:44:37Z

Couple of fixes for GGML_CUDA_GRAPH_OPT=1

Fixed a bug where we weren't clearing events in case concurrent events are not valid memory-wise
To reduce surface area, the forking node needs to be of the form attn_norm, since that is only pattern (QKV) I've tested extensively
Only works when CUDA graphs are enabled (so Ampere+, single-GPU at the moment)

ggerganov · 2025-12-31T06:44:29Z

Is this path stable enough now? I thought it still had some issues.

I will try to sync the latest codebase to the ggml and whisper.cpp repos today and maybe we can run some tests with whisper.cpp + GGML_CUDA_GRAPH_OPT to make sure there are not side-effects there, before merging this change?

am17an · 2025-12-31T07:03:06Z

Yes it should be fine, although more tests are always welcome. I did the log-probs test on a couple of popular models with this vs master and they were fine

am17an · 2025-12-31T07:07:19Z

BTW we should use this PR as a test because it contains a fix + adds more checks as to when to enable this feature

This PR enables concurrent streams introduced in ggml-org#16991 by default. To disable a new env flag `GGML_CUDA_DISABLE_GRAPH_OPT` is introduced

ggerganov · 2025-12-31T10:29:38Z

BTW we should use this PR as a test because it contains a fix + adds more checks as to when to enable this feature

Can you extract the fix and the checks in a separate PR, before enabling GGML_CUDA_GRAPH_OPT by default?

am17an · 2025-12-31T13:51:29Z

@ggerganov the fix and checks are in this PR, so I just changed the flag to have the same behavior as before (use GGML_CUDA_GRAPH_OPT=1 to enable)

ggerganov · 2025-12-31T14:21:15Z

Ok, will pick it up when this gets merged. I checked that with these changes whisper.cpp runs both with and without GGML_CUDA_GRAPH_OPT=1 (before this it's crashing).

ggerganov · 2026-01-02T06:21:24Z

ggml/src/ggml-cuda/ggml-cuda.cu

+            // only optimize for attn_norm
+            if (!strstr(root_node->name, "attn_norm")) {
+                continue;
+            }
+


We should add TODOs for these name-based workarounds - ideally we want to avoid all special casing in the backends based on node names.

Added the TODO. At the moment it is the only way I can contain the change. With different types of models/architectures there can be many nodes which have a fan-out of exactly 3, which would be totally untested perf wise

JohannesGaessler

Remember to also update the title of the PR.

Somewhat related: the deadline that I've set for myself to try and produce a working prototype for generic tensor parallelism is January 12. In principle it would also be possible to re-use the same code for splitting a compute graph for a single GPU. We should test how that would perform vs. this implementation.

am17an · 2026-01-03T13:38:15Z

produce a working prototype for generic tensor parallelism is January 12

Amazing, way to bury the lede! Excited to see what you put out

thomasjfox · 2026-01-04T11:43:15Z

This commit causes a crash on master on cef1d23

I've bisected it down and dropped commit e57f523 on my local branch and everything is back to normal.

This is the backtrace I'm seeing with GPT-OSS 120B mxfp4 with CUDA using "sm120a-real" architecture (dual GPU):

Jan 04 12:17:44 /mnt/local/ai-models/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:94: CUDA error                                                                                                
Jan 04 12:17:44 CUDA error: an illegal memory access was encountered                          
Jan 04 12:17:44   current device: 0, in function ggml_backend_cuda_synchronize at /mnt/local/ai-models/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2837                                        
Jan 04 12:17:44   cudaStreamSynchronize(cuda_ctx->stream())                                   
Jan 04 12:17:44 [New LWP 833153]
Jan 04 12:17:44 [New LWP 833152]
Jan 04 12:17:44 [New LWP 833151]                                                                                                                                           
..
Jan 04 12:17:44 [New LWP 833088]                                                                                                                                                             
..
Jan 04 12:17:45 Using host libthread_db library "/lib64/libthread_db.so.1".                                                                                                                                                                                                                                                                                                               
Jan 04 12:17:45 0x00007fc7ce489422 in __syscall_cancel_arch () from /lib64/libc.so.6                                                                                                         
Jan 04 12:17:45 #0  0x00007fc7ce489422 in __syscall_cancel_arch () from /lib64/libc.so.6                                                                                                     
Jan 04 12:17:45 #1  0x00007fc7ce47d71c in __internal_syscall_cancel () from /lib64/libc.so.6                                                                                                 
Jan 04 12:17:45 #2  0x00007fc7ce47d764 in __syscall_cancel () from /lib64/libc.so.6                                                                                                          
Jan 04 12:17:45 #3  0x00007fc7ce4edbcf in wait4 () from /lib64/libc.so.6                      
Jan 04 12:17:45 #4  0x00007fc7d755cce3 in ggml_print_backtrace () from /mnt/local/ai-models/llama.cpp/build/bin/libggml-base.so.0                                                            
Jan 04 12:17:45 #5  0x00007fc7d755ce2c in ggml_abort () from /mnt/local/ai-models/llama.cpp/build/bin/libggml-base.so.0                                                                                                                                                                                                                                                                   
Jan 04 12:17:45 #6  0x00007fc7d49300e3 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /mnt/local/ai-models/llama.cpp/build/bin/libggml-cuda.so.0                                                                                                                                                                                                     
Jan 04 12:17:45 #7  0x00007fc7d49311d8 in ggml_backend_cuda_synchronize(ggml_backend*) () from /mnt/local/ai-models/llama.cpp/build/bin/libggml-cuda.so.0                                    
Jan 04 12:17:45 #8  0x00007fc7d75770ed in ggml_backend_sched_graph_compute_async () from /mnt/local/ai-models/llama.cpp/build/bin/libggml-base.so.0                                          
Jan 04 12:17:45 #9  0x00007fc7d72459d0 in llama_context::graph_compute(ggml_cgraph*, bool) () from /mnt/local/ai-models/llama.cpp/build/bin/libllama.so.0                                    
Jan 04 12:17:45 #10 0x00007fc7d72474b2 in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from /mnt/local/ai-models/llama.cpp/build/bin/libllama.so.0                                                                                                                                                                        
Jan 04 12:17:45 #11 0x00007fc7d724ca7f in llama_context::decode(llama_batch const&) () from /mnt/local/ai-models/llama.cpp/build/bin/libllama.so.0                                           
Jan 04 12:17:45 #12 0x00007fc7d724d87e in llama_decode () from /mnt/local/ai-models/llama.cpp/build/bin/libllama.so.0                                                                        
Jan 04 12:17:45 #13 0x00000000004f6cee in server_context_impl::update_slots() ()                                                                                                             
Jan 04 12:17:45 #14 0x000000000053298f in server_queue::start_loop(long) ()                   
Jan 04 12:17:45 #15 0x000000000046950c in main ()

Machine is a dual GPU setup, here's how llama-server is started:

llama.cpp/build/bin/llama-server \
    --model /mnt/local/ai-models/gpt-oss-120b-mxfp4-00001-of-00003.gguf \
    --temp 1.0 --top-p 1.0 --top-k 0 \
    -ngl 99 \
    --jinja --threads 16 --ctx-size 100000 \
    --cache-ram 32768 \
    -ot "blk\.[0-2]\.attn_.*=CUDA1" \
    -ot "blk\.[0-2]\.ffn_.*=CUDA1" \
    --cpu-moe \
    --chat-template-file gpt-oss-120b-chat_template.jinja \
    --chat-template-kwargs '{"reasoning_effort": "medium"}' \
    --flash-attn on \
    --slots \
    --timeout 3600

All I wanted it so give the improved regex parser from #18342 a spin 😆

am17an · 2026-01-04T12:12:51Z

@thomasjfox I'm not able to reproduce this crash, does it happen after you start the server or after you send a prompt?

thomasjfox · 2026-01-04T12:16:02Z

After sending a prompt. Crashes right on the first prompt.

Sorry, should have mentioned that.

tccybo · 2026-01-04T12:19:49Z

I have the same issue in b7621 and b7622.
[49697] common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[49697] ggml_cuda_compute_forward: MUL failedsrv log_server_r: request: GET /v1/models 127.0.0.1 200
[49697] CUDA error: an illegal memory access was encountered[49697] current device: 0, in function ggml_cuda_compute_forward at D:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:2738
[49697] err
[49697] D:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:94: CUDA error

reverted to b7620 and it's fixed.

am17an · 2026-01-04T12:20:22Z

Also just to be clear, this is without GGML_CUDA_GRAPH_OPT?

am17an · 2026-01-04T12:24:48Z

Ok I was able to reproduce the error, will put out a fix soon

thomasjfox · 2026-01-04T12:25:48Z

Also just to be clear, this is without GGML_CUDA_GRAPH_OPT?

yes, nothing specifically enabled. I also did a "make clean" to make sure it's not a borked build.

The moment I put the commit back in (just tried), it crashes on the first prompt. Just tried "hello world".

UPDATE: Ah, I just saw you reproduced it. Cool! I'll stay tuned.

am17an · 2026-01-04T12:48:42Z

Should be fixed by #18593

thomasjfox · 2026-01-04T12:53:50Z

Should be fixed by #18593

still crashes with the fix applied 😬

Jan 04 13:51:52 /mnt/local/ai-models/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:94: CUDA error
Jan 04 13:51:52 CUDA error: an illegal memory access was encountered
Jan 04 13:51:52   current device: 1, in function ggml_backend_cuda_synchronize at /mnt/local/ai-models/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2837
Jan 04 13:51:52   cudaStreamSynchronize(cuda_ctx->stream())

Also ran "make clean" to be on the safe side.

am17an · 2026-01-04T13:33:02Z

@thomasjfox do you mind trying that PR again?

thomasjfox · 2026-01-04T13:45:24Z

@thomasjfox do you mind trying that PR again?

I was just about the leave the computer alone... that fixed it!

I tested GPT-OSS 120B (CPU MoE offloading) and MiniMax M2.1 (fully on GPU). Both work.

Thanks a lot!

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 30, 2025

am17an force-pushed the graph-opt-fix branch from 901c824 to 1464216 Compare December 31, 2025 05:34

loci-dev mentioned this pull request Dec 31, 2025

UPSTREAM PR #18496: ggml-cuda: enable concurrent streams by default auroralabs-loci/llama.cpp#766

Open

am17an force-pushed the graph-opt-fix branch 3 times, most recently from 918ebb9 to 0bb5294 Compare December 31, 2025 06:40

am17an force-pushed the graph-opt-fix branch from 0bb5294 to b3a9b4c Compare December 31, 2025 07:02

ggml-cuda: enable concurrent streams by default

25ae798

This PR enables concurrent streams introduced in ggml-org#16991 by default. To disable a new env flag `GGML_CUDA_DISABLE_GRAPH_OPT` is introduced

am17an force-pushed the graph-opt-fix branch from b3a9b4c to 25ae798 Compare December 31, 2025 07:13

make flag opt-in

93cfa8d

ggerganov reviewed Jan 2, 2026

View reviewed changes

add todo about special casing

d405fa1

am17an requested a review from JohannesGaessler January 3, 2026 12:15

JohannesGaessler approved these changes Jan 3, 2026

View reviewed changes

am17an changed the title ~~ggml-cuda: enable concurrent streams by default~~ ggml-cuda: fixes for concurrent streams Jan 3, 2026

update comment

b423920

am17an force-pushed the graph-opt-fix branch from c44291b to b423920 Compare January 3, 2026 14:03

am17an merged commit e57f523 into ggml-org:master Jan 3, 2026
71 checks passed

am17an deleted the graph-opt-fix branch January 4, 2026 12:48

LostRuins mentioned this pull request Jan 5, 2026

CUDA invalid argument error in ggml_cuda_mul_mat_id on v1.105.2 LostRuins/koboldcpp#1911

Closed

ggml-cuda: fixes for concurrent streams #18496

ggml-cuda: fixes for concurrent streams #18496

Conversation

am17an commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Dec 31, 2025

Uh oh!

am17an commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Dec 31, 2025

Uh oh!

am17an commented Dec 31, 2025

Uh oh!

ggerganov commented Dec 31, 2025

Uh oh!

ggerganov Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

am17an Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

am17an commented Jan 3, 2026

Uh oh!

Uh oh!

thomasjfox commented Jan 4, 2026

Uh oh!

am17an commented Jan 4, 2026

Uh oh!

thomasjfox commented Jan 4, 2026

Uh oh!

tccybo commented Jan 4, 2026

Uh oh!

am17an commented Jan 4, 2026

Uh oh!

am17an commented Jan 4, 2026

Uh oh!

thomasjfox commented Jan 4, 2026

Uh oh!

am17an commented Jan 4, 2026

Uh oh!

thomasjfox commented Jan 4, 2026

Uh oh!

am17an commented Jan 4, 2026

Uh oh!

thomasjfox commented Jan 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

am17an commented Dec 30, 2025 •

edited

Loading

ggerganov commented Dec 31, 2025 •

edited

Loading

am17an commented Dec 31, 2025 •

edited

Loading