Skip to content

Conversation

@am17an
Copy link
Collaborator

@am17an am17an commented Dec 30, 2025

Couple of fixes for GGML_CUDA_GRAPH_OPT=1

  • Fixed a bug where we weren't clearing events in case concurrent events are not valid memory-wise
  • To reduce surface area, the forking node needs to be of the form attn_norm, since that is only pattern (QKV) I've tested extensively
  • Only works when CUDA graphs are enabled (so Ampere+, single-GPU at the moment)

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 30, 2025
@am17an am17an force-pushed the graph-opt-fix branch 3 times, most recently from 918ebb9 to 0bb5294 Compare December 31, 2025 06:40
@ggerganov
Copy link
Member

ggerganov commented Dec 31, 2025

Is this path stable enough now? I thought it still had some issues.

I will try to sync the latest codebase to the ggml and whisper.cpp repos today and maybe we can run some tests with whisper.cpp + GGML_CUDA_GRAPH_OPT to make sure there are not side-effects there, before merging this change?

@am17an
Copy link
Collaborator Author

am17an commented Dec 31, 2025

Yes it should be fine, although more tests are always welcome. I did the log-probs test on a couple of popular models with this vs master and they were fine

@am17an
Copy link
Collaborator Author

am17an commented Dec 31, 2025

BTW we should use this PR as a test because it contains a fix + adds more checks as to when to enable this feature

This PR enables concurrent streams introduced in ggml-org#16991 by default. To disable a new env flag `GGML_CUDA_DISABLE_GRAPH_OPT` is introduced
@ggerganov
Copy link
Member

BTW we should use this PR as a test because it contains a fix + adds more checks as to when to enable this feature

Can you extract the fix and the checks in a separate PR, before enabling GGML_CUDA_GRAPH_OPT by default?

@am17an
Copy link
Collaborator Author

am17an commented Dec 31, 2025

@ggerganov the fix and checks are in this PR, so I just changed the flag to have the same behavior as before (use GGML_CUDA_GRAPH_OPT=1 to enable)

@ggerganov
Copy link
Member

Ok, will pick it up when this gets merged. I checked that with these changes whisper.cpp runs both with and without GGML_CUDA_GRAPH_OPT=1 (before this it's crashing).

Comment on lines 3905 to 3909
// only optimize for attn_norm
if (!strstr(root_node->name, "attn_norm")) {
continue;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add TODOs for these name-based workarounds - ideally we want to avoid all special casing in the backends based on node names.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the TODO. At the moment it is the only way I can contain the change. With different types of models/architectures there can be many nodes which have a fan-out of exactly 3, which would be totally untested perf wise

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember to also update the title of the PR.

Somewhat related: the deadline that I've set for myself to try and produce a working prototype for generic tensor parallelism is January 12. In principle it would also be possible to re-use the same code for splitting a compute graph for a single GPU. We should test how that would perform vs. this implementation.

@am17an am17an changed the title ggml-cuda: enable concurrent streams by default ggml-cuda: fixes for concurrent streams Jan 3, 2026
@am17an
Copy link
Collaborator Author

am17an commented Jan 3, 2026

produce a working prototype for generic tensor parallelism is January 12

Amazing, way to bury the lede! Excited to see what you put out

@am17an am17an merged commit e57f523 into ggml-org:master Jan 3, 2026
71 checks passed
@thomasjfox
Copy link
Contributor

This commit causes a crash on master on cef1d23

I've bisected it down and dropped commit e57f523 on my local branch and everything is back to normal.

This is the backtrace I'm seeing with GPT-OSS 120B mxfp4 with CUDA using "sm120a-real" architecture (dual GPU):

Jan 04 12:17:44 /mnt/local/ai-models/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:94: CUDA error                                                                                                
Jan 04 12:17:44 CUDA error: an illegal memory access was encountered                          
Jan 04 12:17:44   current device: 0, in function ggml_backend_cuda_synchronize at /mnt/local/ai-models/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2837                                        
Jan 04 12:17:44   cudaStreamSynchronize(cuda_ctx->stream())                                   
Jan 04 12:17:44 [New LWP 833153]
Jan 04 12:17:44 [New LWP 833152]
Jan 04 12:17:44 [New LWP 833151]                                                                                                                                           
..
Jan 04 12:17:44 [New LWP 833088]                                                                                                                                                             
..
Jan 04 12:17:45 Using host libthread_db library "/lib64/libthread_db.so.1".                                                                                                                                                                                                                                                                                                               
Jan 04 12:17:45 0x00007fc7ce489422 in __syscall_cancel_arch () from /lib64/libc.so.6                                                                                                         
Jan 04 12:17:45 #0  0x00007fc7ce489422 in __syscall_cancel_arch () from /lib64/libc.so.6                                                                                                     
Jan 04 12:17:45 #1  0x00007fc7ce47d71c in __internal_syscall_cancel () from /lib64/libc.so.6                                                                                                 
Jan 04 12:17:45 #2  0x00007fc7ce47d764 in __syscall_cancel () from /lib64/libc.so.6                                                                                                          
Jan 04 12:17:45 #3  0x00007fc7ce4edbcf in wait4 () from /lib64/libc.so.6                      
Jan 04 12:17:45 #4  0x00007fc7d755cce3 in ggml_print_backtrace () from /mnt/local/ai-models/llama.cpp/build/bin/libggml-base.so.0                                                            
Jan 04 12:17:45 #5  0x00007fc7d755ce2c in ggml_abort () from /mnt/local/ai-models/llama.cpp/build/bin/libggml-base.so.0                                                                                                                                                                                                                                                                   
Jan 04 12:17:45 #6  0x00007fc7d49300e3 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /mnt/local/ai-models/llama.cpp/build/bin/libggml-cuda.so.0                                                                                                                                                                                                     
Jan 04 12:17:45 #7  0x00007fc7d49311d8 in ggml_backend_cuda_synchronize(ggml_backend*) () from /mnt/local/ai-models/llama.cpp/build/bin/libggml-cuda.so.0                                    
Jan 04 12:17:45 #8  0x00007fc7d75770ed in ggml_backend_sched_graph_compute_async () from /mnt/local/ai-models/llama.cpp/build/bin/libggml-base.so.0                                          
Jan 04 12:17:45 #9  0x00007fc7d72459d0 in llama_context::graph_compute(ggml_cgraph*, bool) () from /mnt/local/ai-models/llama.cpp/build/bin/libllama.so.0                                    
Jan 04 12:17:45 #10 0x00007fc7d72474b2 in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from /mnt/local/ai-models/llama.cpp/build/bin/libllama.so.0                                                                                                                                                                        
Jan 04 12:17:45 #11 0x00007fc7d724ca7f in llama_context::decode(llama_batch const&) () from /mnt/local/ai-models/llama.cpp/build/bin/libllama.so.0                                           
Jan 04 12:17:45 #12 0x00007fc7d724d87e in llama_decode () from /mnt/local/ai-models/llama.cpp/build/bin/libllama.so.0                                                                        
Jan 04 12:17:45 #13 0x00000000004f6cee in server_context_impl::update_slots() ()                                                                                                             
Jan 04 12:17:45 #14 0x000000000053298f in server_queue::start_loop(long) ()                   
Jan 04 12:17:45 #15 0x000000000046950c in main ()

Machine is a dual GPU setup, here's how llama-server is started:

llama.cpp/build/bin/llama-server \
    --model /mnt/local/ai-models/gpt-oss-120b-mxfp4-00001-of-00003.gguf \
    --temp 1.0 --top-p 1.0 --top-k 0 \
    -ngl 99 \
    --jinja --threads 16 --ctx-size 100000 \
    --cache-ram 32768 \
    -ot "blk\.[0-2]\.attn_.*=CUDA1" \
    -ot "blk\.[0-2]\.ffn_.*=CUDA1" \
    --cpu-moe \
    --chat-template-file gpt-oss-120b-chat_template.jinja \
    --chat-template-kwargs '{"reasoning_effort": "medium"}' \
    --flash-attn on \
    --slots \
    --timeout 3600

All I wanted it so give the improved regex parser from #18342 a spin 😆

@am17an
Copy link
Collaborator Author

am17an commented Jan 4, 2026

@thomasjfox I'm not able to reproduce this crash, does it happen after you start the server or after you send a prompt?

@thomasjfox
Copy link
Contributor

After sending a prompt. Crashes right on the first prompt.

Sorry, should have mentioned that.

@tccybo
Copy link

tccybo commented Jan 4, 2026

I have the same issue in b7621 and b7622.
[49697] common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[49697] ggml_cuda_compute_forward: MUL failedsrv log_server_r: request: GET /v1/models 127.0.0.1 200
[49697] CUDA error: an illegal memory access was encountered[49697] current device: 0, in function ggml_cuda_compute_forward at D:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:2738
[49697] err
[49697] D:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:94: CUDA error

reverted to b7620 and it's fixed.

@am17an
Copy link
Collaborator Author

am17an commented Jan 4, 2026

Also just to be clear, this is without GGML_CUDA_GRAPH_OPT?

@am17an
Copy link
Collaborator Author

am17an commented Jan 4, 2026

Ok I was able to reproduce the error, will put out a fix soon

@thomasjfox
Copy link
Contributor

Also just to be clear, this is without GGML_CUDA_GRAPH_OPT?

yes, nothing specifically enabled. I also did a "make clean" to make sure it's not a borked build.

The moment I put the commit back in (just tried), it crashes on the first prompt. Just tried "hello world".

UPDATE: Ah, I just saw you reproduced it. Cool! I'll stay tuned.

@am17an
Copy link
Collaborator Author

am17an commented Jan 4, 2026

Should be fixed by #18593

@am17an am17an deleted the graph-opt-fix branch January 4, 2026 12:48
@thomasjfox
Copy link
Contributor

Should be fixed by #18593

still crashes with the fix applied 😬

Jan 04 13:51:52 /mnt/local/ai-models/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:94: CUDA error
Jan 04 13:51:52 CUDA error: an illegal memory access was encountered
Jan 04 13:51:52   current device: 1, in function ggml_backend_cuda_synchronize at /mnt/local/ai-models/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2837
Jan 04 13:51:52   cudaStreamSynchronize(cuda_ctx->stream())

Also ran "make clean" to be on the safe side.

@am17an
Copy link
Collaborator Author

am17an commented Jan 4, 2026

@thomasjfox do you mind trying that PR again?

@thomasjfox
Copy link
Contributor

@thomasjfox do you mind trying that PR again?

I was just about the leave the computer alone... that fixed it!

I tested GPT-OSS 120B (CPU MoE offloading) and MiniMax M2.1 (fully on GPU). Both work.

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants