-
Notifications
You must be signed in to change notification settings - Fork 14.4k
ggml-cuda: fixes for concurrent streams #18496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
901c824 to
1464216
Compare
918ebb9 to
0bb5294
Compare
|
Is this path stable enough now? I thought it still had some issues. I will try to sync the latest codebase to the ggml and whisper.cpp repos today and maybe we can run some tests with whisper.cpp + GGML_CUDA_GRAPH_OPT to make sure there are not side-effects there, before merging this change? |
0bb5294 to
b3a9b4c
Compare
|
Yes it should be fine, although more tests are always welcome. I did the log-probs test on a couple of popular models with this vs master and they were fine |
|
BTW we should use this PR as a test because it contains a fix + adds more checks as to when to enable this feature |
This PR enables concurrent streams introduced in ggml-org#16991 by default. To disable a new env flag `GGML_CUDA_DISABLE_GRAPH_OPT` is introduced
b3a9b4c to
25ae798
Compare
Can you extract the fix and the checks in a separate PR, before enabling |
|
@ggerganov the fix and checks are in this PR, so I just changed the flag to have the same behavior as before (use |
|
Ok, will pick it up when this gets merged. I checked that with these changes |
| // only optimize for attn_norm | ||
| if (!strstr(root_node->name, "attn_norm")) { | ||
| continue; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add TODOs for these name-based workarounds - ideally we want to avoid all special casing in the backends based on node names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the TODO. At the moment it is the only way I can contain the change. With different types of models/architectures there can be many nodes which have a fan-out of exactly 3, which would be totally untested perf wise
JohannesGaessler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember to also update the title of the PR.
Somewhat related: the deadline that I've set for myself to try and produce a working prototype for generic tensor parallelism is January 12. In principle it would also be possible to re-use the same code for splitting a compute graph for a single GPU. We should test how that would perform vs. this implementation.
Amazing, way to bury the lede! Excited to see what you put out |
|
This commit causes a crash on master on cef1d23 I've bisected it down and dropped commit e57f523 on my local branch and everything is back to normal. This is the backtrace I'm seeing with GPT-OSS 120B mxfp4 with CUDA using "sm120a-real" architecture (dual GPU): Machine is a dual GPU setup, here's how llama-server is started: All I wanted it so give the improved regex parser from #18342 a spin 😆 |
|
@thomasjfox I'm not able to reproduce this crash, does it happen after you start the server or after you send a prompt? |
|
After sending a prompt. Crashes right on the first prompt. Sorry, should have mentioned that. |
|
I have the same issue in b7621 and b7622. reverted to b7620 and it's fixed. |
|
Also just to be clear, this is without GGML_CUDA_GRAPH_OPT? |
|
Ok I was able to reproduce the error, will put out a fix soon |
yes, nothing specifically enabled. I also did a "make clean" to make sure it's not a borked build. The moment I put the commit back in (just tried), it crashes on the first prompt. Just tried "hello world". UPDATE: Ah, I just saw you reproduced it. Cool! I'll stay tuned. |
|
Should be fixed by #18593 |
still crashes with the fix applied 😬 Also ran "make clean" to be on the safe side. |
|
@thomasjfox do you mind trying that PR again? |
I was just about the leave the computer alone... that fixed it! I tested GPT-OSS 120B (CPU MoE offloading) and MiniMax M2.1 (fully on GPU). Both work. Thanks a lot! |
Couple of fixes for
GGML_CUDA_GRAPH_OPT=1attn_norm, since that is only pattern (QKV) I've tested extensively