-
Notifications
You must be signed in to change notification settings - Fork 14k
Fix race conditions in threadpool when dealing with dynamic/frequent n_threads changes #17748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
c09526e to
222c9f8
Compare
Hi @max-krasnyansky , may I ask how do I determine whether the updated test was successful or failed? I've run the updated test on Mac M4 pro without the fix, but no hang. |
For me it is crashing (segfault) because one or more threads would eventually get out of sync and start processing the graph while the cplan is getting updated. The race window is small, so you need more threads and more rounds (I used 10 threads with 1K rounds on M4-Pro). Here is an example on M4-Pro with the new test but without the fix: The issue is kind of obvious when you consider that the worker thread can get preempted in between reading It's possible to fix this by adding more logic (don't bump |
Thanks for your clarification. |
I'd say no need to reproduce the issue with the new test. Just see if this PR fixes your original issue. |
Now I can reproduce the crash with your regression test now. |
The original discussion started in #17515
The short summary is that we have a race condition when the number of active threads changes rapidly while the worker threads are still in their hybrid polling loops.
I updated
test_barrierto test for this scenario. There is an additional test in there now that flip-flops between doinggraph_computewith 1 and N threads. Without the fix this new test quickly and reliably fails on all platforms that I tested Snapdragon-Gen3/4/5 (Android), Mac M4-Pro, AMD Ryzen-9 (Linux).See this comment for the original report and analysis of the end-to-end use-cases that trigger this scenario
#17515 (comment)
This PR combines
n_graphandn_threads_cur(number of active threads) into a single atomic update.I played with a bunch of ideas and this seems to be the cleanest/simplest way to ensure all threads see a consistent state without adding extra logic. Also worth noting that adding additional memory ordering restriction (ie instead of doing relaxed reads) is not sufficient because the threads can get preempted in between the atomic reads and still end up with the inconsistent state.
Here is a quick test report from various systems:
AMD Ryzen 9 3950X (16-Cores) -- tested with and without OpenMP, with and without TSAN
Galaxy S24 Ultra (Gen3) -- no OpenMP, also tested Galaxy S25 (Gen4) and Gen5 device
Mac M4-Pron -- no OpenMP, with and without TSAN
Also tested all the usual stuff
llama-cliandllama-benchwith various models and backends with partial offloads.@DamonFool
Please give this a shot on your setup.
@jeffbolznv @ggerganov