-
Notifications
You must be signed in to change notification settings - Fork 13.9k
ggml-cpu: randomly hang forever in ggml_barrier on weak memory model systems #17515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…systems Signed-off-by: Jie Fu <jiefu@tencent.com>
This explanation/fix doesn't make sense to me. If n_threads_cur is atomic, then changing relaxed to release/acquire only affects consistency of other values that are synchronized by the store/release. It doesn't affect atomicity or ordering of n_threads_cur itself. Is the issue about ordering of n_thread_cur vs n_graph? |
|
@DamonFool I bet the issue is simply that you have too many threads. The OS/device scheduler probably offlined some of the CPU cores for power saving and there are not enough active cores to actively run all threads. Try reducing the number of threads. My guess would be you're using the default 6 and maybe 4 is better on your device. I agree with @jeffbolznv that making those atomic reads more strict is not a "fix". |
See the following code Although So before this patch, the 6 worker threads may After this patch, all the worker thread would get read n_threads_cur = 6 if the updated n_graph has been read. |
No, I don't think so. Before this patch, we can reproduce the hang in less than five minutes. |
@max-krasnyansky You can get the model here https://huggingface.co/tencent/Hunyuan-1.8B-Instruct . The device is Snapdragon 8 Gen 3, with Hexagon Arch version v75. |
|
Thanks for the details, the change makes sense to me now. |
Thanks @jeffbolznv for your review. |
|
Hi @max-krasnyansky , may I ask can you reproduce the hang issue in your environment? |
Thanks to #16547 , we can enable hexagon npus in llama.cpp.
While inferencing with CPU & NPU, llama.cpp was observed to randomly hang forever in
ggml_barrierin our experiments.The situation is
Just imagine that graph split n was scheduled to npu with 1 active worker threads.
Then graph split n+1 was scheduled to cpu with 6 active worker threads.
However, only 5 worker threads wake up, the remaining 1 thread sleeps again due the failure of
ggml_graph_compute_thread_activecheck.Why the
ggml_graph_compute_thread_activecheck fails?This is because the thread read the old value of
n_threads_curas 1 (which should be 6).The case can be simplified with the following code.
The woker-thread may read the old value of n_threads_cur (not as the value of main-thread set) with
memory_order_relaxedon weak memory model systems.The suggested fix is to use store-release/load-acquire pattern.