ggml-cpu: randomly hang forever in ggml_barrier on weak memory model systems #17515

DamonFool · 2025-11-26T06:47:08Z

Thanks to #16547 , we can enable hexagon npus in llama.cpp.

While inferencing with CPU & NPU, llama.cpp was observed to randomly hang forever in ggml_barrier in our experiments.

The situation is

compute graph split n    ---> scheduled to npu with 1 active worker threads
     |
     v
compute graph split n+1  ---> scheduled to cpu with 6 active worker threads
     |  (only 5 worker threads wake up, the remaining 1 thread sleeps again)
     v
 hang forever in ggml_barrier (the barrier would expect 6 wokers to finish, but there are only 5, so hang forever)

Just imagine that graph split n was scheduled to npu with 1 active worker threads.
Then graph split n+1 was scheduled to cpu with 6 active worker threads.
However, only 5 worker threads wake up, the remaining 1 thread sleeps again due the failure of ggml_graph_compute_thread_active check.

Why the ggml_graph_compute_thread_active check fails?
This is because the thread read the old value of n_threads_cur as 1 (which should be 6).

// check if thread is active
static inline bool ggml_graph_compute_thread_active(struct ggml_compute_state * state) {
    struct ggml_threadpool * threadpool = state->threadpool;
    int n_threads = atomic_load_explicit(&threadpool->n_threads_cur, memory_order_relaxed);  // <--- read the old value of n_threads_cur=1, which should be 6.
    return (state->ith < n_threads);
}

The case can be simplified with the following code.

main-thread: 
  atomic_store_explicit(&threadpool->n_threads_cur, g_n_threads, memory_order_relaxed);
  atomic_fetch_add_explicit(&threadpool->n_graph, 1, memory_order_seq_cst);

worker-thread:
  int local_n_graph = atomic_load_explicit(&threadpool->n_graph, memory_order_relaxed);
  int local_n_threads = atomic_load_explicit(&threadpool->n_threads_cur, memory_order_relaxed);

The woker-thread may read the old value of n_threads_cur (not as the value of main-thread set) with memory_order_relaxed on weak memory model systems.

The suggested fix is to use store-release/load-acquire pattern.

…systems Signed-off-by: Jie Fu <jiefu@tencent.com>

jeffbolznv · 2025-11-26T15:58:03Z

This is because the thread read the old value of n_threads_cur as 1 (which should be 6).

This explanation/fix doesn't make sense to me. If n_threads_cur is atomic, then changing relaxed to release/acquire only affects consistency of other values that are synchronized by the store/release. It doesn't affect atomicity or ordering of n_threads_cur itself.

Is the issue about ordering of n_thread_cur vs n_graph?

max-krasnyansky · 2025-11-26T18:52:13Z

@DamonFool
Please share the command you run.
In particular the number of threads used for the CPU backend and which device you used for the test.

I bet the issue is simply that you have too many threads. The OS/device scheduler probably offlined some of the CPU cores for power saving and there are not enough active cores to actively run all threads.

Try reducing the number of threads. My guess would be you're using the default 6 and maybe 4 is better on your device.
Also if you're using scripts/snapdradon/adb/run-... scripts you might want to disable the cpu-mask settings or maybe bump the priority --prio 1. In other words, play with those settings to find a good combo.

I agree with @jeffbolznv that making those atomic reads more strict is not a "fix".
Those are supposed to be relaxed by design.

DamonFool · 2025-11-27T02:07:56Z

Is the issue about ordering of n_thread_cur vs n_graph?

See the following code

main-thread: 
  atomic_store_explicit(&threadpool->n_threads_cur, g_n_threads, memory_order_relaxed);
  atomic_fetch_add_explicit(&threadpool->n_graph, 1, memory_order_seq_cst);

worker-thread:
  int local_n_graph = atomic_load_explicit(&threadpool->n_graph, memory_order_relaxed);
  int local_n_threads = atomic_load_explicit(&threadpool->n_threads_cur, memory_order_relaxed);

Although n_graph is updated after n_threads_cur in main-thread, the worker thread may still read the updated value of n_graph and the stale value of n_thread_cur with relaxed load.
However, if we change relaxed to require in the worker thread, it is guaranteed that the worker thread read the updated value of n_threads_cur if it has read the updated value of n_graph.

So before this patch, the 6 worker threads may

thread 0: the main thread
thread 1:  read the updated n_graph, read n_threads_cur = 6
thread 2:  read the updated n_graph, read n_threads_cur = 6
thread 3:  read the updated n_graph, read n_threads_cur = 1    <=== got the stale value of n_threads_cur, check fail in ggml_graph_compute_thread_active and fall asleep immediately
thread 4:  read the updated n_graph, read n_threads_cur = 6
thread 5:  read the updated n_graph, read n_threads_cur = 6

After this patch, all the worker thread would get read n_threads_cur = 6 if the updated n_graph has been read.

DamonFool · 2025-11-27T02:09:57Z

I bet the issue is simply that you have too many threads. The OS/device scheduler probably offlined some of the CPU cores for power saving and there are not enough active cores to actively run all threads.

No, I don't think so.

Before this patch, we can reproduce the hang in less than five minutes.
After this patch, we've tested for more than 6 hours, no hang any more.

DamonFool · 2025-11-27T02:16:22Z

Please share the command you run.
In particular the number of threads used for the CPU backend and which device you used for the test.

@max-krasnyansky
The following is the running script.

export LD_LIBRARY_PATH=/data/local/tmp/llama.cpp/./lib
export ADSP_LIBRARY_PATH=/data/local/tmp/llama.cpp/./lib          
export GGML_HEXAGON_NDEV=1          
export GGML_HEXAGON_VERBOSE=0

/data/local/tmp/llama.cpp/bin/llama-cli \
    -m /data/local/tmp/gguf/hy1.8b_Q8_0.gguf \
    --no-mmap \
    --poll 1000 \
    -t 6 \
    --cpu-mask 0xfc \
    --cpu-strict 1 \
    --ctx-size 8192 \
    --batch-size 128 \
    -ctk q8_0 -ctv q8_0 \
    -fa on \
    -ngl 99 \
    --device HTP0  \
    --no-warmup \
    -no-cnv -p "你好"

You can get the model here https://huggingface.co/tencent/Hunyuan-1.8B-Instruct .

The device is Snapdragon 8 Gen 3， with Hexagon Arch version v75.

jeffbolznv · 2025-11-27T17:22:01Z

Thanks for the details, the change makes sense to me now.

DamonFool · 2025-11-28T01:45:31Z

Thanks for the details, the change makes sense to me now.

Thanks @jeffbolznv for your review.

DamonFool · 2025-11-28T03:52:23Z

Hi @max-krasnyansky , may I ask can you reproduce the hang issue in your environment?
Thanks.

ggml-cpu: randomly hang forever in ggml_barrier on weak memory model …

e5aedf2

…systems Signed-off-by: Jie Fu <jiefu@tencent.com>

DamonFool requested a review from ggerganov as a code owner November 26, 2025 06:47

DamonFool mentioned this pull request Nov 26, 2025

Add experimental ggml-hexagon backend for the Hexagon NPU #16547

Merged

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 26, 2025

Merge branch 'master' into cpu-hang

2ab38bf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-cpu: randomly hang forever in ggml_barrier on weak memory model systems #17515

ggml-cpu: randomly hang forever in ggml_barrier on weak memory model systems #17515

DamonFool commented Nov 26, 2025

Uh oh!

jeffbolznv commented Nov 26, 2025

Uh oh!

max-krasnyansky commented Nov 26, 2025

Uh oh!

DamonFool commented Nov 27, 2025

Uh oh!

DamonFool commented Nov 27, 2025

Uh oh!

DamonFool commented Nov 27, 2025

Uh oh!

jeffbolznv commented Nov 27, 2025

Uh oh!

DamonFool commented Nov 28, 2025

Uh oh!

DamonFool commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ggml-cpu: randomly hang forever in ggml_barrier on weak memory model systems #17515

Are you sure you want to change the base?

ggml-cpu: randomly hang forever in ggml_barrier on weak memory model systems #17515

Conversation

DamonFool commented Nov 26, 2025

Uh oh!

jeffbolznv commented Nov 26, 2025

Uh oh!

max-krasnyansky commented Nov 26, 2025

Uh oh!

DamonFool commented Nov 27, 2025

Uh oh!

DamonFool commented Nov 27, 2025

Uh oh!

DamonFool commented Nov 27, 2025

Uh oh!

jeffbolznv commented Nov 27, 2025

Uh oh!

DamonFool commented Nov 28, 2025

Uh oh!

DamonFool commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants