Fix race conditions in threadpool when dealing with dynamic/frequent n_threads changes #17748

max-krasnyansky · 2025-12-04T01:10:34Z

The original discussion started in #17515

The short summary is that we have a race condition when the number of active threads changes rapidly while the worker threads are still in their hybrid polling loops.

I updated test_barrier to test for this scenario. There is an additional test in there now that flip-flops between doing graph_compute with 1 and N threads. Without the fix this new test quickly and reliably fails on all platforms that I tested Snapdragon-Gen3/4/5 (Android), Mac M4-Pro, AMD Ryzen-9 (Linux).

See this comment for the original report and analysis of the end-to-end use-cases that trigger this scenario
#17515 (comment)

This PR combines n_graph and n_threads_cur (number of active threads) into a single atomic update.
I played with a bunch of ideas and this seems to be the cleanest/simplest way to ensure all threads see a consistent state without adding extra logic. Also worth noting that adding additional memory ordering restriction (ie instead of doing relaxed reads) is not sufficient because the threads can get preempted in between the atomic reads and still end up with the inconsistent state.

Here is a quick test report from various systems:

AMD Ryzen 9 3950X (16-Cores) -- tested with and without OpenMP, with and without TSAN

$ ./build-amd64-omp/bin/test-barrier 16 1000
graph-compute with
 n_threads: 16
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 4176811 usec 
 4176.81 usec per-iter
 2088.41 nsec per-node

graph-compute with
 n_threads: 16
   n_nodes: 4
  n_rounds: 100000

$ ./build-amd64/bin/test-barrier 16 1000
graph-compute with
 n_threads: 16
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 3982746 usec 
 3982.75 usec per-iter
 1991.37 nsec per-node
 
graph-compute with
 n_threads: 16
   n_nodes: 4
  n_rounds: 100000

Galaxy S24 Ultra (Gen3) -- no OpenMP, also tested Galaxy S25 (Gen4) and Gen5 device

~/src/llama.cpp-hexagon$ ./scripts/snapdragon/adb/run-tool.sh test-barrier 6 1000
...
graph-compute with
 n_threads: 6
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 1507086 usec 
 1507.09 usec per-iter
 753.543 nsec per-node

graph-compute with
 n_threads: 6
   n_nodes: 4
  n_rounds: 100000

Mac M4-Pron -- no OpenMP, with and without TSAN

$ ./build-macos/bin/test-barrier 10 1000
graph-compute with
 n_threads: 10
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 3080797 usec 
 3080.8 usec per-iter
 1540.4 nsec per-node

graph-compute with
 n_threads: 10
   n_nodes: 4
  n_rounds: 100000

Also tested all the usual stuff llama-cli and llama-bench with various models and backends with partial offloads.

@DamonFool
Please give this a shot on your setup.

@jeffbolznv @ggerganov

DamonFool · 2025-12-04T08:31:28Z

Without the fix this new test quickly and reliably fails on all platforms that I tested Snapdragon-Gen3/4/5 (Android), Mac M4-Pro, AMD Ryzen-9 (Linux).

Hi @max-krasnyansky , may I ask how do I determine whether the updated test was successful or failed?
Does the failure also mean hang?

I've run the updated test on Mac M4 pro without the fix, but no hang.

max-krasnyansky · 2025-12-04T17:19:18Z

Without the fix this new test quickly and reliably fails on all platforms that I tested Snapdragon-Gen3/4/5 (Android), Mac M4-Pro, AMD Ryzen-9 (Linux).

Hi @max-krasnyansky , may I ask how do I determine whether the updated test was successful or failed? Does the failure also mean hang?

I've run the updated test on Mac M4 pro without the fix, but no hang.

For me it is crashing (segfault) because one or more threads would eventually get out of sync and start processing the graph while the cplan is getting updated. The race window is small, so you need more threads and more rounds (I used 10 threads with 1K rounds on M4-Pro).

Here is an example on M4-Pro with the new test but without the fix:

$ git log --oneline
41a2a6cfa (HEAD -> cpu-n_threads-race) tests: update barrier test to check for race condition in active threads
dea9ba27c (tag: b7261, upstream/master, upstream/HEAD, qcom/master, qcom/HEAD, origin/master, origin/HEAD) ggml-cpu: remove duplicate conditional check 'iid' (#17650)
c6d1a00aa Add a couple of file types to the text section (#17670)
424c57945 convert : support latest mistral-common (fix conversion with --mistral-format) (#17712)
...

$ cmake -D GGML_OPENMP=OFF -D GGML_SANITIZE_THREAD=OFF -D CMAKE_BUILD_TYPE=RelWithDebInfo -G Ninja -B build-macos

$ ./build-macos/bin/test-barrier 10 1000
graph-compute with
 n_threads: 10
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 2753814 usec 
 2753.81 usec per-iter
 1376.91 nsec per-node
graph-compute with
 n_threads: 10
   n_nodes: 4
  n_rounds: 100000
Segmentation fault: 11  <<<<<<<

The issue is kind of obvious when you consider that the worker thread can get preempted in between reading n_graph and n_threads_cur and start running again after we processed a new graph with a single thread and started prepping the next graph (ie already bumped n_threads_cur).

It's possible to fix this by adding more logic (don't bump n_graph for single threaded case, etc) but the fix I added is simple and robust -- graph-N must be processed with M-threads -- it doesn't matter when the workers threads get preempted they will always observe the consistent state.

DamonFool · 2025-12-05T00:56:27Z

For me it is crashing (segfault) because one or more threads would eventually get out of sync and start processing the graph while the cplan is getting updated. The race window is small, so you need more threads and more rounds (I used 10 threads with 1K rounds on M4-Pro).

Thanks for your clarification.
I'll try to reproduce the issue with your updated tests.

max-krasnyansky · 2025-12-05T01:13:29Z

For me it is crashing (segfault) because one or more threads would eventually get out of sync and start processing the graph while the cplan is getting updated. The race window is small, so you need more threads and more rounds (I used 10 threads with 1K rounds on M4-Pro).

Thanks for your clarification. I'll try to reproduce the issue with your updated tests.

I'd say no need to reproduce the issue with the new test. Just see if this PR fixes your original issue.
As I mentioned I tested it on Gen3 (S24 Ultra) with the new test and with the original Hexagon backend use-case.

DamonFool · 2025-12-05T10:54:35Z

I'd say no need to reproduce the issue with the new test.

Now I can reproduce the crash with your regression test now.
The crash seems to be another story and I'll spend some time to understand the case this weekend.

max-krasnyansky added 2 commits December 3, 2025 16:11

tests: update barrier test to check for race condition in active threads

41a2a6c

cpu: combine n_graph and n_threads into a single atomic update

8b7c68f

max-krasnyansky requested a review from ggerganov as a code owner December 4, 2025 01:10

max-krasnyansky mentioned this pull request Dec 4, 2025

ggml-cpu: randomly hang forever in ggml_barrier on weak memory model systems #17515

Open

github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Dec 4, 2025

loci-dev mentioned this pull request Dec 4, 2025

UPSTREAM PR #17748: Fix race conditions in threadpool when dealing with dynamic/frequent n_threads changes auroralabs-loci/llama.cpp#426

Open

tests: add multi-graph test for test_barrier

222c9f8

max-krasnyansky force-pushed the cpu-n_threads-race branch from c09526e to 222c9f8 Compare December 4, 2025 04:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix race conditions in threadpool when dealing with dynamic/frequent n_threads changes #17748

Fix race conditions in threadpool when dealing with dynamic/frequent n_threads changes #17748

max-krasnyansky commented Dec 4, 2025

Uh oh!

DamonFool commented Dec 4, 2025

Uh oh!

max-krasnyansky commented Dec 4, 2025

Uh oh!

DamonFool commented Dec 5, 2025

Uh oh!

max-krasnyansky commented Dec 5, 2025

Uh oh!

DamonFool commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix race conditions in threadpool when dealing with dynamic/frequent n_threads changes #17748

Are you sure you want to change the base?

Fix race conditions in threadpool when dealing with dynamic/frequent n_threads changes #17748

Conversation

max-krasnyansky commented Dec 4, 2025

Uh oh!

DamonFool commented Dec 4, 2025

Uh oh!

max-krasnyansky commented Dec 4, 2025

Uh oh!

DamonFool commented Dec 5, 2025

Uh oh!

max-krasnyansky commented Dec 5, 2025

Uh oh!

DamonFool commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants