Skip to content

Conversation

@max-krasnyansky
Copy link
Collaborator

The original discussion started in #17515

The short summary is that we have a race condition when the number of active threads changes rapidly while the worker threads are still in their hybrid polling loops.

I updated test_barrier to test for this scenario. There is an additional test in there now that flip-flops between doing graph_compute with 1 and N threads. Without the fix this new test quickly and reliably fails on all platforms that I tested Snapdragon-Gen3/4/5 (Android), Mac M4-Pro, AMD Ryzen-9 (Linux).

See this comment for the original report and analysis of the end-to-end use-cases that trigger this scenario
#17515 (comment)

This PR combines n_graph and n_threads_cur (number of active threads) into a single atomic update.
I played with a bunch of ideas and this seems to be the cleanest/simplest way to ensure all threads see a consistent state without adding extra logic. Also worth noting that adding additional memory ordering restriction (ie instead of doing relaxed reads) is not sufficient because the threads can get preempted in between the atomic reads and still end up with the inconsistent state.

Here is a quick test report from various systems:

AMD Ryzen 9 3950X (16-Cores) -- tested with and without OpenMP, with and without TSAN

$ ./build-amd64-omp/bin/test-barrier 16 1000
graph-compute with
 n_threads: 16
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 4176811 usec 
 4176.81 usec per-iter
 2088.41 nsec per-node

graph-compute with
 n_threads: 16
   n_nodes: 4
  n_rounds: 100000

$ ./build-amd64/bin/test-barrier 16 1000
graph-compute with
 n_threads: 16
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 3982746 usec 
 3982.75 usec per-iter
 1991.37 nsec per-node
 
graph-compute with
 n_threads: 16
   n_nodes: 4
  n_rounds: 100000

Galaxy S24 Ultra (Gen3) -- no OpenMP, also tested Galaxy S25 (Gen4) and Gen5 device

~/src/llama.cpp-hexagon$ ./scripts/snapdragon/adb/run-tool.sh test-barrier 6 1000
...
graph-compute with
 n_threads: 6
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 1507086 usec 
 1507.09 usec per-iter
 753.543 nsec per-node

graph-compute with
 n_threads: 6
   n_nodes: 4
  n_rounds: 100000

Mac M4-Pron -- no OpenMP, with and without TSAN

$ ./build-macos/bin/test-barrier 10 1000
graph-compute with
 n_threads: 10
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 3080797 usec 
 3080.8 usec per-iter
 1540.4 nsec per-node

graph-compute with
 n_threads: 10
   n_nodes: 4
  n_rounds: 100000

Also tested all the usual stuff llama-cli and llama-bench with various models and backends with partial offloads.

@DamonFool
Please give this a shot on your setup.

@jeffbolznv @ggerganov

@DamonFool
Copy link
Contributor

Without the fix this new test quickly and reliably fails on all platforms that I tested Snapdragon-Gen3/4/5 (Android), Mac M4-Pro, AMD Ryzen-9 (Linux).

Hi @max-krasnyansky , may I ask how do I determine whether the updated test was successful or failed?
Does the failure also mean hang?

I've run the updated test on Mac M4 pro without the fix, but no hang.

@max-krasnyansky
Copy link
Collaborator Author

Without the fix this new test quickly and reliably fails on all platforms that I tested Snapdragon-Gen3/4/5 (Android), Mac M4-Pro, AMD Ryzen-9 (Linux).

Hi @max-krasnyansky , may I ask how do I determine whether the updated test was successful or failed? Does the failure also mean hang?

I've run the updated test on Mac M4 pro without the fix, but no hang.

For me it is crashing (segfault) because one or more threads would eventually get out of sync and start processing the graph while the cplan is getting updated. The race window is small, so you need more threads and more rounds (I used 10 threads with 1K rounds on M4-Pro).

Here is an example on M4-Pro with the new test but without the fix:

$ git log --oneline
41a2a6cfa (HEAD -> cpu-n_threads-race) tests: update barrier test to check for race condition in active threads
dea9ba27c (tag: b7261, upstream/master, upstream/HEAD, qcom/master, qcom/HEAD, origin/master, origin/HEAD) ggml-cpu: remove duplicate conditional check 'iid' (#17650)
c6d1a00aa Add a couple of file types to the text section (#17670)
424c57945 convert : support latest mistral-common (fix conversion with --mistral-format) (#17712)
...

$ cmake -D GGML_OPENMP=OFF -D GGML_SANITIZE_THREAD=OFF -D CMAKE_BUILD_TYPE=RelWithDebInfo -G Ninja -B build-macos

$ ./build-macos/bin/test-barrier 10 1000
graph-compute with
 n_threads: 10
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 2753814 usec 
 2753.81 usec per-iter
 1376.91 nsec per-node
graph-compute with
 n_threads: 10
   n_nodes: 4
  n_rounds: 100000
Segmentation fault: 11  <<<<<<< 

The issue is kind of obvious when you consider that the worker thread can get preempted in between reading n_graph and n_threads_cur and start running again after we processed a new graph with a single thread and started prepping the next graph (ie already bumped n_threads_cur).

It's possible to fix this by adding more logic (don't bump n_graph for single threaded case, etc) but the fix I added is simple and robust -- graph-N must be processed with M-threads -- it doesn't matter when the workers threads get preempted they will always observe the consistent state.

@DamonFool
Copy link
Contributor

For me it is crashing (segfault) because one or more threads would eventually get out of sync and start processing the graph while the cplan is getting updated. The race window is small, so you need more threads and more rounds (I used 10 threads with 1K rounds on M4-Pro).

Thanks for your clarification.
I'll try to reproduce the issue with your updated tests.

@max-krasnyansky
Copy link
Collaborator Author

For me it is crashing (segfault) because one or more threads would eventually get out of sync and start processing the graph while the cplan is getting updated. The race window is small, so you need more threads and more rounds (I used 10 threads with 1K rounds on M4-Pro).

Thanks for your clarification. I'll try to reproduce the issue with your updated tests.

I'd say no need to reproduce the issue with the new test. Just see if this PR fixes your original issue.
As I mentioned I tested it on Gen3 (S24 Ultra) with the new test and with the original Hexagon backend use-case.

@DamonFool
Copy link
Contributor

I'd say no need to reproduce the issue with the new test.

Now I can reproduce the crash with your regression test now.
The crash seems to be another story and I'll spend some time to understand the case this weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants