Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock on AMD/Mesa/vk #4686

Closed
SludgePhD opened this issue Nov 14, 2023 · 5 comments · Fixed by #3626
Closed

Deadlock on AMD/Mesa/vk #4686

SludgePhD opened this issue Nov 14, 2023 · 5 comments · Fixed by #3626
Labels
api: vulkan Issues with Vulkan type: bug Something isn't working

Comments

@SludgePhD
Copy link
Contributor

SludgePhD commented Nov 14, 2023

Description
I wrote a library that runs various unit tests that perform wgpu operations, and those tests sometimes end up in what looks like a deadlock in wgpu.

Repro steps
cargo t -p zaru-image on this commit can be used to reproduce https://github.com/SludgePhD/Zaru/commit/ac29836b0528a2e50c63c2a7ff68eb09b33a6cf3

Extra materials
I've tried to use the parking_lot deadlock detection feature, but it turns out that that does not support RW locks.

GDB output below.

Thread state when the deadlock happens:

(gdb) info threads
  Id   Target Id                                           Frame 
* 1    Thread 0x7ffff7c8ccc0 (LWP 11329) "zaru_image-5a5c" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  2    Thread 0x7ffff7c8b6c0 (LWP 11331) "blend::tests::b" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  3    Thread 0x7ffff7a8a6c0 (LWP 11332) "draw::tests::te" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  4    Thread 0x7ffff78896c0 (LWP 11333) "draw::tests::te" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  6    Thread 0x7ffff73c66c0 (LWP 11335) "image::tests::c" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  7    Thread 0x7ffff71c56c0 (LWP 11336) "image::tests::d" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  22   Thread 0x7fffddfff6c0 (LWP 11351) "shader::compute" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  25   Thread 0x7fffdd9fc6c0 (LWP 11354) "shader::compute" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  27   Thread 0x7ffff6fc46c0 (LWP 11356) "view::tests::vi" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  28   Thread 0x7fffcd5ff6c0 (LWP 11357) "zaru_im:disk$0"  0x00007ffff7d174ae in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7ffff009f150) at futex-internal.c:57
  29   Thread 0x7fff8f3fd6c0 (LWP 11358) "blend::tests::b" 0x00007ffff7d174ae in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7ffff00a091c) at futex-internal.c:57

The stacks on most of these threads looks like this (though sometimes with a read lock instead of a write lock, and often for a variety of different resources instead of command encoder creation):

Thread 2 (Thread 0x7ffff7c8b6c0 (LWP 11331) "blend::tests::b"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x0000555555c12eb4 in parking_lot::raw_rwlock::RawRwLock::lock_exclusive_slow::h7957d3e95355ce44 ()
#2  0x0000555555a41116 in wgpu_core::registry::FutureId<I,T>::assign::h2cbde308a5113f46 ()
#3  0x00005555559293a2 in wgpu_core::device::global::<impl wgpu_core::global::Global<G>>::device_create_command_encoder::hd4fbe81984d0da62 ()
#4  0x00005555559d674f in <wgpu::backend::direct::Context as wgpu::context::Context>::device_create_command_encoder::ha410a29667457a46 ()
#5  0x00005555559df130 in <T as wgpu::context::DynContext>::device_create_command_encoder::hd39f52a0286d846f ()
#6  0x0000555555a63273 in wgpu::Device::create_command_encoder::hb9a94e62fccd0e4e ()

The only threads that look significantly different are the one running the test harness and the following two Mesa/Vulkan-related threads:

Thread 29 (Thread 0x7fff8f3fd6c0 (LWP 11358) "blend::tests::b"):
#0  0x00007ffff7d174ae in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7ffff00a091c) at futex-internal.c:57
#1  __futex_abstimed_wait_common (futex_word=futex_word@entry=0x7ffff00a091c, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0, cancel=cancel@entry=true) at futex-internal.c:87
#2  0x00007ffff7d1752f in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7ffff00a091c, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at futex-internal.c:139
#3  0x00007ffff7d19d40 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7ffff00a08c8, cond=0x7ffff00a08f0) at pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x7ffff00a08f0, mutex=0x7ffff00a08c8) at pthread_cond_wait.c:618
#5  0x00007ffff5ed9e11 in __gthread_cond_wait (__mutex=<optimized out>, __cond=0x7ffff00a08f0) at /usr/src/debug/gcc/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:865
#6  std::__condvar::wait (__m=..., this=0x7ffff00a08f0) at /usr/src/debug/gcc/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/std_mutex.h:171
#7  std::condition_variable::wait (this=this@entry=0x7ffff00a08f0, __lock=...) at /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/condition_variable.cc:41
#8  0x00007fffcee28575 in QUEUE_STATE::NextSubmission (this=this@entry=0x7ffff00a0790) at /usr/src/debug/vulkan-validation-layers/Vulkan-ValidationLayers-vulkan-sdk-1.3.268.0/layers/state_tracker/queue_state.cpp:164
#9  0x00007fffcee2a0b8 in QUEUE_STATE::ThreadFunc (this=0x7ffff00a0790) at /usr/src/debug/vulkan-validation-layers/Vulkan-ValidationLayers-vulkan-sdk-1.3.268.0/layers/state_tracker/queue_state.cpp:200
#10 0x00007ffff5ee1943 in std::execute_native_thread_routine (__p=0x7ffff10602b0) at /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/thread.cc:104
#11 0x00007ffff7d1a9eb in start_thread (arg=<optimized out>) at pthread_create.c:444
#12 0x00007ffff7d9e7cc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 28 (Thread 0x7fffcd5ff6c0 (LWP 11357) "zaru_im:disk$0"):
#0  0x00007ffff7d174ae in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7ffff009f150) at futex-internal.c:57
#1  __futex_abstimed_wait_common (futex_word=futex_word@entry=0x7ffff009f150, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0, cancel=cancel@entry=true) at futex-internal.c:87
#2  0x00007ffff7d1752f in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7ffff009f150, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at futex-internal.c:139
#3  0x00007ffff7d19d40 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7ffff009f100, cond=0x7ffff009f128) at pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x7ffff009f128, mutex=0x7ffff009f100) at pthread_cond_wait.c:618
#5  0x00007fffdd0162fc in cnd_wait () at ../mesa-23.2.1/src/c11/impl/threads_posix.c:135
#6  util_queue_thread_func () at ../mesa-23.2.1/src/util/u_queue.c:290
#7  0x00007fffdd03861c in impl_thrd_routine () at ../mesa-23.2.1/src/c11/impl/threads_posix.c:67
#8  0x00007ffff7d1a9eb in start_thread (arg=<optimized out>) at pthread_create.c:444
#9  0x00007ffff7d9e7cc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Platform
Arch Linux, wgpu 0.18, Mesa 23.2.1-arch1.2, Radeon RX 6700 XT

@SludgePhD SludgePhD changed the title Deadlock on AMD/Mesa Deadlock on AMD/Mesa/vk Nov 14, 2023
@Wumpf Wumpf added the type: bug Something isn't working label Nov 18, 2023
@Wumpf
Copy link
Member

Wumpf commented Nov 18, 2023

Can you try on

This PR changes (and removes) a lot of the internal locking on wgpu. It's expected to land next week

@SludgePhD
Copy link
Contributor Author

It still happens on that branch, and seeing all the mutexes in Device and how arbitrarily they are locked from everywhere I see how this could happen. I've opened gents83#17 against that branch with a fix.

@gents83
Copy link
Contributor

gents83 commented Nov 19, 2023

I'll take a look at this during the day to check it and integrate it if possible 👍🏼 it seems that there were still some "too early" locks

@gents83
Copy link
Contributor

gents83 commented Nov 19, 2023

Integrated in #3626 👍

@Wumpf
Copy link
Member

Wumpf commented Nov 19, 2023

awesome, thank you @gents83 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: vulkan Issues with Vulkan type: bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants