-
Notifications
You must be signed in to change notification settings - Fork 846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent crashes in Vulkan tests on Linux #5084
Comments
Okay, this is pretty fun! I have a core dump with two threads from a run of Thread 1 is the main thread, which has returned from Thread 2 was created by the In fact, thread 2 is unmapping the very range of addresses containing thread 1's executing pc:
How unfriendly. |
Famous last words, but: I think there is a GLIBC bug here. Both However, both functions also release I believe one reason we've run into this is KhronosGroup/Vulkan-ValidationLayers#7340, which ensures that the Vulkan validation layer's shared library has many slow destructors to run. I think our workaround is simply to have the test case join the thread, so that How exactly does When a shared library is closed, the dynamic linker is supposed to call all the functions listed in the shared library's
|
Join all threads before returning from the test case, to ensure that we don't return from `main` until all open `Device`s have been dropped. This avoids a race condition in glibc in which a thread calling `dlclose` can unmap a shared library's code even while the main thread is still running its finalization functions. (See gfx-rs#5084 for details.) Joining all threads before returning from the test ensures that the Vulkan loader has finished `dlclose`-ing the Vulkan validation layer shared library before `main` returns. Fixes gfx-rs#5084.
Join all threads before returning from the test case, to ensure that we don't return from `main` until all open `Device`s have been dropped. This avoids a race condition in glibc in which a thread calling `dlclose` can unmap a shared library's code even while the main thread is still running its finalization functions. (See gfx-rs#5084 for details.) Joining all threads before returning from the test ensures that the Vulkan loader has finished `dlclose`-ing the Vulkan validation layer shared library before `main` returns. Remove `skip` for this test on GL/llvmpipe. With this change, that has not been observed to crash. Without it, the test crashes within ten runs or so. Fixes gfx-rs#5084. Fixed gfx-rs#4285.
Join all threads before returning from the test case, to ensure that we don't return from `main` until all open `Device`s have been dropped. This avoids a race condition in glibc in which a thread calling `dlclose` can unmap a shared library's code even while the main thread is still running its finalization functions. (See gfx-rs#5084 for details.) Joining all threads before returning from the test ensures that the Vulkan loader has finished `dlclose`-ing the Vulkan validation layer shared library before `main` returns. Remove `skip` for this test on GL/llvmpipe. With this change, that has not been observed to crash. Without it, the test crashes within ten runs or so. Fixes gfx-rs#5084. Fixed gfx-rs#4285.
Join all threads before returning from the test case, to ensure that we don't return from `main` until all open `Device`s have been dropped. This avoids a race condition in glibc in which a thread calling `dlclose` can unmap a shared library's code even while the main thread is still running its finalization functions. (See #5084 for details.) Joining all threads before returning from the test ensures that the Vulkan loader has finished `dlclose`-ing the Vulkan validation layer shared library before `main` returns. Remove `skip` for this test on GL/llvmpipe. With this change, that has not been observed to crash. Without it, the test crashes within ten runs or so. Fixes #5084. Fixed #4285.
Filed bug against GLIBC: https://sourceware.org/bugzilla/show_bug.cgi?id=31285 |
Many wgpu tests crash intermittently at exit on Linux.
Comments copied from #4285, which I inadvertently crashed (as in "crashed a party"):
I'm looking at one interesting crash that occurs in exactly the scenario described by this GLIBC comment:
The comment is explaining why
__exit_funcs_lock
needs to be held and why theflavor
field needs to be set toef_free
. The code does do these things, so I don't think that's the bug.But I bring this up because I have some core dumps in which this is exactly what's happening: one thread has this stack frame:
indicating that the validation layer DSO is being closed down, while another thread has these stack frames:
indicating that this is the main thread, which has returned from
main
and is traversing the same list of cleanup handles as the other thread.In another core dump, we have one thread in the midst of unmapping the Vulkan validation library (based on locals at frame #10):
while another thread is, again, running the destructor list:
The text was updated successfully, but these errors were encountered: