Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent crashes in Vulkan tests on Linux #5084

Closed
jimblandy opened this issue Jan 18, 2024 · 3 comments · Fixed by #5129
Closed

Intermittent crashes in Vulkan tests on Linux #5084

jimblandy opened this issue Jan 18, 2024 · 3 comments · Fixed by #5129
Assignees
Labels
api: vulkan Issues with Vulkan type: bug Something isn't working

Comments

@jimblandy
Copy link
Member

Many wgpu tests crash intermittently at exit on Linux.

Comments copied from #4285, which I inadvertently crashed (as in "crashed a party"):


I'm looking at one interesting crash that occurs in exactly the scenario described by this GLIBC comment:

	    /* We don't want to run this cleanup more than once.  The Itanium
	       C++ ABI requires that multiple calls to __cxa_finalize not
	       result in calling termination functions more than once.  One
	       potential scenario where that could happen is with a concurrent
	       dlclose and exit, where the running dlclose must at some point
	       release the list lock, an exiting thread may acquire it, and
	       without setting flavor to ef_free, might re-run this destructor
	       which could result in undefined behaviour.  Therefore we must
	       set flavor to ef_free to avoid calling this destructor again.

The comment is explaining why __exit_funcs_lock needs to be held and why the flavor field needs to be set to ef_free. The code does do these things, so I don't think that's the bug.

But I bring this up because I have some core dumps in which this is exactly what's happening: one thread has this stack frame:

#1  0x00007f26284e85e7 in __do_global_dtors_aux () from /lib64/libVkLayer_khronos_validation.so

indicating that the validation layer DSO is being closed down, while another thread has these stack frames:

#11 0x00007f268757dfd6 in __run_exit_handlers (status=0, listp=<optimized out>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:111
#12 0x00007f268757e11e in __GI_exit (status=<optimized out>) at exit.c:141
#13 0x00007f2687565151 in __libc_start_call_main (main=main@entry=0x55dd5c35ef80 <main>, argc=argc@entry=2, argv=argv@entry=0x7fff46fd4c18) at ../sysdeps/nptl/libc_start_call_main.h:74
#14 0x00007f268756520b in __libc_start_main_impl (main=0x55dd5c35ef80 <main>, argc=2, argv=0x7fff46fd4c18, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=0x7fff46fd4c08) at ../csu/libc-start.c:360
#15 0x000055dd5c34f655 in _start ()

indicating that this is the main thread, which has returned from main and is traversing the same list of cleanup handles as the other thread.


In another core dump, we have one thread in the midst of unmapping the Vulkan validation library (based on locals at frame #10):

#0  0x00007f50a90350eb in __GI_munmap () at ../sysdeps/unix/syscall-template.S:117
#1  0x00007f50a902840a in _dl_unmap_segments (l=0x7f509c08fb00) at ./dl-unmap-segments.h:32
#2  _dl_unmap (map=map@entry=0x7f509c08fb00) at ../sysdeps/x86_64/tlsdesc.c:31
#3  0x00007f50a901503d in _dl_close_worker (map=<optimized out>, map@entry=0x7f509c08fb00, force=force@entry=false) at dl-close.c:628
#4  0x00007f50a901569b in _dl_close (_map=0x7f509c08fb00) at dl-close.c:793
#5  0x00007f50a9014523 in __GI__dl_catch_exception (exception=exception@entry=0x7f4f5c5f6090, operate=0x7f50a9015660 <_dl_close>, args=0x7f509c08fb00) at dl-catch.c:237
#6  0x00007f50a9014679 in _dl_catch_error (objname=0x7f4f5c5f60f8, errstring=0x7f4f5c5f6100, mallocedp=0x7f4f5c5f60f7, operate=<optimized out>, args=<optimized out>) at dl-catch.c:256
#7  0x00007f50a8aa8143 in _dlerror_run (operate=<optimized out>, args=<optimized out>) at dlerror.c:138
#8  0x00007f50a8aa7e76 in __dlclose (handle=<optimized out>) at dlclose.c:31
#9  0x00007f50a017565c in loader_platform_close_library (library=<optimized out>) at /usr/src/debug/vulkan-loader-1.3.268.0-1.fc39.x86_64/loader/vk_loader_platform.h:381
#10 loader_delete_layer_list_and_properties (inst=inst@entry=0x7f509c09f460, layer_list=layer_list@entry=0x7f509c0a0788)
    at /usr/src/debug/vulkan-loader-1.3.268.0-1.fc39.x86_64/loader/loader.c:493
#11 0x00007f50a017cc2d in loader_delete_layer_list_and_properties (layer_list=<optimized out>, inst=<optimized out>)
    at /usr/src/debug/vulkan-loader-1.3.268.0-1.fc39.x86_64/loader/loader.c:720
#12 vkDestroyInstance (pAllocator=0x0, instance=<optimized out>) at /usr/src/debug/vulkan-loader-1.3.268.0-1.fc39.x86_64/loader/trampoline.c:796
#13 vkDestroyInstance (instance=<optimized out>, pAllocator=0x0) at /usr/include/vulkan/vulkan_core.h:4124

while another thread is, again, running the destructor list:

#0  std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::~vector () at /usr/include/c++/13/bits/stl_vector.h:730
#1  0x00007f50a8a5efd6 in __run_exit_handlers (status=0, listp=<optimized out>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:111
#2  0x00007f50a8a5f11e in __GI_exit (status=<optimized out>) at exit.c:141
#3  0x00007f50a8a46151 in __libc_start_call_main (main=main@entry=0x55ae5a3bb340 <main>, argc=argc@entry=4, argv=argv@entry=0x7ffced56f6c8) at ../sysdeps/nptl/libc_start_call_main.h:74
@jimblandy jimblandy added type: bug Something isn't working api: vulkan Issues with Vulkan labels Jan 18, 2024
@jimblandy jimblandy self-assigned this Jan 18, 2024
@jimblandy
Copy link
Member Author

Okay, this is pretty fun!

I have a core dump with two threads from a run of [Vulkan/llvmpipe (LLVM 17.0.6, 256 bits)/1] wgpu_examples::hello_compute::tests::multithreaded_compute.

Thread 1 is the main thread, which has returned from main and is now in __run_exit_handlers, which traverses GLIBC's __exit_funcs list and calls cleanup functions. This includes destructors for global variables.

Thread 2 was created by the MULTITHREADED_COMPUTE test, and is exiting the thread's closure, dropping the last Arc to the TestingContext that owns the wgpu Device. This has wound its way through to vkDestroyInstance, which is unloading the shared library libVkLayer_khronos_validation.so; it is stopped in the call to munmap.

In fact, thread 2 is unmapping the very range of addresses containing thread 1's executing pc:

(gdb) thread 1
[Switching to thread 1 (Thread 0x7f87ad2a78c0 (LWP 697942))]
#0  std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::~vector () at /usr/include/c++/13/bits/stl_vector.h:730
730	      ~vector() _GLIBCXX_NOEXCEPT
(gdb) print/x $pc
$13 = 0x7f86968191a0
(gdb) thread 2
[Switching to thread 2 (Thread 0x7f86943fc6c0 (LWP 698084))]
#0  0x00007f87ad3f30eb in __GI_munmap () at ../sysdeps/unix/syscall-template.S:117
117	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS,
(gdb) up
#1  0x00007f87ad3e640a in _dl_unmap_segments (l=0x7f87a808fb00) at ./dl-unmap-segments.h:32
32	  __munmap ((void *) l->l_map_start, l->l_map_end - l->l_map_start);
(gdb) print l->l_map_start < $13 && $13 < l->l_map_end
$14 = 1
(gdb) 

How unfriendly.

@jimblandy
Copy link
Member Author

Famous last words, but: I think there is a GLIBC bug here.

Both __run_exit_handlers, called after returning from main, and __cxa_finalize, called from shared libraries' closing cleanup code, traverse the same list of finalization functions, __exit_funcs. Both functions hold __exit_funcs_lock while traversing the list, and clear entries' flavor to ef_free before calling the cleanup function. This ensures that, when __run_exit_handlers and __cxa_finalize are running concurrently, each cleanup function will only be called by one or the other.

However, both functions also release __exit_funcs_lock while calling each cleanup function. This creates a potential race condition: if __run_exit_handlers happens to pick up a long-running cleanup function, __cxa_finalize will skip that entry, and can run through the entire rest of the list and return, giving dlclose the impression that it can now unmap the shared library's code --- even though __run_exit_handlers is still executing it.

I believe one reason we've run into this is KhronosGroup/Vulkan-ValidationLayers#7340, which ensures that the Vulkan validation layer's shared library has many slow destructors to run.

I think our workaround is simply to have the test case join the thread, so that Instance destruction can complete before main returns.


How exactly does __cxa_finalize get called?

When a shared library is closed, the dynamic linker is supposed to call all the functions listed in the shared library's DT_FINI_ARRAY table. The dynamic linker's _dl_call_fini function traverses that array and calls the functions it points to.

DT_FINI_ARRAY generally points to a toolchain-generated function named __do_global_dtors_aux defined in libgcc.a which gets statically linked into every shared library.

__do_global_dtors_aux implements the C++ ABI's requirement to call __cxa_finalize.

jimblandy added a commit to jimblandy/wgpu that referenced this issue Jan 24, 2024
Join all threads before returning from the test case, to ensure that
we don't return from `main` until all open `Device`s have been
dropped.

This avoids a race condition in glibc in which a thread calling
`dlclose` can unmap a shared library's code even while the main thread
is still running its finalization functions. (See gfx-rs#5084 for details.)
Joining all threads before returning from the test ensures that the
Vulkan loader has finished `dlclose`-ing the Vulkan validation layer
shared library before `main` returns.

Fixes gfx-rs#5084.
jimblandy added a commit to jimblandy/wgpu that referenced this issue Jan 24, 2024
Join all threads before returning from the test case, to ensure that
we don't return from `main` until all open `Device`s have been
dropped.

This avoids a race condition in glibc in which a thread calling
`dlclose` can unmap a shared library's code even while the main thread
is still running its finalization functions. (See gfx-rs#5084 for details.)
Joining all threads before returning from the test ensures that the
Vulkan loader has finished `dlclose`-ing the Vulkan validation layer
shared library before `main` returns.

Remove `skip` for this test on GL/llvmpipe. With this change, that has
not been observed to crash. Without it, the test crashes within ten
runs or so.

Fixes gfx-rs#5084.
Fixed gfx-rs#4285.
jimblandy added a commit to jimblandy/wgpu that referenced this issue Jan 24, 2024
Join all threads before returning from the test case, to ensure that
we don't return from `main` until all open `Device`s have been
dropped.

This avoids a race condition in glibc in which a thread calling
`dlclose` can unmap a shared library's code even while the main thread
is still running its finalization functions. (See gfx-rs#5084 for details.)
Joining all threads before returning from the test ensures that the
Vulkan loader has finished `dlclose`-ing the Vulkan validation layer
shared library before `main` returns.

Remove `skip` for this test on GL/llvmpipe. With this change, that has
not been observed to crash. Without it, the test crashes within ten
runs or so.

Fixes gfx-rs#5084.
Fixed gfx-rs#4285.
cwfitzgerald pushed a commit that referenced this issue Jan 24, 2024
Join all threads before returning from the test case, to ensure that
we don't return from `main` until all open `Device`s have been
dropped.

This avoids a race condition in glibc in which a thread calling
`dlclose` can unmap a shared library's code even while the main thread
is still running its finalization functions. (See #5084 for details.)
Joining all threads before returning from the test ensures that the
Vulkan loader has finished `dlclose`-ing the Vulkan validation layer
shared library before `main` returns.

Remove `skip` for this test on GL/llvmpipe. With this change, that has
not been observed to crash. Without it, the test crashes within ten
runs or so.

Fixes #5084.
Fixed #4285.
@jimblandy
Copy link
Member Author

Filed bug against GLIBC: https://sourceware.org/bugzilla/show_bug.cgi?id=31285

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: vulkan Issues with Vulkan type: bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant