Intermittent crashes in Vulkan tests on Linux #5084

jimblandy · 2024-01-18T00:13:24Z

Many wgpu tests crash intermittently at exit on Linux.

Comments copied from #4285, which I inadvertently crashed (as in "crashed a party"):

I'm looking at one interesting crash that occurs in exactly the scenario described by this GLIBC comment:

	    /* We don't want to run this cleanup more than once.  The Itanium
	       C++ ABI requires that multiple calls to __cxa_finalize not
	       result in calling termination functions more than once.  One
	       potential scenario where that could happen is with a concurrent
	       dlclose and exit, where the running dlclose must at some point
	       release the list lock, an exiting thread may acquire it, and
	       without setting flavor to ef_free, might re-run this destructor
	       which could result in undefined behaviour.  Therefore we must
	       set flavor to ef_free to avoid calling this destructor again.

The comment is explaining why __exit_funcs_lock needs to be held and why the flavor field needs to be set to ef_free. The code does do these things, so I don't think that's the bug.

But I bring this up because I have some core dumps in which this is exactly what's happening: one thread has this stack frame:

#1  0x00007f26284e85e7 in __do_global_dtors_aux () from /lib64/libVkLayer_khronos_validation.so

indicating that the validation layer DSO is being closed down, while another thread has these stack frames:

#11 0x00007f268757dfd6 in __run_exit_handlers (status=0, listp=<optimized out>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:111
#12 0x00007f268757e11e in __GI_exit (status=<optimized out>) at exit.c:141
#13 0x00007f2687565151 in __libc_start_call_main (main=main@entry=0x55dd5c35ef80 <main>, argc=argc@entry=2, argv=argv@entry=0x7fff46fd4c18) at ../sysdeps/nptl/libc_start_call_main.h:74
#14 0x00007f268756520b in __libc_start_main_impl (main=0x55dd5c35ef80 <main>, argc=2, argv=0x7fff46fd4c18, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=0x7fff46fd4c08) at ../csu/libc-start.c:360
#15 0x000055dd5c34f655 in _start ()

indicating that this is the main thread, which has returned from main and is traversing the same list of cleanup handles as the other thread.

In another core dump, we have one thread in the midst of unmapping the Vulkan validation library (based on locals at frame #10):

#0  0x00007f50a90350eb in __GI_munmap () at ../sysdeps/unix/syscall-template.S:117
#1  0x00007f50a902840a in _dl_unmap_segments (l=0x7f509c08fb00) at ./dl-unmap-segments.h:32
#2  _dl_unmap (map=map@entry=0x7f509c08fb00) at ../sysdeps/x86_64/tlsdesc.c:31
#3  0x00007f50a901503d in _dl_close_worker (map=<optimized out>, map@entry=0x7f509c08fb00, force=force@entry=false) at dl-close.c:628
#4  0x00007f50a901569b in _dl_close (_map=0x7f509c08fb00) at dl-close.c:793
#5  0x00007f50a9014523 in __GI__dl_catch_exception (exception=exception@entry=0x7f4f5c5f6090, operate=0x7f50a9015660 <_dl_close>, args=0x7f509c08fb00) at dl-catch.c:237
#6  0x00007f50a9014679 in _dl_catch_error (objname=0x7f4f5c5f60f8, errstring=0x7f4f5c5f6100, mallocedp=0x7f4f5c5f60f7, operate=<optimized out>, args=<optimized out>) at dl-catch.c:256
#7  0x00007f50a8aa8143 in _dlerror_run (operate=<optimized out>, args=<optimized out>) at dlerror.c:138
#8  0x00007f50a8aa7e76 in __dlclose (handle=<optimized out>) at dlclose.c:31
#9  0x00007f50a017565c in loader_platform_close_library (library=<optimized out>) at /usr/src/debug/vulkan-loader-1.3.268.0-1.fc39.x86_64/loader/vk_loader_platform.h:381
#10 loader_delete_layer_list_and_properties (inst=inst@entry=0x7f509c09f460, layer_list=layer_list@entry=0x7f509c0a0788)
    at /usr/src/debug/vulkan-loader-1.3.268.0-1.fc39.x86_64/loader/loader.c:493
#11 0x00007f50a017cc2d in loader_delete_layer_list_and_properties (layer_list=<optimized out>, inst=<optimized out>)
    at /usr/src/debug/vulkan-loader-1.3.268.0-1.fc39.x86_64/loader/loader.c:720
#12 vkDestroyInstance (pAllocator=0x0, instance=<optimized out>) at /usr/src/debug/vulkan-loader-1.3.268.0-1.fc39.x86_64/loader/trampoline.c:796
#13 vkDestroyInstance (instance=<optimized out>, pAllocator=0x0) at /usr/include/vulkan/vulkan_core.h:4124

while another thread is, again, running the destructor list:

#0  std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::~vector () at /usr/include/c++/13/bits/stl_vector.h:730
#1  0x00007f50a8a5efd6 in __run_exit_handlers (status=0, listp=<optimized out>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:111
#2  0x00007f50a8a5f11e in __GI_exit (status=<optimized out>) at exit.c:141
#3  0x00007f50a8a46151 in __libc_start_call_main (main=main@entry=0x55ae5a3bb340 <main>, argc=argc@entry=4, argv=argv@entry=0x7ffced56f6c8) at ../sysdeps/nptl/libc_start_call_main.h:74

The text was updated successfully, but these errors were encountered:

jimblandy · 2024-01-23T02:04:50Z

Okay, this is pretty fun!

I have a core dump with two threads from a run of [Vulkan/llvmpipe (LLVM 17.0.6, 256 bits)/1] wgpu_examples::hello_compute::tests::multithreaded_compute.

Thread 1 is the main thread, which has returned from main and is now in __run_exit_handlers, which traverses GLIBC's __exit_funcs list and calls cleanup functions. This includes destructors for global variables.

Thread 2 was created by the MULTITHREADED_COMPUTE test, and is exiting the thread's closure, dropping the last Arc to the TestingContext that owns the wgpu Device. This has wound its way through to vkDestroyInstance, which is unloading the shared library libVkLayer_khronos_validation.so; it is stopped in the call to munmap.

In fact, thread 2 is unmapping the very range of addresses containing thread 1's executing pc:

(gdb) thread 1
[Switching to thread 1 (Thread 0x7f87ad2a78c0 (LWP 697942))]
#0  std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::~vector () at /usr/include/c++/13/bits/stl_vector.h:730
730	      ~vector() _GLIBCXX_NOEXCEPT
(gdb) print/x $pc
$13 = 0x7f86968191a0
(gdb) thread 2
[Switching to thread 2 (Thread 0x7f86943fc6c0 (LWP 698084))]
#0  0x00007f87ad3f30eb in __GI_munmap () at ../sysdeps/unix/syscall-template.S:117
117	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS,
(gdb) up
#1  0x00007f87ad3e640a in _dl_unmap_segments (l=0x7f87a808fb00) at ./dl-unmap-segments.h:32
32	  __munmap ((void *) l->l_map_start, l->l_map_end - l->l_map_start);
(gdb) print l->l_map_start < $13 && $13 < l->l_map_end
$14 = 1
(gdb)

How unfriendly.

jimblandy · 2024-01-23T02:59:55Z

Famous last words, but: I think there is a GLIBC bug here.

Both __run_exit_handlers, called after returning from main, and __cxa_finalize, called from shared libraries' closing cleanup code, traverse the same list of finalization functions, __exit_funcs. Both functions hold __exit_funcs_lock while traversing the list, and clear entries' flavor to ef_free before calling the cleanup function. This ensures that, when __run_exit_handlers and __cxa_finalize are running concurrently, each cleanup function will only be called by one or the other.

However, both functions also release __exit_funcs_lock while calling each cleanup function. This creates a potential race condition: if __run_exit_handlers happens to pick up a long-running cleanup function, __cxa_finalize will skip that entry, and can run through the entire rest of the list and return, giving dlclose the impression that it can now unmap the shared library's code --- even though __run_exit_handlers is still executing it.

I believe one reason we've run into this is KhronosGroup/Vulkan-ValidationLayers#7340, which ensures that the Vulkan validation layer's shared library has many slow destructors to run.

I think our workaround is simply to have the test case join the thread, so that Instance destruction can complete before main returns.

How exactly does __cxa_finalize get called?

When a shared library is closed, the dynamic linker is supposed to call all the functions listed in the shared library's DT_FINI_ARRAY table. The dynamic linker's _dl_call_fini function traverses that array and calls the functions it points to.

DT_FINI_ARRAY generally points to a toolchain-generated function named __do_global_dtors_aux defined in libgcc.a which gets statically linked into every shared library.

__do_global_dtors_aux implements the C++ ABI's requirement to call __cxa_finalize.

Join all threads before returning from the test case, to ensure that we don't return from `main` until all open `Device`s have been dropped. This avoids a race condition in glibc in which a thread calling `dlclose` can unmap a shared library's code even while the main thread is still running its finalization functions. (See gfx-rs#5084 for details.) Joining all threads before returning from the test ensures that the Vulkan loader has finished `dlclose`-ing the Vulkan validation layer shared library before `main` returns. Fixes gfx-rs#5084.

Join all threads before returning from the test case, to ensure that we don't return from `main` until all open `Device`s have been dropped. This avoids a race condition in glibc in which a thread calling `dlclose` can unmap a shared library's code even while the main thread is still running its finalization functions. (See gfx-rs#5084 for details.) Joining all threads before returning from the test ensures that the Vulkan loader has finished `dlclose`-ing the Vulkan validation layer shared library before `main` returns. Remove `skip` for this test on GL/llvmpipe. With this change, that has not been observed to crash. Without it, the test crashes within ten runs or so. Fixes gfx-rs#5084. Fixed gfx-rs#4285.

Join all threads before returning from the test case, to ensure that we don't return from `main` until all open `Device`s have been dropped. This avoids a race condition in glibc in which a thread calling `dlclose` can unmap a shared library's code even while the main thread is still running its finalization functions. (See #5084 for details.) Joining all threads before returning from the test ensures that the Vulkan loader has finished `dlclose`-ing the Vulkan validation layer shared library before `main` returns. Remove `skip` for this test on GL/llvmpipe. With this change, that has not been observed to crash. Without it, the test crashes within ten runs or so. Fixes #5084. Fixed #4285.

jimblandy · 2024-01-24T03:48:17Z

Filed bug against GLIBC: https://sourceware.org/bugzilla/show_bug.cgi?id=31285

jimblandy added type: bug Something isn't working api: vulkan Issues with Vulkan labels Jan 18, 2024

jimblandy self-assigned this Jan 18, 2024

jimblandy mentioned this issue Jan 18, 2024

Multithreaded Compute Segfaults in CI on GL/Linux #4285

Closed

jimblandy mentioned this issue Jan 24, 2024

Join threads in MULTITHREADED_COMPUTE example. #5129

Merged

4 tasks

cwfitzgerald closed this as completed in #5129 Jan 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent crashes in Vulkan tests on Linux #5084

Intermittent crashes in Vulkan tests on Linux #5084

jimblandy commented Jan 18, 2024

jimblandy commented Jan 23, 2024

jimblandy commented Jan 23, 2024

jimblandy commented Jan 24, 2024