broker crash in `content_cache_destroy` #5706

grondo · 2024-01-26T15:04:05Z

While running some tests on the dane system a bunch of brokers crashed in content_cache_destroy():

At the time of the crash I was testing multiple brokers per node, which has been causing nodes on this system to crash with a bug in the hfi driver. In this case I think this was a 2 node test with 23 brokers per node. The assert occurred after the first node of the allocation crashed, and all of the brokers on that node created a core (I actually didn't notice until the 23 corefiles appeared in my working directory)

#0  0x0000155553740acf in raise () from /lib64/libc.so.6
#1  0x0000155553713ea5 in abort () from /lib64/libc.so.6
#2  0x0000155553713d79 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
#3  0x0000155553739426 in __assert_fail () from /lib64/libc.so.6
#4  0x0000155541f16f6f in cache_entry_destroy (e=0x155538023570)
    at content/cache.c:193
#5  0x0000155541f176b5 in cache_entry_destructor (item=0x15553801d460)
    at content/cache.c:209
#6  cache_entry_destructor (item=0x15553801d460) at content/cache.c:206
#7  0x0000155541f1db72 in s_item_destroy (hard=true, item=0x15553801d460, 
    self=0x155538003ce0) at zhashx.c:193
#8  s_item_destroy (self=0x155538003ce0, item=0x15553801d460, 
    hard=<optimized out>) at zhashx.c:177
#9  0x0000155541f1dc19 in s_purge (self=self@entry=0x155538003ce0)
    at zhashx.c:144
#10 0x0000155541f1dfb9 in fzhashx_destroy (self_p=0x155538003c58)
    at zhashx.c:161
#11 0x0000155541f178b9 in content_cache_destroy (cache=0x155538003c30)
    at content/cache.c:1099
#12 content_cache_destroy (cache=cache@entry=0x155538003c30)
    at content/cache.c:1092
#13 0x0000155541f15d3a in mod_main (h=<optimized out>, argc=<optimized out>, 
    argv=<optimized out>) at content/main.c:36
#14 0x000000000041247f in module_thread (arg=0x6db7d0) at module.c:208
#15 0x0000155554e821ca in start_thread () from /lib64/libpthread.so.0
#16 0x000015555372be73 in clone () from /lib64/libc.so.6
(gdb) bt full
#0  0x0000155553740acf in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x0000155553713ea5 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x0000155553713d79 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
No symbol table info available.
#3  0x0000155553739426 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#4  0x0000155541f16f6f in cache_entry_destroy (e=0x155538023570)
    at content/cache.c:193
        saved_errno = 17
        __PRETTY_FUNCTION__ = "cache_entry_destroy"
        saved_errno = <optimized out>
#5  0x0000155541f176b5 in cache_entry_destructor (item=0x15553801d460)
    at content/cache.c:209
No locals.
#6  cache_entry_destructor (item=0x15553801d460) at content/cache.c:206
No locals.
#7  0x0000155541f1db72 in s_item_destroy (hard=true, item=0x15553801d460, 
    self=0x155538003ce0) at zhashx.c:193
        cur_item = <optimized out>
        prev_item = <optimized out>
        cur_item = <optimized out>
        prev_item = <optimized out>
        __PRETTY_FUNCTION__ = "s_item_destroy"
#8  s_item_destroy (self=0x155538003ce0, item=0x15553801d460, 
    hard=<optimized out>) at zhashx.c:177
        cur_item = <optimized out>
        prev_item = <optimized out>
        __PRETTY_FUNCTION__ = "s_item_destroy"
#9  0x0000155541f1dc19 in s_purge (self=self@entry=0x155538003ce0)
    at zhashx.c:144
        next_item = 0x0
        cur_item = <optimized out>
        index = 92
        limit = 103
#10 0x0000155541f1dfb9 in fzhashx_destroy (self_p=0x155538003c58)
    at zhashx.c:161
        self = 0x155538003ce0
        __PRETTY_FUNCTION__ = "fzhashx_destroy"
        self = <optimized out>
#11 0x0000155541f178b9 in content_cache_destroy (cache=0x155538003c30)
    at content/cache.c:1099
        saved_errno = 17
#12 content_cache_destroy (cache=cache@entry=0x155538003c30)
--Type <RET> for more, q to quit, c to continue without paging--
    at content/cache.c:1092
        saved_errno = <optimized out>
#13 0x0000155541f15d3a in mod_main (h=<optimized out>, argc=<optimized out>, 
    argv=<optimized out>) at content/main.c:36
        cache = 0x155538003c30
        rc = 0
#14 0x000000000041247f in module_thread (arg=0x6db7d0) at module.c:208
        p = <optimized out>
        signal_set = {__val = {18446744067267100671, 
            18446744073709551615 <repeats 15 times>}}
        errnum = 0
        uri = '\000' <repeats 127 times>
        av = 0x155538003220
        ac = <optimized out>
        mod_main_errno = 0
        msg = <optimized out>
        f = <optimized out>
#15 0x0000155554e821ca in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#16 0x000015555372be73 in clone () from /lib64/libc.so.6
No symbol table info available.

The text was updated successfully, but these errors were encountered:

garlick · 2024-02-12T19:17:51Z

FWIW, that's the first assertion in this function:

/* Destroy a cache entry
 */
static void cache_entry_destroy (struct cache_entry *e)
{
    if (e) {
        int saved_errno = errno;
        assert (e->load_requests == NULL);
        assert (e->store_requests == NULL);
        if (e->mmapped)
            content_mmap_region_decref (e->data_container);
        else
            flux_msg_decref (e->data_container);
        free (e);
        errno = saved_errno;
    }
}

This means a load request was parked on a cache entry (e.g. waiting for an upstream load request to complete) when the module was unloaded, calling the module's main destructor, content_cache_destroy().

A closer look at the code is needed to see if that really isn't supposed to be able to happen. Possibly the right answer is to simply destroy any pending requests rather than assert.

garlick · 2024-03-08T23:11:49Z

@grondo just observed this one again.

Oh there certainly could be pending requests if pending requests are canceled in dependent modules in response to the shutdown. In any case, theoretically all dependent modules have unloaded before this one so there is unlikely to be anybody blocked waiting for a response that will never come. I'll submit a PR to remove these assertions.

Problem: sometimes brokers hit an assertion when the content module is unloaded during shutdown. The assertion is that there are no pending load/store operations waiting on a cache entry when it is destroyed. This is certainly possible, depending on the behavior of users of the load/store RPCs. In the case of unload in rc3, dependent modules would have already been unloaded, so it seems safe to simply destroy any pending RPCs without responding to them. Destroy them, if present. Fixes flux-framework#5706

garlick mentioned this issue Mar 8, 2024

modules/content: drop incorrect assertion #5781

Merged

mergify bot closed this as completed in #5781 Mar 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

broker crash in `content_cache_destroy` #5706

broker crash in `content_cache_destroy` #5706

grondo commented Jan 26, 2024

garlick commented Feb 12, 2024

garlick commented Mar 8, 2024

broker crash in content_cache_destroy #5706

broker crash in content_cache_destroy #5706

Comments

grondo commented Jan 26, 2024

garlick commented Feb 12, 2024

garlick commented Mar 8, 2024

broker crash in `content_cache_destroy` #5706

broker crash in `content_cache_destroy` #5706