Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broker crash in content_cache_destroy #5706

Closed
grondo opened this issue Jan 26, 2024 · 2 comments · Fixed by #5781
Closed

broker crash in content_cache_destroy #5706

grondo opened this issue Jan 26, 2024 · 2 comments · Fixed by #5781

Comments

@grondo
Copy link
Contributor

grondo commented Jan 26, 2024

While running some tests on the dane system a bunch of brokers crashed in content_cache_destroy():

At the time of the crash I was testing multiple brokers per node, which has been causing nodes on this system to crash with a bug in the hfi driver. In this case I think this was a 2 node test with 23 brokers per node. The assert occurred after the first node of the allocation crashed, and all of the brokers on that node created a core (I actually didn't notice until the 23 corefiles appeared in my working directory)

#0  0x0000155553740acf in raise () from /lib64/libc.so.6
#1  0x0000155553713ea5 in abort () from /lib64/libc.so.6
#2  0x0000155553713d79 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
#3  0x0000155553739426 in __assert_fail () from /lib64/libc.so.6
#4  0x0000155541f16f6f in cache_entry_destroy (e=0x155538023570)
    at content/cache.c:193
#5  0x0000155541f176b5 in cache_entry_destructor (item=0x15553801d460)
    at content/cache.c:209
#6  cache_entry_destructor (item=0x15553801d460) at content/cache.c:206
#7  0x0000155541f1db72 in s_item_destroy (hard=true, item=0x15553801d460, 
    self=0x155538003ce0) at zhashx.c:193
#8  s_item_destroy (self=0x155538003ce0, item=0x15553801d460, 
    hard=<optimized out>) at zhashx.c:177
#9  0x0000155541f1dc19 in s_purge (self=self@entry=0x155538003ce0)
    at zhashx.c:144
#10 0x0000155541f1dfb9 in fzhashx_destroy (self_p=0x155538003c58)
    at zhashx.c:161
#11 0x0000155541f178b9 in content_cache_destroy (cache=0x155538003c30)
    at content/cache.c:1099
#12 content_cache_destroy (cache=cache@entry=0x155538003c30)
    at content/cache.c:1092
#13 0x0000155541f15d3a in mod_main (h=<optimized out>, argc=<optimized out>, 
    argv=<optimized out>) at content/main.c:36
#14 0x000000000041247f in module_thread (arg=0x6db7d0) at module.c:208
#15 0x0000155554e821ca in start_thread () from /lib64/libpthread.so.0
#16 0x000015555372be73 in clone () from /lib64/libc.so.6
(gdb) bt full
#0  0x0000155553740acf in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x0000155553713ea5 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x0000155553713d79 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
No symbol table info available.
#3  0x0000155553739426 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#4  0x0000155541f16f6f in cache_entry_destroy (e=0x155538023570)
    at content/cache.c:193
        saved_errno = 17
        __PRETTY_FUNCTION__ = "cache_entry_destroy"
        saved_errno = <optimized out>
#5  0x0000155541f176b5 in cache_entry_destructor (item=0x15553801d460)
    at content/cache.c:209
No locals.
#6  cache_entry_destructor (item=0x15553801d460) at content/cache.c:206
No locals.
#7  0x0000155541f1db72 in s_item_destroy (hard=true, item=0x15553801d460, 
    self=0x155538003ce0) at zhashx.c:193
        cur_item = <optimized out>
        prev_item = <optimized out>
        cur_item = <optimized out>
        prev_item = <optimized out>
        __PRETTY_FUNCTION__ = "s_item_destroy"
#8  s_item_destroy (self=0x155538003ce0, item=0x15553801d460, 
    hard=<optimized out>) at zhashx.c:177
        cur_item = <optimized out>
        prev_item = <optimized out>
        __PRETTY_FUNCTION__ = "s_item_destroy"
#9  0x0000155541f1dc19 in s_purge (self=self@entry=0x155538003ce0)
    at zhashx.c:144
        next_item = 0x0
        cur_item = <optimized out>
        index = 92
        limit = 103
#10 0x0000155541f1dfb9 in fzhashx_destroy (self_p=0x155538003c58)
    at zhashx.c:161
        self = 0x155538003ce0
        __PRETTY_FUNCTION__ = "fzhashx_destroy"
        self = <optimized out>
#11 0x0000155541f178b9 in content_cache_destroy (cache=0x155538003c30)
    at content/cache.c:1099
        saved_errno = 17
#12 content_cache_destroy (cache=cache@entry=0x155538003c30)
--Type <RET> for more, q to quit, c to continue without paging--
    at content/cache.c:1092
        saved_errno = <optimized out>
#13 0x0000155541f15d3a in mod_main (h=<optimized out>, argc=<optimized out>, 
    argv=<optimized out>) at content/main.c:36
        cache = 0x155538003c30
        rc = 0
#14 0x000000000041247f in module_thread (arg=0x6db7d0) at module.c:208
        p = <optimized out>
        signal_set = {__val = {18446744067267100671, 
            18446744073709551615 <repeats 15 times>}}
        errnum = 0
        uri = '\000' <repeats 127 times>
        av = 0x155538003220
        ac = <optimized out>
        mod_main_errno = 0
        msg = <optimized out>
        f = <optimized out>
#15 0x0000155554e821ca in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#16 0x000015555372be73 in clone () from /lib64/libc.so.6
No symbol table info available.
@garlick
Copy link
Member

garlick commented Feb 12, 2024

FWIW, that's the first assertion in this function:

/* Destroy a cache entry
 */
static void cache_entry_destroy (struct cache_entry *e)
{
    if (e) {
        int saved_errno = errno;
        assert (e->load_requests == NULL);
        assert (e->store_requests == NULL);
        if (e->mmapped)
            content_mmap_region_decref (e->data_container);
        else
            flux_msg_decref (e->data_container);
        free (e);
        errno = saved_errno;
    }
}

This means a load request was parked on a cache entry (e.g. waiting for an upstream load request to complete) when the module was unloaded, calling the module's main destructor, content_cache_destroy().

A closer look at the code is needed to see if that really isn't supposed to be able to happen. Possibly the right answer is to simply destroy any pending requests rather than assert.

@garlick
Copy link
Member

garlick commented Mar 8, 2024

@grondo just observed this one again.

Oh there certainly could be pending requests if pending requests are canceled in dependent modules in response to the shutdown. In any case, theoretically all dependent modules have unloaded before this one so there is unlikely to be anybody blocked waiting for a response that will never come. I'll submit a PR to remove these assertions.

garlick added a commit to garlick/flux-core that referenced this issue Mar 8, 2024
Problem: sometimes brokers hit an assertion when the
content module is unloaded during shutdown.

The assertion is that there are no pending load/store operations
waiting on a cache entry when it is destroyed.  This is certainly
possible, depending on the behavior of users of the load/store
RPCs.  In the case of unload in rc3, dependent modules would have
already been unloaded, so it seems safe to simply destroy any
pending RPCs without responding to them.

Destroy them, if present.

Fixes flux-framework#5706
@mergify mergify bot closed this as completed in #5781 Mar 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants