-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
broker crash in content_cache_destroy
#5706
Comments
FWIW, that's the first assertion in this function: /* Destroy a cache entry
*/
static void cache_entry_destroy (struct cache_entry *e)
{
if (e) {
int saved_errno = errno;
assert (e->load_requests == NULL);
assert (e->store_requests == NULL);
if (e->mmapped)
content_mmap_region_decref (e->data_container);
else
flux_msg_decref (e->data_container);
free (e);
errno = saved_errno;
}
} This means a load request was parked on a cache entry (e.g. waiting for an upstream load request to complete) when the module was unloaded, calling the module's main destructor, A closer look at the code is needed to see if that really isn't supposed to be able to happen. Possibly the right answer is to simply destroy any pending requests rather than assert. |
@grondo just observed this one again. Oh there certainly could be pending requests if pending requests are canceled in dependent modules in response to the shutdown. In any case, theoretically all dependent modules have unloaded before this one so there is unlikely to be anybody blocked waiting for a response that will never come. I'll submit a PR to remove these assertions. |
Problem: sometimes brokers hit an assertion when the content module is unloaded during shutdown. The assertion is that there are no pending load/store operations waiting on a cache entry when it is destroyed. This is certainly possible, depending on the behavior of users of the load/store RPCs. In the case of unload in rc3, dependent modules would have already been unloaded, so it seems safe to simply destroy any pending RPCs without responding to them. Destroy them, if present. Fixes flux-framework#5706
While running some tests on the dane system a bunch of brokers crashed in
content_cache_destroy()
:At the time of the crash I was testing multiple brokers per node, which has been causing nodes on this system to crash with a bug in the hfi driver. In this case I think this was a 2 node test with 23 brokers per node. The assert occurred after the first node of the allocation crashed, and all of the brokers on that node created a core (I actually didn't notice until the 23 corefiles appeared in my working directory)
The text was updated successfully, but these errors were encountered: