Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
improve reliability of module unloading #1017
This PR takes another step towards cleaning up the broker shutdown path.
rc3 unloads modules that were loaded in rc1. A reactor timer is set when rc3 is entered so that if it takes too long, the reactor is terminated and we can clean up and exit. Due to #1006, we actually always let this timer run out to completion.
In this post-reactor cleanup path is some code that sends a "shutdown" request to any modules still loaded, and then calls
A SIGARLM timer was set up to break out of this hang and move things along, however it's a kludge, another source of timing related behavior variability, and it caused some troubles noted in #1014 with atexit handlers being called from signal handler context.
This PR removes the cause of the hangs in
Since cancelling a thread causes it to terminate early, potentially missing some cleanup, we should try to avoid this case by making rc3 successful, if possible. Thus when it happens, we log to stderr (though still exit 0). The default shutdown grace period was increased somewhat (0.5s to 1.0s for <16 ranks).
@@ Coverage Diff @@ ## master #1017 +/- ## ========================================== + Coverage 77.63% 77.71% +0.07% ========================================== Files 152 152 Lines 25787 25752 -35 ========================================== - Hits 20021 20012 -9 + Misses 5766 5740 -26
referenced this pull request
Mar 28, 2017
Ok, ran with this version on hype and all looks good (does slow down tests, but not outside the realm of acceptability)
I also tested running a session with
For the sake of a casual user, would it make sense to call out that, in this context, "live" is a module, e.g.
'module "live" was not cleanly shutdown'
Just a suggestion and actually I think I'll go ahead and merge this now anyway as this can easily be fixed up later.