Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve reliability of module unloading #1017

Merged
merged 9 commits into from Mar 28, 2017

Conversation

Projects
None yet
4 participants
@garlick
Copy link
Member

garlick commented Mar 28, 2017

This PR takes another step towards cleaning up the broker shutdown path.

rc3 unloads modules that were loaded in rc1. A reactor timer is set when rc3 is entered so that if it takes too long, the reactor is terminated and we can clean up and exit. Due to #1006, we actually always let this timer run out to completion.

In this post-reactor cleanup path is some code that sends a "shutdown" request to any modules still loaded, and then calls modhash_destroy() which tries to destroy each module starting with a pthread_join(). This is prone to hanging because for whatever reason, modules may not have processed the shutdown message (for example, they may be blocked in an RPC).

A SIGARLM timer was set up to break out of this hang and move things along, however it's a kludge, another source of timing related behavior variability, and it caused some troubles noted in #1014 with atexit handlers being called from signal handler context.

This PR removes the cause of the hangs in pthread_join() by replacing the shutdown message in the post-reactor cleanup path with a pthread_cancel(). The SIGALRM handler is then dropped.

Since cancelling a thread causes it to terminate early, potentially missing some cleanup, we should try to avoid this case by making rc3 successful, if possible. Thus when it happens, we log to stderr (though still exit 0). The default shutdown grace period was increased somewhat (0.5s to 1.0s for <16 ranks).

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Mar 28, 2017

Codecov Report

Merging #1017 into master will increase coverage by 0.07%.
The diff coverage is 81.69%.

@@            Coverage Diff             @@
##           master    #1017      +/-   ##
==========================================
+ Coverage   77.63%   77.71%   +0.07%     
==========================================
  Files         152      152              
  Lines       25787    25752      -35     
==========================================
- Hits        20021    20012       -9     
+ Misses       5766     5740      -26
Impacted Files Coverage Δ
src/broker/broker.c 72.97% <81.35%> (+2.55%) ⬆️
src/broker/module.c 82.22% <83.33%> (-2.08%) ⬇️
src/broker/content-cache.c 73.03% <0%> (-1.13%) ⬇️
src/cmd/flux-module.c 83.85% <0%> (-0.32%) ⬇️
src/common/libflux/handle.c 85.78% <0%> (+0.25%) ⬆️
src/common/libflux/message.c 83.89% <0%> (+0.37%) ⬆️
src/common/libflux/rpc.c 91.17% <0%> (+0.73%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5e85aa7...cf40a36. Read the comment docs.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Mar 28, 2017

Coverage Status

Coverage decreased (-0.01%) to 77.905% when pulling b0a4bb4 on garlick:better_module_unload into de54940 on flux-framework:master.

@garlick garlick force-pushed the garlick:better_module_unload branch from b0a4bb4 to 8d85138 Mar 28, 2017

@coveralls

This comment has been minimized.

Copy link

coveralls commented Mar 28, 2017

Coverage Status

Coverage decreased (-0.01%) to 77.905% when pulling 8d85138 on garlick:better_module_unload into de54940 on flux-framework:master.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Mar 28, 2017

Coverage Status

Coverage increased (+0.09%) to 78.011% when pulling b79c870 on garlick:better_module_unload into de54940 on flux-framework:master.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Mar 28, 2017

I noticed some typos in commit messages, and some more "imbalanced" sharness tests (where modules that were loaded in the test script were not unloaded), so an update is coming. I'll hold off until #1016 is merged and rebase it first though.

garlick added some commits Mar 16, 2017

test/sharness: add rc3 "personalities"
Problem: rc1 files for different test environments
should have corresponding rc3 files for proper teardown.

Edit flux-sharness.sh to require a personality to have
both rc1 and rc3 scripts, and add rc3 scripts for the three
test personalities already configured with rc1 scripts.
broker/module: address broker hangs in pthread_join
Problem: broker teardown sometimes gets stuck in
pthread_join() on modules that don't complete
unloading in rc3.

Modules that are still loaded when the shutdown grace
timer expires are unloaded in the post-reactor teardown
code.  First a shutdown message is sent to each module,
then the modhash is destroyed.  During modhash destruction,
the hash element destructor is called on each remaining
module.  The destructor calls pthread_join() on the module
thread before attempting to free its resources.  This may
hang, for example if the module thread is blocked on an
RPC with the broker reactor stopped, and the module hasn't
yet processed the shutdown request.

Change the modhash destructor so that, as a prelude to
destroying the hash, it iterates over the remaining modules
in the hash, calling pthread_cancel() on them.  This
makes it much more likely that the pthread_join() will
be successful.

Since this means some teardown within the module thread
may have been skipped, log each module that is terminated
with a cancellation so we can run those down.

Drop dead code for stopping/starting all modules asynchronously.
test/sharness: specify bootstrap mode
Reduce some noise in the test output by explicitly selecting
--bootstrap-mode=selfpmi.
test/sharnesss: add module remove to tests
Problem: Some sharness test scripts load modules but
do not unload them.

Ensure that test scripts unload any modules that that load.
broker: drop SIGALRM timeout
Problem: calling exit() from a signal handler can lead to
segfaults in the libczmq atexit handler, and with changes to
the module unloading code, should no longer be necessary.

Remove the SIGALRM handler and alarm() call that was
active during post-reactor teardown.
broker: refactor module load/unload code
Problem: the load_modules() function contains dead code and
duplicates code in the cmb.insmod request handler.

Refactor that function and the duplicated code in  insmod/rmmod
request handlers into these functions:
  load_module_bypath()
  load_module_byname()
  unload_module_byname()
broker: double the default shutdown grace times
Problem: default shutdown grace period of 0.5s is
insufficient time to unload the default set of modules
in small (size < 16) "selfpmi" instances.  Warnings
are issued during instance shutdown.

Double all the default grace period times, increasing
the <16 node time to 1s, which seems to be sufficient.

This unfortunately does slow down testing, but the
wasted time letting the shutdown grace period expire
every time should be addressed when #1006 is fixed.
test/basic: increase broker option coverage
Problem: some broker options don't have test coverage.

Add a few tests of basic options to t0001-basic.t.
While these tests are not comprehensive, at least
the exercise the code paths and ensure no segfaults
are lurking there.

@garlick garlick force-pushed the garlick:better_module_unload branch from b79c870 to cf40a36 Mar 28, 2017

@coveralls

This comment has been minimized.

Copy link

coveralls commented Mar 28, 2017

Coverage Status

Coverage increased (+0.07%) to 78.018% when pulling cf40a36 on garlick:better_module_unload into 5e85aa7 on flux-framework:master.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Mar 28, 2017

Updated, rebased, and ready for review.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Mar 28, 2017

Do we need to add a sharness or perhaps even just a specific travis test to check for unbalanced module load/unload in test scripts?

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Mar 28, 2017

Hmm, that's a good idea. I wonder if we could add something to sharness test_under_flux to make sure the same modules are loaded before and after the script executes, and fail the test if they are different?

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Mar 28, 2017

@garlick, thats a good idea and probably pretty easy to accomplish.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Mar 28, 2017

I opened #1020 to track that. Think it would be OK to save that one for later?

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Mar 28, 2017

Yes. This one ready to merge then?

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Mar 28, 2017

Yep.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Mar 28, 2017

Ok, ran with this version on hype and all looks good (does slow down tests, but not outside the realm of acceptability)

I also tested running a session with flux start and loading an extra module, e.g. flux module load live. After exiting the initial program (shell), as expected I got:

(flux-177926) grondo@hype356:~/git/flux-core.git$ exit
exit
lt-flux-broker: live was not cleanly shutdown

For the sake of a casual user, would it make sense to call out that, in this context, "live" is a module, e.g.

'module "live" was not cleanly shutdown'
?

Just a suggestion and actually I think I'll go ahead and merge this now anyway as this can easily be fixed up later.

@grondo grondo merged commit 9685852 into flux-framework:master Mar 28, 2017

4 checks passed

codecov/patch 81.69% of diff hit (target 77.63%)
Details
codecov/project 77.71% (+0.07%) compared to 5e85aa7
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.07%) to 78.018%
Details
@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Mar 28, 2017

Thanks for the extra testing! I'll try to sneak that change in with my next PR.
(Incidentally live was removed so you've got an outdated .so there).

@garlick garlick deleted the garlick:better_module_unload branch Mar 28, 2017

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Mar 28, 2017

Incidentally live was removed so you've got an outdated .so there

oh, whoops! I guess that is a drawback with running ./autogen.sh && ./configure before make clean

@grondo grondo referenced this pull request Mar 28, 2017

Closed

0.7.0 Release Notes #1019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.