Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

call destructors in broker shutdown path #1005

Merged
merged 8 commits into from Mar 16, 2017

Conversation

Projects
None yet
4 participants
@garlick
Copy link
Member

garlick commented Mar 16, 2017

As discussed in #25, instance shutdown is far from clean. It's a bit better than before, now that we have the rc1, rc2, and rc3 run levels (modules loaded in rc1 are unloaded in rc3), but the "normal" path is to let the shutdown timer (started at the beginning of rc3) expire and call exit from its reactor callback.
Since the reactor never returns, we never get back to main() to call destructors.

Ideally we'd like to get to the point where all reactor watchers are stopped after their subsystems are torn down and the reactor would exit naturally, with the timeout acting as a fallback rather than the main path. That will require some design work to ensure the overlay is torn down in an orderly fashion without deadlocking distributed services, and is not attempted here.

This PR takes a step in the right direction by replacing the exit() call in the shutdown handler with flux_reactor_stop(). The destructors in main() are then called - some bug fixes were needed in these. Finally since deadlock is still possible when attempting to unload modules that failed to unload in rc3, a SIGALRM timer is set with an exit call in the handler just in case.

I think we could close #25 if this is merged and open new issues for the remaining problems.

Also this lays some groundwork for #974 (convert to latest czmq) as zsys gets sad if any sockets are open in its atexit() call.

garlick added some commits Mar 14, 2017

broker/log: fix logic error in teardown
Problem: destroying the broker flux_t handle causes an
assertion failure in log.c due to a stray magic cookie
de-assignment in the destroy code.

Remove the stray de-assignment line.
broker/module: add back module_stop_all()
Problem: any modules that are still loaded after rc3,
such as the hardwired connector-local module, need to
be sent a shutdown request, otherwise modhash_destroy()
will hang in pthread_join().

Add module_stop_all() back into module.c and call it
before modhash_destroy () during teardown.
broker: destroy content cache before flux_t handle
Problem: destroying the content cache after the flux_t
handle has been destroyed causes a segfault in
flux_event_unsubscribe().

Destroy the content cache earlier in the teardown
process.
broker: stop the reactor after shutdown timeout
When the shutdown timeout expires, instead of exiting
directly, tell the reactor to stop so main() can
exit through its stack of destructors.
broker: save/restore signal mask/actions
Problem: if the broker gets stuck while attempting to tear
down its various subsystems, it requires a SIGKILL to
terminate it.

Restore the original signal mask once the reactor terminates.
Also, restore SIG_DFL actions for SIGTERM and SIGINT (exit).
These are apparently being changed within czmq, despite
calling zsys_handler_set (NULL).
broker: fix flux_reactor_create flags argument
Problem: flux_reactor_create() is called with flags set to
SIGCHLD (17), not a valid flag.

Change to the proper flag of FLUX_REACTOR_SIGCHLD (1).

By luck, the internal flag test worked equally well for
FLUX_REACTOR_SIGCHLD or SIGCHLD so this won't change behavior.
broker: install SIGALRM timeout for teardown
Problem: at times shutdown hangs

The code for unloading modules and closing zeromq sockets
needs work, but in the mean time, set an alarm clock and
exit if post-reactor cleanup takes longer than 1s.  This
period is after the hueristically-determined shutdown
period in which the instance can still communicate because
the reactor is still running.  Only local stuff can happen
after the reactor has terminated, but some modules may be
still be waiting for messages that will never arrive if
they failed to unload during the shutdown period.

Preserve the exit code if the alarm clock fires.  E.g. timeout
is not handled as an error at this time.  Before we used to
just exit without teardown when the shutdown period was over,
so this is certainly no worse than before.
cmd/flux-start: increase timeout to 2s
Problem: occasionally the clean shutdown path deadlocks
unloading modules, then SIGALRM after 1s as a workaround
moves this forward.  However, the workaround races with
flux-start's "killer-timeout" which defaults to 1s.
This causes tests to fail with "killed by signal 9".

Raise the killer-timeout to 2s.
@coveralls

This comment has been minimized.

Copy link

coveralls commented Mar 16, 2017

Coverage Status

Coverage increased (+0.6%) to 77.644% when pulling ceb3b71 on garlick:clean_shutdown into 5ca1767 on flux-framework:master.

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Mar 16, 2017

Codecov Report

Merging #1005 into master will increase coverage by 0.58%.
The diff coverage is 79.48%.

@@            Coverage Diff             @@
##           master    #1005      +/-   ##
==========================================
+ Coverage   76.74%   77.32%   +0.58%     
==========================================
  Files         151      151              
  Lines       25724    25754      +30     
==========================================
+ Hits        19741    19915     +174     
+ Misses       5983     5839     -144
Impacted Files Coverage Δ
src/broker/log.c 67.41% <ø> (+6.32%)
src/cmd/flux-start.c 74% <ø> (ø)
src/broker/broker.c 70.25% <74.07%> (+3.42%)
src/broker/module.c 82.98% <91.66%> (+1.36%)
src/common/libflux/handle.c 85.78% <0%> (ø)
src/common/libflux/message.c 83.64% <0%> (+0.37%)
src/modules/aggregator/aggregator.c 77.17% <0%> (+0.82%)
src/common/libflux/response.c 85% <0%> (+0.83%)
src/broker/attr.c 85.5% <0%> (+0.96%)
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5ca1767...ceb3b71. Read the comment docs.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Mar 16, 2017

This looks good to me, ok to merge it?

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Mar 16, 2017

Should be OK, thanks.

@grondo grondo merged commit f81ab5c into flux-framework:master Mar 16, 2017

4 checks passed

codecov/patch 79.48% of diff hit (target 76.74%)
Details
codecov/project 77.32% (+0.58%) compared to 5ca1767
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.6%) to 77.644%
Details

@garlick garlick deleted the garlick:clean_shutdown branch Mar 17, 2017

@grondo grondo referenced this pull request Mar 28, 2017

Closed

0.7.0 Release Notes #1019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.