Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cleanup: dispatch orphan brokers, fully clean up temp directories #1835

Merged
merged 15 commits into from Nov 14, 2018

Conversation

Projects
None yet
3 participants
@garlick
Copy link
Member

garlick commented Nov 13, 2018

This fixes flux-start failing to clean up orphan brokers after one dies; and the broker failing to clean up ipc sockets, causing /tmp/flux-XXXXX directories containing only sockets to accumulate over time.

I still need to add a test for the orphan broker cleanup.

garlick added some commits Nov 12, 2018

testsuite: verify that broker.rundir is cleaned up
Add tests to t0001-basic.t to run flux in
--bootstrap=pmi (singleton)
--bootstrap=selfpmi (size=1)
--bootstrap=selfpmi (size=2)
and validate that broker.rundir is removed on exit.
broker/overlay: cleanup 'child' ipc:// files
Problem: broker creates ipc:// sockets when instructed,
but doesn't clean them up on exit.

Register an atexit cleanup handler to do this
for the "child" socket.

Fixes #1831
cmd/flux-start: enable killer timeout
Problem: if one broker process dies, flux-start
doesn't attempt to signal remaining brokers.

The ctx.subprocesses list was always empty, so the killer()
timeout handler was effectively a no-op.  Change it to
a list of (struct client *), and populate it.

Fixes #1785
@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Nov 13, 2018

Codecov Report

Merging #1835 into master will increase coverage by 0.1%.
The diff coverage is 71.42%.

@@            Coverage Diff            @@
##           master    #1835     +/-   ##
=========================================
+ Coverage   79.71%   79.82%   +0.1%     
=========================================
  Files         197      197             
  Lines       35203    35211      +8     
=========================================
+ Hits        28063    28107     +44     
+ Misses       7140     7104     -36
Impacted Files Coverage Δ
src/broker/broker.c 77.23% <0%> (+0.25%) ⬆️
src/common/libflux/panic.c 100% <100%> (+100%) ⬆️
src/broker/overlay.c 84.58% <100%> (+0.19%) ⬆️
src/cmd/flux-start.c 78.21% <70%> (+2.43%) ⬆️
src/cmd/flux-comms.c 73.52% <87.5%> (+18.98%) ⬆️
src/broker/modservice.c 78.84% <0%> (-0.97%) ⬇️
src/cmd/flux-module.c 85.58% <0%> (-0.31%) ⬇️
src/modules/kvs/kvs.c 65.57% <0%> (+0.15%) ⬆️
src/common/libflux/message.c 81.64% <0%> (+0.24%) ⬆️
... and 1 more

garlick added some commits Nov 13, 2018

cmd/flux-start: save/restore terminal mode
Problem: if the rank 0 broker dies, and the initial program
was interactive, sometimes tty is left in an unusable state.

If stdin is a tty, save the state and restore it in
an atexit handler.
cmd/flux-start: ignore SIGTTOU
Problem: when the initial program is an interactive shell,
the "current terminal process group ID" is switched to
the initial program.  If the broker that spawned it is
dispatched with a SIGKILL, flux-start receives a SIGTTOU.
The default action for SIGTTOU is STOP.

As discussed in #1831, flux-start can ignore SIGTTOU
to avoid being stopped right before exiting.
libflux/panic: clean up flux_panic()
Problem: flux_panic() triggers assertions in czmq atexit handler.

Change the cmb.panic message handler to call _exit(1) after
logging its message to stderr.

Cleanup:
- protocol now requires 'reason' string (renamed from 'msg')
- add flags to proto and function (reserved for future use)
- don't send response on protocol error, just log it
- take uint32_t nodeid not integer rank with -1 conversion
  to FLUX_NODEID_ANY

Update flux-comms panic command.
testsuite: flux-start cleans up after broker fail
Add a test to t0001-basic.t where one broker rank is
made to exit and verify that the toher one is killed
by flux-start after the --killer-timeout.

@garlick garlick force-pushed the garlick:fix_cleanup branch from 8535f2b to 2460dc3 Nov 13, 2018

garlick added some commits Nov 14, 2018

build: improve libflux test linkage
Problem: libflux/test/util.c is being compiled for
each test that uses it, when it only needs to be
compiled once.

Create a libtestutil.la library to avoid this and
add it to $(test_ldadd) so all libflux unit tests can use it
without explicitly linking against it.

@garlick garlick force-pushed the garlick:fix_cleanup branch from d717876 to 8ce20a7 Nov 14, 2018

testsuite: move module unit test to sharness area
Problem: module unit test in libflux requires kvs
module to be built.

Relocate this test to t/module/codec.t
@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Nov 14, 2018

The tests I added for flux start cleanup needed to be able to kill off one broker, so I ended up doing some cleanup of flux_panic() and adding a test for it.

Then when adding that test, I realized (in the context of @grondo's comments about rcalc.o yesterday) that I should put the test helper util.c in a library to avoid compiling it for each test that uses it, so I did that.

But then getting that dependency to be built automatically for the module.t unit test became tricky because its _DEPENDENCIES setting was being overridden to get the kvs.so module built first, and combining that with automake's logic to extract dependencies from _LDADD proved problematic. So rather than try to work through that, I just moved the module unit test to the sharness area where some other module-related stuff was being built for testing and dropped the kvs dependeny in libflux, since we said we needed to revisit the module loading stuff anyway.

Sorry for the lateral drift.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Nov 14, 2018

Nice "cleanup" (hah)!

I had over 2K /tmp/flux-* directories accumulated over time on my desktop.

Before this PR, a single make check results in 86 directories left over in /tmp.
After the PR there is just one. I wonder what it is from?

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Nov 14, 2018

After the PR there is just one. I wonder what it is from?

Oh probably the test I added above which calls panic and makes a broker _exit(), bypassing the atexit cleanup handlers. I should probably fix the test to clean up in that case.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Nov 14, 2018

ironically, the cleanup_test.XXX directory is also left over after make check. Orthogonal to this PR tho

garlick added some commits Nov 14, 2018

testsuite: fix cleanup test that didn't clean up
Problem: libutil/cleanup test calls unlink() on a
directory without checking return value.

Change that to a rmdir(), with BAIL_OUT on failure.
@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Nov 14, 2018

Note: the code coverage report shows that cmb_panic_cb() is not covered, but this is likely because _exit() is called and so gcov doesn't have a chance to dump its data.

Definitely not important, but we could consider adding a call to __gcov_flush() before _exit() if CODE_COVERAGE_ENABLED is defined.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Nov 14, 2018

I puzzled over that and figured that might be possible.

I'll go ahead and add that!

garlick added some commits Nov 14, 2018

broker: call gcov_flush in panic handler
Problem: test coverage misses code in panic handler
because it calls _exit().

Call __gcov_flush() if built with coverage enabled.
build: define CODE_COVERAGE_ENABLED in config.h
Problem: there is no way to test whether
coverage support is enabled within C code.

Conditionally define CODE_COVERAGE_ENABLED in config.h.
@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Nov 14, 2018

Nice! Ready to merge once Travis is green?

I gave this a whirl again and make check now leaves nothing behind in /tmp

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Nov 14, 2018

Yes please.

@grondo grondo merged commit 4c5a8ae into flux-framework:master Nov 14, 2018

2 of 3 checks passed

codecov/patch 76.74% of diff hit (target 79.71%)
Details
codecov/project 79.88% (+0.16%) compared to 2fa5cd5
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Nov 14, 2018

Looks like the __gcov_flush() did fix the problem covering the panic handler. Thanks for that suggestion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.