Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add unit tests for cleanup functions #1014

Merged
merged 6 commits into from Mar 28, 2017

Conversation

Projects
None yet
4 participants
@garlick
Copy link
Member

garlick commented Mar 25, 2017

This PR was originally in #1013 but wasn't quite on topic there, hence was broken out.

After the reactor exits, the broker sets up a SIGALRM handler in case its last ditch attempts to unload modules hangs. This signal handler calls exit() which causes any registered atexit() functions to run, including one to clean up temporary files and zsys_shutdown() to close dangling zeromq sockets. This latter one is installed automatically through the use of various libczmq classes.

zsys_shutdown() has an internal mutex that would appear to only protect against concurrent calls to itself, not against calls to other socket operations. Since zeromq sockets are not thread safe, this can lead to segfaults. This became evident in PR #1013 when converting from zsocket to zsock classes.

To avoid calling this function from the SIGALARM handler where comms modules may still be shutting down, replace the call to exit() with _exit(). We do however still want to clean up temporary files, so expose cleanup_run() and call it manually from the signal handler.

The unlink_recursive() function called by cleanup_run() is generally useful and has no tests, so break it out to its own libutil source module and add a test. Fix a bug discovered by the test.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Mar 25, 2017

Coverage Status

Coverage decreased (-0.2%) to 77.672% when pulling d7bad53 on garlick:atexit_race into f78422e on flux-framework:master.

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Mar 25, 2017

Codecov Report

Merging #1014 into master will increase coverage by 0.01%.
The diff coverage is 92.85%.

@@            Coverage Diff            @@
##           master   #1014      +/-   ##
=========================================
+ Coverage   77.59%   77.6%   +0.01%     
=========================================
  Files         151     152       +1     
  Lines       25779   25787       +8     
=========================================
+ Hits        20002   20012      +10     
+ Misses       5777    5775       -2
Impacted Files Coverage Δ
src/common/libutil/cleanup.c 97.36% <100%> (+8.08%) ⬆️
src/common/libutil/unlink_recursive.c 92.3% <92.3%> (ø)
src/common/libflux/request.c 87.67% <0%> (-1.37%) ⬇️
src/common/libflux/tagpool.c 95.12% <0%> (-1.22%) ⬇️
src/common/libflux/message.c 83.64% <0%> (+0.12%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 02746a8...a6f36e5. Read the comment docs.

libutil/cleanup: export cleanup_run()
Export cleanup handler normally called from atexit()
to allow it to be used manually when _exit() is called.

@garlick garlick force-pushed the garlick:atexit_race branch from d7bad53 to 501ff3f Mar 26, 2017

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Mar 26, 2017

After you rebase this one on master, I want to doublecheck coverage. The current report strangely shows some unrelated functions are now missed (e.g. in aggregator and cron), which can sometimes mean testsuite wasn't fully run, or perhaps just a transient problem with codecov.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Mar 26, 2017

Coverage Status

Coverage decreased (-0.4%) to 77.498% when pulling 501ff3f on garlick:atexit_race into 02746a8 on flux-framework:master.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Mar 26, 2017

I've rebased for what it's worth. I'm looking too...

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Mar 26, 2017

Looking through the coverage travis run, I do see a fair number of warnings that the broker exited through the SIGALRM timeout. Hmm, if gcov collects its results using an atexit() function, then maybe it fails to collect results when the SIGALRM handler calls _exit()?

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Mar 26, 2017

Looking through the coverage travis run, I do see a fair number of warnings that the broker exited through the SIGALRM timeout. Hmm, if gcov collects its results using an atexit() function, then maybe it fails to collect results when the SIGALRM handler calls _exit()?

That is a very good theory... Hm, I don't have any good ideas at the moment.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Mar 26, 2017

Actually there might be a way to manually flush gcov data, a quick search turned up this starting point:

https://osadl.org/Dumping-gcov-data-at-runtime-simple-ex.online-coverage-analysis.0.html

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Mar 26, 2017

Yeah, I poked at calling __gcov_dump() in the signal handler, but the function isn't defined if not building with gcov and I don't see a preprocessor symbol that would allow it to be easily compiled conditionally. I guess I could add one.

I think in addition to that, I need to figure out how to reduce the SIGALRM timeouts, since that indicates that (probably) the module unloading in rc3 was cut short because the shutdown grace is too tight for the code when it's running under instrumentation. That also could affect coverage and since it's a timing thing, might contribute to coverage instability. The kludgey way to work around this would be to conditionally increase the timeouts, but a much better way would be to fix #1006...

@coveralls

This comment has been minimized.

Copy link

coveralls commented Mar 27, 2017

Coverage Status

Coverage decreased (-0.3%) to 77.594% when pulling 2ec9cb7 on garlick:atexit_race into 02746a8 on flux-framework:master.

garlick added some commits Mar 22, 2017

libutil/unlink_recursive: factor out of cleanup.c
Move unlink_recursive() to its own C module so that it can be
tested and used independently.
libutil/unlink_recursive: drop O_PATH
Problem: unlink_recursive won't descend more than 1
directory level deep.

O_PATH appears to have been misued here, as it prevents
the use of unlinkat using the file descriptor obtained
with it.
test/unlink_recursive: add new unit test
Problem: unlink_recursive() has no test coverage.

Add a small TAP test for unlink_recursive().
test/cleanup: add new unit test
Problem: libutil/cleanup has no test coverage.

Add basic unit test for cleanup.

@garlick garlick force-pushed the garlick:atexit_race branch from 486ba8a to a6f36e5 Mar 28, 2017

@garlick garlick changed the title avoid calling libczmq zsys_shutdown() in SIGLARM handler add unit tests for cleanup functions Mar 28, 2017

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Mar 28, 2017

I've got another PR going now which gets rid of the signal handler entirely, so this PR is being "demoted" to one that just adds tests for libutil/cleanup and libutil/unlink_recursive.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Mar 28, 2017

Coverage Status

Coverage increased (+0.01%) to 77.916% when pulling a6f36e5 on garlick:atexit_race into 02746a8 on flux-framework:master.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Mar 28, 2017

Ok, seems straightforward. Ready to merge?

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Mar 28, 2017

Yep!

@grondo grondo merged commit de54940 into flux-framework:master Mar 28, 2017

4 checks passed

codecov/patch 92.85% of diff hit (target 77.59%)
Details
codecov/project 77.6% (+0.01%) compared to 02746a8
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.01%) to 77.916%
Details

@garlick garlick deleted the garlick:atexit_race branch Mar 28, 2017

@grondo grondo referenced this pull request Mar 28, 2017

Closed

0.7.0 Release Notes #1019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.