Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid deadlock when unloading connector-local module #1027

Merged
merged 3 commits into from Apr 3, 2017

Conversation

Projects
None yet
4 participants
@garlick
Copy link
Member

garlick commented Apr 3, 2017

This PR fixes issue #1025, in which a sched test fails due to a flux-start timeout during post-reactor shutdown. It turns out there is a pretty obvious deadlock potential in the connector-local module.

The connector-local module is removed manually as a special case after reactor shutdown, thus it cannot make any RPCs as part of its shutdown path since they cannot be answered. However, it does attempt to call flux_event_unsubscribe() during shutdown if any clients are still connected at this point, and have active subscriptions.

The fix is to disable these unsubscribes in the tear-down path. Their state gets removed anyway when the module context is destroyed.

Add a regression test for this case.

Also, the back-to-back instances run in t0001-basic.t take a really long time due to #1006. I added a workaround to speed up the test at the expense of likely short circuiting teardown. Teardown gets plenty of coverage later on - there's no need to make ourselves suffer unduly in this one test.

test/t0001-basic.t: speed up test
Problem: t0001-basic.t takes a long time to run.

This test includes many back to back instance invocations,
which due to #1006, each run out the default "shutdown-grace"
timeout, which for size < 16 is 1.0s.

Speed things up by shortening the grace period to 0.1s.
This means module unloading in rc3 will certainly be interrupted,
however, that should not affect the outcome of any of the tests
in this script.
@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 3, 2017

Nice. Thanks for the reproducer! Could the reproducer get a reduced grace period too, or does that not make much sense here?

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Apr 3, 2017

Ooops, good idea. Let me fix that.

@garlick garlick force-pushed the garlick:issue_1025 branch from f32799a to fee396a Apr 3, 2017

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Apr 3, 2017

Codecov Report

Merging #1027 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #1027      +/-   ##
==========================================
+ Coverage   77.87%   77.87%   +<.01%     
==========================================
  Files         150      150              
  Lines       25491    25494       +3     
==========================================
+ Hits        19850    19854       +4     
+ Misses       5641     5640       -1
Impacted Files Coverage Δ
src/modules/connector-local/local.c 70.14% <100%> (+0.18%) ⬆️
src/common/libflux/message.c 80.77% <0%> (-0.13%) ⬇️
src/common/libflux/rpc.c 91.17% <0%> (+0.73%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a028c54...fee396a. Read the comment docs.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Apr 3, 2017

I added the reduced grace period and fixed a typo in the commit message for the fix, then rebased.

garlick added some commits Apr 3, 2017

modules/connector-local: avoid unload deadlock
Problem: if connector-local module is unloaded while a client is
still connected that has active event subscriptions, the
connector-local teardown will call flux_event_unsusbscribe(),
but in most cases connector-local is torn down as a special
case after the broker reactor has terminated, so this will deadlock.

Avoid calling flux_event_unsusbscribe() when this module is
unloaded.  Its subscriptions are part of the module context that
will be destroyed when the module unloads anyway, thus unsubscribing
to all subscribed-to events serves more of an aesthetic purpose
than a practical one.

Fixes #1025
@coveralls

This comment has been minimized.

Copy link

coveralls commented Apr 3, 2017

Coverage Status

Coverage increased (+0.01%) to 78.196% when pulling f32799a on garlick:issue_1025 into a028c54 on flux-framework:master.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Apr 3, 2017

Coverage Status

Coverage increased (+0.006%) to 78.188% when pulling fee396a on garlick:issue_1025 into a028c54 on flux-framework:master.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 3, 2017

Great. Ready to merge?

Do you think this warrants a 0.7.1?

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Apr 3, 2017

OK to merge.

Not sure on the tag. Maybe we can work around this bug in sched so that its not required in order to get sched tagged? What do you think?

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 3, 2017

Not sure on the tag. Maybe we can work around this bug in sched so that its not required in order to get sched tagged? What do you think?

That seems reasonable.

Once this hits flux-core/master sched will at least be able to pass travis -- however, we would need to make a workaround for the failing test before building and testing flux-core against 0.7.0 specifically. I was also worried this issue might hit people experimenting with and developing on flux on llnl clusters, since it may be a long time until we tag and get an 0.8.0 installed out there. (Hope that makes sense)

@grondo grondo merged commit ac055a9 into flux-framework:master Apr 3, 2017

4 checks passed

codecov/patch 100% of diff hit (target 77.87%)
Details
codecov/project 77.87% (+<.01%) compared to a028c54
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.006%) to 78.188%
Details

@garlick garlick referenced this pull request Apr 3, 2017

Merged

avoid flux-core bug 1027 #244

@grondo grondo referenced this pull request Aug 23, 2017

Closed

0.8.0 Release #1160

@garlick garlick deleted the garlick:issue_1025 branch Sep 6, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.