Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-info: Fix racy cancel #2341

Merged
merged 2 commits into from Aug 27, 2019

Conversation

@chu11
Copy link
Contributor

commented Aug 27, 2019

Per discussion in #2331, I think there may be a race where a user attempts to cancel the watch, but as the cancellation is in flight, the namespace is deleted. Internally job-info gets a ENOTSUP, and therefore will never send ENODATA to the caller.

Labeling WIP for now - as we'll want to run this through travis many times to see if the issue hits.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Aug 27, 2019

Heh, you are doing too much work in libsubprocess. Prefix label should be "job-info:" ?

@chu11 chu11 changed the title [WIP] libsubprocess: Fix racy cancel [WIP] job-info: Fix racy cancel Aug 27, 2019
@codecov-io

This comment has been minimized.

Copy link

commented Aug 27, 2019

Codecov Report

Merging #2341 into master will increase coverage by 0.03%.
The diff coverage is 0%.

@@            Coverage Diff             @@
##           master    #2341      +/-   ##
==========================================
+ Coverage   80.83%   80.86%   +0.03%     
==========================================
  Files         215      215              
  Lines       34240    34242       +2     
==========================================
+ Hits        27677    27689      +12     
+ Misses       6563     6553      -10
Impacted Files Coverage Δ
src/modules/job-info/guest_watch.c 71.59% <0%> (-0.42%) ⬇️
src/cmd/flux-job.c 85.84% <0%> (+0.3%) ⬆️
src/shell/shell.c 81.02% <0%> (+0.51%) ⬆️
src/broker/modservice.c 71.42% <0%> (+0.75%) ⬆️
src/modules/connector-local/local.c 74.56% <0%> (+1.15%) ⬆️
@chu11 chu11 force-pushed the chu11:issue2331_racy_cancel branch from 4b1a52a to 95f0dd6 Aug 27, 2019
@chu11 chu11 changed the title [WIP] job-info: Fix racy cancel job-info: Fix racy cancel Aug 27, 2019
@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 27, 2019

I've run 8 times through travis I think, haven't reproduced issue #2331 (although I've gotten a few spurious valgrinds #2195). @grondo if you haven't hit either, I think we can declare victory. I'll let the travis run I just restarted be the last one.

Thanks for getting the debug info to set me on the right path. I thought for sure it was an issue in job-ingest.

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 27, 2019

got a ERROR: t1005-kvs-security.t - exited with status 137 (terminated by signal 9?), restarted that builder.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Aug 27, 2019

This is ready now?

@grondo
grondo approved these changes Aug 27, 2019
On cancellation, code incorrectly went to the wrong goto statement.
@grondo

This comment has been minimized.

Copy link
Contributor

commented Aug 27, 2019

I approved this fix and set merge-when-passing. When you're happy with it you can remove the WIP prefix and mergify should take care of the rebase.

@chu11 chu11 force-pushed the chu11:issue2331_racy_cancel branch from 95f0dd6 to 1838a2f Aug 27, 2019
@grondo

This comment has been minimized.

Copy link
Contributor

commented Aug 27, 2019

Oh, my copy of the PR was stale, you already removed "WIP". So will merge when passing ;-)

Fix a race in which the guest eventlog watcher cancels just as the guest
namespace is being deleted.  There is a small chance the user will never
see a ENODATA response.

Fixes #2331
@chu11 chu11 force-pushed the chu11:issue2331_racy_cancel branch from 1838a2f to 5ca8443 Aug 27, 2019
@mergify mergify bot merged commit 0092161 into flux-framework:master Aug 27, 2019
2 checks passed
2 checks passed
Summary 1 rule matches
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@codecov-io

This comment has been minimized.

Copy link

commented Aug 27, 2019

Codecov Report

Merging #2341 into master will increase coverage by 0.01%.
The diff coverage is 0%.

@@            Coverage Diff             @@
##           master    #2341      +/-   ##
==========================================
+ Coverage   80.84%   80.85%   +0.01%     
==========================================
  Files         215      215              
  Lines       34252    34254       +2     
==========================================
+ Hits        27691    27697       +6     
+ Misses       6561     6557       -4
Impacted Files Coverage Δ
src/modules/job-info/guest_watch.c 71.59% <0%> (-0.42%) ⬇️
src/common/libflux/mrpc.c 87.79% <0%> (-1.19%) ⬇️
src/broker/module.c 74.94% <0%> (-0.24%) ⬇️
src/cmd/flux-module.c 83.96% <0%> (-0.24%) ⬇️
src/cmd/flux-job.c 85.84% <0%> (+0.3%) ⬆️
src/common/libutil/veb.c 98.95% <0%> (+0.52%) ⬆️
src/modules/connector-local/local.c 74.42% <0%> (+1.15%) ⬆️
@garlick garlick referenced this pull request Sep 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.