-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libsubprocess: demote assert to warning #5959
libsubprocess: demote assert to warning #5959
Conversation
My plan is to put this patch into an RPM patch and make it available as v0.62.0-2. We can then continue to debug the issue and decide if this is the correct workaround for the next release. |
/* we're also waiting for the "complete" to come from the remote end */ | ||
if (!p->local && !p->remote_completed) | ||
return; | ||
if (p->state == FLUX_SUBPROCESS_EXITED) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue # is blank in commit message.
It would be easier to view the diff if the assert()
were changed to:
if (p->state != FLUX_SUBPROCESS_EXITED) {
log_err ("subprocess_check_completed: unexpected state %s",
flux_subprocess_state_string (p->state));
return;
}
However, that's just personal preference and perhaps it is more correct to do it the way is done here, so feel free to ignore that comment.
Problem: A recent crash showed that it is possible for a process to not be in the EXITED state when subprocess_check_completed() is called. See issue flux-framework#5959. Solution: To temporarily remove the possibility for broker crashes, demote the assert to an error message.
e9504ee
to
cd6d5f3
Compare
re-pushed, going with that slightly tweak. I agree it's cleaner. |
Problem: A recent crash showed that it is possible for a process to not be in the EXITED state when subprocess_check_completed() is called. See issue flux-framework#5959. Solution: To temporarily remove the possibility for broker crashes, demote the assert to an error message.
cd6d5f3
to
403a954
Compare
The commit references this PR not the broker crash issue (that may make sense, just making sure that's what you meant) Edit: That issue is #5956 |
Oh crap! Cut & paste from wrong tab. Lemme fix. |
403a954
to
a29338c
Compare
re-pushed fixing up the issue number in the commit message Also, FWIW I contemplated adding a state check here instead, like:
I think this may be the more "correct" fix, but it assumes that code is racy and |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5959 +/- ##
==========================================
- Coverage 83.62% 83.29% -0.34%
==========================================
Files 506 515 +9
Lines 81531 83398 +1867
==========================================
+ Hits 68184 69469 +1285
- Misses 13347 13929 +582
|
Would it be better to change the assertion into a subprocess failure, e.g. that ends up calling |
My worry was that type of solution might cause a subsequent crash (because we can't reproduce the issue and thus can't really test a fix, and we don't know what would happen next) We're trying put a bandaid here to prevent a crash and (hopefully) not just cause a different one. FYI, an RPM with this patch applied has already been generated to give to the admins to install if the crash recurs. |
Reasonable! I'm fine with merging this so we have it in the history and following on with more work to harden the client against server misbehavior. |
As Mark said, this was intentionally not right to just avoid the crash. Thinking about it a tad more this morning, perhaps a setting EPROTO and "goto error" in the calling function might be a good idea if we didn't see state EXIT/FAILED before getting ENODATA. |
Since this change was deployed in the patched rpm 0.62.0-2, do we want to merge this PR so that patch exists in the history? The "right" fix could be a follow-on PR. |
Good thought. Works for me! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK then!
Problem: A recent crash showed that it is possible for a process to not be in the EXITED state when subprocess_check_completed() is called. See issue flux-framework#5956. Solution: To temporarily remove the possibility for broker crashes, demote the assert to an error message.
a29338c
to
58e9ade
Compare
Problem: A recent crash showed that it is possible for a process to not be in the EXITED state when
subprocess_check_completed() is called. See issue
Solution: To temporarily remove the possibility for broker crashes, demote the assert to an error message.