Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libsubprocess: demote assert to warning #5959

Merged
merged 1 commit into from
May 16, 2024

Conversation

chu11
Copy link
Member

@chu11 chu11 commented May 10, 2024

Problem: A recent crash showed that it is possible for a process to not be in the EXITED state when
subprocess_check_completed() is called. See issue

Solution: To temporarily remove the possibility for broker crashes, demote the assert to an error message.

@grondo
Copy link
Contributor

grondo commented May 10, 2024

My plan is to put this patch into an RPM patch and make it available as v0.62.0-2. We can then continue to debug the issue and decide if this is the correct workaround for the next release.

/* we're also waiting for the "complete" to come from the remote end */
if (!p->local && !p->remote_completed)
return;
if (p->state == FLUX_SUBPROCESS_EXITED) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue # is blank in commit message.

It would be easier to view the diff if the assert() were changed to:

    if (p->state != FLUX_SUBPROCESS_EXITED) {
        log_err ("subprocess_check_completed: unexpected state %s",
                 flux_subprocess_state_string (p->state));
        return;
    }

However, that's just personal preference and perhaps it is more correct to do it the way is done here, so feel free to ignore that comment.

chu11 added a commit to chu11/flux-core that referenced this pull request May 10, 2024
Problem: A recent crash showed that it is possible for
a process to not be in the EXITED state when
subprocess_check_completed() is called.  See issue flux-framework#5959.

Solution: To temporarily remove the possibility for broker
crashes, demote the assert to an error message.
@chu11 chu11 force-pushed the issue5956_demote_assertion branch from e9504ee to cd6d5f3 Compare May 10, 2024 23:07
@chu11
Copy link
Member Author

chu11 commented May 10, 2024

re-pushed, going with that slightly tweak. I agree it's cleaner.

chu11 added a commit to chu11/flux-core that referenced this pull request May 10, 2024
Problem: A recent crash showed that it is possible for
a process to not be in the EXITED state when
subprocess_check_completed() is called.  See issue flux-framework#5959.

Solution: To temporarily remove the possibility for broker
crashes, demote the assert to an error message.
@chu11 chu11 force-pushed the issue5956_demote_assertion branch from cd6d5f3 to 403a954 Compare May 10, 2024 23:09
@grondo
Copy link
Contributor

grondo commented May 10, 2024

The commit references this PR not the broker crash issue (that may make sense, just making sure that's what you meant) Edit: That issue is #5956

@chu11
Copy link
Member Author

chu11 commented May 10, 2024

The commit references this PR not the broker crash issue (that may make sense, just making sure that's what you meant)

Oh crap! Cut & paste from wrong tab. Lemme fix.

@chu11 chu11 force-pushed the issue5956_demote_assertion branch from 403a954 to a29338c Compare May 10, 2024 23:15
@chu11
Copy link
Member Author

chu11 commented May 10, 2024

re-pushed fixing up the issue number in the commit message

Also, FWIW I contemplated adding a state check here instead, like:

static void remote_completion (flux_subprocess_t *p)
{
    p->remote_completed = true;
    /* TBON inorder delivery of messages should guarantee we received
     * FLUX_SUBPROCESS_EXITED before this.
     */
    if (p->state == FLUX_SUBPROCESS_EXITED)
        subprocess_check_completed (p);
}

I think this may be the more "correct" fix, but it assumes that code is racy and subprocess_check_completed() would be called elsewhere at appropriate points in time. Since I didn't know for sure what was happening, the proposed fix is the current "safer" one.

Copy link

codecov bot commented May 10, 2024

Codecov Report

Attention: Patch coverage is 33.33333% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 83.29%. Comparing base (fb85782) to head (a29338c).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5959      +/-   ##
==========================================
- Coverage   83.62%   83.29%   -0.34%     
==========================================
  Files         506      515       +9     
  Lines       81531    83398    +1867     
==========================================
+ Hits        68184    69469    +1285     
- Misses      13347    13929     +582     
Files Coverage Δ
src/common/libsubprocess/subprocess.c 87.78% <33.33%> (-0.95%) ⬇️

... and 176 files with indirect coverage changes

@garlick
Copy link
Member

garlick commented May 11, 2024

Would it be better to change the assertion into a subprocess failure, e.g. that ends up calling ops->on_state_change (FLUX_SUBPROCESS_FAILED)? The other two proposed solutions seem like they could leave the subprocess object in a stuck state if no more responses are going to come from the server to move things forward?

@grondo
Copy link
Contributor

grondo commented May 11, 2024

My worry was that type of solution might cause a subsequent crash (because we can't reproduce the issue and thus can't really test a fix, and we don't know what would happen next) We're trying put a bandaid here to prevent a crash and (hopefully) not just cause a different one.

FYI, an RPM with this patch applied has already been generated to give to the admins to install if the crash recurs.

@garlick
Copy link
Member

garlick commented May 11, 2024

Reasonable! I'm fine with merging this so we have it in the history and following on with more work to harden the client against server misbehavior.

@chu11
Copy link
Member Author

chu11 commented May 11, 2024

Would it be better to change the assertion into a subprocess failure, e.g. that ends up calling ops->on_state_change (FLUX_SUBPROCESS_FAILED)?

As Mark said, this was intentionally not right to just avoid the crash. Thinking about it a tad more this morning, perhaps a setting EPROTO and "goto error" in the calling function might be a good idea if we didn't see state EXIT/FAILED before getting ENODATA.

@garlick
Copy link
Member

garlick commented May 16, 2024

Since this change was deployed in the patched rpm 0.62.0-2, do we want to merge this PR so that patch exists in the history?

The "right" fix could be a follow-on PR.

@grondo
Copy link
Contributor

grondo commented May 16, 2024

Good thought. Works for me!

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK then!

Problem: A recent crash showed that it is possible for
a process to not be in the EXITED state when
subprocess_check_completed() is called.  See issue flux-framework#5956.

Solution: To temporarily remove the possibility for broker
crashes, demote the assert to an error message.
@chu11 chu11 force-pushed the issue5956_demote_assertion branch from a29338c to 58e9ade Compare May 16, 2024 18:01
@mergify mergify bot merged commit a3cfb3b into flux-framework:master May 16, 2024
33 checks passed
@chu11 chu11 deleted the issue5956_demote_assertion branch May 16, 2024 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants