Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition #3212

Merged
merged 1 commit into from May 10, 2021
Merged

Fix race condition #3212

merged 1 commit into from May 10, 2021

Conversation

rhatdan
Copy link
Member

@rhatdan rhatdan commented May 7, 2021

Ed has found situations where the container exits, before we can check
the state causing a failure, where I think we can complete successfully.

Fixes: https://github.com/containers/buildah/issues/3113

[NO TESTS NEEDED] since I have no way to generate this race condition.

Signed-off-by: Daniel J Walsh dwalsh@redhat.com

What type of PR is this?

/kind api-change
/kind bug
/kind cleanup
/kind deprecation
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake
/kind other

What this PR does / why we need it:

How to verify it

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?


@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rhatdan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rhatdan
Copy link
Member Author

rhatdan commented May 7, 2021

@nalind PTAL, does this make sense to you?

@nalind
Copy link
Member

nalind commented May 7, 2021

Have we gotten errors that match it ("no such process")? I'd expect that perhaps if the runtime's `state' command exited out from under us and was reaped by someone else, though the waitpid(2) man page suggests we'd get an ECHILD ("no child processes") for cases like that.
The errors in #3113 look like they're being relayed by us when printed by the runtime, which is exiting with status 1, so I don't think it'd hit the code that's being added. Browsing the library sources, the "exit status %d" text looks more like what we'd get from formatting an os.ExitError for display.

@rhatdan rhatdan changed the title Fix race condition [WIP] Fix race condition May 8, 2021
@rhatdan rhatdan changed the title [WIP] Fix race condition Fix race condition May 9, 2021
@rhatdan
Copy link
Member Author

rhatdan commented May 9, 2021

@nalind @giuseppe @edsantiago PTAL. I am now setting the stopped flag when pid1 of the container exits. Would this solve the problem?

@edsantiago
Copy link
Collaborator

LGTM in principle. You'll need to re-push to get [NO TESTS NEEDED] to trigger.

Ed has found situations where the container exits, before we can check
the state causing a failure, where I think we can complete successfully.

Fixes: https://github.com/containers/buildah/issues/3113

[NO TESTS NEEDED] since I have no way to generate this race condition.

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
@giuseppe
Copy link
Member

I'd wait to see if the fix in crun addresses the race we've seen. It might take some time between sending the KILL signal and the process being really terminated by the kernel, so I am afraid this change introduces a new race condition.

@rhatdan
Copy link
Member Author

rhatdan commented May 10, 2021

@giuseppe I don't see this as being any worse then what we currently have, and quite possibly better

@giuseppe
Copy link
Member

the issue could be in the sequence:

kill -9 $CONTAINER_PID; $RUNTIME delete $CONTAINER

The container could still be alive by the time we call $RUNTIME delete as the KILL signal was not handled yet. In this case the delete would fail.

At least runc and crun wait for the process to exit before the kill operation is terminated, so in this case we are sure we are not introducing a new race condition, but in general a OCI runtime could handle the kill command in a different way and don't wait that the container terminates

Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry nevermind my last comment, I got confused.

LGTM

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 10, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe, rhatdan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rhatdan rhatdan added the lgtm label May 10, 2021
@openshift-merge-robot openshift-merge-robot merged commit 135d63d into containers:master May 10, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants