Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

imagebuildah: fix an attempt to write to a nil map #3533

Merged
merged 1 commit into from
Sep 23, 2021

Conversation

nalind
Copy link
Member

@nalind nalind commented Sep 22, 2021

What type of PR is this?

/kind bug

What this PR does / why we need it:

If the build for a single stage fails, we break out of the loop that's iterating through all of the stages over in its own goroutine, and start cleaning up after the stages that were already completed.

Because the function that launched that goroutine also calls its cleanup function in non-error cases, the cleanup function sets the map that's used to keep track of what needs to be cleaned up to nil after the function finishes iterating through the map, so that we won't try to clean up (a given thing that needs to be cleaned up) more than once.

Because the loop that's iterating through all of the stages is running in its own goroutine, it doesn't stop when the function that started it returns in error cases, so it would still attempt to build subsequent stages. Have it check for cases where the map variable has already been cleared, or if one of the stages that it's already run returned an error. If the function that it calls to build the stage, using the map variable as a parameter, is already running at that point, it'll have a non-nil map, so it won't crash, but it might not be cleaned up correctly, either.

If such a stage finishes, either successfully or with an error, the goroutine would try to pass the result back to its parent(?) goroutine over a channel that was no longer being read from, and it would stall, never releasing the jobs semaphore. Because we started sharing that semaphore across multiple-platform builds, builds for other platforms would stall completely, and the whole build would stall. Make the results channel into a buffered channel to allow it to not stall there.

How to verify it

New integration test!

Which issue(s) this PR fixes:

None

Special notes for your reviewer:

Does this PR introduce a user-facing change?

If the build for a single stage fails, we break out of the loop that's
iterating through all of the stages over in its own goroutine, and start
cleaning up after the stages that were already completed.

Because the function that launched that goroutine also calls its cleanup
function in non-error cases, the cleanup function sets the map that's
used to keep track of what needs to be cleaned up to `nil` after the
function finishes iterating through the map, so that we won't try to
clean up (a given thing that needs to be cleaned up) more than once.

Because the loop that's iterating through all of the stages is running
in its own goroutine, it doesn't stop when the function that started it
returns in error cases, so it would still attempt to build subsequent
stages.  Have it check for cases where the map variable has already been
cleared, or if one of the stages that it's already run returned an
error.  If the function that it calls to build the stage, using the map
variable as a parameter, is already running at that point, it'll have a
non-`nil` map, so it won't crash, but it might not be cleaned up
correctly, either.

If such a stage finishes, either successfully or with an error, the
goroutine would try to pass the result back to its parent(?) goroutine
over a channel that was no longer being read from, and it would stall,
never releasing the jobs semaphore.  Because we started sharing that
semaphore across multiple-platform builds, builds for other platforms
would stall completely, and the whole build would stall.  Make the
results channel into a buffered channel to allow it to not stall there.

Signed-off-by: Nalin Dahyabhai <nalin@redhat.com>
@openshift-ci openshift-ci bot added kind/bug Categorizes issue or PR as related to a bug. approved labels Sep 22, 2021
Copy link
Member

@vrothberg vrothberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rhatdan
Copy link
Member

rhatdan commented Sep 23, 2021

/approve
/lgtm

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 23, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nalind, rhatdan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 018e6f1 into containers:main Sep 23, 2021
@nalind nalind deleted the mid-failure branch September 23, 2021 11:59
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved kind/bug Categorizes issue or PR as related to a bug. lgtm locked - please file new issue/PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants