imagebuildah: fix an attempt to write to a nil map #3533

nalind · 2021-09-22T22:13:28Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

If the build for a single stage fails, we break out of the loop that's iterating through all of the stages over in its own goroutine, and start cleaning up after the stages that were already completed.

Because the function that launched that goroutine also calls its cleanup function in non-error cases, the cleanup function sets the map that's used to keep track of what needs to be cleaned up to nil after the function finishes iterating through the map, so that we won't try to clean up (a given thing that needs to be cleaned up) more than once.

Because the loop that's iterating through all of the stages is running in its own goroutine, it doesn't stop when the function that started it returns in error cases, so it would still attempt to build subsequent stages. Have it check for cases where the map variable has already been cleared, or if one of the stages that it's already run returned an error. If the function that it calls to build the stage, using the map variable as a parameter, is already running at that point, it'll have a non-nil map, so it won't crash, but it might not be cleaned up correctly, either.

If such a stage finishes, either successfully or with an error, the goroutine would try to pass the result back to its parent(?) goroutine over a channel that was no longer being read from, and it would stall, never releasing the jobs semaphore. Because we started sharing that semaphore across multiple-platform builds, builds for other platforms would stall completely, and the whole build would stall. Make the results channel into a buffered channel to allow it to not stall there.

How to verify it

New integration test!

Which issue(s) this PR fixes:

None

Special notes for your reviewer:

Does this PR introduce a user-facing change?

If the build for a single stage fails, we break out of the loop that's iterating through all of the stages over in its own goroutine, and start cleaning up after the stages that were already completed. Because the function that launched that goroutine also calls its cleanup function in non-error cases, the cleanup function sets the map that's used to keep track of what needs to be cleaned up to `nil` after the function finishes iterating through the map, so that we won't try to clean up (a given thing that needs to be cleaned up) more than once. Because the loop that's iterating through all of the stages is running in its own goroutine, it doesn't stop when the function that started it returns in error cases, so it would still attempt to build subsequent stages. Have it check for cases where the map variable has already been cleared, or if one of the stages that it's already run returned an error. If the function that it calls to build the stage, using the map variable as a parameter, is already running at that point, it'll have a non-`nil` map, so it won't crash, but it might not be cleaned up correctly, either. If such a stage finishes, either successfully or with an error, the goroutine would try to pass the result back to its parent(?) goroutine over a channel that was no longer being read from, and it would stall, never releasing the jobs semaphore. Because we started sharing that semaphore across multiple-platform builds, builds for other platforms would stall completely, and the whole build would stall. Make the results channel into a buffered channel to allow it to not stall there. Signed-off-by: Nalin Dahyabhai <nalin@redhat.com>

vrothberg

LGTM

rhatdan · 2021-09-23T10:09:40Z

/approve
/lgtm

openshift-ci · 2021-09-23T10:09:48Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nalind, rhatdan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [nalind,rhatdan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot added kind/bug Categorizes issue or PR as related to a bug. approved labels Sep 22, 2021

vrothberg reviewed Sep 23, 2021

View reviewed changes

openshift-ci bot assigned rhatdan Sep 23, 2021

openshift-ci bot added the lgtm label Sep 23, 2021

openshift-merge-robot merged commit 018e6f1 into containers:main Sep 23, 2021

nalind deleted the mid-failure branch September 23, 2021 11:59

github-actions bot added the locked - please file new issue/PR label Sep 15, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

imagebuildah: fix an attempt to write to a nil map #3533

imagebuildah: fix an attempt to write to a nil map #3533

nalind commented Sep 22, 2021

vrothberg left a comment

rhatdan commented Sep 23, 2021

openshift-ci bot commented Sep 23, 2021

imagebuildah: fix an attempt to write to a nil map #3533

imagebuildah: fix an attempt to write to a nil map #3533

Conversation

nalind commented Sep 22, 2021

What type of PR is this?

What this PR does / why we need it:

How to verify it

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

vrothberg left a comment

Choose a reason for hiding this comment

rhatdan commented Sep 23, 2021

openshift-ci bot commented Sep 23, 2021