Skip to content
This repository has been archived by the owner on Oct 2, 2023. It is now read-only.

Add option to push images sequentially in push-all #1658

Merged
merged 1 commit into from
Nov 3, 2020

Conversation

imjasonh
Copy link
Contributor

@imjasonh imjasonh commented Oct 21, 2020

There seems to be a bug with pushing many images (50+) related to the
async+wait behavior previously used, and this new option will give users
who have lots of images to push a safer option.

Fixes #525

@imjasonh
Copy link
Contributor Author

cc @nlopezgi @smukherj1

@eytankidron
Copy link
Collaborator

I am worried that this change might negatively affect existing use cases (i.e. make them slower).
I'd like to suggest one of two alternate approaches:
(1) A combination of sequential and parallel: Group the images into groups of, say, 50, and push the images within a group in parallel. The groups will be pushed sequentially.
(2) Add an optional sequential boolean flag to container_push and have it default to false.

@imjasonh
Copy link
Contributor Author

I am worried that this change might negatively affect existing use cases (i.e. make them slower).
I'd like to suggest one of two alternate approaches:
(1) A combination of sequential and parallel: Group the images into groups of, say, 50, and push the images within a group in parallel. The groups will be pushed sequentially.
(2) Add an optional sequential boolean flag to container_push and have it default to false.

That's a reasonable concern. In my experience, pushing 178 relatively small images sequentially, from source uploaded to and built in GCB to GCR, took only a minute or two longer than pushing in parallel.

I talked with @jonjohnsonjr and he mentioned it could actually be somewhat faster to push sequentially, since blob existence checks could get more hits. This is especially true if the images tend to share layers, which in my case they do, and even moreso if the layers are large (mine generally aren't).

Weigh this against the not-entirely-well-known issue that manifests when "too many" (How many? Not empirically known) images are pushed, caused by the async function and wait $PID. Ideally we'd not rely on bash for async fan-out/fan-in behavior, since it's not well suited for that. Rewriting parallel pushing efficiently in Go seems useful.

@eytankidron
Copy link
Collaborator

You might be right about sequential pushes being faster in some cases. That sounds reasonable, but honestly, I don't know, and I'd rather not make this change which might affect existing users in unforeseeable ways.
Just to be on the safe side, we can go with the sequential (or parallel) flag for now.
I see that you started adding something like that in 5e7547e but I don't see it in this PR. Did you revert that commit?

@imjasonh
Copy link
Contributor Author

You might be right about sequential pushes being faster in some cases. That sounds reasonable, but honestly, I don't know, and I'd rather not make this change which might affect existing users in unforeseeable ways.
Just to be on the safe side, we can go with the sequential (or parallel) flag for now.
I see that you started adding something like that in 5e7547e but I don't see it in this PR. Did you revert that commit?

I did revert the commit, since I wanted to plead my case for sequential-by-default first.

Existing users are already affected in unforeseeable ways -- at some point they can add too many images, and push-all will fail in mysterious and hard to debug ways. This change removes that unspecified hard-to-debug behavior at the cost of some latency.

Some users might be negatively affected by that increase in latency, but push-all never specified parallel pushes to users, and since it's turning out to be unsafe, I think it's safe to do from a user-surprise perspective.

I'm separately working on some changes in go-containerregistry to efficiently (and safely!) push multiple images in parallel, and if that lands I expect it could be very useful in this scenario.

@eytankidron
Copy link
Collaborator

eytankidron commented Oct 23, 2020

I'm honestly not sure how most users use push-all. Do they mostly use it to push <50 images and may be negatively affected by slowness? Maybe, maybe not.
Having said that, you do have a point that being broken in a way that's hard to debug is a scenario we want to avoid.

So how about adding a sequential/parallel flag and have it default to sequential. Then if anyone notices a hit in performance, they can always switch back to parallel.

Additionally, once your changes to container registry land, we can switch to using that.

@imjasonh
Copy link
Contributor Author

The main reason I don't want to add a flag, whatever its default, is that we fully expect to remove it in the future, and that seems like a hairy breaking change to manage. Meanwhile I just want to successfully push my 100+ images using Bazel 😄

If the behavior of push-all remains "unspecified", we have flexibility to say that, sure, it was in parallel, but now it's sequential for safety. If a user is unsatsified with the additional latency after this change (which we're not even sure they'll notice), they can use an older version of rules_docker until some future change supports safe parallel pushing without unspecified random failures.

@eytankidron
Copy link
Collaborator

I'm okay with paying the price of the eventual deprecation of the sequential/parallel flag. I don't think it needs to be a breaking change. The way I see it, once the efficient and correct parallel push is rolled out, we just declare this flag as deprecated and allow users to set whatever value for this flag, while documenting that we are now ignoring it.

I think this is a reasonable price to pay for the ability to have users be able to choose the best solution that works for them until this better push lands. I am saying this with the full understanding that we have no way of knowing (nor do we plan to try and figure out) what the best solution is for users or what the users think the best solution for them is. I just don't want to be in a situation where a customer wants to go back to the the old way of pushing but for some other reason can't use the older version.

@imjasonh
Copy link
Contributor Author

imjasonh commented Nov 2, 2020

I'm okay with paying the price of the eventual deprecation of the sequential/parallel flag. I don't think it needs to be a breaking change. The way I see it, once the efficient and correct parallel push is rolled out, we just declare this flag as deprecated and allow users to set whatever value for this flag, while documenting that we are now ignoring it.

I think this is a reasonable price to pay for the ability to have users be able to choose the best solution that works for them until this better push lands. I am saying this with the full understanding that we have no way of knowing (nor do we plan to try and figure out) what the best solution is for users or what the users think the best solution for them is. I just don't want to be in a situation where a customer wants to go back to the the old way of pushing but for some other reason can't use the older version.

If you're comfortable managing the removal of the flag that's fine with me. I'll ping this PR when I have the sequential=True flag landed (the default will remain False to maintain current unsafe parallel push behavior)

FYI I just landed remote.MultiWrite in go-containerregistry which a tool in this repo could leverage to make fast and safe parallel uploads a reality.

@imjasonh imjasonh force-pushed the push-all branch 2 times, most recently from 153d199 to 1ce0288 Compare November 3, 2020 02:37
@imjasonh imjasonh changed the title Push images sequentially in push-all Add option to push images sequentially in push-all Nov 3, 2020
@imjasonh
Copy link
Contributor Author

imjasonh commented Nov 3, 2020

Confirmed that this PR works to push distroless images sequentially, and that removing sequential = False successfully pushed images, then resulted in the error we saw before.

GoogleContainerTools/distroless#624

Copy link
Collaborator

@eytankidron eytankidron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this Jason.

@eytankidron
Copy link
Collaborator

Looks like buidifier is failing because of lexicographical order.

contrib/push-all.bzl:72: unsorted-dict-items: Dictionary items are out of their lexicographical order.
contrib/push-all.bzl:74: unsorted-dict-items: Dictionary items are out of their lexicographical order.

There seems to be a bug with pushing many images (50+) related to the
async+wait behavior previously used, and this new option will give users
who have lots of images to push a safer option.
@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: eytankidron, ImJasonH

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@eytankidron
Copy link
Collaborator

/gcbrun

@eytankidron eytankidron merged commit feaaebd into bazelbuild:master Nov 3, 2020
@wyattanderson
Copy link

FYI I just landed remote.MultiWrite in go-containerregistry which a tool in this repo could leverage to make fast and safe parallel uploads a reality.

Is there interest in migrating push behavior into Go code from Bash? We're hitting some rate limits with AWS when pushing hundreds of images in parallel, and we may be able to remedy that by pushing sequentially, but if there's interest/alignment in moving the parallelism into a Go binary, I could explore the concept in a pull request.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failed while running push-all
5 participants