Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remotes/docker: Only return "already exists" on push when the upload was sucessful #5276

Merged
merged 1 commit into from
Apr 7, 2021

Conversation

aaronlehmann
Copy link
Contributor

The (dockerPusher).Push method uses a StatusTracker to check if an
upload already happened, before repeating the upload. However, there is
no provision for failure handling. If a PUT request returns an error,
the StatusTracker will still see the upload as if it happened
successfully. Add a status boolean so that only successful uploads
short-circuit Push.

Signed-off-by: Aaron Lehmann alehmann@netflix.com

@k8s-ci-robot
Copy link

Hi @aaronlehmann. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@aaronlehmann
Copy link
Contributor Author

Somewhat related to this PR: Push uses doWithRetries, but because the body comes from a pipe that isn't seekable, an attempt to retry the upload will fail. I'm not sure what purpose doWithRetries serves in this case, since the architecture of Push doesn't seem like it really supports retries. It may make sense to redesign this (or stop attempting retries in Push). I ran into the issue in this PR because I originally wanted to adjust the retry logic for Push, but found I had to go one level up to the library user to get usable retries.

aaronlehmann added a commit to aaronlehmann/buildkit that referenced this pull request Mar 27, 2021
Some registries can be flaky and return intermittent 5xx errors. This
change allows those errors to be retried, similarly to network-level
errors.

Note that this needs the upstream containerd fix
containerd/containerd#5276 to work reliably.

This was tested with a registry that was modified to return 504 on every
other manifest PUT. Without the change, exports to the registry fail
every other attempt.  With the change and the related containerd change,
exports to the registry always succeed.

Signed-off-by: Aaron Lehmann <alehmann@netflix.com>
@theopenlab-ci
Copy link

theopenlab-ci bot commented Mar 27, 2021

Build succeeded.

@theopenlab-ci
Copy link

theopenlab-ci bot commented Mar 27, 2021

Build succeeded.

@thaJeztah
Copy link
Member

Hi 👋 @aaronlehmann long time no see 🤗

@dmcgowan
Copy link
Member

@aaronlehmann related to doWithRetries, I agree that part could be improved, especially the returned errors. It is designed right now to just fail on a subsequent request after a body has already been requested from a previous request, I'm not sure there is a case today where that would end up doing anything though. There have probably been some improvements in Go that allow us to clean this up further too.

What has been your experience with push here though? This code flow hasn't been tested extensively outside of buildkit use cases

@aaronlehmann
Copy link
Contributor Author

What has been your experience with push here though? This code flow hasn't been tested extensively outside of buildkit use cases

This came up in a buildkit context.

@aaronlehmann
Copy link
Contributor Author

@dmcgowan: Would you happen to be the right person to review this? It would be great to get it upstream!

@thaJeztah
Copy link
Member

@aaronlehmann I saw your PR in BuildKit linking to this one (moby/buildkit#2043) was to handle 5xx errors; is this fix related to / trying to solve the same issue as #5288 ?

/cc @kzys

@k8s-ci-robot k8s-ci-robot requested a review from kzys April 2, 2021 08:32
@aaronlehmann
Copy link
Contributor Author

No, I wasn't aware of that issue. The problem I was trying to solve was flaky infrastructure causing sporadic 5xx errors which caused pushes to fail.

@dmcgowan
Copy link
Member

dmcgowan commented Apr 2, 2021

@dmcgowan: Would you happen to be the right person to review this? It would be great to get it upstream!

Yes, but I am still trying to understand this state a little better. Completed seems like it might share some state with committed, these states don't transfer well from the registry API to content API

@aaronlehmann
Copy link
Contributor Author

The issue here was that the check near the beginning of Push was only checking the number of bytes sent. If we had sent N bytes of an N byte blob to the registry, we assume the registry has that blob, but that's a bad assumption to make. What if we sent those bytes but the registry returned an error and didn't commit the blob? That's the condition I'm trying to handle.

I'm fine with renaming it to Committed if you think that's a better description.

@dmcgowan
Copy link
Member

dmcgowan commented Apr 5, 2021

Committed would be a better field name.

Why I think it still doesn't transfer well is because the content API works a little differently than this. The Status type is intended to represent an active ingest. The "more correct" use of the content API would mean the upload should not check the size at all, but rather continue the process as long as there is a status object returned. After the upload is successful, the status should be removed and the "already exists" would happen via calling Info on the content.Manager interface.

My only reservation is related to making any changes to content.Status. For this particular change, I do think the change should occur inside the Commit method (there is one related TODO there). I also think it may make sense to just keep a separate record of the committed Info objects to the registry. This would be a little more work than just updating the status, but long term I think it is the right way to integrate with the interface. So the changes would mean the tracker would support removing status and keeping an Info cache (just a cache since this could also be checked via Head to the registry).

@aaronlehmann
Copy link
Contributor Author

Thanks, that makes sense. I've modified this to rename the field to Committed, move it from content.Status to docker.Status, and make the update in the Commit method.

I'm not completely clear on how the Info cache would fit into this. It looks to me like the pusher code is making HTTP requests directly rather than using the content.Manager abstraction. Are you suggesting that dockerPusher should be refactored to use content.Manager?

@theopenlab-ci
Copy link

theopenlab-ci bot commented Apr 5, 2021

Build succeeded.

@dmcgowan
Copy link
Member

dmcgowan commented Apr 5, 2021

@aaronlehmann actually this works great, thanks!

LGTM

merge might be on hold still until CI issues resolved

@theopenlab-ci
Copy link

theopenlab-ci bot commented Apr 5, 2021

Build succeeded.

@theopenlab-ci
Copy link

theopenlab-ci bot commented Apr 6, 2021

Build succeeded.

@aaronlehmann
Copy link
Contributor Author

Is there a known issue with CI? I've tried a few times since yesterday and continue to see these failures.

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mikebrow
Copy link
Member

mikebrow commented Apr 7, 2021

Is there a known issue with CI? I've tried a few times since yesterday and continue to see these failures.

yeah rebase and you should be good now

@theopenlab-ci
Copy link

theopenlab-ci bot commented Apr 7, 2021

Build succeeded.

…was successful

The `(dockerPusher).Push` method uses a `StatusTracker` to check if an
upload already happened, before repeating the upload. However, there is
no provision for failure handling. If a PUT request returns an error,
the `StatusTracker` will still see the upload as if it happened
successfully. Add a status boolean so that only successful uploads
short-circuit `Push`.

Signed-off-by: Aaron Lehmann <alehmann@netflix.com>
@theopenlab-ci
Copy link

theopenlab-ci bot commented Apr 7, 2021

Build succeeded.

Copy link
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fuweid fuweid merged commit 85041ff into containerd:master Apr 7, 2021
tonistiigi pushed a commit to tonistiigi/buildkit that referenced this pull request Apr 22, 2021
Some registries can be flaky and return intermittent 5xx errors. This
change allows those errors to be retried, similarly to network-level
errors.

Note that this needs the upstream containerd fix
containerd/containerd#5276 to work reliably.

This was tested with a registry that was modified to return 504 on every
other manifest PUT. Without the change, exports to the registry fail
every other attempt.  With the change and the related containerd change,
exports to the registry always succeed.

Signed-off-by: Aaron Lehmann <alehmann@netflix.com>
(cherry picked from commit d3b96f4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants