Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transition from gcr.io to a modern artifact repository #15199

Closed
sanmai-NL opened this issue Jan 30, 2023 · 31 comments
Closed

Transition from gcr.io to a modern artifact repository #15199

sanmai-NL opened this issue Jan 30, 2023 · 31 comments

Comments

@sanmai-NL
Copy link

What would you like to be added?

The Google Container Registry is deprecated. Transitioning within the Google ecosystem, to their Artifact Registry, is described on https://cloud.google.com/artifact-registry/docs/transition/transition-from-gcr.

Alternatively, only use Quay.

Why is this needed?

A pressing problem this would solve is that the Artifact Registry is reachable over IPv6, whereas the Container Registry isn't.

@sanmai-NL
Copy link
Author

Having studied the release script, I see pushing to both registries substantially increases duration and resource usage of the release pipeline. The advantage is unclear to me.

@serathius
Copy link
Member

Would be good to consult what K8s is doing about this. @BenTheElder

@sanmai-NL
Copy link
Author

@BenTheElder
Copy link

registry.k8s.io is a multi-cloud hybrid system for funding reasons (that's a whole complicated topic ...), but we've also used the opportunity to move to basing Kubernetes's future image hosting on Artifact Registry, we hope to adopt some of the AR features at some point like immutable tags.

registry.k8s.io basically sits in front of AR and redirects some content download traffic to other hosts. The source code is not fully reusable at the moment (shipping reliably ASAP >>> flexible configuration), but the approach is hopefully well enough documented and relatively simple.

I'm not sure what overall is most appropriate for etcd, other than I would recommend GCR => AR. It's mostly a drop-in upgrade.

@justinsb
Copy link
Contributor

justinsb commented Feb 4, 2023

I know that technically etcd isn't a kubernetes sig (right?), but it is CNCF, so maybe it should just use the kubernetes release pipeline, rather than creating a whole new one. I'd much rather we redefine the kubernetes release pipeline as the CNCF release pipeline, than require every CNCF project to stand up their own.

There is a pipeline for etcd already:

https://github.com/kubernetes/k8s.io/tree/main/k8s.gcr.io/images/k8s-staging-etcd

The process is described here (along with a background of why etcd is there - TLDR because it is bundled with k8s)

This then becomes a shared problem (aka not etcd's problem), though of course anyone would be welcome to work on it. With artifacts.k8s.io, our dependency on gcr.io is pretty light anyway, and if the etcd project wants to maintain their own read-only mirror (e.g. if you have some money burning a hole in your pocket) then it's relatively easy to stand up a S3 / GCS / whatever bucket to do that.

@serathius
Copy link
Member

@justinsb I agree that it's inefficient for each project to build their own pipeline, however I don't think it's a simple as just taking K8s pipeline. Etcd image released by etcd is totally different than what etcd users would expect. It includes additional old etcd binaries, wraper scripts for purpose of running etcd in K8s.

It would be great if CNCF gave us ready release tooling and maintained it for us, however reality is that we mostly depend on contributions and etcd community is not large enough to support it on our own. I have escalated problem of etcd release pipelines multiple times to both CNCF representatives and Kubernetes release people, but no luck. I'm stuck building etcd on my own laptop.

@BenTheElder
Copy link

GHCR + github actions might be worth exploring as a potentially no-cost, automated, low-maintenance option. I think some SIG subprojects in Kubernetes have done so, but I don't have first hand experience yet.

I'm not sure Kubernetes is in a position to be offering to host the entire CNCF (considering our existing budget overruns...) ... but for etcd in particular there is probably an argument to be made, we'd need to bring that to SIG K8s Infra and SIG Release.

Otherwise if Kubernetes is not actively hosting the infrastructure for you, I wouldn't recommend replicating all of it, especially if you're already understaffed. The approaches used are not without benefits but also not free.

https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry

@sanmai-NL
Copy link
Author

GHCR + github actions might be worth exploring as a potentially no-cost, automated, low-maintenance option. I think some SIG subprojects in Kubernetes have done so, but I don't have first hand experience yet.

I'm not sure Kubernetes is in a position to be offering to host the entire CNCF (considering our existing budget overruns...) ... but for etcd in particular there is probably an argument to be made, we'd need to bring that to SIG K8s Infra and SIG Release.

Otherwise if Kubernetes is not actively hosting the infrastructure for you, I wouldn't recommend replicating all of it, especially if you're already understaffed. The approaches used are not without benefits but also not free.

https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry

GitHub Container Registry isn't configured for IPv6 either.

@jeefy
Copy link

jeefy commented Feb 6, 2023

It would be great if CNCF gave us ready release tooling and maintained it for us, however reality is that we mostly depend on contributions and etcd community is not large enough to support it on our own.

Tell me more @serathius and I might be able to make that monkey paw finger curl 🙃

@stale
Copy link

stale bot commented May 21, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label May 21, 2023
@ahrtr ahrtr added stage/tracked and removed stale labels May 21, 2023
@hakman
Copy link

hakman commented May 18, 2024

@justinsb @serathius @ahrtr Is this something that can be revisited these days?

@ahrtr
Copy link
Member

ahrtr commented May 18, 2024

@justinsb @serathius @ahrtr Is this something that can be revisited these days?

etcd has already become a Kubernetes SIG, how do other SIGs maintain their images? Can we just follow the similar way to do this? We need someone to drive this effort.

@hakman
Copy link

hakman commented May 18, 2024

Let's chat separately and see if I can help with this.

@ahrtr
Copy link
Member

ahrtr commented May 18, 2024

/assign @hakman

Thanks. Please feel free to let me, @jmhbnz know if you need any assistance from etcd side.

@serathius
Copy link
Member

Going back to solving the immediate problem of AR disappearing in march. Can we just follow https://cloud.google.com/artifact-registry/docs/transition/auto-migrate-gcr-ar#migrate-gcrio-hosted-ar?

Run gcloud artifacts docker upgrade migrate --projects=PROJECTS for the etcd image hosting project. It should migrate all images to AR and create a gcr.io repository in AR and route the traffic to it. That should be it, at the end both etcd release and users should not feel any difference. Am I right?

Would be good to confirm the assumption and have someone to test the command if it works as we expect. So, create a docker registry in GCP project, push a random image there, make it public, check if they can download it, then migrate it, check if they can download it after and push new images the same way. Anyone interested in helping?

@BenTheElder
Copy link

BenTheElder commented Jan 21, 2025

There is a pipeline for etcd already:

https://github.com/kubernetes/k8s.io/tree/main/k8s.gcr.io/images/k8s-staging-etcd

The process is described here (along with a background of why etcd is there - TLDR because it is bundled with k8s)

This then becomes a shared problem (aka not etcd's problem), though of course anyone would be welcome to work on it. With artifacts.k8s.io, our dependency on gcr.io is pretty light anyway, and if the etcd project wants to maintain their own read-only mirror (e.g. if you have some money burning a hole in your pocket) then it's relatively easy to stand up a S3 / GCS / whatever bucket to do that.

[NOTE: registry.k8s.io has not depended on gcr.io for a long time, it is on artifact registry + other hosts, long term this is a more cost effective option inline with other SIG projects]

@jmhbnz
Copy link
Member

jmhbnz commented Jan 21, 2025

I've raised #19250 to propose a migration to artifact registry.

We would still need to migrate existing images but there are helper utilities as mentioned by @serathius above for copying existing images across to the new repository.

@ivanvc
Copy link
Member

ivanvc commented Feb 4, 2025

/assign

Based on the consensus in pull request #19250 and the 2025-01-21 community meeting, we lean towards following @serathius's suggestion.

Would be good to confirm the assumption and have someone to test the command if it works as we expect. So, create a docker registry in GCP project, push a random image there, make it public, check if they can download it, then migrate it, check if they can download it after and push new images the same way. Anyone interested in helping?

I'm on it.

@ivanvc
Copy link
Member

ivanvc commented Feb 5, 2025

Uh, I don't think I'll be able to test. I created a new GCR repository, but GCP has already implemented the Artifact Registry mirroring for new repositories:

$ gcloud artifacts docker upgrade migrate --projects=ivan-tests-449923 
Artifact Registry is already handling all requests for *gcr.io repos for the provided projects. If there are images you still need to copy, use the --copy-only flag.

If we want to test this scenario, we'd need someone with an old and not yet migrated GCR repository.

@serathius
Copy link
Member

Alternative would be to do a canary rollout. The migration command supports --canary-reads flag that allows to do a gradual rollout and AR has command gcloud artifacts settings disable-upgrade-redirection --project=$PROJECT_ID to rollback in case of things going sour.

We could do a following rollout process:

  1. Redirect 1% of traffic by running gcloud artifacts docker upgrade migrate --projects=$PROJECT_ID --canary-reads=1
    1. Validate if everything works, using gcloud artifacts settings disable-upgrade-redirection --project=$PROJECT_ID if things go wrong.
  2. Redirect 10% of traffic by running gcloud artifacts docker upgrade migrate --projects=$PROJECT_ID --canary-reads=10
  3. Redirect 100% of traffic by running gcloud artifacts docker upgrade migrate --projects=$PROJECT_ID --canary-reads=100

One thing I'm not sure yet is how to recognize if image was served by GCR or AR. I expect that it should be visible in the response headers.

What do you think about redirecting 1% of traffic now? In worst case user will need to retry a request. I don't think we can test it otherwise. cc @BenTheElder @ahrtr @jmhbnz @ivanvc what's your thoughts?

@ahrtr
Copy link
Member

ahrtr commented Feb 5, 2025

  1. Redirect 1% of traffic by running gcloud artifacts docker upgrade migrate --projects=$PROJECT_ID --canary-reads=1

Makes sense. Probably we can even set --canary-reads=0, and gradually increase it when the migration is done.

One thing I'm not sure yet is how to recognize if image was served by GCR or AR. I expect that it should be visible in the response headers.

I am not 100% sure about it either. Currently I see there is a warning on the console "You have gcr.io repositories in Container Registry. Use "gcloud artifacts docker upgrade migrate" to migrate to Artifact Registry.", probably the warning might be gone once the migration is done. Let's watch and see.

Another thing which needs to be super carefully about is the removal of the legacy container registry storage. Based on this doc,

  • When redirection is enabled, commands to delete images in gcr.io paths delete images in the corresponding Artifact Registry gcr.io repository
  • To safely remove all Container Registry images, delete the Cloud Storage buckets for each Container Registry hostname.

It seems that the recommended way is to remove the Cloud Storage buckets for the Container Registry directly. Probably we should do (cleanup the storage) it after the legacy container registry is completely out of support (May 22, 2025); but we still need to double confirm that it won't affect the already migrated gcr.io hosted on artifact registry.

@BenTheElder
Copy link

BenTheElder commented Feb 5, 2025

I think doing 1% trial is fine, but I think you should be free to just move it, eventually the migration will be forced anyhow.

If you someone complains at 50%, are you actually going to turn it off? Then what?

GCR turndown was announced a long time ago and will actually begin on March 18th, so there's not a lot of room to delay. https://cloud.google.com/artifact-registry/docs/transition/prepare-gcr-shutdown

We've been very explicit with registry.k8s.io that we won't be beholden to users depending on implementation details of the host versus a public OCI registry and that there can be no SLA as a free, volunteer staffed content host https://registry.k8s.io#stability

If users are really serious about uptime they need to use a mirror/pull-through-cache or distro provided mirror (which we provide docs/guidance for).

I think the situation is similar here.

It seems that the recommended way is to remove the Cloud Storage buckets for the Container Registry directly. Probably we should do (cleanup the storage) it after the legacy container registry is completely out of support (May 22, 2025); but we still need to double confirm that it won't affect the already migrated gcr.io hosted on artifact registry.

Artifact registry doesn't depend on your GCR GCS bucket once the content is migrated

@BenTheElder
Copy link

BenTheElder commented Feb 5, 2025

One thing I'm not sure yet is how to recognize if image was served by GCR or AR. I expect that it should be visible in the response headers.

This is slightly leaky, so you you do crane pull --verbose $image /dev/null and see if blobs are redirected to the GCS bucket or https://gcr.io/artifacts-downloads/[...] instead of https://storage.googleapis.com/ (you can just check the last request at the end)

The traffic to the GCS bucket is also observable in cloud console IIRC.

er: this is using https://github.com/google/go-containerregistry/tree/main/cmd/crane, not the only option, but I find it helpful for this sort of debugging so you can see the requests etc

@ahrtr
Copy link
Member

ahrtr commented Feb 5, 2025

I think doing 1% trial is fine, but I think you should be free to just move it

Either way works. To keep it simple, let's just move it (without using the --canary-reads)?

Artifact registry doesn't depend on your GCR GCS bucket once the content is migrated

Thanks for the confirmation.

@ahrtr
Copy link
Member

ahrtr commented Feb 5, 2025

@ivanvc let's just issue the command below today and let's check the status tomorrow? We have another container registry at quay.io, which guarantees that we still have alternative in the worst case.

gcloud artifacts docker upgrade migrate --projects=etcd-development 

@ivanvc
Copy link
Member

ivanvc commented Feb 5, 2025

Hey all, I had a busy morning and couldn't reply earlier. By now, it's probably late in the evening for @ahrtr and @serathius.

I agree with @BenTheElder. Even if things fail, we don't have an alternate plan, as the shutdown is coming soon, and there's nothing else we can do.

However, I still feel like the safest course of action would be to first do the 1% canary just to triple-check that everything works fine (and that we don't have configuration issues). Then, move forward with it.

@ahrtr (hopefully you see this today), are you okay with this?

@ivanvc
Copy link
Member

ivanvc commented Feb 6, 2025

I could finally reproduce the steps to migrate an old GCR repo into AR.

I tried with a 1% and 10% canary and later ran the whole migration in one of my repositories, and it worked fine.

I tested getting the images using a crane and curl to inspect the redirection and see if the canary worked.

It doesn't hurt to do a canary deployment in etcd-development, while I can verify the location and that the images work as expected. So, I'm enabling it in the meantime.

@ivanvc
Copy link
Member

ivanvc commented Feb 6, 2025

I enabled it with 1%. I'm seeing that some of my requests are being redirected to AR. I'm testing with a blob from our latest release (v3.5.18): https://gcr.io/v2/etcd-development/etcd/blobs/sha256:b9e6889272c9e672fa749795344385882b2696b0f302c6430a427a4377044a7a

Following the redirect, it returns a 200. So, there aren't any permission issues.

I'll open the floodgates 😄 (route all traffic as asked by @ahrtr).

@ivanvc
Copy link
Member

ivanvc commented Feb 6, 2025

And it works as expected. I tested several 3.5 and 3.4 images, which run fine in Docker. Also, I checked with crane, and none are using Appspot anymore, as expected. 🎉 🎉 🎉

@ahrtr
Copy link
Member

ahrtr commented Feb 6, 2025

Great news. Thanks @ivanvc

@ivanvc
Copy link
Member

ivanvc commented Feb 14, 2025

Yesterday, we released the v3.6.0-rc.0 release. It was the first time we pushed to AR's backed GCR, and it worked as expected with no issues. So, the download and upload parts are now thoroughly tested.

I believe we can close this issue now. If you think otherwise, we can reopen it.

Happy valentines 💟 ✌

@ivanvc ivanvc closed this as completed Feb 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

9 participants