Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ECR] [request]: support cache manifest #876

Open
lifeofguenter opened this issue May 5, 2020 · 76 comments
Open

[ECR] [request]: support cache manifest #876

lifeofguenter opened this issue May 5, 2020 · 76 comments
Assignees
Labels
ECR Amazon Elastic Container Registry Proposed Community submitted issue Under consideration

Comments

@lifeofguenter
Copy link

Would be great if ECR could support cache-manifest (see: https://medium.com/titansoft-engineering/docker-build-cache-sharing-on-multi-hosts-with-buildkit-and-buildx-eb8f7005918e)

@TBBle
Copy link

TBBle commented Oct 27, 2020

BuildKit 0.8 will default to using an OCI media type for its caches (see moby/buildkit#1746) which I assume should make this work, but I haven't tested it myself.

@aleks-fofanov
Copy link

It still doesn't work with recently released buildkit 0.8.0
It can write the layers and config, but it is unable to upload manifest to ECR:

=> ERROR exporting cache                                                                                                     5.4s
 => => preparing build cache for export                                                                                       0.2s
 => => writing layer sha256:0d48cc65d93fe2ee9877959ff98ebc98b95fe4b2fc467ff50f27103c1c5d6973                                  0.3s
 => => writing layer sha256:2ade286d53f2e045413601ca0e3790de3792ea34abd3d025cd2cd9c3cb5231de                                  0.3s
 => => writing layer sha256:64befcf53942ba04c144cde468548885d497e238001e965e983e39eb947860c2                                  0.3s
 => => writing layer sha256:7415f0cbea8739c1bf353568b16ac74a9cfbc0b36327602e3a025abf919a38a6                                  0.3s
 => => writing layer sha256:76a1f73c618c30eb1b1d90cf043fe3f855a1cce922d1fb47458defd3dbe1c783                                  0.3s
 => => writing layer sha256:8674739c0ada3e834b816667d26dd185aa5ea089f33701f11a05b7be03f43026                                  0.3s
 => => writing layer sha256:9dc80bcd2805b2a441bd69bc9468df2e81994239e34879567bed7bdef6cb605d                                  0.3s
 => => writing layer sha256:cbdbe7a5bc2a134ca8ec91be58565ec07d037386d1f1d8385412d224deafca08                                  0.3s
 => => writing layer sha256:ce4e6de84945ab498f65d16920c9b801dfea3792871e44f89e6438e232a690b3                                  0.3s
 => => writing layer sha256:d46583c5d4c69b34cb46866838d68f53a38686dc7f2d1347ae0f252e8eb0ed4c                                  0.2s
 => => writing config sha256:33c76a0f8a74a06e461926d8a8d1845371c0cf9e86753db2483a4873aede8889                                 2.0s
 => => writing manifest sha256:0f69a7e6626f6a24a0a95ed915613ebdf9459280d4986879480d87e34849aea8                               0.6s
------
 > importing cache manifest from XXXXXXXXXXXX.dkr.ecr.us-west-2.amazonaws.com/test-repo:buildcache:
------
------
 > exporting cache:
------
error: failed to solve: rpc error: code = Unknown desc = error writing manifest blob: failed commit on ref "sha256:0f69a7e6626f6a24a0a95ed915613ebdf9459280d4986879480d87e34849aea8": unexpected status: 400 Bad Request

@errm
Copy link

errm commented Dec 9, 2020

I am seeing the same error on buildkit 0.8.0

even when setting oci-mediatypes explicitly to true: --export-cache type=registry,ref=${REPO}:buildcache,oci-mediatypes=true

 => ERROR exporting cache                                                                                                                                                                                                                                                                                                                                                                                                                                                                1.4s
 => => preparing build cache for export                                                                                                                                                                                                                                                                                                                                                                                                                                                  0.0s
 => => writing layer sha256:757d39990544d20fbebf7a88e29a5dd2bb6a4fdb116d67df9fe8056843da794d                                                                                                                                                                                                                                                                                                                                                                                             0.1s
 => => writing layer sha256:7597eaba0060104f2bd4f3c46f0050fcf6df83066870767af41c2d7696bb33b2                                                                                                                                                                                                                                                                                                                                                                                             0.1s
 => => writing config sha256:0e308fd4eee4cae672eee133cbd77ef7c197fa5d587110b59350a99b289f7000                                                                                                                                                                                                                                                                                                                                                                                            0.8s
 => => writing manifest sha256:8eb142b16e0ec25db4517f2aecff795cca2b1adbe07c32f5c571efc5c808cbcd                                                                                                                                                                                                                                                                                                                                                                                          0.3s
------
 > importing cache manifest from xxx.dkr.ecr.us-east-1.amazonaws.com/errm/test:buildcache:
------
------
 > exporting cache:
------
error: failed to solve: rpc error: code = Unknown desc = error writing manifest blob: failed commit on ref "sha256:8eb142b16e0ec25db4517f2aecff795cca2b1adbe07c32f5c571efc5c808cbcd": unexpected status: 400 Bad Request

Deamon logs:

time="2020-12-09T13:42:48Z" level=info msg="running server on /run/buildkit/buildkitd.sock"
time="2020-12-09T13:44:09Z" level=warning msg="reference for unknown type: application/vnd.buildkit.cacheconfig.v0"
time="2020-12-09T13:44:10Z" level=error msg="/moby.buildkit.v1.Control/Solve returned error: rpc error: code = Unknown desc = error writing manifest blob: failed commit on ref \"sha256:8eb142b16e0ec25db4517f2aecff795cca2b1adbe07c32f5c571efc5c808cbcd\": unexpected status: 400 Bad Request\n"

@AlexLast
Copy link

AlexLast commented Jan 4, 2021

Also seeing this for private repos, although it doesn't seem to be an issue with public ECR repos..

@n1ru4l
Copy link

n1ru4l commented Feb 8, 2021

Is there a timeframe for this feature request available? This could help to tremendously speed up CI builds.

@jellevanhees
Copy link

We have been experimenting with this buildkit feature for some time now and it works wonders.
currently, we are still dependant upon dockehub so having this functionality in private ecr would greatly benefit our ci/cd workflow

@davidfm
Copy link

davidfm commented Mar 16, 2021

Any indication as to if/when this will ever be available? Using buildkit would really improve our CI build times

@dsaydon90
Copy link

One year passed and still nothing 😔

@pieterza
Copy link

pieterza commented Jun 1, 2021

We'd really like to see support of this with ECR private repos 🙏

As of today, it still does not work:

error: failed to solve: rpc error: code = Unknown desc = error writing manifest blob: failed commit on ref "sha256:75f32e1bb4df7c6333dc352ea3ea9d04d1e04e4a14ba79b59daa019074166519": unexpected status: 400 Bad Request

@hf
Copy link

hf commented Jun 13, 2021

Yes please!

@abatilo
Copy link

abatilo commented Aug 21, 2021

Can we get any kind of communications on this?

@srrengar srrengar added this to Researching in containers-roadmap via automation Aug 25, 2021
@renannprado
Copy link

Is there any workaround available?

@ynouri
Copy link

ynouri commented Nov 2, 2021

For the teams using Github but wishing to keep images in ECR, it is possible to leverage the cache manifest support from Github Container Registry (GHCR) and push the image to ECR at the same time. When pushing to ECR, only new layers get pushed.

Github Actions workflow example:

jobs:

  docker_build:
    strategy:
      matrix:
        name:
          - my-image
        include:
          - name: my-image
            registry_ecr: my-aws-account-id.dkr.ecr.us-east-1.amazonaws.com
            registry_ghcr: ghcr.io/my-github-org-name
            dockerfile: ./path/to/Dockerfile
            context: .
            extra_args: ''

    steps:
      - uses: actions/checkout@v2

      - name: Install Buildkit
        uses: docker/setup-buildx-action@v1
        id: buildx
        with:
          install: true

      - name: Login to GitHub Container Registry
        uses: docker/login-action@v1
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v1
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1
          role-skip-session-tagging: true
          role-duration-seconds: 1800
          role-session-name: GithubActionsBuildDockerImages

      - name: Login to Amazon ECR
        uses: aws-actions/amazon-ecr-login@v1

      - name: Build & Push (ECR)
        # - https://docs.docker.com/engine/reference/commandline/buildx_build/
        # - https://github.com/moby/buildkit#export-cache
        run: |
          docker buildx build \
            --cache-from=type=registry,ref=${{ matrix.registry_ghcr }}/${{ matrix.name }}:cache \
            --cache-to=type=registry,ref=${{ matrix.registry_ghcr }}/${{ matrix.name }}:cache,mode=max \
            --push \
            ${{ matrix.extra_args }} \
            -f ${{ matrix.dockerfile }} \
            -t ${{ matrix.registry_ecr }}/${{ matrix.name }}:${{ github.sha }} \
            ${{ matrix.context }}

@abatilo
Copy link

abatilo commented Nov 3, 2021

@ynouri Just be careful of your storage costs in GHCR. It's oddly expensive. I found https://github.com/snok/container-retention-policy to help solve that use case for me.

@kgns
Copy link

kgns commented Nov 3, 2021

ECR still not supporting this is unbelievably amateurish, it doesn't suit AWS...

@pieterza
Copy link

pieterza commented Nov 5, 2021

Is there any workaround available?

Use another Docker registry. Dockerhub, or perhaps your own tiny EC2 with some fat storage.
Sucks, but AWS doesn't seem interested.

@poldridge
Copy link

This seems to have started working unannounced, at least when using docker 20.10.11 to build

@ramosbugs
Copy link

This seems to have started working unannounced, at least when using docker 20.10.11 to build

I'm still seeing error writing manifest blob with 400 Bad Request on Docker 5:20.10.12~3-0~ubuntu-focal, at least in us-west-2.

@kgns
Copy link

kgns commented Dec 19, 2021

This seems to have started working unannounced, at least when using docker 20.10.11 to build

is this confirmed?

@BeyondEvil
Copy link

This seems to have started working unannounced, at least when using docker 20.10.11 to build

is this confirmed?

I'm wondering the same thing.

Could you share some more info @poldridge ?

@eduard-malakhov
Copy link

I've just faced the same issue with Docker version 20.10.12, build e91ed57. Would appreciate any hints or workarounds.

@sherifabdlnaby
Copy link

This seems to have started working unannounced, at least when using docker 20.10.11 to build

Did not work for me using docker:20.10.11-dind and ECR us-west-2.

@sherifabdlnaby
Copy link

Can we get any kind of communication on this? Being able to use remote cache is a major benefit to all our build pipelines.

@tavlima
Copy link

tavlima commented Oct 23, 2022 via email

@automartin5000
Copy link

FWIW, my team uses buildctl directly, instead of the buildx wrapper.

Do you have an example of that?

@tavlima
Copy link

tavlima commented Nov 2, 2022

Do you have an example of that?

Sure. Please see the examples here.

@dmarkey
Copy link

dmarkey commented Nov 11, 2022

For those looking for guidance on the S3 build cache with buildx this worked for me..

    - uses: actions/checkout@v3
    - name: configure aws credentials
    - name: Set up QEMU
      uses: docker/setup-qemu-action@v2
    - name: Set up Docker Buildx
      uses: docker/setup-buildx-action@v2
    - name: Buildx master
      run: docker buildx create --bootstrap --driver docker-container --driver-opt image=moby/buildkit:master --use

and

docker buildx build . --cache-from=type=s3,region=eu-west-1,bucket=bucket-name,name=docker-cache/myapp,access_key_id=$(AWS_ACCESS_KEY_ID),secret_access_key=$(AWS_SECRET_ACCESS_KEY),session_token=$(AWS_SESSION_TOKEN) --cache-to=type=s3,region=eu-west-1,bucket=bucket-name,name=docker-cache/myapp,access_key_id=$(AWS_ACCESS_KEY_ID),secret_access_key=$(AWS_SECRET_ACCESS_KEY),session_token=$(AWS_SESSION_TOKEN) 

Will probably wait until S3 cache is released before using in production.

@kgns
Copy link

kgns commented Nov 15, 2022

I feel like we are going off-topic with all the workarounds being mentioned. Every information is helpful, but in the end we still need ECR to support cache manifests so we can use it to store our cache layers. What's the current progress on this?

@rstanevich
Copy link

I feel like we are going off-topic with all the workarounds being mentioned. Every information is helpful, but in the end we still need ECR to support cache manifests so we can use it to store our cache layers. What's the current progress on this?

It seems that the comments above are related to this post, which also contains a detailed explanation and questions for users.

Is anyone in this thread using or intending to use other tools than Docker buildx to build their images using cache manifest?

From my observation, only the Buildkit engine uses the cache manifest. Other tools like Kaniko and Buildah have a different remote cache implementation, roughly speaking, publishing each layer under a dedicated tag.

Would also love to hear how people are working around this limitation today, as it provides additional context about the underlying customer need.

I know customers have asked JFrog to support a manifest cache in their Artifactory, but now it works with a size limit of 2G per layer

@damienrj
Copy link

Thanks so much to everyone that provided more into here and in Twitter.

We are investigating the best way to proceed. Buildkit is relying on a relatively unorthodox use of the OCI specification to enable this feature (see here), and we want to make sure we do our best to continue adhering to the standard while we support customers. OCI Artifacts (see here) seem to be there more adequate tool for the cache manifest. I've asked for more information on the Buildkit repo to see the progress being made there, and we are also looking at the effort on our side.

Is anyone in this thread using or intending to use other tools than Docker buildx to build their images using cache manifest? Want to make sure we are thorough as possible and support customers across the board.

Would also love to hear how people are working around this limitation today, as it provides additional context about the underlying customer need

Our current work around is we are still using our Google GCR since the inline caching has been working there but would nice to keep image layers from having to go cloud to cloud.

@sherifabdlnaby
Copy link

Really wish this was supported by now, everything else seems like a workaround, and it's confusing to look at in the build pipelines.

@rafavallina
Copy link

rafavallina commented Nov 18, 2022

Hi all,

Providing a quick update here. I can't comment on dates in a public roadmap, but note that we are still working on this :)

I can provide an overview of our thinking but please note that this is a snapshot of our current plans, and not set in stone or a firm commitment.

We are closely looking at the release of the OCI 1.1 specification, which includes OCI Artifacts and the Referrers API. We think that OCI Artifact manifests are perfect for this. The 1.1 specification is in a release candidate state, and the OCI has set a 1/31 date for the final release.

Our intention is to see if we can work with Buildkit to have them move over to OCI Artifacts once the specification is official and ECR supports it, which would immediately make this work for everyone using it today without having to make changes. We do want to make sure that other registries also support this implementation, as it's not our intention to break anyone's existing workflows.

I apologize for the wait - I know this has been open for a long time. It's very high on our priority list, but even then it is taking a bit to get over the finish line.

@Supesharisuto
Copy link

Any updates pls ? This feature alone keeps me from promoting GHA build/deploy to my org. We currently use CodeBuild which sks..

@georgeseifada
Copy link

I am also looking forward to this feature! 🙏

@AndrewFarley
Copy link

Note: This bit me and my team in the behind just now trying to use buildkit and cache things in ECR. It's 3 years later and this still isn't possible. Can this please get some serious attention?

@osterman
Copy link

Note: This bit me and my team in the behind just now trying to use buildkit and cache things in ECR. It's 3 years later and this still isn't possible. Can this please get some serious attention?

For now, you can leverage the newly announced support in buildkit for S3 caching.

Speed up your docker builds on AWS! Docker Now Supports Native S3 Caching. Build command supports new flags --attest and shorthands --sbom and --provenance for adding attestations for your current build. When creating OCI images a minimal provenance attestation is included with the image by default. This feature requires BuildKit v0.11.0+ and Dockerfile 1.5.0+.
https://lnkd.in/gCYReAHn

@AndrewFarley
Copy link

@osterman We tried, it segfaults hard on the latest master version as of this morning. S3 support is in (very early) beta, at best on buildkit. Can't get it to work personally.

@rchernobelskiy
Copy link

We've been using S3 as a backend to local instance of registry, running together with the build.
The issue there is cleaning the cache but can work around it by giving a specific s3 path to the local registry, for example with the name of the month in it and then deleting the paths for a month that passed

@orf
Copy link

orf commented Jan 19, 2023

for example with the name of the month in it and then deleting the paths for a month that passed

Lifecycle policies?

@rchernobelskiy
Copy link

rchernobelskiy commented Jan 19, 2023

for example with the name of the month in it and then deleting the paths for a month that passed

Lifecycle policies?

lifecycle policies alone will corrupt the registry store as it will delete parts of the cache that are still referenced elsewhere
but if used with the month-in-the-path approach, then a lifecycle rule that deletes everything older than 32 days can be used safely

@rafavallina
Copy link

Hi all,

I'm sorry that we are taking so long with this. I unfortunately don't have too many updates to report: as of right now, we are working on supporting the OCI 1.1 specification (which continues to be target for a 1/31), and have also reached out to BuildKit to drive a change on how they store cache manifests.

I appreciate everyone's patience with this.

@AndrewFarley
Copy link

@rchernobelskiy Ah, if I understand it you're using S3 as the backend instead of using local, and you weren't using their caching feature at all then eh? I didn't imagine to try that route, could you perhaps share the commands/settings/config/setup you're using to get this to work? I'd like to explore it more. Thanks!

(sorry folks for this comment not being directly related to this issue)

@rchernobelskiy
Copy link

rchernobelskiy commented Jan 20, 2023

@AndrewFarley we're using a temporarily spun up registry (https://github.com/distribution/distribution) for cache for each build, and for that registry we use it with s3 as the storage backend to persist the cache across builds. This is the only reliable and maintenance-free persistent cache solution we've found after a lot of experimentation about a year ago.

@AndrewFarley
Copy link

@rchernobelskiy We are on AWS and the solution we landed on with some experimentation is using an FSX disk as our cache, which is a multi-mount capable super-fast shared disk solution, so we just tell buildkit to use a local disk cache on all our runners pointed to this disk. Even at very low provisioned speeds its lightning fast.

@marcesengel
Copy link

@rchernobelskiy We are on AWS and the solution we landed on with some experimentation is using an FSX disk as our cache, which is a multi-mount capable super-fast shared disk solution, so we just tell buildkit to use a local disk cache on all our runners pointed to this disk. Even at very low provisioned speeds its lightning fast.

Sounds nice! Do you have any comparisons of this vs. using S3 cache + the local disc?

@AndrewFarley
Copy link

AndrewFarley commented Jan 26, 2023

@marcesengel I didn't get S3 cache working, so I can't say, and I don't have the time to experiment with the "distribution" registry proxy idea that @rchernobelskiy shared above.

I can say that our builds are crazy fast. I was previously doing builds with Kaniko with AWS ECR Registry layer caching I was seeing 1-3 minute build times on some of our test repos, where with BuildKit and FSX we are seeing 10-20 second build times regularly. This may however be more indicative of BuildKit rather than of S3, however I would guess that there's far more overhead to talk to a remote registry a bunch of times to grab a bunch of cached layers, opening up an TCP connection, doing an HTTPS handshake, and doing SSL encryption/decryption, etc, rather than just reading from a local disk.

Try it out, I bet you won't be disappointed. If you get both working and compare speeds I'd be eager to hear what the numbers look like.

@roy-ht
Copy link

roy-ht commented Jan 26, 2023

Here is an example for S3 cache:

docker buildx build \ 
  --output 'type=registry,name=xxx.dkr.ecr.xxx.amazonaws.com/xxx:latest' \
  --cache-from 'type=s3,region=<your region>,bucket=<your bucket>,prefix=<your s3 key>,name=<some name>' \
  --cache-to 'type=s3,region=<your region>,bucket=<your bucket>,mode=max,prefix=<your key>;<some name>' \
  -f '<your Dockerfile>' \
  .

See also: https://github.com/moby/buildkit#s3-cache-experimental

Our settings are working successfully on GitHub self-hosted runner.

One problem is that S3 cache do not remove old layer caches automatically,
so if you are sensitive to the storage cost, you have to remove some old cache by hand, some scheduled job, lifecycle policy, etc.

@AndrewFarley
Copy link

AndrewFarley commented Jan 26, 2023

@roy-ht I'm aware of those examples. Those examples hard-crash for us with the latest release and on master. Good to hear that someone has had some success though, might be something we were doing. Maybe will have another go with it when I have some time see if I can figure out why, or file a bug in BuildKit.

@marcesengel
Copy link

marcesengel commented Jan 26, 2023

@AndrewFarley I got it working with S3 by using the local cache target of BuildX and configuring CodeBuild to do S3 caching on the resulting directory (instead of using BuildX s3 target).
Cache size is about 750 MB for our api, which adds about 35 s of up and download respectively, so I think cutting out S3 and instead mounting an EFS will lower our build times by about a minute per image. Thanks for the great idea, we'll be switching to EFS once the initial iteration of our pipeline is up and running.

Edit: Also thanks a lot for your in-depth answer, much appreciated.

@marcesengel
Copy link

I reckon even when ECR supports cache manifests, EFS will most likely stay the fastest option because of better transfer rates than the ECR + one could even store the cache without compression using EFS. What do you think? 🤔

@pieterza
Copy link

Let's keep pressure on getting ECR to support this instead of workarounds 🙏

I definitely do not look forward to manually cleaning up docker registry cache, should it for example go to S3..

@le0m
Copy link

le0m commented Jan 26, 2023

Using a local cache, which is what you guys are doing by mounting an EFS/FSX volume on the CI machine, will always be faster than using a remote one.

Could you now stop spamming this issue about support for cache manifest in ECS, please? I'd say the workarounds have been discussed at length.

@rchernobelskiy
Copy link

When we tried EFS about a year ago it was a disaster, I really don't recommend it. FSX seems a different beast though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ECR Amazon Elastic Container Registry Proposed Community submitted issue Under consideration
Projects
containers-roadmap
  
Researching
Development

No branches or pull requests