Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fargate/ECS] [Image caching]: provide image caching for Fargate. #696

Open
matthewcummings opened this issue Jan 14, 2020 · 116 comments
Open
Assignees
Labels
ECR Amazon Elastic Container Registry ECS Amazon Elastic Container Service Fargate AWS Fargate Work in Progress

Comments

@matthewcummings
Copy link

matthewcummings commented Jan 14, 2020

EDIT: as @ronkorving mentioned, image caching is available for EC2 backed ECS. I've updated this request to be specifically for Fargate.

What do you want us to build?
I've deployed scheduled Fargate tasks and been clobbered with high data transfer fees pulling down the image from ECR. Additionally, configuring a VPC endpoint for ECR is not for the faint of heart. The doc is a bit confusing.

It would be a big improvement if there were a resource (network/host) local to the instance where my containers run which could be used to load my docker images.

Which service(s) is this request for?
Fargate and ECR.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
I don't want to be charged for pulling a Docker image every time my scheduled Fargate task runs.
On that note the VPC endpoint doc should be better too.

Are you currently working around this issue?
This was for a personal project, I instead deployed an EC2 instance running a cron job, which is not my preference. I would prefer using Docker and the ECS/Fargate ecosystem.

@matthewcummings matthewcummings added the Proposed Community submitted issue label Jan 14, 2020
@pavneeta pavneeta added ECS Amazon Elastic Container Service Fargate AWS Fargate labels Jan 14, 2020
@jtoberon jtoberon added the ECR Amazon Elastic Container Registry label Jan 15, 2020
@jtoberon
Copy link

@matthewcummings can you clarify which doc you're talking about ("The doc is horrific")? Can you also clarify which regions your Fargate tasks and your ECR images are in?

@matthewcummings
Copy link
Author

matthewcummings commented Jan 15, 2020

@jtoberon can we have these kinds of things in every region? I generally use us-east-1 and us-west-2 these days.

@matthewcummings
Copy link
Author

matthewcummings commented Jan 15, 2020

It seems better now https://docs.aws.amazon.com/AmazonECR/latest/userguide/vpc-endpoints.html. It has been updated from what I can see.

However, it still feels like a leaky abstraction. I'd argue that I shouldn't need to know/think about S3 here. Nowhere else in the ECS/EKS/ECR ecosystem do we really see mention of S3.

It would be great if the S3 details could be "abstracted away".

@jtoberon
Copy link

Regarding regions, I'm really asking whether you're doing cross-region pulls.

You're right: this is a leaky abstraction. The client (e.g. docker) doesn't care, but from a networking perspective you need to poke a hole to S3 right now.

Regarding making all of this easier, we plan to build cross-region replication, and we plan to simplify the registry URL so that you don't have to think as much about which region you're pulling from. #140 has more details and some discussion.

@matthewcummings
Copy link
Author

Ha ha, thanks. Excuse my snarkiness. . . I am not doing cross-region pulls right now but that is something I may need to do.
Thank you!

@matthewcummings
Copy link
Author

@jtoberon your call on whether this should be a separate request or folded into the other one.

@ronkorving
Copy link

ronkorving commented Jan 17, 2020

Wait, aren't you really asking for ECS_IMAGE_PULL_BEHAVIOR control?

This was added (it seems) to ECS EC2 in 2018:
https://aws.amazon.com/about-aws/whats-new/2018/05/amazon-ecs-adds-options-to-speed-up-container-launch-times/

Agent config docs.

I get the impression Fargate does not give control over that, and does not have it set to prefer-cached or once. This is what we really need, isn't it?

@matthewcummings matthewcummings changed the title [Fargate/ECS] [Image caching]: provide image caching for ECS. [Fargate/ECS] [Image caching]: provide image caching for Fargate. Jan 18, 2020
@matthewcummings
Copy link
Author

@ronkorving yes, that's exactly what I've requested. I wasn't aware of the ECS/EC2 feature. . . thanks for pointing me to that. However, a Fargate option would be great. I'm going to update the request.

@koxon
Copy link

koxon commented Jan 24, 2020

much needed indeed this caching option for fargate

@rametta
Copy link

rametta commented Jan 30, 2020

I would like to upvote this feature too.
I'm using Fargate at work and our images are ~1GB and it takes very long to start the task because it needs to redownload the image from ECR all the time. If there was some way to cache the image just like the way it's possible for ECS on EC2, then this would be extremely beneficial.

@andrestone
Copy link

How's this evolving?

There are many use cases where what you need is just a Lambda with unrestricted access to a kernel / filesystem. Having Fargate with cached / hot images perfectly fits this use case.

@fitzn
Copy link

fitzn commented Feb 21, 2020

@jtoberon @samuelkarp I realize that this is a more involved feature to build than it was on ECS with EC2 since the instances are changing underneath across AWS accounts, but are you able to provide any timeline on if and when this image caching would be available in Fargate? Lambda eventually fixed this same cold start issue with the short-term cache. This request is for the direct analog in Fargate.

Our use case: we run containers on-demand when our customers initiate an action and connect them to the container that we spin up. So, it's a real-time use case. Right now, we run these containers on ECS with EC2 and the launch times are perfectly acceptable (~1-3 seconds) because we cache the image on the EC2 box with PULL_BEHAVIOR.

We'd really like to move to Fargate but our testing shows our Fargate containers spend ~70 seconds in the PENDING state before moving to the RUNNING state. ECR reports our container at just under 900MB. Both ECR and the ECS cluster are in the same region, us-east-1.

We have to make some investments in the area soon so I am trying to get a sense for how much we should invest into optimizing our current EC2-based setup because we absolutely want to move to Fargate as soon as this cold start issue is resolved. As always, thank you for your communication.

@Brother-Andy
Copy link

I wish Fargate could have some sort of caching. Due to lack of environment variables my task just kept falling during all weekend. And every restart meant that new image will be downloaded from docker hub. In the end I've faced with horrible traffic usage, since Fargate had been deployed within private VPC.
Of course there is an endpoint (Fargate requires both ECR and S3 as I understood), but still some sort of caching would be much cheaper and predictable option.

@pgarbe
Copy link

pgarbe commented Mar 17, 2020

@Brother-Andy For this use-case, I built cdk-ecr-sync which syncs specific images from DockerHub to ECR. Doesn't solve the caching part but might reduce your bill.

@pyx7b
Copy link

pyx7b commented Apr 5, 2020

Ditto on the feature. We use containers to spin-off cyber ranges for students. Usage can fluctuate from 0 to thousands, Fargate is the best solution for ease of management, but the launch time is a challenge even with ECR. Caching is a much-needed feature.

@narzero
Copy link

narzero commented Apr 25, 2020

+1

1 similar comment
@klatu201
Copy link

klatu201 commented May 5, 2020

+1

@rouralberto
Copy link

Same here, I need to run multiple Fargate cross-region and it takes around a minute to pull the image. Once pulled, the task only takes 4 seconds to run. This completely stops us from using Fargate.

@nmqanh
Copy link

nmqanh commented May 29, 2020

we had the same problem, the Fargate task should take only 10 seconds to run but it takes like a minute to pull the I image :(

@congthang1
Copy link

Is that possible to use EFS file system to store image and the task just run this image? Or that is the same question of pulling from EFS to VPS which storing the container?

@amunhoz
Copy link

amunhoz commented Jul 4, 2020

Azure is solving this problem in their plataform
https://stevelasker.blog/2019/10/29/azure-container-registry-teleportation/

@nakulpathak3
Copy link

+1 we run a very large number of tasks and 1GB image. This would significantly speed up our deploys and would be a super helpful feature. We're considering moving to EC2 due to Fargate deployment slowness and this is one of the factors.

@MattBred
Copy link

MattBred commented Aug 5, 2020

Currently using Gitlab Runner Fargate driver which is great, except for the spinup time ~1-2 minutes for our image (> 1gb) because it has to pull it from ECS for every job. Not super great.

Would really like to see some sort of image caching.

@SpyMachine
Copy link

Caching of docker images in AWS ECS Fargate to reduce data transfer costs is must. We have huge cost associated with it. Why is AWS not providing this feature since years ?

I think there is no other reason except good case for high billing ?

@trivedisorabh I believe cross region replication is what you're looking for.

https://aws.amazon.com/blogs/containers/cross-region-replication-in-amazon-ecr-has-landed/

@mmarinaccio
Copy link

Would love an update on this feature! I'm very excited to reduce my organization's Fargate cold start times. The impact will be significant.

@lrajca
Copy link

lrajca commented Sep 6, 2022

Would also love an update, have been refreshing this issue once a week for 2 years hoping to hear of some progress. Would be a massive game changer.

@vmx-kbabenko
Copy link

Joining the waiting room for this one.
It would help tremendously in our case to reduce the autoscaling time.

@thereforsunrise
Copy link

thereforsunrise commented Sep 7, 2022

I am here because our NAT gateway costs went up. Use Fargate they said, it will be easier they said.

@maxgashkov
Copy link

maxgashkov commented Sep 7, 2022

@thereforsunrise if that's your only pain with the lack of caching you may benefit from defining VPC endpoints for ECR & S3 + host your docker images on ECR. This way you will not pay for traffic passing through NAT gateways due to image pull upon task launch.

ref.:

@thereforsunrise
Copy link

@maxgashkov Yes we're enabling the VPC endpoints and looking at the pull-through caching options for DockerHub too.

@Maxwell2022
Copy link

If you are using cross-region replication cost should not be a problem. The problem is the time it takes to pull the image, which is not acceptable if you compare it with Kubernetes products from their competitor

Cross region replication: https://aws.amazon.com/blogs/containers/cross-region-replication-in-amazon-ecr-has-landed/

@mreferre
Copy link

Thanks for your continued interest, we hear you and are working on a couple of specific areas to address the requirements discussed in this issue. We wanted to take the opportunity to share how we are thinking about the problem and what to expect. In the context of this specific feature request, our ultimate goal is to provide mechanisms that reduce launch times. Caching is one approach to this problem, but as pointed out in a previous update, this is not an easy problem to solve in Fargate as every new task is run on a freshly booted and patched EC2 instance. While caching remains an area of investigation, we are also working towards some alternative approaches to achieve the same goal of reducing launch times.

The first is an easy grab but it involves a specific build workflow to compress images using zstd. Nothing is required server-side (e.g. Fargate), and pull time improvements vary depending on the type of image. This mechanism is available today for use with Fargate and we have recently published a blog post that gets into the details.

Another approach we are working on to reduce pull times is to use the concept of lazy loading. The idea here, in a nutshell, is to keep pulling the image at every launch in the background but to start the container as early as possible. Loading only the essential elements needed to start means that the container can be started before the pull is completed. You may have seen we recently launched the soci-snapshotter open source project which is at the core of this idea. You can also read more about it in this what’s new post. One of our goals is to make this technique available to Fargate customers transparently without changing your current workflows and thus making it work for all existing container images. We don’t yet have timelines to share but we expect this will be made available before an image caching feature specific to Fargate. Just like with the zstd technique, we expect improvements in pull times to vary depending on the image size and type.

As others have noted, ECR does offer features to help minimize the transfer costs, which is a bit unrelated to improving Fargate launch times, but important to call out. One thing to consider for private networks is the use of VPC endpoints to avoid unnecessary charges in NAT Gateway, as discussed by @magoun and @alexjeen. For many use cases, ECR Replication can be used to minimize cross-region transfer costs. For images that are stored upstream in public registries, ECR Pull Through Cache may work for you. We have ongoing work to increase the use cases for PTC through authentication for rule upstreams and would be happy to hear from you.

@trivedisorabh
Copy link

Is there any update on this or i am asking too early ?

@MattFellows
Copy link

Has anyone had much luck with zstd? I haven't foudn the time to try it out yet, but if anyone else has, would be interested to know how much difference it has made?

@AffiTheCreator
Copy link

Hello @MattFellows

I have been on a quest to optimize my company ECS fargate containers to try to bring down the startup time.

@mreferre told me about this blog post where he is using a buildx to build and push the container image to ECR using the new zstd compression.

I found a 25% reduction in startup time in ECS and the image size decreased by 50%.
Using real values i had a container with 2.4 GB and after compression it went down to 1.16 GB, the startup time went from ~90 seconds to ~60.
I'm using a task with 4vCPU and 8GB RAM

Also take a look at this blog

Also as a final tip, following the best practices for docker by removing the number of instructions in the Dockerfile this is because it creates more layers in the final image and that increases the size and therefore the startup time.

@AffiTheCreator
Copy link

AffiTheCreator commented Nov 25, 2022

ps: the docker instructions I'm talking are:

  • FROM creates a layer from the ubuntu:18.04 Docker image.
  • COPY adds files from your Docker client’s current directory.
  • RUN builds your application with make.
  • CMD specifies what command to run within the container.

see this for more information

And i forgot to mention the buildx setup broke my wsl and took a bit to fix it

@cb-salaikumar
Copy link

Is there any update on this thread?

@MaazDev
Copy link

MaazDev commented Feb 6, 2023

2023 , any roadmap or news ?

@Maxwell2022
Copy link

I found a 25% reduction in startup time in ECS and the image size decreased by 50%.
Using real values i had a container with 2.4 GB and after compression it went down to 1.16 GB, the startup time went from ~90 seconds to ~60.
I'm using a task with 4vCPU and 8GB RAM

I found out I cannot pull these images on my M1 mac so I disabled the compression, I prefer being able to pull production images than having them 10-20% smaller. Plus out of the 2+ minutes for the pod to come online, downloading the image is not the worse, the worse is registering the Fargate pod in the VPC (at least 60 seconds)

@smailc
Copy link

smailc commented Feb 6, 2023

Podman supports pulling zstd images

@henry118
Copy link

I found out I cannot pull these images on my M1 mac so I disabled the compression

@Maxwell2022 which tool did you use to pull zstd images on Mac? If you used docker, it didn't support zstd until the most recent release v23.0.0. Another option on Mac (apart from podman) is finch, you may give it a try.

@Maxwell2022
Copy link

From v23.0.0? I'm on the latest docker for mac and the engine is version 20.10.22
I didn't use any tool, I was expecting it to work with docker cli commands, I verified I had zstd lib install but that was not enough apparently

@henry118
Copy link

The v23.0.0 docker engine (open source name is Moby) was released 2 weeks ago. I guess it hasn't been integrated into Docker Desktop yet. Once the engine is upgraded to v23, the cli command will work for zstd images.

@ValtsAusmanis
Copy link

Is there any update on this? As this is surely solid downside when using Fargate!

@otavioribeiromedeiros
Copy link

This limitation makes me want to switch to ECS + EC2.

@acu192
Copy link

acu192 commented Mar 1, 2023

Yep, I've already given up on Fargate because of this issue, and am using ECS + EC2 for everything now. With EC2 you can tell it to cache your images so after the first one starts, all subsequent ones start within ~2 seconds. Of course that relies on keeping your same EC2 running and not letting them cycle, but in our case that's easy to do.

Shame thought because I used to be excited about Fargate.

@Maxwell2022
Copy link

Maxwell2022 commented Mar 1, 2023

all subsequent ones start within ~2 seconds

I guess that's assuming your pool doesn't have to add another node. Because if it has to add a node you'll probably have the same latency than Fargate which takes at least a minute to register it in the VPC

@vaibhavkhunger
Copy link

Hi folks, the Fargate team is invested in solving this issue for our customers. We are actively working integration SOCI with Fargate and conducting the due diligence to deliver a frictionless customer experience that incurs minimal onboarding effort. Please give us some more time and we will keep you posted with updates here.

@MattFellows
Copy link

MattFellows commented Mar 1, 2023 via email

@bearrito
Copy link

bearrito commented May 3, 2023

@vaibhavkhunger Any further update on this. I have very large image sizes ~5GB. Fargate is a non-starter for me because of this.

Image sizes cannot be reduced due to the sizable robotics/ai libraries required.

@mmarinaccio
Copy link

@bearrito My guess is that your robotics/ai libraries install large binaries / artifacts as dependencies. Is that right? If so, you might consider hosting those static assets in S3 or something comparable. This way, your much slimmer container can download the asset during its boot sequence. I've never actually done this, but my understanding is that it's usually best to avoid embedding large assets in docker images.

@bearrito
Copy link

bearrito commented May 3, 2023

@mmarinaccio Doesn't work that way in my case. The image I receive is produced by an upstream team. There are no large artifacts embedded in the image e.g model files.

Pile on a bunch of robotics libraries and you are going to get a big image...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ECR Amazon Elastic Container Registry ECS Amazon Elastic Container Service Fargate AWS Fargate Work in Progress
Projects
containers-roadmap
  
We're Working On It
Development

No branches or pull requests