-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fargate] [docs?]: Clarify Fargate limits #768
Comments
Support discussed this with the Fargate team( case
|
As per @nathanpeck's tweet, the limits were updated:
Not closing this issue as the docs still refer to EC2 specifically. |
Ditto for EKS Fargate. Is it ok if we add an eks label @tabern ? We hit a rate limit that isn't documented in https://docs.aws.amazon.com/eks/latest/userguide/service-quotas.html. You may encounter rate limits imposed by Fargate. This looks like a pod stuck in pending:
The Pending pod can be described like so:
Good thing we caught this before going to prod! What are the launch rate limits in EKS Fargate? |
@sandan this sounds like a TPS (task per second) throughput limit. This is a soft limit that the team can lift on a need basis. Did it eventually come up? The EKS fargate scheduler should be retrying and eventually succeed. Note this limit is orthogonal to the limit of concurrent Fargate tasks/pods you can have in an account/region at any point in time (also a soft limit). |
@mreferre No it stayed pending for 56 minutes then I just scaled it back down. I was testing some autoscaling policies with the Horizontal Pod Autoscaler and trying to get a sense for pod cold start latencies. There was at least a minute between scaling events though. It would be great to document throughput limits in the EKS user guide. I ran the same tests the day before and was able to scale past 256 with no quota issues. It is good to know it is a soft limit but without knowing the TPS throughput limit, how would we know what to increase it to in our service quota request? |
We don't publish these numbers for various reasons. Mostly because knowing the rate limits of a single service may not be enough as an operation may depend on other limits (the EC2 page that discusses this is a good example of this concept). As I said these limits can be softened but it usually requires a hand-holding process to define scope, use case, and other things. To date the best approach would be to assume you can hit those limits and build a retry logic (better if it includes bake-off retries). I am also intrigued by the fact that (single?) pod has been pending for so long. How many pods were you deploying and/or were there other tests going on in the same account/region? I am asking because the EKS Fargate scheduler does implement retries (with bake-offs etc) so unless there was a lot of deployment activity going on it should have succeeded. There is a limit to these retries but again if this was one pod (or few pods) this shouldn't have been a problem. If you have time to invest in this and you are able to replicate the problem I would be eager to understand better your setup. @Vlaaaaaaad has done some interesting testing on scaling that you can watch here. I thought you may be interested. |
Would be good to have know the Fargate pod launch limit, even this parameter has a complex relation to the other services. Having an even approximate values give us understanding how aggressive scaling approach we can implement for our tasks. |
@bothra90 do the pending tasks show |
@mreferre: Yes it did. The pods eventually ran after being stuck in pending state for ~60 mins, which makes me think the launch rate is enforced at an hourly level. As @Lefthander said, having documentation for what these limits are will be tremendously helpful. |
I am wondering if there is a pattern here. I just ran the following experiment: This is the content of it:
It took 7 minutes and a bunch of seconds to transition all of them into |
@mreferre: Yes, the problem is fairly reproducible for us - instead of running 300 replicas of one job, we were trying to schedule 300 separate jobs. Could you try that and see if you hit the same issue? |
@bothra90 to be clear, I launched a deployment with 300 replica
And all 300 standalone pods came up in about the same amount of time (roughly 8 minutes). Given you are mentioning Otherwise, if you can provide a yaml/script that I could use to replicate the problem that would help. Thanks. |
in my case, it never even recovered from pending. Basically, the cluster is not usable for me, with no clear path on how to overcome these issues. I almost gave up on the overall idea of using EKS due to 0 chances to launch a single pod. |
The |
Very good |
Details
|
Community Note
Tell us about your request
I'd like some clarification around the limits for ECS on Fargate. In the documented limits we see a clear limit for ECS on EC2:
Even though no relevant Fargate limits are specified, I can't seem to be able to scale 1 Service using Fargate Spot to anything above 1.000 tasks, no matter what I do. Adding a second service works fine, but that service is also limited to 1.000 tasks.
This is the error I get:
Error: InvalidParameterException: Desired count exceeds limit.
Does the limit apply to Fargate too (not just EC2)? Is this a limit that can be increased? I'd love some more details about what the Fargate limits are. Are the limits for Fargate Spot different? What about ECS on Fargate vs EKS on Fargate?
Which service(s) is this request for?
ECS on Fargate
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
I am doing some benchmarks :)
Are you currently working around this issue?
Multiple services with the same definition seem to work fine.
The text was updated successfully, but these errors were encountered: