Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fargate] [docs?]: Clarify Fargate limits #768

Open
Vlaaaaaaad opened this issue Feb 24, 2020 · 15 comments
Open

[Fargate] [docs?]: Clarify Fargate limits #768

Vlaaaaaaad opened this issue Feb 24, 2020 · 15 comments
Labels
Docs Fargate Proposed

Comments

@Vlaaaaaaad
Copy link

@Vlaaaaaaad Vlaaaaaaad commented Feb 24, 2020

Community Note

  • Please vote on this issue by adding a 馃憤 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request
I'd like some clarification around the limits for ECS on Fargate. In the documented limits we see a clear limit for ECS on EC2:

Tasks using the EC2 launch type per service (the desired count) : 1.000

Even though no relevant Fargate limits are specified, I can't seem to be able to scale 1 Service using Fargate Spot to anything above 1.000 tasks, no matter what I do. Adding a second service works fine, but that service is also limited to 1.000 tasks.

This is the error I get: Error: InvalidParameterException: Desired count exceeds limit.

Does the limit apply to Fargate too (not just EC2)? Is this a limit that can be increased? I'd love some more details about what the Fargate limits are. Are the limits for Fargate Spot different? What about ECS on Fargate vs EKS on Fargate?

Which service(s) is this request for?
ECS on Fargate

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
I am doing some benchmarks :)

Are you currently working around this issue?
Multiple services with the same definition seem to work fine.

@Vlaaaaaaad Vlaaaaaaad added the Proposed label Feb 24, 2020
@tabern tabern added Docs Fargate labels Mar 2, 2020
@Vlaaaaaaad
Copy link
Author

@Vlaaaaaaad Vlaaaaaaad commented Mar 23, 2020

Support discussed this with the Fargate team( case 6755894161) and confirmed the limitation:

[...] In continuation, they have mentioned that the number of tasks which can be running in each service is limited to 1000 which is why the number of tasks did not increase. If you want to run 10,000 task concurrently, they have suggested you to make use of 10 ECS services

@Vlaaaaaaad
Copy link
Author

@Vlaaaaaaad Vlaaaaaaad commented Jul 7, 2020

As per @nathanpeck's tweet, the limits were updated:

ECS default limits are raising again:

Old: 1000 tasks per service, 1000 services per cluster
New: 2000 tasks per service, 2000 services per cluster

As always these are adjustable quotas so feel free to ask if you need more! We tend to raise these defaults based on frequent asks

Not closing this issue as the docs still refer to EC2 specifically.

@sandan
Copy link

@sandan sandan commented Feb 5, 2021

Ditto for EKS Fargate. Is it ok if we add an eks label @tabern ? We hit a rate limit that isn't documented in https://docs.aws.amazon.com/eks/latest/userguide/service-quotas.html.

You may encounter rate limits imposed by Fargate. This looks like a pod stuck in pending:

$ kubectl get deploy
NAME         READY     UP-TO-DATE   AVAILABLE   AGE
echoserver   255/256   256          255         4d20h

$ kubectl get pods |grep -i pending
echoserver-c776f8bdd-8ftkx   0/1     Pending   0          56m

The Pending pod can be described like so:

Events:
  Type     Reason            Age        From               Message
  ----     ------            ----       ----               -------
  Warning  FailedScheduling  <unknown>  fargate-scheduler  Fargate pod launch rate exceeded

Good thing we caught this before going to prod! What are the launch rate limits in EKS Fargate?

@mreferre
Copy link

@mreferre mreferre commented Feb 5, 2021

@sandan this sounds like a TPS (task per second) throughput limit. This is a soft limit that the team can lift on a need basis. Did it eventually come up? The EKS fargate scheduler should be retrying and eventually succeed. Note this limit is orthogonal to the limit of concurrent Fargate tasks/pods you can have in an account/region at any point in time (also a soft limit).

@sandan
Copy link

@sandan sandan commented Feb 6, 2021

@mreferre No it stayed pending for 56 minutes then I just scaled it back down. I was testing some autoscaling policies with the Horizontal Pod Autoscaler and trying to get a sense for pod cold start latencies. There was at least a minute between scaling events though. It would be great to document throughput limits in the EKS user guide.

I ran the same tests the day before and was able to scale past 256 with no quota issues. It is good to know it is a soft limit but without knowing the TPS throughput limit, how would we know what to increase it to in our service quota request?

@mreferre
Copy link

@mreferre mreferre commented Feb 12, 2021

We don't publish these numbers for various reasons. Mostly because knowing the rate limits of a single service may not be enough as an operation may depend on other limits (the EC2 page that discusses this is a good example of this concept). As I said these limits can be softened but it usually requires a hand-holding process to define scope, use case, and other things. To date the best approach would be to assume you can hit those limits and build a retry logic (better if it includes bake-off retries).

I am also intrigued by the fact that (single?) pod has been pending for so long. How many pods were you deploying and/or were there other tests going on in the same account/region? I am asking because the EKS Fargate scheduler does implement retries (with bake-offs etc) so unless there was a lot of deployment activity going on it should have succeeded. There is a limit to these retries but again if this was one pod (or few pods) this shouldn't have been a problem. If you have time to invest in this and you are able to replicate the problem I would be eager to understand better your setup.

@Vlaaaaaaad has done some interesting testing on scaling that you can watch here. I thought you may be interested.

@bothra90
Copy link

@bothra90 bothra90 commented Mar 14, 2021

@mreferre: We have faced the same problem as @sandan mentions above - pods in EKS are stuck in pending forever and the fargate-scheduler doesn't retry to schedule them after we hit the launch rate.

@Lefthander
Copy link

@Lefthander Lefthander commented Mar 14, 2021

Would be good to have know the Fargate pod launch limit, even this parameter has a complex relation to the other services. Having an even approximate values give us understanding how aggressive scaling approach we can implement for our tasks.

@mreferre
Copy link

@mreferre mreferre commented Mar 14, 2021

@bothra90 do the pending tasks show Fargate pod launch rate exceeded as the message?

@bothra90
Copy link

@bothra90 bothra90 commented Mar 15, 2021

@mreferre: Yes it did. The pods eventually ran after being stuck in pending state for ~60 mins, which makes me think the launch rate is enforced at an hourly level. As @Lefthander said, having documentation for what these limits are will be tremendously helpful.

@mreferre
Copy link

@mreferre mreferre commented Mar 17, 2021

I am wondering if there is a pattern here. I just ran the following experiment: kubectl apply -f myweb.yaml.

This is the content of it:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myweb
spec:
  selector:
    matchLabels:
      run: myweb
  replicas: 300
  template:
    metadata:
      labels:
        run: myweb
    spec:
      containers:
        - name: my-nginx
          image: nginx
          ports:
            - containerPort: 80

It took 7 minutes and a bunch of seconds to transition all of them into Running. As I said I am wondering if there is some race conditions of sort that trigger these long bake-offs. It's hard for me to speculate. @bothra90 are you in a position to open a ticket with support to debug this or EVEN BETTER, do you have a way to reproduce the problem you are seeing?

@bothra90
Copy link

@bothra90 bothra90 commented Mar 28, 2021

@mreferre: Yes, the problem is fairly reproducible for us - instead of running 300 replicas of one job, we were trying to schedule 300 separate jobs. Could you try that and see if you hit the same issue?

@mreferre
Copy link

@mreferre mreferre commented Mar 29, 2021

@bothra90 to be clear, I launched a deployment with 300 replica pods (I did not use Kubernetes jobs). I also launched 300 independent pods with the following:

for i in {1..300}; do kubectl run nginx$i --image=nginx; done

And all 300 standalone pods came up in about the same amount of time (roughly 8 minutes).

Given you are mentioning jobs, is it possible you are actually running jobs and they are not cleaned up? (Reference: #255). When this happens Fargate nodes are NOT released because while there is no longer running, the pod remains available by default. In this case the issue would not be about throttling but rather it would be related to the concurrent Fargate tasks/pods you have in the account/region.

Otherwise, if you can provide a yaml/script that I could use to replicate the problem that would help.

Thanks.

@sneerin
Copy link

@sneerin sneerin commented Sep 7, 2021

in my case, it never even recovered from pending. Basically, the cluster is not usable for me, with no clear path on how to overcome these issues. I almost gave up on the overall idea of using EKS due to 0 chances to launch a single pod.

@mreferre
Copy link

@mreferre mreferre commented Sep 9, 2021

in my case, it never even recovered from pending. Basically, the cluster is not usable for me, with no clear path on how to overcome these issues. I almost gave up on the overall idea of using EKS due to 0 chances to launch a single pod.

The almost is promising and I am happy to try to help if I can. I am not sure how your cluster is configured (Fargate only? EC2 only? Mixed?) but perhaps the best starting point would be to describe the pod and check what the reason is for it to be pending.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Fargate Proposed
Projects
None yet
Development

No branches or pull requests

7 participants