Skip to content

[Fargate] [request]: Add higher vCPU / Memory options #164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mbj opened this issue Feb 13, 2019 · 54 comments
Closed

[Fargate] [request]: Add higher vCPU / Memory options #164

mbj opened this issue Feb 13, 2019 · 54 comments
Assignees
Labels
Fargate AWS Fargate Proposed Community submitted issue

Comments

@mbj
Copy link

mbj commented Feb 13, 2019

Tell us about your request

Increase maximum allowed vCPU / Memory resources of Fargate tasks.

Which service(s) is this request for?

Fargate

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

I want to offload computationally heavy tasks, that can only be locally parallelized to Fargate. Without having to boot an EC2 instance and its associated maintenance overhead.

An example of such a task is the compilation of GHC (the Haskell compiler). Its build system allows parallel computation, but no distribution.

Are you currently working around this issue?

Considering to script the use of a bigger EC2 instance, with its associated maintenance overhead.

Additional context

None.

@mbj mbj added the Proposed Community submitted issue label Feb 13, 2019
@mbj mbj changed the title [service] [request]: describe request here [Fargate] [request]: describe request here Feb 13, 2019
@mbj mbj changed the title [Fargate] [request]: describe request here [Fargate] [request]: Add higher vCPU / Memory options Feb 13, 2019
@abby-fuller abby-fuller added the Fargate AWS Fargate label Feb 14, 2019
@fengyj
Copy link

fengyj commented May 15, 2019

yes, pls support more cpu and more ebs. @abby-fuller , does AWS have any plan on this?

@srinivaspype
Copy link

Any SLA for this? Currently Fargate implementation provides general-purpose CPU cycle speed (2.2GHz- 2.3GHz) for us and not capable of running CPU/GPU critical applications.

@mavericknavs
Copy link

Please extend the current vCPUs limit. It will really help a lot of our Customers.

@Djolivald
Copy link

Any idea if other CPU option are even considered? This is 'proposed' for a while without feedback from AWS.
Would love to have 8/16/32 vcpu options, without having to take care of EC2 maintenance.

@luxaritas
Copy link

Recently discovered Fargate and the idea of not having to manually provision instances, keep the underlying instance up to date, etc. is very attractive. However, a particular workload we have, which involves some ML model evaluation, requires somewhere from 60-100GB of RAM. Will likely fall back to a dedicated ECS/EC2 cluster with autoscaling, but would rather let Fargate handle the process...

@sreeninair
Copy link

I had similar cpu spikes issue when migrated to fargate. In the task definition I have specified 4 cores (maximum resources) initially and which was not sufficient for my app. After specifying cpu in the container >> Environment >> cpu to 4096, the cpu spike is comparatively less. This cpu field is optional for fargate launch type but without it I guess it will only allocating 4 cores for the total number of tasks running, though I am not sure. @luxaritas Similarly for memory if we specify the to 32G in container >> Environment >> memory, it might fix your issue.

Aws definitely needs to upgrade cores to a minimum of 8.

@andreasunterhuber
Copy link

@abby-fuller any update on this request? 8 vcpu would be a start!

@ostwalprasad
Copy link

Fargate Compute Optimized CPUs ftw!

@harshdubey-sb
Copy link

Increase maximum allowed vCPU / Memory resources of Fargate tasks minimum to 8.

@sanjeevpande26
Copy link

any update on this ?

@ostwalprasad
Copy link

Can the newly launched AWS App Runner run on more CPUs? Is that an alternative to Fargate?

@mreferre
Copy link

mreferre commented Jun 7, 2021

Can the newly launched AWS App Runner run on more CPUs? Is that an alternative to Fargate?

No. App Runner can run a subset of the native Fargate options (App Runner can run applications with up to 2 vCPU and 4GB of memory per instance)

@mbj
Copy link
Author

mbj commented Jun 7, 2021

I actually suspect that App runner internally is implemented on Fargate. Such like Fargate is implemented on EC2 etc.

@mreferre
Copy link

App Runner is indeed built on top of Fargate but (today) it cannot be configured to take advantage of all the CPU/mem configurations "raw" Fargate offers.

@sumitverma
Copy link

At least 8 vcpu will be helpful. this is holding up back from migrating fully to fargate.

@rokopi-byte
Copy link

I second that, more vCPU would be very useful

@tiivik
Copy link

tiivik commented Jul 14, 2021

Second that, more vCPU on Fargate would be very useful

@jojofeng
Copy link

I see that this issue is categorized in the "Researching" project; is there any estimate of when we might see some additional progress on this task? Would love to deploy one of our core services (with high CPU demand) to Fargate, but will likely only be able to do so once Fargate supports up to 12 vCPUs :)

@peegee123
Copy link

The compute limits are far too restrictive for many applications (e.g. custom video/audio processing).
Until this is addressed, Fargate is not a viable solution which is a real shame.

@mreferre
Copy link

Thanks @peegee123 for the feedback. We heard this loud and clear and we want to lift this limitation. Stay tuned.

@omieomye omieomye self-assigned this Nov 10, 2021
@alexjeen
Copy link

alexjeen commented Dec 8, 2021

We are indeed also heavily relient on fargate and are constantly maxing out our containers now, we would throw more money your way if we can more VCPUS (16 maybe) and up until 64 GB of ram :) !

Some people have asked how we solve it now:
https://gist.github.com/alexjeen/984dd2b092ffa49e1c3bf4f6505d0ebe

Basically, we have a ECS ASG set to desired capacity of 0, then if we add a task with a placement constraint of XXL or XXXL it will create a new machine (EC2) for that instance and it will place the task there.

When the task is done running, the EC2 instance will be destroyed so you basically also only pay for what you use.

When Fargate gets higher VCPU and Memory usage options, we would drop this approach.

@nwsparks
Copy link

The ability to be more flexible with the selections would be nice as well. Currently it is not possible to do something like 2 cores and 2gb of memory.

@sourav-crossml
Copy link

sourav-crossml commented Jan 20, 2022

i am using 4gb ram and 4vcpu of fargate? is it possible to dynamically allocate hardware for every request...

@mreferre
Copy link

mreferre commented Feb 9, 2022

@jbidinger we are actively working on it.

@tf401
Copy link

tf401 commented Feb 17, 2022

We are indeed also heavily relient on fargate and are constantly maxing out our containers now, we would throw more money your way if we can more VCPUS (16 maybe) and up until 64 GB of ram :) !

Some people have asked how we solve it now: https://gist.github.com/alexjeen/984dd2b092ffa49e1c3bf4f6505d0ebe

Basically, we have a ECS ASG set to desired capacity of 0, then if we add a task with a placement constraint of XXL or XXXL it will create a new machine (EC2) for that instance and it will place the task there.

When the task is done running, the EC2 instance will be destroyed so you basically also only pay for what you use.

When Fargate gets higher VCPU and Memory usage options, we would drop this approach.

This is great, I've began to impement this since I had similar problems when running a task in FARGATE.
However, I'm using aws cdk and my current problem is that when I run a task via the aws cli it spins up a ec2 instance, runs the task but the instance never gets stopped/terminated. I noticed that the autoscaling group changing the desired capacity from 0 to 1 and I recon that this is the problem. Did you encounter similar problem and how did you solve it?

For reference, here is a snippet of my aws cdk code.

auto_scaling_group = cdk.aws_autoscaling.AutoScalingGroup(self, "MyAsg",
    vpc=vpc,
    instance_type=ec2.InstanceType("t2.xlarge"),
    machine_image=ecs.EcsOptimizedImage.amazon_linux(),
    # Or use Amazon ECS-Optimized Amazon Linux 2 AMI
    # machineImage: EcsOptimizedImage.amazonLinux2(),
    desired_capacity=0,
    max_capacity=1,
    min_capacity=0,
    new_instances_protected_from_scale_in=False, # unsure?
    cooldown=cdk.Duration.seconds(30)
)

capacity_provider = ecs.AsgCapacityProvider(self, 
    "AsgCapacityProvider",
    auto_scaling_group=auto_scaling_group,
    capacity_provider_name='AsgCapacityProvider', # if this is not specified, the Cap fails. Seems like a bug for name ref via id
)
cluster.add_asg_capacity_provider(capacity_provider)

EDIT:
After some research I found this article
https://aws.amazon.com/blogs/containers/deep-dive-on-amazon-ecs-cluster-auto-scaling/

In particular step 4. Which states that after 15 minutes (or 15 datapoints) it scales in (down). So after waiting 15 mins I achieve the desired outcome. What I dont have found is how to configures this interval. Ideally I would like a scale in directly after a task is finished but if I can configure this to 1-5 mins I would be happy.

@omieomye
Copy link

Quick update: this is a top priority for us, we're actively developing on it, and will move it to the next roadmap phase within a few weeks.

@ghomem
Copy link

ghomem commented Apr 25, 2022

Quick update: this is a top priority for us, we're actively developing on it, and will move it to the next roadmap phase within a few weeks.

Would it be possible to provide a rough estimation of when this would be ready and what would be the new memory and vCPU limits that you are aiming for?

It would be very useful if this information could be shared (the answer would not be taken as a commitment but rather as an indication).

Thanks in advance.

@luiszimmermann
Copy link

Any updates about that? We are starting to consider the change to EC2 because of this limitation.

@mreferre
Copy link

@luiszimmermann it's coming soon. If you have an AWS employee you work with (AM/TAM/SA/etc), can you ask them to ping me referencing you and this thread? If you don't have AWS contacts you work with, can you send me an email to mreferre at amazon dot com?

Thanks.

@ascrookes
Copy link

As @ghomem asked, can you release any information about how high the memory and vCPU limits might be set? It would be helpful to have even a ballpark estimate to know if these changes could make Fargate a viable option for us.

@mreferre
Copy link

@ascrookes before the launch, we can't release publicly more information. Sorry about that.

@mattaltberg
Copy link

Following

@sammcj
Copy link

sammcj commented Jul 1, 2022

We're half way through 2022 and ECS Fargate still has a maximum of 4 vCPUs - this really doesn't play nice with enterprise clients running Java in containers 🤣

@omieomye
Copy link

omieomye commented Jul 1, 2022

Thanks for the patience on this. We'll announce increased vCPU and memory options for Fargate-based tasks and pods within a few weeks.

@dcalde
Copy link

dcalde commented Jul 25, 2022

Would love to be able to run 60G tasks to spin up ephemeral databases for adhoc data processing.

@ghost
Copy link

ghost commented Jul 26, 2022

This limitation is holding us back too, as we're relying on node worker threads to parallelise work which can't realistically be spread across network communications.

@chrisempson-kmt
Copy link

chrisempson-kmt commented Jul 26, 2022

Hi @omieomye - it has been 25 days since you mentioned that the increased vCPU and memory options will be available in a few weeks. Can you share an update on the timing of the launch of these new options? Is it likely to be a few days, a few more weeks or a few more months? I'm asking because one of my Fargate workloads is hitting the 30GB memory limitation, and I'm trying to decide whether to abandon Fargate and run it on an ECS cluster or EC2 instance, or wait until the increased limits launch. Thanks. [Edit: grammar]

@gabrielkoo
Copy link

Looking forward to increased vCPU options, it will definitely be one of the greatest milestones for the Fargate service since it has launched! (After the ECS Exec)

@raniel90
Copy link

raniel90 commented Aug 4, 2022

Hi, guys. To workaround this, I've built two callbacks on Apache Airflow to create ECS cluster using EC2 container instance. You can choose any desired hardware, but it's required use a default AMI with ECS capabilities. You can use boto3 directly to integrate with AWS too as alternative.

Declare this constants:

ECS_EC2_AMI = "ami-09ce6553a7f2ae75d"
ECS_EC2_IAM_ROLE = "ecsInstanceRole"
ECS_ARGS = {
    "region_name": "YOUR-REGION",
    "aws_conn_id": "aws_default",
    "task_definition": "taskdef",
    "cluster": "YOUR-CLUSTER-NAME",
    "launch_type": "FARGATE",
    "platform_version": "1.4.0",
    "network_configuration": {
        "awsvpcConfiguration": {
            "securityGroups": ["YOUR-SECURITY-GROUPS"],
            "subnets": ["YOUR-SUBNETS"],
            "assignPublicIp": "ENABLED",
        }
    },
    "retries": 0,
}

Function to create cluster (create cluster, create instance, associate EC2 instance, await instance state to ready):

import pprint
import traceback
import time
import boto3

def on_success_callback_start_ecs_cluster_container_instance(kwargs, instance_type):
    try:
        has_container_instance = False

        if instance_type:
            ## Force delete before create to not create many EC2 Instances
            on_callback_stop_ecs_cluster_container_instance(kwargs)

            region_name = ECS_ARGS.get("region_name")
            ecs_client = boto3.client("ecs", region_name=region_name)
            ec2_client = boto3.client("ec2", region_name=region_name)
            ec2_resource = boto3.resource("ec2", region_name=region_name)

            cluster_name = kwargs["dag_run"].conf.get("cluster_name")
            ecs_client.create_cluster(
                clusterName=cluster_name,
                tags=[
                    {"key": "PROJECT", "value": str(cluster_name).upper()},
                ],
            )

            pprint.pprint(
                f"Cluster {cluster_name} created with success on {ECS_ARGS.get('region_name')}"
            )

            security_group_ids = (
                ECS_ARGS.get("network_configuration")
                .get("awsvpcConfiguration")
                .get("securityGroups")
            )
            subnet_id = (
                ECS_ARGS.get("network_configuration")
                .get("awsvpcConfiguration")
                .get("subnets")[1]
            )

            ec2_response = ec2_client.run_instances(
                ImageId=ECS_EC2_AMI,
                MinCount=1,
                MaxCount=1,
                SecurityGroupIds=security_group_ids,
                SubnetId=subnet_id,
                InstanceType=instance_type,
                IamInstanceProfile={"Name": ECS_EC2_IAM_ROLE},
                TagSpecifications=[
                    {
                        "ResourceType": "instance",
                        "Tags": [
                            {
                                "Key": "PROJECT",
                                "Value": str(cluster_name).upper(),
                            },
                        ],
                    },
                ],
                UserData=f"""
                #!/bin/bash
                echo ECS_CLUSTER={cluster_name} >> /etc/ecs/ecs.config
                echo ECS_AVAILABLE_LOGGING_DRIVERS='["json-file","awslogs"]' >> /etc/ecs/ecs.config
                echo ECS_ENABLE_AWSLOGS_EXECUTIONROLE_OVERRIDE=true >> /etc/ecs/ecs.config
                """,
            )

            pprint.pprint(f"EC2 create response: {ec2_response}")

            instance_id = ec2_response.get("Instances")[0].get("InstanceId")
            instance = ec2_resource.Instance(instance_id)

            pprint.pprint(f"Waiting instance {instance_id} start on {region_name}...")

            instance.wait_until_running()

            pprint.pprint(
                f"Instance {instance_id} is running on {ECS_ARGS.get('region_name')}"
            )

            while not has_container_instance:
                response = ecs_client.list_container_instances(cluster=cluster_name)

                if response["containerInstanceArns"]:
                    has_container_instance = True
                    print("Container instance installed on cluster.")
                else:
                    print(
                        "There's not a container instance installed on cluster. Repeating..."
                    )
                    time.sleep(20)
    except Exception:
        traceback.print_exc()
        raise Exception("Internal error")

Stop Cluster (drain EC2, delete EC2, await EC2 be deleted, drop cluster):

import pprint
import traceback
import time
import boto3

def on_callback_stop_ecs_cluster_container_instance(kwargs):
    response = None
    region_name = ECS_ARGS.get("region_name")
    cluster_name = kwargs["dag_run"].conf.get("cluster_name")
    ecs_client = boto3.client("ecs", region_name=region_name)
    ec2_client = boto3.client("ec2", region_name=region_name)
    ec2_resource = boto3.resource("ec2", region_name=region_name)

    try:
        response = ecs_client.list_container_instances(cluster=cluster_name)
    except:
        print(f"Cluster {cluster_name} not found. Skipping delete cluster.")

    if response and response["containerInstanceArns"]:
        container_instance_resp = ecs_client.describe_container_instances(
            cluster=cluster_name, containerInstances=response["containerInstanceArns"]
        )
        for ec2_instance in container_instance_resp["containerInstances"]:
            ec2_client.terminate_instances(
                DryRun=False,
                InstanceIds=[
                    ec2_instance["ec2InstanceId"],
                ],
            )

            instance_id = ec2_instance["ec2InstanceId"]
            instance = ec2_resource.Instance(instance_id)

            pprint.pprint(
                f"Waiting instance {instance_id} be terminated on {region_name}..."
            )

            instance.wait_until_terminated()

            pprint.pprint(f"Instance {instance_id} was terminated on {region_name}")

        ecs_client.delete_cluster(cluster=cluster_name)

        pprint.pprint(f"Cluster {cluster_name} deleted with success on {region_name}")

@sanderv32
Copy link

When can we expect the increase of vCPU's? 4 vCPU's is not enough for a workload we are running.

@yogendratamang48
Copy link

Our testing and staging setups are running on FARGATE but the production needs higher limits. We added EC2 on prod but this is something we wanted to avoid. I hope AWS announces by this month.

@neer0089
Copy link

neer0089 commented Aug 31, 2022

Hitting the vCPU limit doing some video transcoding tasks every now and then. Badly need the vCPU increase.

@innix
Copy link

innix commented Sep 10, 2022

Seems like the delay is because they first need to migrate the old Fargate quota system to the new vCPU-based quota system:
https://aws.amazon.com/blogs/containers/migrating-fargate-service-quotas-to-vcpu-based-quotas/

The article explains good reasons for this:

Over the past five years, quotas have been based on the total number of concurrent Amazon ECS tasks and Amazon EKS pods running at a given time. However, Fargate offers various task and pod sizes, from 0.25 vCPU per task/pod up to four vCPUs per task/pod. Additionally, many customers have asked for even larger task sizes for Fargate on our public roadmap. A quota that enforces an absolute task and pod count of 1,000 tasks and pods no longer makes sense given the wide range of task and pod sizes. If you are launching the 0.25 vCPU task size, then the task count limit of 1,000 tasks only allows you to launch 250 vCPUs. Meanwhile, someone launching 4 vCPU tasks can launch up to 4,000 vCPUs. So, the real amount of computing power available to you varies widely based on how you choose to size your tasks and pods.


If I'm right, then I would guess that the higher vCPU's feature will be rolled out in either October or November, based on the transitional rollout dates for the new quota system (dates available in the linked blog post).

Hopefully someone from AWS can give us an official update since we haven't had one since the start of July. @omieomye , do you have any new information you can share?

@ziXet
Copy link

ziXet commented Sep 12, 2022

Seems like the delay is because they first need to migrate the old Fargate quota system to the new vCPU-based quota system: https://aws.amazon.com/blogs/containers/migrating-fargate-service-quotas-to-vcpu-based-quotas/

The article explains good reasons for this:

Over the past five years, quotas have been based on the total number of concurrent Amazon ECS tasks and Amazon EKS pods running at a given time. However, Fargate offers various task and pod sizes, from 0.25 vCPU per task/pod up to four vCPUs per task/pod. Additionally, many customers have asked for even larger task sizes for Fargate on our public roadmap. A quota that enforces an absolute task and pod count of 1,000 tasks and pods no longer makes sense given the wide range of task and pod sizes. If you are launching the 0.25 vCPU task size, then the task count limit of 1,000 tasks only allows you to launch 250 vCPUs. Meanwhile, someone launching 4 vCPU tasks can launch up to 4,000 vCPUs. So, the real amount of computing power available to you varies widely based on how you choose to size your tasks and pods.

If I'm right, then I would guess that the higher vCPU's feature will be rolled out in either October or November, based on the transitional rollout dates for the new quota system (dates available in the linked blog post).

Hopefully someone from AWS can give us an official update since we haven't had one since the start of July. @omieomye , do you have any new information you can share?

Yeah. this is promising...

@LachlanMarnham
Copy link

Following.

@nsht
Copy link

nsht commented Sep 15, 2022

I now have an option to use up to 16vCPUs on Fargate, I can see this option in us-west-1 but west-2 only has a max of 4.

image

@mreferre
Copy link

The team (that has been working hard on this) appreciates the enthusiasm in this thread. Hang in there.

@omieomye
Copy link

As many will notice, we've begun rolling out higher resource configurations (more vCPU and memory) options. We'll make a formal announcement soon. Ensure that you first opt-in your accounts to vCPU-based quotas, which we announced last week (post, FAQs) before using 8 and 16 vCPU tasks on Fargate.

Opt'ing in to vCPU-based quotas for an ECS customer is simple, we recommend you use the ECS PutAccountSettingDefault API call. For EKS customers, or if you don't want to use the API to use vCPU-based quotas, file a request in the AWS Support Center console.

Echo'ing what @mreferre said, thanks for the patience!

@gabrielkoo
Copy link

It’s facinating!!!

02A6BED9-25A2-43CF-B05E-22865936F0F9

@shandrew
Copy link

Thanks, that's all working well for me. Congrats on the launch.

@omieomye
Copy link

Announcement. Thanks for the engagement, all!

Please note that to use this, first opt-in your accounts to vCPU-based quotas. ECS Fargate customers can easily opt-in to vCPU-based quotas using the PutAccountSettingDefault API, before their accounts run larger tasks. EKS Fargate customers can cut us a ticket.

Closing.

@epinzur
Copy link

epinzur commented Oct 1, 2022

Has anyone been successful in opting-in to vCPU-based quotes on EKS? I created a customer support ticket for this on AWS, and they had no idea how to do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Fargate AWS Fargate Proposed Community submitted issue
Projects
None yet
Development

No branches or pull requests