ECS ENI Density Increases #7

abby-fuller · 2018-11-28T17:14:33Z

Instances running in awsvpc networking mode will have greater allotments of ENIs allowing for greater Task densities.

MarcusNoble · 2018-12-12T12:25:40Z

Will this benefit EKS workers also?

FernandoMiguel · 2018-12-12T12:33:12Z

@MarcusNoble EKS uses secondary IPs, so it allows for much bigger tasks in each pod.
If we get new density to EC2, maybe it can benefit EKS too, but right now it is a much smaller issue than it is for ECS

ofiliz · 2018-12-13T07:02:46Z

@MarcusNoble Can you please tell us more about your EKS pods-per-node density requirements?

MarcusNoble · 2018-12-13T14:03:52Z

It'd be great if we could make use of some of the smaller instance types (in terms of CPU and memory) but still benefit from being able to have a large number of pods. When we were picking the right instance type we've had to pick much more resources than we need because of the IP limitation when balanced with the cost of running more smaller ones.

jpoley · 2018-12-15T04:00:38Z

yes please. this would be valuable. increasing container density on ECS/.EKS. (no matter if IP or port based) also having a one pager max containers per instnace flavor would be useful too

jespersoderlund · 2018-12-27T19:53:50Z

An acceptable level of ENI density should be about 1 ENI / 0.5 VCPU and scale linearly with instance size, not every other as it is today.

mancej · 2019-01-05T06:21:21Z

An acceptable level of ENI density should be about 1 ENI / 0.5 VCPU and scale linearly with instance size, not every other as it is today.

I would say 1ENI / .5 VCPU would be on the low end. Honestly at that rate we probably still wouldn't bother with awsvpc networking mode. We regularly run 10-16 tasks on hosts with as few as 2 VCPUs.

geekgonecrazy · 2019-02-13T18:00:33Z

I would point out that on other providers this limit is not in place. So coming in with purely a k8s familiarity.. I expect that there is a hard coded limit of 110 pods per node.

This one caught us a bit off guard. Started migrating from GCP and chose as close to same sized machines as we could in AWS. Start the migration and suddenly pods aren't starting.

It was only because we had happened to have remembered reading about ips per ENI that we were able to figure this out.

I can definitely understand the context switching for the CPU and other factors being an issue with traditional EC2. But with much smaller jobs running it would be nice to at least be able to acknowledge these risks and do it anyways.

Especially with EKS where we can / are responsible for setting usage needs to let k8s best schedule across our node capacity

emanuelecasadio · 2019-03-29T11:20:25Z

I can explain a good use case for this. We currently have a EKS cluster on AWS and a AKS cluster on Azure.

On the Azure cluster we run many small pods (80 pods approx. per node): they are so small that they can easily fit on the equivalent of a m5.xlarge. Unfortunately, the m5.xlarge allows only 59 pods per node (of which at least 2 pods are needed by the system itself).

So we are basically using the Azure cluster for cost optimization.

peterjuras · 2019-04-07T07:48:11Z

Any news on when we can expect an update? We are planning to move workloads to ECS using awsvpc but are currently blocked by this issue. We could use the the bridge networking mode for now, but for this it would be good to know whether an update to this issue is imminent or rather something for next year (both are fine, but information on this would be great)

ofiliz · 2019-04-07T23:11:14Z

@peterjuras We are currently actively working on this feature. Targeting a release soon, this year.

@emanuelecasadio Please note this issue tracks the progress of ENI density increases for ECS. We are also working on networking improvements for EKS, just not as part of this issue.

joshuabaird · 2019-04-08T13:20:36Z

@ofiliz Does this mean "calendar year?" (ie, 2019). We were initially under the impression this feature would be shipping months ago. Until it does ship, awsvpc (and thus App Mesh) is not usable for us.

mancej · 2019-04-08T13:43:05Z

@ofiliz Does this mean "calendar year?" (ie, 2019). We were initially under the impression this feature would be shipping months ago. Until it does ship, awsvpc (and thus App Mesh) is not usable for us.

I second this, I struggle to see AppMesh working for the majority of use cases with ECS given the current ENI limitations and sole support for awsvpc networking mode. It's a shame there is so much focus on EKS support when K8s already has tons of community support and tooling around service-mesh architectures. Meanwhile today, for ECS, all service-mesh deployments have to be more or less home-rolled due to limited support.

I've been patiently waiting, but I'm about to just roll Linkerd out across all of our clusters because the feature set of AppMesh as is right now is still very limited, and this ENI density issue is a non-starter for us. It seems AppMesh was prematurely announced, since it's just now GA 6 months after announcement, and is still effectively unusable for any reasonably sophisticated ECS deployments.

tomelliff · 2019-04-09T10:29:54Z

AWS tend to release services as soon as they are useful for some subset of their intended customer base. If you are running reasonably heavy memory containers then, depending on the instance type you use, you won't hit the ENI limits when using awsvpc networking.

While this is a problem for you (and myself) there are clearly going to be some people where this is useful and so it's obviously good to release it to those people before solving a much harder issue around ENI density or reworking the awsvpc networking on ECS to use secondary IPs such as with EKS via network policies on top of security groups.

There's certainly a nice level of simplicity in that with the awsvpc networking then each task gets its own ENI and thus you can use AWS networking primitives such as security groups natively. EKS' use of secondary IPs for pods sits on top of the already well established network policies used by overlay networks in Kubernetes but for a lot of people this is way more complexity than necessary.

I personally prefer the simplicity of ECS over Kubernetes for exactly these types of decisions.

FernandoMiguel · 2019-04-09T10:32:54Z

I've said this before in multiple places.
having native SG per ENI is a huge benefit for any org.
Powered by Nitro technology, it should be possible to create a new instance family that removes the limit of ENI per vCPU/core that currently limits EC2.

tomelliff · 2019-04-09T10:39:20Z

That's pretty outrageous speculation there.

Whatever you do you're still restricted by the physical limitations of the actual tin and part of that ENI per core thing is just because that's how instances are divided up as part of the physical kit. Even if the networking is entirely virtualised or offloaded there's still some cost to it and AWS needs to be able to portion that out to every user of the tin as fairly as possible.

FernandoMiguel · 2019-04-09T10:41:28Z

true @tomelliff but would lift this entire problem to a different scale

ofiliz · 2019-04-17T20:16:09Z

@joshuabaird @mancej Yes, this calendar year, coming soon. We appreciate your feedback. We are aware that this issue impact AppMesh on ECS and are working hard to increase the task density without requiring our customers to make any disruptive changes or lose any functionality that they expect from VPC networking using awsvpc mode.

Bensign · 2019-04-17T23:01:27Z

Hi everyone: I'm on the product management team for ECS. We're going to be doing an early access period soon for this feature prior to being generally available.

In the event you're interested in participating: can you please email me at bsheck [at] amazon with your AWS account ID(s). I'll ensure your accounts get access and follow up with more specific instructions when the early access period is opened up.

mfortin · 2019-05-16T23:15:58Z

With the release of the Amazon ECS agent v1.28.0 released today, the introduction of high density awsvpc tasks support was announced. What's the new limit ? Is it more ENI per EC2 instances ? more IP addresses per ENI ?
We have instances running as many as 120 tasks on them, wondering where the limit is now.

Thanks!

Bensign · 2019-05-16T23:22:53Z

@mfortin The agent release today is staged in anticipation for when we open up the feature for general availability relatively soon. At that point, we'll be publishing all the documentation with all the various ENI increases on a per-instance basis and I'll report back here at that time.

mfortin · 2019-05-16T23:26:36Z

@Bensign I sent you an email last month to be part of the beta test from my corporate email, we love being guinea pigs ;) If you prefer, I can make this request more official through our TAM.

abby-fuller · 2019-06-06T22:20:42Z

shipped: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-account-settings.html

FernandoMiguel · 2019-06-06T22:45:03Z

@abby-fuller is this limited to the specific families listed on the docs, or does it also include sub families like c5d?

coultn · 2019-06-06T22:48:30Z

@abby-fuller is this limited to the specific families listed on the docs, or does it also include sub families like c5d?

It is currently limited to the specific instance types listed in the docs. We are working on adding additional instance types.

joshuabaird · 2019-06-07T16:21:10Z

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/container-instance-eni.html

sargun · 2019-06-08T01:24:42Z

How does this work? Is there any reason why we wouldn't opt into this mode? Are there any limitations?

joshuabaird · 2019-06-10T16:43:17Z

Is this actually working for anyone? I have the account setting defined, running newest ECS AMI (w/ 1.28.1 ECS agent, etc), but I still can only run 3 tasks on a m5.2x. I don't see that the trunk interface is being provisioned. Talking to support now, but I think they may be stumped as well.

joshuabaird · 2019-06-10T18:10:19Z

An update: I enabled awsvpcTrunking for the account using a non-root account/role. This role was also used to provision the ECS container instance and the ECS service, but ENI trunking was still not working/available. We then logged into the ECS console using the root account and enabled the setting (which sets the default setting for the entire account). After doing this, ENI trunking started working as expected.

iwarshak · 2019-06-10T18:45:13Z

@joshuabaird Yup. I had the same issue. You need to enable the awsvpcTrunking as the root user. It's not obvious.

geekgonecrazy · 2019-06-11T00:24:10Z

Does this apply just to ECS or also EKS? Was directed here by a couple of aws solution architects before this was closed. Was under the impression it would be usable by eks as well. The announcement doesn’t mention it though

ofiliz · 2019-06-11T01:23:33Z

Hi @geekgonecrazy, this feature is currently only for ECS. Do you want more pods per node in EKS? Or do you want VPC security groups for each EKS pod? If you can tell us more about your requirements, we can suggest solutions or consider adding such a feature in our roadmap.

geekgonecrazy · 2019-06-11T05:51:50Z

@ofiliz

I would point out that on other providers this limit is not in place. So coming in with purely a k8s familiarity.. I expect that there is a hard coded limit of 110 pods per node.

This one caught us a bit off guard. Started migrating from GCP and chose as close to same sized machines as we could in AWS. Start the migration and suddenly pods aren't starting.

It was only because we had happened to have remembered reading about ips per ENI that we were able to figure this out.

I can definitely understand the context switching for the CPU and other factors being an issue with traditional EC2. But with much smaller jobs running it would be nice to at least be able to acknowledge these risks and do it anyways.

Especially with EKS where we can / are responsible for setting usage needs to let k8s best schedule across our node capacity

To quote my initial comment here 4 months ago.

Every other provider we can do the k8s default of 110 pods per node. With eks we have to get a machine with more interfaces and way more specs then we need just to get 110 pods per node.

peterjuras · 2019-06-11T06:09:43Z

Are there any plans to also bring this to the smallest instance types (e.g. t2/t3.micro)? I would rather plan on using this feature for DEV environments, where we would bin pack as much as possible, on production environments I don't see as much need here.

emanuelecasadio · 2019-07-10T09:08:30Z

@ofiliz we have a workload running on a different cloud provider that we would like to move to EKS, but the fact that we cannot allocate 110 pods on a t3.medium or t3.large node is a no-go for us.

ofiliz · 2019-07-18T17:26:05Z

@geekgonecrazy @emanuelecasadio Thanks for your feedback. We are working on significantly improving the EKS pods-per-node density, as well as adding other exciting new networking features. We have created a new item in our EKS roadmap: #398

mailjunze · 2019-07-24T14:44:46Z

ENI trunking doesn't work when opting in via console as non-root user. You would need to opt-in as the root user via console or run the following command as root/non-root user.
aws ecs put-account-setting-default --name awsvpcTrunking --value enabled --region

pradeepkamath007 · 2019-09-24T10:17:43Z

ENI trunking doesn't work for instances launched in a shared VPC subnet : https://docs.aws.amazon.com/vpc/latest/userguide/vpc-sharing.html

The instances fail to register to the cluster when launched in a shared VPC and ENI trunking feature being enabled.

tomaszdudek7 · 2019-10-22T08:37:53Z

Bumping @peterjuras question - will you ever support t2/t3 family?

Running at least c5 family on dev/qa/preprod environment costs way too much.

coultn · 2019-10-22T16:57:23Z

Due to technical constraints with how ENI trunking works, we do not currently have pans to support t2 and t3.

bcluap · 2021-01-21T22:13:06Z

15 months later and ENI density remains a major issue for ECS users. Microservice architectures fundamentally need a high density of tasks per CPU to make sense or else one ends up with ridiculous costs. We need 1 or 2 VCPU instances which can host say 10 tasks. Please come up with a way of enabling this even if it has a performance impact. Customers need to be able to run light microservices with a high density.

joshuabaird · 2021-01-22T14:09:01Z

I know this probably isn't helpful, but what about a m5.large? 2VCPU, supports 10 ENI's with ENI trunking, but does cost slightly more than a t3.small.

bcluap · 2021-01-22T14:40:36Z

We looked at that but the commercials do not make sense (EU West 1):

m5.large | 2 | 10 | 8 GiB | EBS Only | $0.107 per Hour | Max 10 ECS Tasks

So t2.small is $0.0114 per task per hour where each task gets 1 VCPU and 1GB RAM
So m5.large is $0.0107 per task per hour where each task gets 0.2 VCPU and 0.8GB RAM

So I'd go with t3.small and pay $0.50 extra a month per task and get 5X the CPU. Hence the m5.large is too expensive for what it is compared to using lots of smaller nodes.

coultn · 2021-01-22T16:17:37Z

15 months later and ENI density remains a major issue for ECS users. Microservice architectures fundamentally need a high density of tasks per CPU to make sense or else one ends up with ridiculous costs. We need 1 or 2 VCPU instances which can host say 10 tasks. Please come up with a way of enabling this even if it has a performance impact. Customers need to be able to run light microservices with a high density.

Have you considered Fargate? You can go as small as 0.25 vCPU per task and there are no limits on task density per se because you are not selecting or managing EC2 instances at all with Fargate.

bcluap · 2021-01-22T16:38:00Z

We actually used to run everything on fargate but found it did not perform nearly as well as plain EC2 instances (there are lots of discussions on this topic) and pricing was horrific.

Also, adding fargate to the analysis:

So t2.small is $0.0114 per task per hour where each task gets 1 VCPU and 1GB RAM
So m5.large is $0.0107 per task per hour where each task gets 0.2 VCPU and 0.8GB RAM
So Fargate is $0.044925 per task per hour where each task gets 1 VCPU and 1GB RAM

... Fargate is 4X more expensive than t3.small!!!

waynerobinson · 2021-01-22T21:14:51Z

You also have to remember that t2.small only gives you about 20% of 1 vCPU, not a whole vCPU. So when you factor that in to your calculation, it's actually slightly more expensive than an m5.large.

Plus you probably need to assign less EBS storage to one m5.large instance running 10 tasks than you do for 10 t2.small instances.

The T*-series instance types make a lot of sense for fractional usage of a whole instance, but when you start factoring in being able to run multiple concurrent tasks with ECS, this becomes less important as you're already able to use fractional computing by assigning more tasks.

waynerobinson · 2021-01-22T21:26:28Z

Also, if these are microservice instances that could stand to be replaced within 2 minutes, have you considered Spot instances? They're often times 5x+ cheaper than on-demand.

Even with the fact these can go away with a 2 minute warning, we've found capacity to be remarkably stable.

In practice, if you mix instance types and availability zones, you are very unlikely to never have capacity available when an instance type becomes unavailable and the spot-based allocation strategy uses extra knowledge to try and avoid using instance types that are likely to be low on capacity.

Also, capacity of instance types in EC2 are set on a per instance-type per AZ basis, so even if they run out of capacity of m5.xlarge, there'll still be m5.large (for example).

And if you're worried about complete instance-type exhaustion, you could use something like https://github.com/AutoSpotting/AutoSpotting which will start everything as on-demand and swap them for Spot. So even if Spot instances do get exhausted, it would replace capacity with on-demand instances.

mreferre · 2021-01-23T10:23:28Z

@bcluap I came here to say that your pricing analysis needs some considerations but @waynerobinson already hinted at that. As far as Fargate is concerned comparing a t3.small 1vCPU to a Fargate 1vCPU it's not apple to apple unless you factor in the t3 bursting price. All in all Fargate raw capacity cost is usually around 20% more expensive than similar spec'ed EC2 costs. Of course this does not take into account the operational savings Fargate can allow customers to achieve (this is a blog centered around EKS/Fargate but many of the considerations are similar for ECS/Fargate as well).

Don't get me wrong, if your workload pattern is such that you can take advantage of the bursting characteristics of T3 and allows you to burst on a need basis without having to pay for the ENTIRE discrete resources you have available, T3 is the way to go. However if you are sizing your tasks for full utilization of discrete resources then M5 (and possibly Fargate based on your specific workload patterns) may be cheaper (along with more ENI flexibility).

As usual, it depends.

kgns · 2021-09-15T23:09:39Z

For 24/7 stable workloads, using Fargate is just throwing money away. Operational savings of Fargate only shine if you have to provision capacity dynamically or only for a period of time. Otherwise, you only configure your EC2 infrastructure once and then run your tasks 24/7 for a much lower cost than Fargate.

It has been said before that support for T family instances were not in the plans, but as @mreferre also said, specific use case for these types of instances are more common than use cases where we occupy 100% resources all the time. Any workload aiming for 24/7 availability but only real usage from 8am to 6pm on a specific time zone should use burstable instance types. And this should cover a big portion (maybe more than half) of the whole cloud usage.

In the end I don't think anyone should be dictating which instance types should be used for anyone else's workload. It's their own choice in the end and they will be billed for their choices.

What I want to know is, whether there really is a technical problem which cannot be solved that is preventing Amazon to enable ENI Trunking on T2/T3/T4g instances, or is it just a business decision forcing customers to use non-burstable instances for higher cost. If there really is a technical blocker, I would like to know what it is if possible.

I have asked a similar question on a related open issue #1094 but received no response yet

abby-fuller created this issue from a note in containers-roadmap (Coming Soon) Nov 28, 2018

abby-fuller added the ECS Amazon Elastic Container Service label Nov 28, 2018

abby-fuller moved this from Coming Soon to We're Working On It in containers-roadmap Nov 28, 2018

Bensign changed the title ~~ENI Trunking Support for ECS~~ ENI Density Increases for ECS Dec 5, 2018

Akramio changed the title ~~ENI Density Increases for ECS~~ ECS ENI Density Increases Dec 5, 2018

ofiliz mentioned this issue Jan 10, 2019

Request: Task networking that gives task specific IP address without a dedicated ENI aws/amazon-ecs-cni-plugins#65

Closed

Bensign moved this from We're Working On It to Coming Soon in containers-roadmap Apr 19, 2019

coultn mentioned this issue May 10, 2019

Enable ECS with other networking modes aws/aws-app-mesh-roadmap#40

Closed

abby-fuller closed this as completed Jun 6, 2019

abby-fuller moved this from Coming Soon to Just Shipped in containers-roadmap Jun 6, 2019

yws-ss mentioned this issue May 24, 2020

ENI trunking document is lack of unsupported instance type of T2/T3 awsdocs/amazon-ecs-developer-guide#125

Closed

ECS ENI Density Increases #7

ECS ENI Density Increases #7

Comments

abby-fuller commented Nov 28, 2018 • edited by Bensign Loading

MarcusNoble commented Dec 12, 2018

FernandoMiguel commented Dec 12, 2018

ofiliz commented Dec 13, 2018

MarcusNoble commented Dec 13, 2018

jpoley commented Dec 15, 2018

jespersoderlund commented Dec 27, 2018

mancej commented Jan 5, 2019

geekgonecrazy commented Feb 13, 2019

emanuelecasadio commented Mar 29, 2019 • edited Loading

peterjuras commented Apr 7, 2019

ofiliz commented Apr 7, 2019

joshuabaird commented Apr 8, 2019

mancej commented Apr 8, 2019

tomelliff commented Apr 9, 2019

FernandoMiguel commented Apr 9, 2019

tomelliff commented Apr 9, 2019

FernandoMiguel commented Apr 9, 2019

ofiliz commented Apr 17, 2019

Bensign commented Apr 17, 2019

mfortin commented May 16, 2019

Bensign commented May 16, 2019

mfortin commented May 16, 2019

abby-fuller commented Jun 6, 2019

FernandoMiguel commented Jun 6, 2019

coultn commented Jun 6, 2019 • edited Loading

joshuabaird commented Jun 7, 2019

sargun commented Jun 8, 2019

joshuabaird commented Jun 10, 2019

joshuabaird commented Jun 10, 2019

iwarshak commented Jun 10, 2019

geekgonecrazy commented Jun 11, 2019 • edited Loading

ofiliz commented Jun 11, 2019

geekgonecrazy commented Jun 11, 2019

peterjuras commented Jun 11, 2019

emanuelecasadio commented Jul 10, 2019 • edited Loading

ofiliz commented Jul 18, 2019

mailjunze commented Jul 24, 2019

pradeepkamath007 commented Sep 24, 2019

tomaszdudek7 commented Oct 22, 2019

coultn commented Oct 22, 2019

bcluap commented Jan 21, 2021

joshuabaird commented Jan 22, 2021

bcluap commented Jan 22, 2021

coultn commented Jan 22, 2021

bcluap commented Jan 22, 2021 • edited Loading

waynerobinson commented Jan 22, 2021

waynerobinson commented Jan 22, 2021

mreferre commented Jan 23, 2021

kgns commented Sep 15, 2021 • edited Loading

abby-fuller commented Nov 28, 2018 •

edited by Bensign

Loading

emanuelecasadio commented Mar 29, 2019 •

edited

Loading

coultn commented Jun 6, 2019 •

edited

Loading

geekgonecrazy commented Jun 11, 2019 •

edited

Loading

emanuelecasadio commented Jul 10, 2019 •

edited

Loading

bcluap commented Jan 22, 2021 •

edited

Loading

kgns commented Sep 15, 2021 •

edited

Loading