Karpenter choose costly spot instances. #3044

liorfranko · 2022-12-15T19:12:30Z

Version

Karpenter Version: v0.19.2

Kubernetes Version: v1.20.15

Expected Behavior

Deploying 13 pods, with CPU request=10 and memory request=53Gi with anfti-affinity (Each pod should deploy on different node), we expect to get spot instances with, more or less, 16 cores each.

Actual Behavior

We got:
4 - r5n.4xlarge
5 - r5n.8xlarge
1 - c5n.9xlarge
2 - r5n.12xlarge
I know that Karpenter uses the "price and capacity optimized" strategy, but it doesn't seem legit.
In the above situation, we requested 130 cores, and we could get 13 instances of many 4xlarges which is 208 cores, but we got 420 cores instead.
This often happens, and our clusters are significantly underutilized without the "support node replacement for spot nodes" kubernetes-sigs/karpenter#763.

Steps to Reproduce the Problem

Use a deployment with anti-affinity, CPU request = 10.

Resource Specs and Logs

2022-12-13T23:51:10.383Z        INFO    controller.provisioning launching node with 1 pods requesting {"cpu":"10370m","memory":"53028Mi","pods":"10"} from types m6id.4xlarge, m6i.4xlarge, m5.4xlarge, m6a.4xlarge, m5d.4xlarge and 50 other(s)        {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot"}
2022-12-13T23:51:10.384Z        INFO    controller.provisioning launching node with 1 pods requesting {"cpu":"10370m","memory":"53028Mi","pods":"10"} from types m6id.4xlarge, m6i.4xlarge, m5.4xlarge, m6a.4xlarge, m5d.4xlarge and 50 other(s)        {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot"}
2022-12-13T23:51:10.385Z        INFO    controller.provisioning launching node with 1 pods requesting {"cpu":"10370m","memory":"53028Mi","pods":"10"} from types m6id.4xlarge, m6i.4xlarge, m5.4xlarge, m6a.4xlarge, m5d.4xlarge and 50 other(s)        {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot"}
2022-12-13T23:51:10.385Z        INFO    controller.provisioning launching node with 1 pods requesting {"cpu":"10370m","memory":"53028Mi","pods":"10"} from types m6id.4xlarge, m6i.4xlarge, m5.4xlarge, m6a.4xlarge, m5d.4xlarge and 50 other(s)        {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot"}
2022-12-13T23:51:10.386Z        INFO    controller.provisioning launching node with 1 pods requesting {"cpu":"10370m","memory":"53028Mi","pods":"10"} from types m6id.4xlarge, m6i.4xlarge, m5.4xlarge, m6a.4xlarge, m5d.4xlarge and 50 other(s)        {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot"}
2022-12-13T23:51:10.387Z        INFO    controller.provisioning launching node with 1 pods requesting {"cpu":"10370m","memory":"53028Mi","pods":"10"} from types m6id.4xlarge, m6i.4xlarge, m5.4xlarge, m6a.4xlarge, m5d.4xlarge and 50 other(s)        {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot"}
2022-12-13T23:51:10.383Z        INFO    controller.provisioning launching node with 1 pods requesting {"cpu":"10370m","memory":"53028Mi","pods":"10"} from types m6id.4xlarge, m6i.4xlarge, m5.4xlarge, m6a.4xlarge, m5d.4xlarge and 50 other(s)        {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot"}
2022-12-13T23:51:10.391Z        INFO    controller.provisioning launching node with 1 pods requesting {"cpu":"10370m","memory":"53028Mi","pods":"10"} from types m6id.4xlarge, m6i.4xlarge, m5.4xlarge, m6a.4xlarge, m5d.4xlarge and 50 other(s)        {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot"}
2022-12-13T23:51:10.392Z        INFO    controller.provisioning launching node with 1 pods requesting {"cpu":"10370m","memory":"53028Mi","pods":"10"} from types m6id.4xlarge, m6i.4xlarge, m5.4xlarge, m6a.4xlarge, m5d.4xlarge and 50 other(s)        {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot"}
2022-12-13T23:51:10.394Z        INFO    controller.provisioning launching node with 1 pods requesting {"cpu":"10370m","memory":"53028Mi","pods":"10"} from types m6id.4xlarge, m6i.4xlarge, m5.4xlarge, m6a.4xlarge, m5d.4xlarge and 50 other(s)        {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot"}
2022-12-13T23:51:10.395Z        INFO    controller.provisioning launching node with 1 pods requesting {"cpu":"10370m","memory":"53028Mi","pods":"10"} from types m6id.4xlarge, m6i.4xlarge, m5.4xlarge, m6a.4xlarge, m5d.4xlarge and 50 other(s)        {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot"}
2022-12-13T23:51:10.396Z        INFO    controller.provisioning launching node with 1 pods requesting {"cpu":"10370m","memory":"53028Mi","pods":"10"} from types m6id.4xlarge, m6i.4xlarge, m5.4xlarge, m6a.4xlarge, m5d.4xlarge and 50 other(s)        {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot"}
2022-12-13T23:51:10.397Z        INFO    controller.provisioning launching node with 1 pods requesting {"cpu":"10370m","memory":"53028Mi","pods":"10"} from types m6id.4xlarge, m6i.4xlarge, m5.4xlarge, m6a.4xlarge, m5d.4xlarge and 50 other(s)        {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot"}
2022-12-13T23:51:18.331Z        INFO    controller.provisioning.cloudprovider   launched new instance   {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot", "launched-instance": "i-0f89de32faf6ff8e9", "hostname": "ip-10-208-18-83.ec2.internal", "type": "r5n.8xlarge", "zone": "us-east-1b", "capacity-type": "spot"}
2022-12-13T23:51:18.334Z        INFO    controller.provisioning.cloudprovider   launched new instance   {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot", "launched-instance": "i-067cf5e99bc4f4fda", "hostname": "ip-10-208-25-138.ec2.internal", "type": "r5n.12xlarge", "zone": "us-east-1b", "capacity-type": "spot"}
2022-12-13T23:51:18.338Z        INFO    controller.provisioning.cloudprovider   launched new instance   {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot", "launched-instance": "i-033caafd1ff6305f7", "hostname": "ip-10-208-8-140.ec2.internal", "type": "r5n.12xlarge", "zone": "us-east-1b", "capacity-type": "spot"}
2022-12-13T23:51:18.338Z        INFO    controller.provisioning.cloudprovider   launched new instance   {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot", "launched-instance": "i-059ba620a24f2d91f", "hostname": "ip-10-208-26-128.ec2.internal", "type": "r5n.4xlarge", "zone": "us-east-1b", "capacity-type": "spot"}
2022-12-13T23:51:18.341Z        INFO    controller.provisioning.cloudprovider   launched new instance   {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot", "launched-instance": "i-0f9e339cd71d7fd2d", "hostname": "ip-10-208-65-11.ec2.internal", "type": "c5n.9xlarge", "zone": "us-east-1e", "capacity-type": "spot"}
2022-12-13T23:51:18.341Z        INFO    controller.provisioning.cloudprovider   launched new instance   {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot", "launched-instance": "i-0bd52684f99a5a74a", "hostname": "ip-10-208-22-84.ec2.internal", "type": "r5n.8xlarge", "zone": "us-east-1b", "capacity-type": "spot"}
2022-12-13T23:51:18.342Z        INFO    controller.provisioning.cloudprovider   launched new instance   {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot", "launched-instance": "i-0ca8cc5e8e877b8c1", "hostname": "ip-10-208-8-205.ec2.internal", "type": "r5n.4xlarge", "zone": "us-east-1b", "capacity-type": "spot"}
2022-12-13T23:51:18.343Z        INFO    controller.provisioning.cloudprovider   launched new instance   {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot", "launched-instance": "i-088bae61aacf23733", "hostname": "ip-10-208-20-93.ec2.internal", "type": "r5n.4xlarge", "zone": "us-east-1b", "capacity-type": "spot"}
2022-12-13T23:51:18.347Z        INFO    controller.provisioning.cloudprovider   launched new instance   {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot", "launched-instance": "i-061e2bc70fbb2f15b", "hostname": "ip-10-208-19-65.ec2.internal", "type": "r5n.8xlarge", "zone": "us-east-1b", "capacity-type": "spot"}
2022-12-13T23:51:18.347Z        INFO    controller.provisioning.cloudprovider   launched new instance   {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot", "launched-instance": "i-0d0ef3a2d6f469b1c", "hostname": "ip-10-208-14-240.ec2.internal", "type": "r5n.8xlarge", "zone": "us-east-1b", "capacity-type": "spot"}
2022-12-13T23:51:18.351Z        INFO    controller.provisioning.cloudprovider   launched new instance   {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot", "launched-instance": "i-0e3430f9fa7d99ac6", "hostname": "ip-10-208-23-31.ec2.internal", "type": "r5n.4xlarge", "zone": "us-east-1b", "capacity-type": "spot"}
2022-12-13T23:51:18.371Z        INFO    controller.provisioning.cloudprovider   launched new instance   {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot", "launched-instance": "i-0bebe831961d35295", "hostname": "ip-10-208-20-74.ec2.internal", "type": "r5n.8xlarge", "zone": "us-east-1b", "capacity-type": "spot"}
2022-12-13T23:51:18.375Z        INFO    controller.provisioning.cloudprovider   launched new instance   {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot", "launched-instance": "i-034f3107f97ae5b8b", "hostname": "ip-10-208-18-115.ec2.internal", "type": "r5n.4xlarge", "zone": "us-east-1b", "capacity-type": "spot"}

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

FernandoMiguel · 2022-12-16T14:36:17Z

can you use ec2-instance-selector to see the spot pricing for those instances in your account?
or check aws console spot pricing?
maybe it is the cheapest option at the time?

liorfranko · 2022-12-16T18:05:11Z

Here is the spot price history:

4 - r5n.4xlarge - 2.9016$
5 - r5n.8xlarge - 6.1935$
1 - c5n.9xlarge - 0.8366$
2 - r5n.12xlarge - 4.123$
That's 14.0547$ per hour.
Instead of ~9$ per hour.

Here is the current average frequency of interruption in our region (Taken from https://aws.amazon.com/ec2/spot/instance-advisor/):

And from my experience with Spots, the r5n.4xlarge is the least interrupted instance from the above list.

spring1843 · 2022-12-19T21:48:09Z

We are looking into this internally to see if we can find a better answer for you.

ellistarn · 2022-12-20T17:40:04Z

One thing to note about the spot instance advisor:

Frequency of interruption represents the rate at which Spot has reclaimed capacity during the trailing month. They are in ranges of < 5%, 5-10%,10-15%,15-20% and >20%.

This does a good job of representing monthly history, but a bad job of representing shortlived spikes. From our investigation, things are working as intended and your instance type flexibility enabled you to find lower interruption rate capacity during this scale out. Karpenter uses the PCO strategy, you may be interested to learn more here: https://aws.amazon.com/blogs/compute/introducing-price-capacity-optimized-allocation-strategy-for-ec2-spot-instances/

liorfranko · 2022-12-22T07:00:09Z

The attached doc explains how the different strategies are not even close to what we see in reality.
They show a split between c5 and c6 or c5 and m5, but all the instances are xlarge which is great.
It's not our case when we see a split between 4xlarge and 12xlarge.

The facts are that once we moved from CAS to Karpenter, our monthly costs increased by 15%.

tzneal · 2022-12-22T16:27:46Z

Going to re-open this and investigate deeper.

mercuriete · 2023-01-18T15:50:15Z

@liorfranko
I had the same problem and I asked on another issue and somebody gave me an advice:
Do weighted provisioners.
I have a cheap instances provisioner and then a default provisioner with all types.

In my use case all containers fit in t3.small so I created a provisioner with t3.small with high priority and then another provisioner with all types with low or 0 priority.

My problem now is I do not know how to debug the configuration and consolidation is not working for me. but that is another story.

I hope it work for you. because for me is almost good.

liorfranko · 2023-01-18T18:36:12Z

Well, it will work, but operation-wise, it will be tough to manage.

liorfranko · 2023-01-26T06:58:01Z

Recently, the spot market has been more stable, and we don't see this behavior anymore.
Likewise, we will replace the anti-affinity with topology-spread.

tzneal · 2023-01-27T00:09:49Z

Hey @liorfranko I did want to let you know that we merged in #3292 which should also help prevent this.

liorfranko · 2023-01-27T06:19:22Z

Looks great; we'll upgrade next week.
Thanks!

felipewnp · 2023-12-07T00:01:29Z

Looks great; we'll upgrade next week. Thanks!

Hi @liorfranko

Did the update solved your issue?

liorfranko · 2023-12-18T19:24:29Z

Hi @felipewnp
Yes, it solved my issue.

liorfranko added the bug Something isn't working label Dec 15, 2022

ellistarn closed this as completed Dec 20, 2022

tzneal reopened this Dec 22, 2022

ellistarn mentioned this issue Jan 25, 2023

Spot -> Spot Consolidation kubernetes-sigs/karpenter#763

Closed

liorfranko closed this as completed Jan 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Karpenter choose costly spot instances. #3044

Karpenter choose costly spot instances. #3044

liorfranko commented Dec 15, 2022

FernandoMiguel commented Dec 16, 2022

liorfranko commented Dec 16, 2022

spring1843 commented Dec 19, 2022

ellistarn commented Dec 20, 2022

liorfranko commented Dec 22, 2022

tzneal commented Dec 22, 2022

mercuriete commented Jan 18, 2023 •

edited

liorfranko commented Jan 18, 2023

liorfranko commented Jan 26, 2023

tzneal commented Jan 27, 2023

liorfranko commented Jan 27, 2023

felipewnp commented Dec 7, 2023

liorfranko commented Dec 18, 2023

Karpenter choose costly spot instances. #3044

Karpenter choose costly spot instances. #3044

Comments

liorfranko commented Dec 15, 2022

Version

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Resource Specs and Logs

Community Note

FernandoMiguel commented Dec 16, 2022

liorfranko commented Dec 16, 2022

spring1843 commented Dec 19, 2022

ellistarn commented Dec 20, 2022

liorfranko commented Dec 22, 2022

tzneal commented Dec 22, 2022

mercuriete commented Jan 18, 2023 • edited

liorfranko commented Jan 18, 2023

liorfranko commented Jan 26, 2023

tzneal commented Jan 27, 2023

liorfranko commented Jan 27, 2023

felipewnp commented Dec 7, 2023

liorfranko commented Dec 18, 2023

mercuriete commented Jan 18, 2023 •

edited