Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter choose costly spot instances. #3044

Closed
liorfranko opened this issue Dec 15, 2022 · 13 comments
Closed

Karpenter choose costly spot instances. #3044

liorfranko opened this issue Dec 15, 2022 · 13 comments
Labels
bug Something isn't working

Comments

@liorfranko
Copy link

Version

Karpenter Version: v0.19.2

Kubernetes Version: v1.20.15

Expected Behavior

Deploying 13 pods, with CPU request=10 and memory request=53Gi with anfti-affinity (Each pod should deploy on different node), we expect to get spot instances with, more or less, 16 cores each.

Actual Behavior

We got:
4 - r5n.4xlarge
5 - r5n.8xlarge
1 - c5n.9xlarge
2 - r5n.12xlarge
I know that Karpenter uses the "price and capacity optimized" strategy, but it doesn't seem legit.
In the above situation, we requested 130 cores, and we could get 13 instances of many 4xlarges which is 208 cores, but we got 420 cores instead.
This often happens, and our clusters are significantly underutilized without the "support node replacement for spot nodes" kubernetes-sigs/karpenter#763.

Steps to Reproduce the Problem

Use a deployment with anti-affinity, CPU request = 10.

Resource Specs and Logs

2022-12-13T23:51:10.383Z        INFO    controller.provisioning launching node with 1 pods requesting {"cpu":"10370m","memory":"53028Mi","pods":"10"} from types m6id.4xlarge, m6i.4xlarge, m5.4xlarge, m6a.4xlarge, m5d.4xlarge and 50 other(s)        {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot"}
2022-12-13T23:51:10.384Z        INFO    controller.provisioning launching node with 1 pods requesting {"cpu":"10370m","memory":"53028Mi","pods":"10"} from types m6id.4xlarge, m6i.4xlarge, m5.4xlarge, m6a.4xlarge, m5d.4xlarge and 50 other(s)        {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot"}
2022-12-13T23:51:10.385Z        INFO    controller.provisioning launching node with 1 pods requesting {"cpu":"10370m","memory":"53028Mi","pods":"10"} from types m6id.4xlarge, m6i.4xlarge, m5.4xlarge, m6a.4xlarge, m5d.4xlarge and 50 other(s)        {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot"}
2022-12-13T23:51:10.385Z        INFO    controller.provisioning launching node with 1 pods requesting {"cpu":"10370m","memory":"53028Mi","pods":"10"} from types m6id.4xlarge, m6i.4xlarge, m5.4xlarge, m6a.4xlarge, m5d.4xlarge and 50 other(s)        {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot"}
2022-12-13T23:51:10.386Z        INFO    controller.provisioning launching node with 1 pods requesting {"cpu":"10370m","memory":"53028Mi","pods":"10"} from types m6id.4xlarge, m6i.4xlarge, m5.4xlarge, m6a.4xlarge, m5d.4xlarge and 50 other(s)        {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot"}
2022-12-13T23:51:10.387Z        INFO    controller.provisioning launching node with 1 pods requesting {"cpu":"10370m","memory":"53028Mi","pods":"10"} from types m6id.4xlarge, m6i.4xlarge, m5.4xlarge, m6a.4xlarge, m5d.4xlarge and 50 other(s)        {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot"}
2022-12-13T23:51:10.383Z        INFO    controller.provisioning launching node with 1 pods requesting {"cpu":"10370m","memory":"53028Mi","pods":"10"} from types m6id.4xlarge, m6i.4xlarge, m5.4xlarge, m6a.4xlarge, m5d.4xlarge and 50 other(s)        {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot"}
2022-12-13T23:51:10.391Z        INFO    controller.provisioning launching node with 1 pods requesting {"cpu":"10370m","memory":"53028Mi","pods":"10"} from types m6id.4xlarge, m6i.4xlarge, m5.4xlarge, m6a.4xlarge, m5d.4xlarge and 50 other(s)        {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot"}
2022-12-13T23:51:10.392Z        INFO    controller.provisioning launching node with 1 pods requesting {"cpu":"10370m","memory":"53028Mi","pods":"10"} from types m6id.4xlarge, m6i.4xlarge, m5.4xlarge, m6a.4xlarge, m5d.4xlarge and 50 other(s)        {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot"}
2022-12-13T23:51:10.394Z        INFO    controller.provisioning launching node with 1 pods requesting {"cpu":"10370m","memory":"53028Mi","pods":"10"} from types m6id.4xlarge, m6i.4xlarge, m5.4xlarge, m6a.4xlarge, m5d.4xlarge and 50 other(s)        {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot"}
2022-12-13T23:51:10.395Z        INFO    controller.provisioning launching node with 1 pods requesting {"cpu":"10370m","memory":"53028Mi","pods":"10"} from types m6id.4xlarge, m6i.4xlarge, m5.4xlarge, m6a.4xlarge, m5d.4xlarge and 50 other(s)        {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot"}
2022-12-13T23:51:10.396Z        INFO    controller.provisioning launching node with 1 pods requesting {"cpu":"10370m","memory":"53028Mi","pods":"10"} from types m6id.4xlarge, m6i.4xlarge, m5.4xlarge, m6a.4xlarge, m5d.4xlarge and 50 other(s)        {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot"}
2022-12-13T23:51:10.397Z        INFO    controller.provisioning launching node with 1 pods requesting {"cpu":"10370m","memory":"53028Mi","pods":"10"} from types m6id.4xlarge, m6i.4xlarge, m5.4xlarge, m6a.4xlarge, m5d.4xlarge and 50 other(s)        {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot"}
2022-12-13T23:51:18.331Z        INFO    controller.provisioning.cloudprovider   launched new instance   {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot", "launched-instance": "i-0f89de32faf6ff8e9", "hostname": "ip-10-208-18-83.ec2.internal", "type": "r5n.8xlarge", "zone": "us-east-1b", "capacity-type": "spot"}
2022-12-13T23:51:18.334Z        INFO    controller.provisioning.cloudprovider   launched new instance   {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot", "launched-instance": "i-067cf5e99bc4f4fda", "hostname": "ip-10-208-25-138.ec2.internal", "type": "r5n.12xlarge", "zone": "us-east-1b", "capacity-type": "spot"}
2022-12-13T23:51:18.338Z        INFO    controller.provisioning.cloudprovider   launched new instance   {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot", "launched-instance": "i-033caafd1ff6305f7", "hostname": "ip-10-208-8-140.ec2.internal", "type": "r5n.12xlarge", "zone": "us-east-1b", "capacity-type": "spot"}
2022-12-13T23:51:18.338Z        INFO    controller.provisioning.cloudprovider   launched new instance   {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot", "launched-instance": "i-059ba620a24f2d91f", "hostname": "ip-10-208-26-128.ec2.internal", "type": "r5n.4xlarge", "zone": "us-east-1b", "capacity-type": "spot"}
2022-12-13T23:51:18.341Z        INFO    controller.provisioning.cloudprovider   launched new instance   {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot", "launched-instance": "i-0f9e339cd71d7fd2d", "hostname": "ip-10-208-65-11.ec2.internal", "type": "c5n.9xlarge", "zone": "us-east-1e", "capacity-type": "spot"}
2022-12-13T23:51:18.341Z        INFO    controller.provisioning.cloudprovider   launched new instance   {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot", "launched-instance": "i-0bd52684f99a5a74a", "hostname": "ip-10-208-22-84.ec2.internal", "type": "r5n.8xlarge", "zone": "us-east-1b", "capacity-type": "spot"}
2022-12-13T23:51:18.342Z        INFO    controller.provisioning.cloudprovider   launched new instance   {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot", "launched-instance": "i-0ca8cc5e8e877b8c1", "hostname": "ip-10-208-8-205.ec2.internal", "type": "r5n.4xlarge", "zone": "us-east-1b", "capacity-type": "spot"}
2022-12-13T23:51:18.343Z        INFO    controller.provisioning.cloudprovider   launched new instance   {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot", "launched-instance": "i-088bae61aacf23733", "hostname": "ip-10-208-20-93.ec2.internal", "type": "r5n.4xlarge", "zone": "us-east-1b", "capacity-type": "spot"}
2022-12-13T23:51:18.347Z        INFO    controller.provisioning.cloudprovider   launched new instance   {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot", "launched-instance": "i-061e2bc70fbb2f15b", "hostname": "ip-10-208-19-65.ec2.internal", "type": "r5n.8xlarge", "zone": "us-east-1b", "capacity-type": "spot"}
2022-12-13T23:51:18.347Z        INFO    controller.provisioning.cloudprovider   launched new instance   {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot", "launched-instance": "i-0d0ef3a2d6f469b1c", "hostname": "ip-10-208-14-240.ec2.internal", "type": "r5n.8xlarge", "zone": "us-east-1b", "capacity-type": "spot"}
2022-12-13T23:51:18.351Z        INFO    controller.provisioning.cloudprovider   launched new instance   {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot", "launched-instance": "i-0e3430f9fa7d99ac6", "hostname": "ip-10-208-23-31.ec2.internal", "type": "r5n.4xlarge", "zone": "us-east-1b", "capacity-type": "spot"}
2022-12-13T23:51:18.371Z        INFO    controller.provisioning.cloudprovider   launched new instance   {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot", "launched-instance": "i-0bebe831961d35295", "hostname": "ip-10-208-20-74.ec2.internal", "type": "r5n.8xlarge", "zone": "us-east-1b", "capacity-type": "spot"}
2022-12-13T23:51:18.375Z        INFO    controller.provisioning.cloudprovider   launched new instance   {"commit": "470aa83", "provisioner": "mediation-events-pipelines-spot", "launched-instance": "i-034f3107f97ae5b8b", "hostname": "ip-10-208-18-115.ec2.internal", "type": "r5n.4xlarge", "zone": "us-east-1b", "capacity-type": "spot"}

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@liorfranko liorfranko added the bug Something isn't working label Dec 15, 2022
@FernandoMiguel
Copy link
Contributor

can you use ec2-instance-selector to see the spot pricing for those instances in your account?
or check aws console spot pricing?
maybe it is the cheapest option at the time?

@liorfranko
Copy link
Author

Here is the spot price history:
image

image

image

image

4 - r5n.4xlarge - 2.9016$
5 - r5n.8xlarge - 6.1935$
1 - c5n.9xlarge - 0.8366$
2 - r5n.12xlarge - 4.123$
That's 14.0547$ per hour.
Instead of ~9$ per hour.

Here is the current average frequency of interruption in our region (Taken from https://aws.amazon.com/ec2/spot/instance-advisor/):
Screen Shot 2022-12-16 at 20 00 49
Screen Shot 2022-12-16 at 20 00 19
Screen Shot 2022-12-16 at 20 00 04
Screen Shot 2022-12-16 at 20 00 34

And from my experience with Spots, the r5n.4xlarge is the least interrupted instance from the above list.

@spring1843
Copy link
Contributor

We are looking into this internally to see if we can find a better answer for you.

@ellistarn
Copy link
Contributor

One thing to note about the spot instance advisor:

Frequency of interruption represents the rate at which Spot has reclaimed capacity during the trailing month. They are in ranges of < 5%, 5-10%,10-15%,15-20% and >20%.

This does a good job of representing monthly history, but a bad job of representing shortlived spikes. From our investigation, things are working as intended and your instance type flexibility enabled you to find lower interruption rate capacity during this scale out. Karpenter uses the PCO strategy, you may be interested to learn more here: https://aws.amazon.com/blogs/compute/introducing-price-capacity-optimized-allocation-strategy-for-ec2-spot-instances/

@liorfranko
Copy link
Author

The attached doc explains how the different strategies are not even close to what we see in reality.
They show a split between c5 and c6 or c5 and m5, but all the instances are xlarge which is great.
It's not our case when we see a split between 4xlarge and 12xlarge.

The facts are that once we moved from CAS to Karpenter, our monthly costs increased by 15%.

@tzneal
Copy link
Contributor

tzneal commented Dec 22, 2022

Going to re-open this and investigate deeper.

@tzneal tzneal reopened this Dec 22, 2022
@mercuriete
Copy link

mercuriete commented Jan 18, 2023

@liorfranko
I had the same problem and I asked on another issue and somebody gave me an advice:
Do weighted provisioners.
I have a cheap instances provisioner and then a default provisioner with all types.

In my use case all containers fit in t3.small so I created a provisioner with t3.small with high priority and then another provisioner with all types with low or 0 priority.

My problem now is I do not know how to debug the configuration and consolidation is not working for me. but that is another story.

I hope it work for you. because for me is almost good.

@liorfranko
Copy link
Author

Well, it will work, but operation-wise, it will be tough to manage.

@liorfranko
Copy link
Author

Recently, the spot market has been more stable, and we don't see this behavior anymore.
Likewise, we will replace the anti-affinity with topology-spread.

@tzneal
Copy link
Contributor

tzneal commented Jan 27, 2023

Hey @liorfranko I did want to let you know that we merged in #3292 which should also help prevent this.

@liorfranko
Copy link
Author

Looks great; we'll upgrade next week.
Thanks!

@felipewnp
Copy link

Looks great; we'll upgrade next week. Thanks!

Hi @liorfranko

Did the update solved your issue?

@liorfranko
Copy link
Author

Hi @felipewnp
Yes, it solved my issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants