Better log message for quota-related failures #3267

sourcedelica · 2023-01-24T21:11:47Z

Tell us about your request

When a node fails to be provisioned due to service quota, make that clear in the log. Right now the portion of the log message is like this:

incompatible with provisioner "gitlab-runner-ifi-bazel-instance-store", no instance type satisfied resources {"memory":"172Gi","pods":"1"} and requirements gitlab-runner.tradeweb.com/network-zone In [dev], kubernetes.io/arch In [amd64], node.kubernetes.io/instance-type In [c5ad.24xlarge], gitlab-runner.tradeweb.com/scope In [ifi-bazel-instance-store], tradeweb.cloud/karpenter-profile In [gitlab-runner-ifi-bazel], karpenter.sh/provisioner-name In [gitlab-runner-ifi-bazel-instance-store], karpenter.sh/capacity-type In [on-demand], gitlab-runner.tradeweb.com/region In [us]

"No instance type satisfied" isn't the best wording for this situation.

When trying to launch the same instance type manually in that account there's a clear error message:

Instance launch failed
You have requested more vCPU capacity than your current vCPU limit of 960 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

We're trying to debug pod provisioning issues but it's unclear what the problem is.

Are you currently working around this issue?

Users are coming to us wondering why their CI jobs aren't running. We have to debug that it's a quota-related issue for their account.

Additional Context

No response

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

spring1843 · 2023-01-27T19:17:44Z

Are you 100% sure the capacity error log did not show up in the logs?
The API responses are typically reflected in the logs, want to make sure these aren't two different issues. Did you have only one provisioner?

sourcedelica · 2023-01-28T03:29:40Z

There is only one Karpenter deployment running though there are three different Provisioner resources that are provisioning different scenarios. They are set up with taints to be mutually exclusive.

I tried grepping the log with the terms quota, limit, capacity, vcpu, bucket but didn't find anything except one unrelated issue (see below). There are tons of log messages so I can't go through them individually, unfortunately.

The grep on capacity did turn up a handful of entries though they don't match up with the times when I ran into the quota issue. It's interesting because I'm not asking for a specific AZ in my pods. The message is about it temporarily not being able to provision the instance type I needed. Here is the message:

2023-01-26T23:42:51.358Z        ERROR   controller.provisioning Provisioning failed, launching node, creating cloud provider instance, with fleet error(s), InsufficientInstanceCapacity: We currently do not have sufficient c5ad.24xlarge capacity in the Availability Zone you requested (us-east-1b). Our system will be working on provisioning additional capacity. You can currently get c5ad.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1c, us-east-1d, us-east-1f.; InsufficientInstanceCapacity: We currently do not have sufficient c5ad.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get c5ad.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b, us-east-1c, us-east-1f.; InsufficientInstanceCapacity: We currently do not have sufficient c5ad.24xlarge capacity in the Availability Zone you requested (us-east-1a). Our system will be working on provisioning additional capacity. You can currently get c5ad.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1b, us-east-1c, us-east-1d, us-east-1f.     {"commit": "5d4ae35-dirty"}

bwmetcalf · 2023-02-20T17:16:26Z

We are seeing a similar issue: #3426.

njtran · 2023-05-31T18:24:06Z

Closing this as we've had a lot of scheduling log improvements recently. One of them notably is in here: kubernetes-sigs/karpenter#317 Please upgrade to v0.27.5 to see some of these changes.

sourcedelica added the feature New feature or request label Jan 24, 2023

bwmetcalf mentioned this issue Feb 20, 2023

Better logging for hitting provisioner memory limit #3426

Closed

njtran closed this as completed May 31, 2023

andrewwdye mentioned this issue Mar 18, 2024

Surface InsufficientCapacityError in Pod events #5485

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better log message for quota-related failures #3267

Better log message for quota-related failures #3267

sourcedelica commented Jan 24, 2023

spring1843 commented Jan 27, 2023

sourcedelica commented Jan 28, 2023 •

edited

Loading

bwmetcalf commented Feb 20, 2023

njtran commented May 31, 2023

Better log message for quota-related failures #3267

Better log message for quota-related failures #3267

Comments

sourcedelica commented Jan 24, 2023

Tell us about your request

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Are you currently working around this issue?

Additional Context

Attachments

Community Note

spring1843 commented Jan 27, 2023

sourcedelica commented Jan 28, 2023 • edited Loading

bwmetcalf commented Feb 20, 2023

njtran commented May 31, 2023

sourcedelica commented Jan 28, 2023 •

edited

Loading