Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better log message for quota-related failures #3267

Closed
sourcedelica opened this issue Jan 24, 2023 · 4 comments
Closed

Better log message for quota-related failures #3267

sourcedelica opened this issue Jan 24, 2023 · 4 comments
Labels
feature New feature or request

Comments

@sourcedelica
Copy link

Tell us about your request

When a node fails to be provisioned due to service quota, make that clear in the log. Right now the portion of the log message is like this:

incompatible with provisioner "gitlab-runner-ifi-bazel-instance-store", no instance type satisfied resources {"memory":"172Gi","pods":"1"} and requirements gitlab-runner.tradeweb.com/network-zone In [dev], kubernetes.io/arch In [amd64], node.kubernetes.io/instance-type In [c5ad.24xlarge], gitlab-runner.tradeweb.com/scope In [ifi-bazel-instance-store], tradeweb.cloud/karpenter-profile In [gitlab-runner-ifi-bazel], karpenter.sh/provisioner-name In [gitlab-runner-ifi-bazel-instance-store], karpenter.sh/capacity-type In [on-demand], gitlab-runner.tradeweb.com/region In [us]

"No instance type satisfied" isn't the best wording for this situation.

When trying to launch the same instance type manually in that account there's a clear error message:

Instance launch failed
You have requested more vCPU capacity than your current vCPU limit of 960 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

We're trying to debug pod provisioning issues but it's unclear what the problem is.

Are you currently working around this issue?

Users are coming to us wondering why their CI jobs aren't running. We have to debug that it's a quota-related issue for their account.

Additional Context

No response

Attachments

No response

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@sourcedelica sourcedelica added the feature New feature or request label Jan 24, 2023
@spring1843
Copy link
Contributor

Are you 100% sure the capacity error log did not show up in the logs?
The API responses are typically reflected in the logs, want to make sure these aren't two different issues. Did you have only one provisioner?

@sourcedelica
Copy link
Author

sourcedelica commented Jan 28, 2023

There is only one Karpenter deployment running though there are three different Provisioner resources that are provisioning different scenarios. They are set up with taints to be mutually exclusive.

I tried grepping the log with the terms quota, limit, capacity, vcpu, bucket but didn't find anything except one unrelated issue (see below). There are tons of log messages so I can't go through them individually, unfortunately.

The grep on capacity did turn up a handful of entries though they don't match up with the times when I ran into the quota issue. It's interesting because I'm not asking for a specific AZ in my pods. The message is about it temporarily not being able to provision the instance type I needed. Here is the message:

2023-01-26T23:42:51.358Z        ERROR   controller.provisioning Provisioning failed, launching node, creating cloud provider instance, with fleet error(s), InsufficientInstanceCapacity: We currently do not have sufficient c5ad.24xlarge capacity in the Availability Zone you requested (us-east-1b). Our system will be working on provisioning additional capacity. You can currently get c5ad.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1c, us-east-1d, us-east-1f.; InsufficientInstanceCapacity: We currently do not have sufficient c5ad.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get c5ad.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b, us-east-1c, us-east-1f.; InsufficientInstanceCapacity: We currently do not have sufficient c5ad.24xlarge capacity in the Availability Zone you requested (us-east-1a). Our system will be working on provisioning additional capacity. You can currently get c5ad.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1b, us-east-1c, us-east-1d, us-east-1f.     {"commit": "5d4ae35-dirty"}

@bwmetcalf
Copy link

We are seeing a similar issue: #3426.

@njtran
Copy link
Contributor

njtran commented May 31, 2023

Closing this as we've had a lot of scheduling log improvements recently. One of them notably is in here: kubernetes-sigs/karpenter#317 Please upgrade to v0.27.5 to see some of these changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants