AWS Cloud Provider now uses accurate pod/memory values when constructing node objects #572

ellistarn · 2021-07-29T17:10:26Z

Issue, if available:

Description of changes:

v1.ResourcePods:   *instanceType.Pods(),
v1.ResourceCPU:    *instanceType.CPU(),
v1.ResourceMemory: *instanceType.Memory(),

Resolves an issue where the scheduler would pod additional pods onto nodes before they come online since the capacity values were way too high.
Deleted the NodeAPI abstraction and folded it into the InstanceProvider

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

netlify · 2021-07-29T17:10:31Z

✔️ Deploy Preview for karpenter-docs-prod canceled.

🔨 Explore the source changes: a8b75d1

🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/610316ff33af5d0008aa79f7

njtran

Nice work! Just a couple of comments.

njtran · 2021-07-29T20:33:52Z

pkg/cloudprovider/aws/instance.go

+	if err := retry.Do(
+		func() (err error) { return p.getInstance(ctx, id, instance) },
+		retry.Delay(1*time.Second),
+		retry.Attempts(3),


Do you have a reason why you picked 3 attempts of 1 second each? It seems pretty likely for EC2 to throttle for 3 seconds if you're making a lot of requests.

Can we make this a larger interval with more retries? If an instance creation fails, we'll have to do binpacking all over again anyways and attempt to create the instance.

I haven't seen any throttling failures here, even with some preliminary 1k node scale tests. I've never seen this one fail due to rate limiting, as the describe limits are actually quite high. I'm open to increasing these, but I'm a bit wary of the retries themselves causing cascading effects on other threads. Regardless, we will recover gracefully w/ the reallocator if these retries do fail.

Cool. We should definitely keep track of this somewhere, especially when we get to larger cluster sizes.

pkg/cloudprovider/aws/instance.go

njtran · 2021-07-29T20:42:32Z

pkg/controllers/allocation/bind.go

 	errs := make([]error, len(pods))
 	workqueue.ParallelizeUntil(ctx, len(pods), len(pods), func(index int) {
 		errs[index] = b.bind(ctx, node, pods[index])
 	})
+	logging.FromContext(ctx).Infof("Bound %d pod(s) to node %s", len(pods), node.Name)


Let's say that we had to bind 10 pods, and only 5 succeeded. This would log that we binded 10 pods, right? Can we change this to be more accurate?

njtran · 2021-07-29T20:56:33Z

pkg/controllers/allocation/bind.go

 	errs := make([]error, len(pods))
 	workqueue.ParallelizeUntil(ctx, len(pods), len(pods), func(index int) {
 		errs[index] = b.bind(ctx, node, pods[index])
 	})
-	return multierr.Combine(errs...)
+	err := multierr.Combine(errs...)
+	logging.FromContext(ctx).Infof("Bound %d/%d pod(s) to node %s", len(pods) - len(multierr.Errors(err)),  len(pods), node.Name)


I don't know if we need fractions here. I think it'd be sufficient to just say how many did bind.

…ing node objects

ellistarn changed the title ~~AWS Cloud Provider now uses accurate pod/memory values when construct…~~ AWS Cloud Provider now uses accurate pod/memory values when constructing node objects Jul 29, 2021

ellistarn force-pushed the leak branch from 91cfae6 to f2d7b6a Compare July 29, 2021 17:14

njtran reviewed Jul 29, 2021

View reviewed changes

ellistarn force-pushed the leak branch from f2d7b6a to a35fc6e Compare July 29, 2021 20:52

njtran reviewed Jul 29, 2021

View reviewed changes

AWS Cloud Provider now uses accurate pod/memory values when construct…

a8b75d1

…ing node objects

ellistarn force-pushed the leak branch from a35fc6e to a8b75d1 Compare July 29, 2021 21:00

njtran approved these changes Jul 29, 2021

View reviewed changes

ellistarn merged commit 7ac2ea6 into aws:main Jul 29, 2021

ellistarn deleted the leak branch July 29, 2021 22:19

gfcroft pushed a commit to gfcroft/karpenter-provider-aws that referenced this pull request Nov 25, 2023

chore: move termination queue to singleton controller (aws#572)

8e6687e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS Cloud Provider now uses accurate pod/memory values when constructing node objects #572

AWS Cloud Provider now uses accurate pod/memory values when constructing node objects #572

ellistarn commented Jul 29, 2021 •

edited

netlify bot commented Jul 29, 2021 •

edited

njtran left a comment

njtran Jul 29, 2021

ellistarn Jul 29, 2021

njtran Jul 29, 2021

njtran Jul 29, 2021

njtran Jul 29, 2021

AWS Cloud Provider now uses accurate pod/memory values when constructing node objects #572

AWS Cloud Provider now uses accurate pod/memory values when constructing node objects #572

Conversation

ellistarn commented Jul 29, 2021 • edited

netlify bot commented Jul 29, 2021 • edited

njtran left a comment

Choose a reason for hiding this comment

njtran Jul 29, 2021

Choose a reason for hiding this comment

ellistarn Jul 29, 2021

Choose a reason for hiding this comment

njtran Jul 29, 2021

Choose a reason for hiding this comment

njtran Jul 29, 2021

Choose a reason for hiding this comment

njtran Jul 29, 2021

Choose a reason for hiding this comment

ellistarn commented Jul 29, 2021 •

edited

netlify bot commented Jul 29, 2021 •

edited