-
Notifications
You must be signed in to change notification settings - Fork 832
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perform min instance bin-packing to short circuit unneeded packings #962
Conversation
✔️ Deploy Preview for karpenter-docs-prod ready! 🔨 Explore the source changes: 86bfe14 🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/61b7cea8cf53290007527b32 😎 Browse the preview: https://deploy-preview-962--karpenter-docs-prod.netlify.app |
} | ||
|
||
type InstanceType struct { | ||
InstanceTypeOptions | ||
options InstanceTypeOptions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too fancy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was necessary because the funcs defined in the Interface now match the struct fields of InstanceTypeOptions, since they are now exported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm -- I actually kind of did this intentionally to encourage not making them public. Do you need to modify instance types in the test? Would it be simpler to defined them all in this package so we can keep them private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making them public allows for more isolated tests with cloud provider specific data. For example, in my testing of the binpacking pkg, I loaded all the real AWS InstanceTypes into fake instance types and ran that thru the packing logic for lightweight, real-world testing.
I'm more in favor of keeping them public since it makes loading arbitrary instance types easier.
@@ -39,10 +41,36 @@ type Result struct { | |||
unpacked []*v1.Pod | |||
} | |||
|
|||
type Packables []*Packable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this sort implementation pattern to be vastly inferior to https://pkg.go.dev/sort#Slice. You can separate out the comparator as necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah you're right, I'll refactor those sorts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you forget to push?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no? I might have done an intermediate push when you looked, should be okay now.
return nil | ||
} | ||
} | ||
needsGpu := p.needsResource(pods, resources.NvidiaGPU) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you collapse this into
if p.requiresResource(pods, resources.NvidiaGPU) == p.InstanceType.NvidiaGPUs().IsZero() {
return fmt.Errorf("%s is only used if required", resources.NvidiaGPU)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this error message:
return errors.New("pod requirements do not match Nvidia GPUs")
"not required" could be understood as "is allowed" which is confusing to get in an error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear my suggestion was to not explicitly reference anything in the string and instead use the resource type. This will allow us to abstract in the future.
// Pods is a set of pods that may schedule to the node; used for binpacking. | ||
Pods []*v1.Pod | ||
} | ||
|
||
type RuntimeConstraints struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on calling this object scheduling.Constraints for parity with v1alpha5.Constraints. I don't think that "Runtime" differentiates more meaningfully than the local context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also -- I'm not understanding why this object is useful.
pkg/utils/pod/sortable.go
Outdated
type SortablePods []*v1.Pod | ||
type ByResourcesRequested struct{ SortablePods } | ||
|
||
func (pods SortablePods) Len() int { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on switching to the sort.Slice(pods, pod.CompareResources) approach mentioned above?
} | ||
return packables[a].AMDGPUs().Cmp(*packables[b].AMDGPUs()) == -1 || | ||
packables[a].NvidiaGPUs().Cmp(*packables[b].NvidiaGPUs()) == -1 || | ||
packables[a].AWSNeurons().Cmp(*packables[b].AWSNeurons()) == -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the number of AWSNeurons, AMDGPUs, and NvidiaGPUs on each instance increase with instance size at the same rate? If not, wouldn't it be possible for this to have non-deterministic behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They increase, not necessarily at the same rate by gpu but by vcpu and memory. I'm not sure how that would introduce non-deterministic behavior though?
For example:
$ ec2-instance-selector -o table-wide --allow-list g4ad
Instance Type VCPUs Mem (GiB) Hypervisor Current Gen Hibernation Support CPU Arch Network Performance ENIs GPUs GPU Mem (GiB) GPU Info On-Demand Price/Hr
------------- ----- --------- ---------- ----------- ------------------- -------- ------------------- ---- ---- ------------- -------- ------------------
g4ad.xlarge 4 16 nitro true false x86_64 Up to 10 Gigabit 2 1 8 AMD Radeon Pro V520 -No Price Filter Specified-
g4ad.2xlarge 8 32 nitro true false x86_64 Up to 10 Gigabit 2 1 8 AMD Radeon Pro V520 -No Price Filter Specified-
g4ad.4xlarge 16 64 nitro true false x86_64 Up to 10 Gigabit 3 1 8 AMD Radeon Pro V520 -No Price Filter Specified-
g4ad.8xlarge 32 128 nitro true false x86_64 15 Gigabit 4 2 16 AMD Radeon Pro V520 -No Price Filter Specified-
g4ad.16xlarge 64 256 nitro true false x86_64 25 Gigabit 8 4 32 AMD Radeon Pro V520 -No Price Filter Specified-
return nil | ||
} | ||
} | ||
needsGpu := p.needsResource(pods, resources.NvidiaGPU) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this error message:
return errors.New("pod requirements do not match Nvidia GPUs")
"not required" could be understood as "is allowed" which is confusing to get in an error message.
if needsGpu && p.InstanceType.AWSNeurons().IsZero() { | ||
return errors.New("aws neuron is required") | ||
} else if !needsGpu && !p.InstanceType.AWSNeurons().IsZero() { | ||
return errors.New("aws neuron is not required") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"not required" could be understood as "is allowed"
59ff79d
to
b566b14
Compare
a103eeb
to
86bfe14
Compare
1. Issue, if available:
N/A
2. Description of changes:
PackablesFor(..
outside of theremainingPods
loop and instead DeepCopy the packables on each iteration to short circuit validation that has already occurred.Some Numbers:
Worst case (meaning min-packing didn't optimize anything):
Best Case (min-packing engaged on smallest instance-type):
3. Does this change impact docs?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.