-
Notifications
You must be signed in to change notification settings - Fork 836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolves race conditions exposed at high scale #578
Conversation
✔️ Deploy Preview for karpenter-docs-prod canceled. 🔨 Explore the source changes: cd1d840 🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/61038cc3222a970007ac6839 |
2666041
to
902f8b2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work so far.
@@ -25,6 +25,7 @@ import ( | |||
"github.com/awslabs/karpenter/pkg/controllers" | |||
"github.com/awslabs/karpenter/pkg/controllers/allocation" | |||
"github.com/awslabs/karpenter/pkg/controllers/expiration" | |||
"github.com/awslabs/karpenter/pkg/controllers/node" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of doing something like regulation controller? I imagine this might end up clashing with a lot of packages/code in the future referencing the kubernetes object itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of "compatibility controller", "node state controller". I actually think that because we're inside the package node, you can refer to other packages named node without colliding. I'd love to revisit this name as we explore how this controller will evolve.
sess := withUserAgent(session.Must(session.NewSession( | ||
request.WithRetryer( | ||
&aws.Config{STSRegionalEndpoint: endpoints.RegionalSTSEndpoint}, | ||
client.DefaultRetryer{NumMaxRetries: 3}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So does this make all cloud provider calls fail after 4 failures? Is it possible to specify the interval in between retries as well? What's the default for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually just restoring this logic which was accidentally removed here: https://github.com/awslabs/karpenter/pull/564/files#diff-281f3c570b5d09346495d0c9b5f5e2a625bbf9efbe8337ff51f5c914add8c916L34. We used to have the base retries + custom retry, and I ripped both out. Now I'm adding the base retry back in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pkg/controllers/allocation/bind.go
Outdated
// the node by the kube scheduler, causing OutOfCPU errors when the | ||
// binpacked pods race to bind to the same node. The system eventually | ||
// heals, but causes delays from additional provisioning (thrash). This | ||
// taint will be removed when the node is marked as ready. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you be able to reference the node controller here in the comment so they know where it's being removed?
pkg/controllers/node/controller.go
Outdated
} | ||
return reconcile.Result{}, err | ||
} | ||
if len(stored.Labels[v1alpha3.ProvisionerNameLabelKey]) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although we can't have an empty named provisioner, this label could have an empty value, but still be set.
This is a weird edge case, but that means a node with this case could sneak past and have the work done to it. Although I don't know what we'd want to do in that case, WDYT if we just make sure it's not in the map at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cleaner anyways.
|
||
// Reconcile adds the termination finalizer if the node is not deleting | ||
func (r *Finalizer) Reconcile(n *v1.Node) error { | ||
if !n.DeletionTimestamp.IsZero() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible a node could be deleting because it had another deletionTimestamp on it, but not the TerminationFinalizer. In this case, if the other finalizers don't implement instance deletion logic, this might leak.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially some daemonsets that only tolerate specific taints, but most tolerate all. We should stay vigilant on this one. Worst case, we alter the behavior do set node.spec.unschedulable, but it's hard to differentiate this against the cordon behavior.
Signed-off-by: sadath-12 <sadathsadu2002@gmail.com> Signed-off-by: syedsadath-17 <90619459+sadath-12@users.noreply.github.com>
Issue, if available:
Description of changes:
Built a lightweight controller that helps enforce invariants (over time) that may emerge due to scalability bottlenecks.
Case 1: Node Missing Finalizer
If kubeapi QPS is backed up, nodes may come online before we're able to create the node object. This means that the node object would be created by the kubelet, rather than patched by the kubelet. The result is that the node would not have the finalizer, and instances could leak. See: #549.
Now, we add the finalizer if it doesn't exist, unless the node is terminating.
Case 2: Scheduler Racing against Provisioner
Taint karpenter.sh/not-ready=NoSchedule to prevent the kube scheduler from scheduling pods before we're able to bind them ourselves. The kube scheduler has an eventually consistent cache of nodes and pods, so it's possible for it to see a provisioned node before it sees the pods bound to it. This creates an edge case where other pending pods may be bound to the node by the kube scheduler, causing OutOfCPU errors when the binpacked pods race to bind to the same node. The system eventually heals, but causes delays from additional provisioning (thrash). This taint will be removed by the node controller when a node is marked ready.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.