Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolves race conditions exposed at high scale #578

Merged
merged 2 commits into from
Jul 30, 2021
Merged

Conversation

ellistarn
Copy link
Contributor

@ellistarn ellistarn commented Jul 30, 2021

Issue, if available:

Description of changes:
Built a lightweight controller that helps enforce invariants (over time) that may emerge due to scalability bottlenecks.

Case 1: Node Missing Finalizer

If kubeapi QPS is backed up, nodes may come online before we're able to create the node object. This means that the node object would be created by the kubelet, rather than patched by the kubelet. The result is that the node would not have the finalizer, and instances could leak. See: #549.

Now, we add the finalizer if it doesn't exist, unless the node is terminating.

Case 2: Scheduler Racing against Provisioner

Taint karpenter.sh/not-ready=NoSchedule to prevent the kube scheduler from scheduling pods before we're able to bind them ourselves. The kube scheduler has an eventually consistent cache of nodes and pods, so it's possible for it to see a provisioned node before it sees the pods bound to it. This creates an edge case where other pending pods may be bound to the node by the kube scheduler, causing OutOfCPU errors when the binpacked pods race to bind to the same node. The system eventually heals, but causes delays from additional provisioning (thrash). This taint will be removed by the node controller when a node is marked ready.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@netlify
Copy link

netlify bot commented Jul 30, 2021

✔️ Deploy Preview for karpenter-docs-prod canceled.

🔨 Explore the source changes: cd1d840

🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/61038cc3222a970007ac6839

@ellistarn ellistarn force-pushed the race branch 4 times, most recently from 2666041 to 902f8b2 Compare July 30, 2021 01:47
Copy link
Contributor

@njtran njtran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work so far.

@@ -25,6 +25,7 @@ import (
"github.com/awslabs/karpenter/pkg/controllers"
"github.com/awslabs/karpenter/pkg/controllers/allocation"
"github.com/awslabs/karpenter/pkg/controllers/expiration"
"github.com/awslabs/karpenter/pkg/controllers/node"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of doing something like regulation controller? I imagine this might end up clashing with a lot of packages/code in the future referencing the kubernetes object itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of "compatibility controller", "node state controller". I actually think that because we're inside the package node, you can refer to other packages named node without colliding. I'd love to revisit this name as we explore how this controller will evolve.

pkg/apis/provisioning/v1alpha3/provisioner.go Show resolved Hide resolved
sess := withUserAgent(session.Must(session.NewSession(
request.WithRetryer(
&aws.Config{STSRegionalEndpoint: endpoints.RegionalSTSEndpoint},
client.DefaultRetryer{NumMaxRetries: 3},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does this make all cloud provider calls fail after 4 failures? Is it possible to specify the interval in between retries as well? What's the default for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually just restoring this logic which was accidentally removed here: https://github.com/awslabs/karpenter/pull/564/files#diff-281f3c570b5d09346495d0c9b5f5e2a625bbf9efbe8337ff51f5c914add8c916L34. We used to have the base retries + custom retry, and I ripped both out. Now I'm adding the base retry back in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// the node by the kube scheduler, causing OutOfCPU errors when the
// binpacked pods race to bind to the same node. The system eventually
// heals, but causes delays from additional provisioning (thrash). This
// taint will be removed when the node is marked as ready.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be able to reference the node controller here in the comment so they know where it's being removed?

pkg/controllers/node/controller.go Show resolved Hide resolved
pkg/controllers/node/controller.go Show resolved Hide resolved
}
return reconcile.Result{}, err
}
if len(stored.Labels[v1alpha3.ProvisionerNameLabelKey]) == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although we can't have an empty named provisioner, this label could have an empty value, but still be set.

This is a weird edge case, but that means a node with this case could sneak past and have the work done to it. Although I don't know what we'd want to do in that case, WDYT if we just make sure it's not in the map at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cleaner anyways.

pkg/controllers/node/controller.go Show resolved Hide resolved
@ellistarn ellistarn changed the title [WIP] Resolves race conditions exposed at high scale Resolves race conditions exposed at high scale Jul 30, 2021

// Reconcile adds the termination finalizer if the node is not deleting
func (r *Finalizer) Reconcile(n *v1.Node) error {
if !n.DeletionTimestamp.IsZero() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible a node could be deleting because it had another deletionTimestamp on it, but not the TerminationFinalizer. In this case, if the other finalizers don't implement instance deletion logic, this might leak.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially some daemonsets that only tolerate specific taints, but most tolerate all. We should stay vigilant on this one. Worst case, we alter the behavior do set node.spec.unschedulable, but it's hard to differentiate this against the cordon behavior.

@ellistarn ellistarn merged commit 7eab29b into aws:main Jul 30, 2021
@ellistarn ellistarn deleted the race branch July 30, 2021 05:59
gfcroft pushed a commit to gfcroft/karpenter-provider-aws that referenced this pull request Nov 25, 2023
Signed-off-by: sadath-12 <sadathsadu2002@gmail.com>
Signed-off-by: syedsadath-17 <90619459+sadath-12@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants