Resolves race conditions exposed at high scale #578

ellistarn · 2021-07-30T01:37:59Z

Issue, if available:

Description of changes:
Built a lightweight controller that helps enforce invariants (over time) that may emerge due to scalability bottlenecks.

Case 1: Node Missing Finalizer

If kubeapi QPS is backed up, nodes may come online before we're able to create the node object. This means that the node object would be created by the kubelet, rather than patched by the kubelet. The result is that the node would not have the finalizer, and instances could leak. See: #549.

Now, we add the finalizer if it doesn't exist, unless the node is terminating.

Case 2: Scheduler Racing against Provisioner

Taint karpenter.sh/not-ready=NoSchedule to prevent the kube scheduler from scheduling pods before we're able to bind them ourselves. The kube scheduler has an eventually consistent cache of nodes and pods, so it's possible for it to see a provisioned node before it sees the pods bound to it. This creates an edge case where other pending pods may be bound to the node by the kube scheduler, causing OutOfCPU errors when the binpacked pods race to bind to the same node. The system eventually heals, but causes delays from additional provisioning (thrash). This taint will be removed by the node controller when a node is marked ready.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

netlify · 2021-07-30T01:38:04Z

✔️ Deploy Preview for karpenter-docs-prod canceled.

🔨 Explore the source changes: cd1d840

🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/61038cc3222a970007ac6839

njtran

Nice work so far.

njtran · 2021-07-30T04:45:48Z

cmd/controller/main.go

@@ -25,6 +25,7 @@ import (
 	"github.com/awslabs/karpenter/pkg/controllers"
 	"github.com/awslabs/karpenter/pkg/controllers/allocation"
 	"github.com/awslabs/karpenter/pkg/controllers/expiration"
+	"github.com/awslabs/karpenter/pkg/controllers/node"


What do you think of doing something like regulation controller? I imagine this might end up clashing with a lot of packages/code in the future referencing the kubernetes object itself.

I was thinking of "compatibility controller", "node state controller". I actually think that because we're inside the package node, you can refer to other packages named node without colliding. I'd love to revisit this name as we explore how this controller will evolve.

pkg/apis/provisioning/v1alpha3/provisioner.go

njtran · 2021-07-30T04:47:37Z

pkg/cloudprovider/aws/cloudprovider.go

+	sess := withUserAgent(session.Must(session.NewSession(
+		request.WithRetryer(
+			&aws.Config{STSRegionalEndpoint: endpoints.RegionalSTSEndpoint},
+			client.DefaultRetryer{NumMaxRetries: 3},


So does this make all cloud provider calls fail after 4 failures? Is it possible to specify the interval in between retries as well? What's the default for that?

I'm actually just restoring this logic which was accidentally removed here: https://github.com/awslabs/karpenter/pull/564/files#diff-281f3c570b5d09346495d0c9b5f5e2a625bbf9efbe8337ff51f5c914add8c916L34. We used to have the base retries + custom retry, and I ripped both out. Now I'm adding the base retry back in.

https://github.com/aws/aws-sdk-go/blob/main/aws/client/default_retryer.go#L38

njtran · 2021-07-30T04:48:42Z

pkg/controllers/allocation/bind.go

+	// the node by the kube scheduler, causing OutOfCPU errors when the
+	// binpacked pods race to bind to the same node. The system eventually
+	// heals, but causes delays from additional provisioning (thrash). This
+	// taint will be removed when the node is marked as ready.


Would you be able to reference the node controller here in the comment so they know where it's being removed?

pkg/controllers/node/controller.go

njtran · 2021-07-30T04:57:23Z

pkg/controllers/node/controller.go

+		}
+		return reconcile.Result{}, err
+	}
+	if len(stored.Labels[v1alpha3.ProvisionerNameLabelKey]) == 0 {


Although we can't have an empty named provisioner, this label could have an empty value, but still be set.

This is a weird edge case, but that means a node with this case could sneak past and have the work done to it. Although I don't know what we'd want to do in that case, WDYT if we just make sure it's not in the map at all?

This is cleaner anyways.

pkg/controllers/node/controller.go

njtran · 2021-07-30T05:42:08Z

pkg/controllers/node/finalizer.go

+
+// Reconcile adds the termination finalizer if the node is not deleting
+func (r *Finalizer) Reconcile(n *v1.Node) error {
+	if !n.DeletionTimestamp.IsZero() {


It's possible a node could be deleting because it had another deletionTimestamp on it, but not the TerminationFinalizer. In this case, if the other finalizers don't implement instance deletion logic, this might leak.

Potentially some daemonsets that only tolerate specific taints, but most tolerate all. We should stay vigilant on this one. Worst case, we alter the behavior do set node.spec.unschedulable, but it's hard to differentiate this against the cordon behavior.

Signed-off-by: sadath-12 <sadathsadu2002@gmail.com> Signed-off-by: syedsadath-17 <90619459+sadath-12@users.noreply.github.com>

ellistarn force-pushed the race branch 4 times, most recently from 2666041 to 902f8b2 Compare July 30, 2021 01:47

Resolves race conditions exposed at high scale

921c3a7

ellistarn force-pushed the race branch from 902f8b2 to 921c3a7 Compare July 30, 2021 04:38

njtran reviewed Jul 30, 2021

View reviewed changes

PR Comments

cd1d840

ellistarn changed the title ~~[WIP] Resolves race conditions exposed at high scale~~ Resolves race conditions exposed at high scale Jul 30, 2021

njtran reviewed Jul 30, 2021

View reviewed changes

njtran approved these changes Jul 30, 2021

View reviewed changes

ellistarn merged commit 7eab29b into aws:main Jul 30, 2021

ellistarn deleted the race branch July 30, 2021 05:59

gfcroft pushed a commit to gfcroft/karpenter-provider-aws that referenced this pull request Nov 25, 2023

test: replaced offsets with GinkgoHelper() (aws#578)

4cf200f

Signed-off-by: sadath-12 <sadathsadu2002@gmail.com> Signed-off-by: syedsadath-17 <90619459+sadath-12@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolves race conditions exposed at high scale #578

Resolves race conditions exposed at high scale #578

ellistarn commented Jul 30, 2021 •

edited

netlify bot commented Jul 30, 2021 •

edited

njtran left a comment

njtran Jul 30, 2021

ellistarn Jul 30, 2021

njtran Jul 30, 2021

ellistarn Jul 30, 2021

ellistarn Jul 30, 2021

njtran Jul 30, 2021

njtran Jul 30, 2021

ellistarn Jul 30, 2021

njtran Jul 30, 2021

ellistarn Jul 30, 2021

Resolves race conditions exposed at high scale #578

Resolves race conditions exposed at high scale #578

Conversation

ellistarn commented Jul 30, 2021 • edited

Case 1: Node Missing Finalizer

Case 2: Scheduler Racing against Provisioner

netlify bot commented Jul 30, 2021 • edited

njtran left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ellistarn commented Jul 30, 2021 •

edited

netlify bot commented Jul 30, 2021 •

edited