Readjusted reallocation controller for nodes that fail to come up #534

njtran · 2021-07-23T04:59:30Z

Description of changes:
This PR changes how reallocation controls the underutilization of nodes depending on their status.

Previously we used to set nodes as underutilized and set a TTL (as specified in the Provisioner) for nodes, regardless of their status. This results in a race condition where nodes are brought up, and we could set a TTL before pods bind to them. This change now only allows us to do so for Ready nodes.

In addition, this PR adds the feature to delete nodes that fail to become ready after 15 minutes, designed here. Since there are many reasons why nodes could become NotReady after becoming ready, we only consider nodes that were unable to come up, but not nodes that were able to come up but had connectivity issues, like KubeletNotReady.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

bwagner5

A few nits and questions. Overall, looks great!

pkg/utils/node/predicates.go

pkg/controllers/reallocation/utilization.go

bwagner5 · 2021-07-23T14:26:22Z

pkg/utils/node/predicates.go

+	if time.Since(node.GetCreationTimestamp().Time) < (15 * time.Minute) {
+		return false
+	}
+	for _, condition := range node.Status.Conditions {


maybe we should make a helper func that puts these condition in a map or something so we don't have to do this ugly for-loop thing through here.

I agree. I added a helper to reduce some lines, lemme know what you think.

If only https://github.com/knative/pkg/blob/main/apis/condition_set.go. However, given how weird node conditions are, this may be not worth.

bwagner5 · 2021-07-23T14:28:26Z

pkg/controllers/reallocation/controller.go

@@ -78,6 +78,11 @@ func (c *Controller) Reconcile(ctx context.Context, req reconcile.Request) (reco
 	if err := c.utilization.terminateExpired(ctx, provisioner); err != nil {
 		return reconcile.Result{}, fmt.Errorf("marking nodes terminable, %w", err)
 	}
+
+	// 5. Delete any node that has been unable to come up for 15 minutes.
+	if err := c.utilization.terminateFailedToBecomeReady(ctx, provisioner); err != nil {


This will be skipped if provisioner.Spec.TTLSecondsAfterEmpty is not specified. I think we'd want to do this no matter what?

You're right. Although I don't know if we want to make a whole new controller for this. I could make it so that this function runs regardless of the specification of TTLSecondsAfterEmpty, but that might confuse those reading the code base. WDYT?

I think we need to step back and try to come up with the core logical concepts behind this, and how the behaviors relate to each other.

ellistarn · 2021-07-23T17:19:27Z

pkg/utils/node/predicates.go

+}
+
+func FailsToBecomeReady(node v1.Node) bool {
+	if time.Since(node.GetCreationTimestamp().Time) < (15 * time.Minute) {


Didn't we test this to be 5 minutes before pods are rescheduled?

It would be great if we could find the line of code upstream that does this (maybe in replicaset?)

I'm noticing

tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300

on my pods.

It looks like even if we set this value, these tolerations are configurable here, meaning that we could have severely different logic depending on user configuration.

While 15 minutes doesn't capture all cases, not leaving any room for leeway after the default timeout value could lead to some race conditions in terms of propagating pod state. It could be that setting a default value like this might not be the answer for a lot of users.

Talked offline, will also check to see if all the pods on the node are terminating as well.

pkg/controllers/reallocation/controller.go

pkg/controllers/reallocation/utilization.go

ellistarn · 2021-07-26T16:20:22Z

pkg/controllers/reallocation/controller.go

@@ -78,6 +78,11 @@ func (c *Controller) Reconcile(ctx context.Context, req reconcile.Request) (reco
 	if err := c.utilization.terminateExpired(ctx, provisioner); err != nil {
 		return reconcile.Result{}, fmt.Errorf("marking nodes terminable, %w", err)
 	}
+
+	// 5. Delete any node that has been unable to come up for 15 minutes.
+	if err := c.utilization.terminateFailedToBecomeReady(ctx, provisioner); err != nil {


This logic flow feels wrong at the top level. If there's something that's continually returning error in step 4 or earlier, we will never even attempt this step. This doesn't need to be tackled now, but can you cut an issue to fix this logic?

Made an issue here. #555

ellistarn · 2021-07-26T16:26:44Z

pkg/utils/node/predicates.go

+	}
+	for _, condition := range node.Status.Conditions {
+		if condition.Type == v1.NodeReady {
+			return condition.Status == v1.ConditionUnknown && condition.LastTransitionTime.IsZero() && condition.LastHeartbeatTime.IsZero()


I think checking ConditionUnknown should be sufficient.

If it goes from Unknown -> False -> True -> Unknown for some reason ([example[(https://kubernetes.io/docs/concepts/architecture/nodes/#node-controller)), then these times would be nonZero. If we check that they're zero, then we know it has never advanced from Unknown to False.

How about we just check heartbeat? If we've never heard a heartbeat, then we know it was never alive.

I wonder if it's possible for that to get erased. Can we ensure that status will be never be nulled? What if there's some issue with data retention?

pkg/utils/node/predicates.go

netlify · 2021-07-26T23:51:55Z

✔️ Deploy Preview for karpenter-docs-prod canceled.

🔨 Explore the source changes: b47cda1

🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/6102fe72b65b580008b2a6d4

pkg/utils/node/predicates.go

…tReady nodes

ellistarn · 2021-07-29T16:08:01Z

pkg/controllers/reallocation/controller.go

@@ -24,8 +24,8 @@ import (
 	"golang.org/x/time/rate"
 	"knative.dev/pkg/logging"

+	"github.com/benbjohnson/clock"


You absolutely don't want a fake clock in our real code. Make sure this is only included in the test imports. This will ensure that golang's treeshaking doesn't build this package into our binary.

pkg/controllers/reallocation/controller.go

pkg/controllers/reallocation/suite_test.go

ellistarn · 2021-07-29T16:10:58Z

pkg/controllers/reallocation/suite_test.go

+			Expect(updatedNode.DeletionTimestamp.IsZero()).To(BeTrue())
+
+			// Simulate 500 seconds passing and do another reconcile
+			mock.Add(500 * time.Second)


How about setting this duration to the same constant we use in our implementation to measure "failed to join"

With a sleep we would wait 500 seconds, this just sets the mock clock's time to +500 seconds.

pkg/controllers/reallocation/suite_test.go

pkg/utils/node/predicates.go

ellistarn · 2021-07-29T16:19:29Z

pkg/utils/pod/scheduling.go

+
+// AreAllTerminating returns true if the set of pods were all unable to join
+func AreAllTerminating(pods []*v1.Pod) bool {
+	for _, p := range pods {


I think you want to reuse your IgnoredForUnderutilization helper, since this code should also take into account HasFailed. It does make the naming get a bit weird, so maybe we should just check HasFailed here.

I think there must be a way to factor this to share the concept. It is the same concept after all, we just need a name for it.

These are a little different. We decided that we only wanted to terminate the node if all of the pods were terminating. This means we need to check the deletionTimestamp, whereas the ignoredForUnderutilization also checks if they failed, which doesn't necessarily mean they were requested to be deleted yet.

What if a pod is failed (i.e. scheduling misbehavior)? I think we should still terminate the node in this case. This is for the exact same reason you want to consider failed pods in the other function.

This similarity leads me to believe that there is common behavior that should be factored out.

ellistarn · 2021-07-29T16:20:10Z

pkg/utils/pod/scheduling.go

+}
+
+// AreAllTerminating returns true if the set of pods were all unable to join
+func AreAllTerminating(pods []*v1.Pod) bool {


Is this function the same as IgnoredForTermination or something different?

IgnoredForTermination feels like we're saying we want to terminate these pods, while in reality they are already terminating. I like this name more.

Sorry my typo was misleading. I meant:

Is this function the same as IgnoredForUnderutilizationor something different?

What I mean is, the logic looks very similar to me. The only difference is that they behave differently w/ deletion timestamp. They both should do the same thing for daemonsets and failed pods.

ellistarn · 2021-07-29T16:59:15Z

pkg/controllers/reallocation/suite_test.go

+
+			future := time.Now().Add(500 * time.Second)
+			// Monkey patch time.Now() so it returns 500 seconds in the future
+			monkey.Patch(time.Now, func() time.Time {


consider monkey.Patch(time.Now, func() time.Time { return time.Now().Add(500 * time.Second) })

Can't reference time.Now while we patch time.Now's function address. This results in a buffer overflow.

Interesting! Still, you could inline 204-206.

pkg/controllers/reallocation/utilization.go

ellistarn · 2021-07-29T17:47:15Z

pkg/controllers/reallocation/utilization.go

+	for _, node := range nodes {
+		pods, err := u.getPods(ctx, node)
+		if err != nil {
+			logging.FromContext(ctx).Debugf("Continuing after failing to get pods for node %s", node.Name)


I don't think we want to continue here.

I added a continue

pkg/test/expectations/expectations.go

pkg/utils/node/predicates.go

ellistarn · 2021-07-29T19:11:17Z

pkg/utils/node/predicates.go

 	}
-	return false
+	condition := getNodeCondition(node.Status.Conditions, v1.NodeReady)
+	return condition.Status == v1.ConditionUnknown && condition.LastHeartbeatTime.IsZero()


thoughts on getNodeCondition(node.Status.Conditions, v1.NodeReady).LastHeartbeatTime.IsZero()

ellistarn · 2021-07-29T19:12:31Z

pkg/controllers/reallocation/utilization.go

 	// 2. Trigger termination workflow if past TTLAfterEmpty
 	for _, node := range nodes {
 		if utilsnode.IsPastEmptyTTL(node) {
 			logging.FromContext(ctx).Infof("Triggering termination for empty node %s", node.Name)
-			if err := u.kubeClient.Delete(ctx, node); err != nil {
+			if err := u.KubeClient.Delete(ctx, node); err != nil {
+				return fmt.Errorf("sending delete for node %s, %w", node.Name, err)


consider deleting node. Sending delete for is a bit of a nod and wink to finalizers, which we know may or may not exist for all kubernetes objects. I'd hate to be forced into "sending" language everywhere because of this. Easier to just use the rest concept delete, regardless of finalization hooks.

bwagner5 reviewed Jul 23, 2021

View reviewed changes

ellistarn reviewed Jul 23, 2021

View reviewed changes

ellistarn reviewed Jul 26, 2021

View reviewed changes

pkg/controllers/reallocation/controller.go Outdated Show resolved Hide resolved

ellistarn reviewed Jul 26, 2021

View reviewed changes

pkg/controllers/reallocation/utilization.go Outdated Show resolved Hide resolved

ellistarn reviewed Jul 26, 2021

View reviewed changes

pkg/utils/node/predicates.go Outdated Show resolved Hide resolved

ellistarn reviewed Jul 26, 2021

View reviewed changes

pkg/utils/node/predicates.go Outdated Show resolved Hide resolved

ellistarn reviewed Jul 27, 2021

View reviewed changes

pkg/utils/node/predicates.go Show resolved Hide resolved

njtran added 5 commits July 28, 2021 19:24

Reallocation deletes failed to become ready nodes and does not TTL No…

70d3f6b

…tReady nodes

fixed typos

854c60a

changed function name

ee5cfa8

added robust failedToJoin logic and tests

9806649

rebasing

5401444