Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readjusted reallocation controller for nodes that fail to come up #534

Merged
merged 6 commits into from
Jul 29, 2021
Merged

Readjusted reallocation controller for nodes that fail to come up #534

merged 6 commits into from
Jul 29, 2021

Conversation

njtran
Copy link
Contributor

@njtran njtran commented Jul 23, 2021

Description of changes:
This PR changes how reallocation controls the underutilization of nodes depending on their status.

Previously we used to set nodes as underutilized and set a TTL (as specified in the Provisioner) for nodes, regardless of their status. This results in a race condition where nodes are brought up, and we could set a TTL before pods bind to them. This change now only allows us to do so for Ready nodes.

In addition, this PR adds the feature to delete nodes that fail to become ready after 15 minutes, designed here. Since there are many reasons why nodes could become NotReady after becoming ready, we only consider nodes that were unable to come up, but not nodes that were able to come up but had connectivity issues, like KubeletNotReady.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link
Contributor

@bwagner5 bwagner5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits and questions. Overall, looks great!

pkg/utils/node/predicates.go Outdated Show resolved Hide resolved
pkg/controllers/reallocation/utilization.go Outdated Show resolved Hide resolved
pkg/controllers/reallocation/utilization.go Outdated Show resolved Hide resolved
if time.Since(node.GetCreationTimestamp().Time) < (15 * time.Minute) {
return false
}
for _, condition := range node.Status.Conditions {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should make a helper func that puts these condition in a map or something so we don't have to do this ugly for-loop thing through here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I added a helper to reduce some lines, lemme know what you think.

Copy link
Contributor

@ellistarn ellistarn Jul 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If only https://github.com/knative/pkg/blob/main/apis/condition_set.go. However, given how weird node conditions are, this may be not worth.

@@ -78,6 +78,11 @@ func (c *Controller) Reconcile(ctx context.Context, req reconcile.Request) (reco
if err := c.utilization.terminateExpired(ctx, provisioner); err != nil {
return reconcile.Result{}, fmt.Errorf("marking nodes terminable, %w", err)
}

// 5. Delete any node that has been unable to come up for 15 minutes.
if err := c.utilization.terminateFailedToBecomeReady(ctx, provisioner); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be skipped if provisioner.Spec.TTLSecondsAfterEmpty is not specified. I think we'd want to do this no matter what?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Although I don't know if we want to make a whole new controller for this. I could make it so that this function runs regardless of the specification of TTLSecondsAfterEmpty, but that might confuse those reading the code base. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to step back and try to come up with the core logical concepts behind this, and how the behaviors relate to each other.

}

func FailsToBecomeReady(node v1.Node) bool {
if time.Since(node.GetCreationTimestamp().Time) < (15 * time.Minute) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we test this to be 5 minutes before pods are rescheduled?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if we could find the line of code upstream that does this (maybe in replicaset?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm noticing

  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300

on my pods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like even if we set this value, these tolerations are configurable here, meaning that we could have severely different logic depending on user configuration.

While 15 minutes doesn't capture all cases, not leaving any room for leeway after the default timeout value could lead to some race conditions in terms of propagating pod state. It could be that setting a default value like this might not be the answer for a lot of users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked offline, will also check to see if all the pods on the node are terminating as well.

@@ -78,6 +78,11 @@ func (c *Controller) Reconcile(ctx context.Context, req reconcile.Request) (reco
if err := c.utilization.terminateExpired(ctx, provisioner); err != nil {
return reconcile.Result{}, fmt.Errorf("marking nodes terminable, %w", err)
}

// 5. Delete any node that has been unable to come up for 15 minutes.
if err := c.utilization.terminateFailedToBecomeReady(ctx, provisioner); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic flow feels wrong at the top level. If there's something that's continually returning error in step 4 or earlier, we will never even attempt this step. This doesn't need to be tackled now, but can you cut an issue to fix this logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made an issue here. #555

}
for _, condition := range node.Status.Conditions {
if condition.Type == v1.NodeReady {
return condition.Status == v1.ConditionUnknown && condition.LastTransitionTime.IsZero() && condition.LastHeartbeatTime.IsZero()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think checking ConditionUnknown should be sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it goes from Unknown -> False -> True -> Unknown for some reason ([example[(https://kubernetes.io/docs/concepts/architecture/nodes/#node-controller)), then these times would be nonZero. If we check that they're zero, then we know it has never advanced from Unknown to False.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we just check heartbeat? If we've never heard a heartbeat, then we know it was never alive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's possible for that to get erased. Can we ensure that status will be never be nulled? What if there's some issue with data retention?

@netlify
Copy link

netlify bot commented Jul 26, 2021

✔️ Deploy Preview for karpenter-docs-prod canceled.

🔨 Explore the source changes: b47cda1

🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/6102fe72b65b580008b2a6d4

@@ -24,8 +24,8 @@ import (
"golang.org/x/time/rate"
"knative.dev/pkg/logging"

"github.com/benbjohnson/clock"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You absolutely don't want a fake clock in our real code. Make sure this is only included in the test imports. This will ensure that golang's treeshaking doesn't build this package into our binary.

Expect(updatedNode.DeletionTimestamp.IsZero()).To(BeTrue())

// Simulate 500 seconds passing and do another reconcile
mock.Add(500 * time.Second)
Copy link
Contributor

@ellistarn ellistarn Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about setting this duration to the same constant we use in our implementation to measure "failed to join"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a sleep we would wait 500 seconds, this just sets the mock clock's time to +500 seconds.


// AreAllTerminating returns true if the set of pods were all unable to join
func AreAllTerminating(pods []*v1.Pod) bool {
for _, p := range pods {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want to reuse your IgnoredForUnderutilization helper, since this code should also take into account HasFailed. It does make the naming get a bit weird, so maybe we should just check HasFailed here.

I think there must be a way to factor this to share the concept. It is the same concept after all, we just need a name for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are a little different. We decided that we only wanted to terminate the node if all of the pods were terminating. This means we need to check the deletionTimestamp, whereas the ignoredForUnderutilization also checks if they failed, which doesn't necessarily mean they were requested to be deleted yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if a pod is failed (i.e. scheduling misbehavior)? I think we should still terminate the node in this case. This is for the exact same reason you want to consider failed pods in the other function.

This similarity leads me to believe that there is common behavior that should be factored out.

}

// AreAllTerminating returns true if the set of pods were all unable to join
func AreAllTerminating(pods []*v1.Pod) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function the same as IgnoredForTermination or something different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IgnoredForTermination feels like we're saying we want to terminate these pods, while in reality they are already terminating. I like this name more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry my typo was misleading. I meant:

Is this function the same as IgnoredForUnderutilizationor something different?

What I mean is, the logic looks very similar to me. The only difference is that they behave differently w/ deletion timestamp. They both should do the same thing for daemonsets and failed pods.


future := time.Now().Add(500 * time.Second)
// Monkey patch time.Now() so it returns 500 seconds in the future
monkey.Patch(time.Now, func() time.Time {
Copy link
Contributor

@ellistarn ellistarn Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider monkey.Patch(time.Now, func() time.Time { return time.Now().Add(500 * time.Second) })

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't reference time.Now while we patch time.Now's function address. This results in a buffer overflow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! Still, you could inline 204-206.

for _, node := range nodes {
pods, err := u.getPods(ctx, node)
if err != nil {
logging.FromContext(ctx).Debugf("Continuing after failing to get pods for node %s", node.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to continue here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a continue

}
return false
condition := getNodeCondition(node.Status.Conditions, v1.NodeReady)
return condition.Status == v1.ConditionUnknown && condition.LastHeartbeatTime.IsZero()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thoughts on getNodeCondition(node.Status.Conditions, v1.NodeReady).LastHeartbeatTime.IsZero()

// 2. Trigger termination workflow if past TTLAfterEmpty
for _, node := range nodes {
if utilsnode.IsPastEmptyTTL(node) {
logging.FromContext(ctx).Infof("Triggering termination for empty node %s", node.Name)
if err := u.kubeClient.Delete(ctx, node); err != nil {
if err := u.KubeClient.Delete(ctx, node); err != nil {
return fmt.Errorf("sending delete for node %s, %w", node.Name, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider deleting node. Sending delete for is a bit of a nod and wink to finalizers, which we know may or may not exist for all kubernetes objects. I'd hate to be forced into "sending" language everywhere because of this. Easier to just use the rest concept delete, regardless of finalization hooks.

@njtran njtran merged commit 0abf03e into aws:main Jul 29, 2021
@njtran njtran deleted the notReady branch September 8, 2021 19:21
gfcroft pushed a commit to gfcroft/karpenter-provider-aws that referenced this pull request Nov 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants