Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactored expiration/utilization controller into node controller #594

Merged
merged 3 commits into from
Aug 4, 2021

Conversation

ellistarn
Copy link
Contributor

Issue, if available:

Description of changes:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@netlify
Copy link

netlify bot commented Aug 2, 2021

✔️ Deploy Preview for karpenter-docs-prod canceled.

🔨 Explore the source changes: 6a94249

🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/610ad8984b1cdb000711227c

Copy link
Contributor

@njtran njtran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work. some comments on your comments

pkg/controllers/node/controller.go Show resolved Hide resolved
pkg/controllers/node/controller.go Outdated Show resolved Hide resolved
}
return requests
}),
).
WithOptions(controller.Options{MaxConcurrentReconciles: 10}).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this was included in a prior PR, but why did we pick 10 here? Are there any detriments to bumping this number up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could consider setting this to kubeclient qps (default 20). I found that 10 was reasonably high parallelism without exhausting default qps. Perhaps we can evaluate this number as we do more scale testing.

pkg/controllers/node/expiration.go Outdated Show resolved Hide resolved
pkg/controllers/node/expiration.go Outdated Show resolved Hide resolved
pkg/controllers/node/expiration.go Outdated Show resolved Hide resolved
node = ExpectNodeExists(env.Client, node.Name)
Expect(node.DeletionTimestamp.IsZero()).To(BeTrue())
})
It("should terminate nodes after expiry", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have the ability to simulate time going by in these tests? I worry a bit that there might be some code paths not quite exercised here ... compared to having a TTL set (to a non-zero value like 30), prove that (say) 15 seconds in the node is not marked for termination, then (say) 30+ seconds in it actually is so marked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are just copied over since it's a refactor, but I'll look into mocking time for these tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@ellistarn ellistarn changed the title Refactored expiration controller into node controller Refactored expiration/utilization controller into node controller Aug 3, 2021
@ellistarn ellistarn force-pushed the nodecontroller branch 5 times, most recently from 1eccfd9 to bfb2e60 Compare August 3, 2021 21:58
pkg/controllers/node/controller.go Show resolved Hide resolved
pkg/controllers/node/controller.go Show resolved Hide resolved
pkg/controllers/node/emptiness.go Show resolved Hide resolved
}{
c.readiness,
c.liveness,
c.expiration,
c.emptiness,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a node doesn't have a finalizer but will be expired for emptiness/expiration, we could choose to delete this node before we add the finalizer since the expiration and emptiness controllers send a delete request before we patch the node.

In addition, since we're running these in parallel, is it possible that the order in which we execute these controllers change?

Copy link
Contributor Author

@ellistarn ellistarn Aug 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semantically, I want the order of these to not matter. I know that in theory, there are some edge cases (i.e, node comes online, instantly ready, we can't reconcile, and emptyttl is 0), but I think they're contrived enough that I'd much rather define an invariant where the order of these reconcilers doesn't matter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that these cases probably won't happen naturally. I also think our logging is in a good enough state that we would be able to debug these cases if they do happen naturally. From then, I'd love to reconsider this approach if necessary.

pkg/controllers/node/emptiness.go Outdated Show resolved Hide resolved
pkg/controllers/node/emptiness.go Show resolved Hide resolved
pkg/controllers/node/expiration.go Show resolved Hide resolved

const LivenessTimeout = 5 * time.Minute

// Liveness is a subreconciler that deletes nodes if its determined to be unrecoverable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment might be a little misleading. From reading this, I would assume that a node that has joined, but then has become in an unrecoverable state would then be subject to the workings of this controller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in theory, that's what this controller should do. I think we do want to implement something like "auto repair" in here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Yet, while we don't have that functionality in place, I think the keeping the comment in line with the actual state of the code makes more sense to me, since we don't know if we want to or will implement "auto-repair"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me. I'd like to keep the comments roughly the same irrespective of the content.

pkg/utils/functional/functional.go Outdated Show resolved Hide resolved
Comment on lines 52 to 62
if !empty {
if _, ok := n.Annotations[v1alpha3.EmptinessTimestampAnnotationKey]; ok {
delete(n.Annotations, v1alpha3.EmptinessTimestampAnnotationKey)
logging.FromContext(ctx).Infof("Removed emptiness TTL from node %s", n.Name)
}
return reconcile.Result{}, nil
}
// 3. Set TTL if not set
n.Annotations = functional.UnionStringMaps(n.Annotations)
ttl := time.Duration(ptr.Int64Value(provisioner.Spec.TTLSecondsAfterEmpty)) * time.Second
emptinessTimestamp, ok := n.Annotations[v1alpha3.EmptinessTimestampAnnotationKey]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT?

Suggested change
if !empty {
if _, ok := n.Annotations[v1alpha3.EmptinessTimestampAnnotationKey]; ok {
delete(n.Annotations, v1alpha3.EmptinessTimestampAnnotationKey)
logging.FromContext(ctx).Infof("Removed emptiness TTL from node %s", n.Name)
}
return reconcile.Result{}, nil
}
// 3. Set TTL if not set
n.Annotations = functional.UnionStringMaps(n.Annotations)
ttl := time.Duration(ptr.Int64Value(provisioner.Spec.TTLSecondsAfterEmpty)) * time.Second
emptinessTimestamp, ok := n.Annotations[v1alpha3.EmptinessTimestampAnnotationKey]
emptinessTimestamp, ok := n.Annotations[v1alpha3.EmptinessTimestampAnnotationKey]
if !empty {
if ok {
delete(n.Annotations, v1alpha3.EmptinessTimestampAnnotationKey)
logging.FromContext(ctx).Infof("Removed emptiness TTL from node %s", n.Name)
}
return reconcile.Result{}, nil
}
// 3. Set TTL if not set
n.Annotations = functional.UnionStringMaps(n.Annotations)
ttl := time.Duration(ptr.Int64Value(provisioner.Spec.TTLSecondsAfterEmpty)) * time.Second

Copy link
Contributor

@njtran njtran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Super nice work. Excited to see this work.

@ellistarn ellistarn merged commit a53687e into aws:main Aug 4, 2021
@ellistarn ellistarn deleted the nodecontroller branch August 4, 2021 18:42
gfcroft pushed a commit to gfcroft/karpenter-provider-aws that referenced this pull request Nov 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants