Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignores stuck pods rather than deleting them to avoid stateful set edge cases #678

Merged
merged 1 commit into from
Sep 15, 2021

Conversation

ellistarn
Copy link
Contributor

@ellistarn ellistarn commented Sep 14, 2021

1. Issue, if available:

2. Description of changes:
@anguslees pointed out that we must allow the kubelet to delete pods that are terminating to avoid violating stateful set guarantees if a kubelet is partitioned. Instead, we simply ignore pods that are past their grace window and delete the node. This ensures the guarantee is met, since the pod will be deleted by KCM once the node no longer exists.

Testing

  1. Modified my provisioner to have a bad endpoint.
  2. Created pod
  3. Node came online, didn't connect
  4. Pod tolerations ran out after 5 minute, pod evicted (w/ grace period 30 sec default)
  5. Node deleted by liveness controller (hanging)
  6. Node waits for pod to terminate (grace period)
  7. Node deleted by termination controller
  8. Pod cleaned up by KCM (2 minutes later)

3. Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: link to issue
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@netlify
Copy link

netlify bot commented Sep 14, 2021

✔️ Deploy Preview for karpenter-docs-prod canceled.

🔨 Explore the source changes: 87ce0e4

🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/61413727d55dfb0007f5425c

@@ -108,7 +109,7 @@ var _ = Describe("Controller", func() {
Expect(n.DeletionTimestamp.IsZero()).To(BeTrue())

// Simulate time passing
node.Now = func() time.Time {
injectabletime.Now = func() time.Time {
return time.Now().Add(time.Duration(*provisioner.Spec.TTLSecondsUntilExpired) * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not seem accurate - it will add potentially add more than that amount of time (granted this may not matter for a particular test), since presumably straight up time.Now() was called earlier in the test, then some time went by executing stuff, then this is called?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand. This effectively forces time to be 30 seconds into the future (which is the explicit behavior we're trying to test).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the previous steps take a few seconds to execute, won't it now be (say) 32 seconds in the future, since we are calling "live" time.Now?

@@ -66,7 +67,7 @@ var _ = Describe("Controller", func() {
})

AfterEach(func() {
node.Now = time.Now
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to use straight up time.Now here or something more deterministic (always return the same value, for example)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO time should work normally unless we explicitly need to control it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt it's causing a problem right now, so not a blocking issue, but in general this will add variability - say the unit test process itself takes longer to schedule (on a heavily loaded machine running a lot of github actions for example), then some of these times might stretch longer. Probably not a big deal here since the code should generally do the right thing if things take "at least N seconds", but could lead to flakiness.

pkg/controllers/termination/terminate.go Outdated Show resolved Hide resolved
njtran
njtran previously approved these changes Sep 14, 2021
Copy link
Contributor

@njtran njtran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work and nice niche catches!

@ellistarn
Copy link
Contributor Author

Nice work and nice niche catches!

niche caches!

@JacobGabrielson JacobGabrielson merged commit 3a77491 into aws:main Sep 15, 2021
@ellistarn ellistarn deleted the safestateful branch September 15, 2021 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants