Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: add inflight checks to detect some configuration issues #64

Merged
merged 1 commit into from
Nov 16, 2022

Conversation

tzneal
Copy link
Contributor

@tzneal tzneal commented Nov 10, 2022

Report as events and logs the reasons why a node is stuck terminating or failing to initialize.

Fixes aws/karpenter-provider-aws#2829

Description

How was this change tested?
Unit testing & deployed.

Sample log:

karpenter-74bdb86d9c-kc77c controller 2022-11-10T15:56:42.861Z	INFO	controller.inflightchecks	Inflight check failed for node ip-192-168-85-164.us-west-2.compute.internal, Can't drain node, pod default/my-shell is not owned	{"commit": "f691533-dirty", "node": "ip-192-168-85-164.us-west-2.compute.internal"}

Sample event:

51s         Warning   FailedInflightCheck             node/ip-192-168-85-164.us-west-2.compute.internal    Can't drain node, pod default/my-shell is not owned

There is some duplication here as we also report failures to evict. I haven't come to an opinion on if that's ok. I'm leaning towards it's better to have the duplication as this is rate limited to once per 10 minutes per node and also logs at the info level in addition to creating the event.

  Warning  FailedInflightCheck  2m34s (x2 over 12m)  karpenter  Can't drain node, pod default/my-shell is not owned
  Warning  FailedDraining       32s (x7 over 12m)    karpenter  Failed to drain node, pod default/my-shell does not have any owner references

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@tzneal tzneal requested a review from a team as a code owner November 10, 2022 16:29
@coveralls
Copy link

coveralls commented Nov 10, 2022

Pull Request Test Coverage Report for Build 3481295853

  • 174 of 230 (75.65%) changed or added relevant lines in 10 files are covered.
  • 23 unchanged lines in 3 files lost coverage.
  • Overall coverage increased (+0.2%) to 74.647%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controllers/controllers.go 0 1 0.0%
pkg/controllers/deprovisioning/consolidation.go 12 14 85.71%
pkg/controllers/inflightchecks/termination.go 21 28 75.0%
pkg/controllers/inflightchecks/failedinit.go 36 46 78.26%
pkg/controllers/inflightchecks/nodeshape.go 30 42 71.43%
pkg/controllers/inflightchecks/controller.go 39 63 61.9%
Files with Coverage Reduction New Missed Lines %
pkg/test/environment.go 6 73.91%
pkg/controllers/provisioning/scheduling/preferences.go 7 86.81%
pkg/controllers/provisioning/scheduling/topology.go 10 84.71%
Totals Coverage Status
Change from base Build 3480855207: 0.2%
Covered Lines: 5073
Relevant Lines: 6796

💛 - Coveralls

pkg/controllers/inflightchecks/suite_test.go Outdated Show resolved Hide resolved
pkg/controllers/inflightchecks/controller.go Outdated Show resolved Hide resolved
pkg/controllers/node/initialization.go Show resolved Hide resolved
pkg/controllers/node/initialization.go Show resolved Hide resolved
pkg/controllers/inflightchecks/nodeshape.go Show resolved Hide resolved
pkg/controllers/inflightchecks/controller.go Outdated Show resolved Hide resolved
pkg/controllers/inflightchecks/controller.go Show resolved Hide resolved
pkg/controllers/inflightchecks/failedinit.go Show resolved Hide resolved
pkg/controllers/deprovisioning/pdblimits.go Show resolved Hide resolved
@jonathan-innis
Copy link
Member

Overall, love this feature for observability! Really cool work! 🎉

@ellistarn
Copy link
Contributor

karpenter-74bdb86d9c-kc77c controller 2022-11-10T15:56:42.861Z INFO controller.inflightchecks Inflight check failed for node ip-192-168-85-164.us-west-2.compute.internal, Can't drain node, pod default/my-shell is not owned {"commit": "f691533-dirty", "node": "ip-192-168-85-164.us-west-2.compute.internal"}

The latest from @dewjam is that we shouldn't block on unowned pods. Worth sequencing these PRs? I'm happy either way.

@tzneal
Copy link
Contributor Author

tzneal commented Nov 13, 2022

karpenter-74bdb86d9c-kc77c controller 2022-11-10T15:56:42.861Z INFO controller.inflightchecks Inflight check failed for node ip-192-168-85-164.us-west-2.compute.internal, Can't drain node, pod default/my-shell is not owned {"commit": "f691533-dirty", "node": "ip-192-168-85-164.us-west-2.compute.internal"}

The latest from @dewjam is that we shouldn't block on unowned pods. Worth sequencing these PRs? I'm happy either way.

There's no PR for that is there? No issues removing the controllerless checks if that code goes in, but I think this check is worthwhile till it does.

@tzneal tzneal force-pushed the inflight-checks branch 2 times, most recently from d64d620 to d314193 Compare November 15, 2022 13:40
@tzneal
Copy link
Contributor Author

tzneal commented Nov 15, 2022

karpenter-74bdb86d9c-kc77c controller 2022-11-10T15:56:42.861Z INFO controller.inflightchecks Inflight check failed for node ip-192-168-85-164.us-west-2.compute.internal, Can't drain node, pod default/my-shell is not owned {"commit": "f691533-dirty", "node": "ip-192-168-85-164.us-west-2.compute.internal"}

The latest from @dewjam is that we shouldn't block on unowned pods. Worth sequencing these PRs? I'm happy either way.

There's no PR for that is there? No issues removing the controllerless checks if that code goes in, but I think this check is worthwhile till it does.

Removed the controller-less pod check since @dewjam 's change went in.

ellistarn
ellistarn previously approved these changes Nov 16, 2022
Report as events and logs the reasons why a node is stuck terminating or
failing to initialize.
@tzneal
Copy link
Contributor Author

tzneal commented Nov 16, 2022

Merge conflict, rebased.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve visibility of why a node isn't being drained/terminating
4 participants