Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance resource-manager health controller #6770

Merged

Conversation

timebertt
Copy link
Member

@timebertt timebertt commented Oct 4, 2022

How to categorize this PR?

/area ops-productivity scalability
/kind enhancement

What this PR does / why we need it:

Enhance resource-manager health controller in several ways:

  • use watches for managed resource and execute the health checks as soon as the health status of some managed resource changes
  • eliminate use of unstructured objects and only use typed or metadata-only objects to make full use of the target cluster cache
  • fix the progressing controller watches (they were wrongly using the source cluster cache – which was only working by coincidence)
  • make health controller logs less noisy

Which issue(s) this PR fixes:

Special notes for your reviewer:

The best part is, that resource-manager reconciles ManagedResources as soon as their health status changes to false.
Hence, if you delete some managed resource or break its health, it gets reconciled back immediately now 🚀

Release note:

The `ManagedResource` health status for objects on the seed cluster is now updated immediately on health status changes (switched from periodic checks to proper watching).

Start additional watches for all GVKs encountered in `ManagedResource.status.resources`.
Objects are mapped to the owning ManagedResource to trigger the health controller as soon
as a change to the health status of any managed object happens.

Fetching events is split from health.CheckService so that we can call CheckHealth
without fetching events in the healthStatusChanged predicate. In the predicate,
we are not interested in the details but only in changes to the health status.
Events are now fetched separately after a failed health check in the reconciler.
We now either use typed objects or metadata-only objects (for GVKs not registered in target scheme).
With this, we can fully leverage the cache of the target cluster (if enabled), even for objects we
don't know. For them, we only care about their presence, so metadata-only is enough.
@gardener-prow gardener-prow bot added area/ops-productivity Operator productivity related (how to improve operations) area/scalability Scalability related kind/enhancement Enhancement, improvement, extension cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. labels Oct 4, 2022
@gardener-prow gardener-prow bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Oct 4, 2022
@rfranzke
Copy link
Member

rfranzke commented Oct 4, 2022

/assign

pkg/resourcemanager/controller/health/add.go Outdated Show resolved Hide resolved
pkg/resourcemanager/controller/health/add.go Outdated Show resolved Hide resolved
pkg/resourcemanager/controller/health/add.go Outdated Show resolved Hide resolved
pkg/resourcemanager/controller/health/add.go Outdated Show resolved Hide resolved
pkg/resourcemanager/controller/health/add_test.go Outdated Show resolved Hide resolved
pkg/resourcemanager/controller/health/health_reconciler.go Outdated Show resolved Hide resolved
CheckHealth can now properly handle metadata-only objects, no need
to treat them specifically in the predicate.
@timebertt
Copy link
Member Author

@rfranzke thanks for your quick review! PTAL :)

Copy link
Member

@rfranzke rfranzke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, no further comments!
/lgtm
/approve

@gardener-prow gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label Oct 5, 2022
@gardener-prow
Copy link
Contributor

gardener-prow bot commented Oct 5, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rfranzke

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gardener-prow gardener-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 5, 2022
@acumino acumino added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Oct 6, 2022
@gardener-prow
Copy link
Contributor

gardener-prow bot commented Oct 6, 2022

@timebertt: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-gardener-check-vulnerabilities 87c9341 link false /test pull-gardener-check-vulnerabilities

Full PR test history. Your PR dashboard. Command help for this repository.
Please help us cut down on flakes by linking this test failure to an open flake report or filing a new flake report if you can't find an existing one. Also see our testing guideline for how to avoid and hunt flakes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@gardener-prow gardener-prow bot merged commit b08522d into gardener:master Oct 6, 2022
@timebertt timebertt deleted the resource-manager-health-watches branch October 19, 2022 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/ops-productivity Operator productivity related (how to improve operations) area/scalability Scalability related cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/enhancement Enhancement, improvement, extension lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants