Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Makes extensions health checks use a Lease resource #6626

Merged

Conversation

plkokanov
Copy link
Contributor

@plkokanov plkokanov commented Sep 5, 2022

How to categorize this PR?

/area quality
/kind enhancement

What this PR does / why we need it:

  • Adds a heartbeat extension controller that will maintain a heart beat lease which can be used during extension health checks.
  • Extension controllers health check actuator no longer updates the extensions' status.conditions[].lastUpdateTime on each reconciliation.
  • When the shoot care controller performs health checks for extension resources it will first check the last time when the extension controller's heart beat lease was renewed. This information will be used to determine whether the conditions in the extension resource's status have expired or not.

Which issue(s) this PR fixes:
Fixes #4798

Special notes for your reviewer:

Release note:

When `gardenlet` checks the conditions of extension resources as part of the shoot health check, it checks if the `gardener-extensions-heartbeat` `Lease` maintained by the extension controllers has been renewed within the `ShootCare` controller's `staleExtensionHealthChecks.thresholds[]` settings and sets the corresponding `Shoot` condition to `Unknown` if that is not the case. If the `Lease` is not found, the `status.conditions[].LastUpdateTime` of the extension resource is checked as well for backwards compatibility.
Health checks performed by the `healthcheck` library no longer update the extensions resources' `status.conditions[].LastUpdateTime` on each reconciliation. Instead, a new heartbeat controller was added to the extensions library that will renew a dedicated `Lease` resource named `gardener-extensions-heartbeat` every 30 seconds by default. Extension controllers have to enable this controller as the `gardener-extensions-heartbeat` `Lease` will be used when `gardenlet` checks whether the extension resources' conditions are stale or not. `gardenlet` expects to find this `Lease` inside the namespace where the extension controller is installed by the corresponding `ControllerInstallation`.

@gardener-prow
Copy link
Contributor

gardener-prow bot commented Sep 5, 2022

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@gardener-prow gardener-prow bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. area/quality Output qualification (tests, checks, scans, automation in general, etc.) related kind/enhancement Enhancement, improvement, extension cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. labels Sep 5, 2022
@gardener-prow gardener-prow bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 5, 2022
@rfranzke rfranzke changed the title [WIP] Makes extensions health checks use a lease reasource [WIP] Makes extensions health checks use a Lease resource Sep 5, 2022
@rfranzke
Copy link
Member

rfranzke commented Sep 7, 2022

I feel that having a dedicated Lease resource makes more sense. This is also what we do for the health checks of gardenlets (they report to a Lease object in the gardener-seed-system-lease namespace in the garden cluster which is a different Lease object they use for their leader election).

@plkokanov
Copy link
Contributor Author

I feel that having a dedicated Lease resource makes more sense. This is also what we do for the health checks of gardenlets (they report to a Lease object in the gardener-seed-system-lease namespace in the garden cluster which is a different Lease object they use for their leader election).

Yep sure, just wondering how would the new Lease resource be created/managed. For gardenlet - there's a controller that reconciles Seed objects and takes care of creating/updating the Lease.

@rfranzke
Copy link
Member

rfranzke commented Sep 7, 2022

WDYM? Can't extensions controller also register such controller?

@plkokanov
Copy link
Contributor Author

I was mostly concerned about what resource would be reconciled - the Seed can't be used since it is in the garden cluster. One option would be to directly reconcile Leases, but then something has to create the lease (maybe it can be deployed as part of the installation chart). Another option would be to reconcile the extension controller's Namespace and a third could be to have a mapping between extension CustomResources and the Lease resource.

@rfranzke
Copy link
Member

rfranzke commented Sep 7, 2022

I'm confused, maybe I miss something or don't have the full picture.
From a high-level, my understanding is that an approach could be that extension controllers regularly report a heartbeat in a dedicated Lease resource (via a dedicated controller) just like we do it in the gardenlet. The controllerinstallation-care controller of gardenlet could maintain this information in a condition in the respective ControllerInstallation. Then the gardenlet can consider this information to become aware whether extensions are still alive (and perform their health checks) or not.

@plkokanov
Copy link
Contributor Author

@rfranzke gave the idea to use something similar to https://github.com/gardener/gardener/blob/master/pkg/resourcemanager/controller/garbagecollector/add.go#L75-L81 to initiate the reconciliations for the new controller that would be responsible for the new heartbeat Lease object.
It seems like a good idea, so that's how I will proceed with the PR.

@plkokanov plkokanov force-pushed the enh/extensions-last-hearbeat-time branch from b55fa15 to 6ecf503 Compare September 22, 2022 05:02
@gardener-prow gardener-prow bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 22, 2022
@plkokanov plkokanov changed the title [WIP] Makes extensions health checks use a Lease resource Makes extensions health checks use a Lease resource Sep 22, 2022
@plkokanov plkokanov marked this pull request as ready for review September 22, 2022 05:52
@gardener-prow gardener-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 22, 2022
@plkokanov plkokanov changed the title Makes extensions health checks use a Lease resource [WIP] Makes extensions health checks use a Lease resource Sep 22, 2022
@gardener-prow gardener-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 22, 2022
@plkokanov plkokanov changed the title [WIP] Makes extensions health checks use a Lease resource Makes extensions health checks use a Lease resource Sep 22, 2022
@gardener-prow gardener-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 22, 2022
@plkokanov
Copy link
Contributor Author

The controllerinstallation-care controller of gardenlet could maintain this information in a condition in the respective ControllerInstallation. Then the gardenlet can consider this information to become aware whether extensions are still alive (and perform their health checks) or not.

I've skipped this part and instead went did what I had the impression was discussed in the issue - gardenlet will directly retrieve the Lease from the namespace where the extension controller is installed during the health checks. This is so that it can check its RenewTime directly and whether it is within the ShootCare controller's staleExtensionHealthChecks.thresholds[]

@rfranzke
Copy link
Member

/assign

@ialidzhikov
Copy link
Member

/assign

pkg/operation/care/health.go Outdated Show resolved Hide resolved
pkg/operation/care/checker.go Outdated Show resolved Hide resolved
docs/extensions/heartbeat.md Outdated Show resolved Hide resolved
@gardener-prow gardener-prow bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 23, 2022
Copy link
Member

@rfranzke rfranzke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR, well structured!

extensions/pkg/controller/heartbeat/reconciler.go Outdated Show resolved Hide resolved
extensions/pkg/controller/heartbeat/cmd/options.go Outdated Show resolved Hide resolved
pkg/utils/gardener/controllerinstallation.go Outdated Show resolved Hide resolved
pkg/utils/gardener/controllerinstallation.go Outdated Show resolved Hide resolved
docs/extensions/heartbeat.md Show resolved Hide resolved
docs/extensions/heartbeat.md Outdated Show resolved Hide resolved
@plkokanov plkokanov force-pushed the enh/extensions-last-hearbeat-time branch from 41b23ea to e255c84 Compare September 29, 2022 11:44
@gardener-prow gardener-prow bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 29, 2022
@plkokanov plkokanov force-pushed the enh/extensions-last-hearbeat-time branch from e255c84 to 420940b Compare October 4, 2022 07:20
Copy link
Member

@rfranzke rfranzke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Joint review together with @ialidzhikov

docs/extensions/heartbeat.md Outdated Show resolved Hide resolved
extensions/pkg/controller/heartbeat/add.go Outdated Show resolved Hide resolved
extensions/pkg/controller/heartbeat/cmd/options.go Outdated Show resolved Hide resolved
cmd/gardener-extension-provider-local/app/app.go Outdated Show resolved Hide resolved
extensions/pkg/controller/heartbeat/controller.go Outdated Show resolved Hide resolved
hack/local-development/start-extension-provider-local Outdated Show resolved Hide resolved
@plkokanov plkokanov force-pushed the enh/extensions-last-hearbeat-time branch from 420940b to bc58ef1 Compare October 5, 2022 07:37
@plkokanov
Copy link
Contributor Author

Thanks for the review. I've addressed the comments and also added proper default and validation for the --heartbeat-renew-interval-seconds

Copy link
Member

@rfranzke rfranzke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice @plkokanov, let's continue now :)

/lgtm
/approve

@gardener-prow gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label Oct 5, 2022
@gardener-prow
Copy link
Contributor

gardener-prow bot commented Oct 5, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rfranzke

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gardener-prow gardener-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 5, 2022
@acumino acumino added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Oct 6, 2022
@gardener-prow
Copy link
Contributor

gardener-prow bot commented Oct 6, 2022

@plkokanov: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-gardener-check-vulnerabilities bc58ef1 link false /test pull-gardener-check-vulnerabilities

Full PR test history. Your PR dashboard. Command help for this repository.
Please help us cut down on flakes by linking this test failure to an open flake report or filing a new flake report if you can't find an existing one. Also see our testing guideline for how to avoid and hunt flakes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/quality Output qualification (tests, checks, scans, automation in general, etc.) related cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/enhancement Enhancement, improvement, extension lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Health checks of extension resource to use dedicated Lease resource for heart beats.
4 participants