Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Introduce BackupReady condition #251

Closed
Tracked by #107
timuthy opened this issue Nov 3, 2021 · 1 comment · Fixed by #271
Closed
Tracked by #107

[Feature] Introduce BackupReady condition #251

timuthy opened this issue Nov 3, 2021 · 1 comment · Fixed by #271
Assignees
Labels
kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age) status/closed Issue is closed (either delivered or triaged)

Comments

@timuthy
Copy link
Member

timuthy commented Nov 3, 2021

Feature (What you would like to be added):
The multi-node etcd proposal includes the introduction of a BackupReady condition ref.

Motivation (Why is this needed?):
The BackupReady condition will influence the decision if client traffic to the etcd cluster is cut-off. will serve as an indicator whether the backup functionality is working as expected.

Approach/Hint to the implement solution (optional):
This condition is supposed to deduce the healthiness of backups from the snapshot Lease object maintained by the etcd-backup-restore side-car. Hence, the backup is considered unhealthy if the renewal time exceeds a configured threshold.

@timuthy timuthy added the kind/enhancement Enhancement, improvement, extension label Nov 3, 2021
@vlerenc
Copy link
Member

vlerenc commented Nov 3, 2021

The BackupReady condition will influence the decision if client traffic to the etcd cluster is cut-off.

See also #252 (comment):

@timuthy Why? Please don't. We have had already an outage when we lost blob store access and needlessly caused cluster outages. Back then we agreed that we should not do that (though it wasn't changed, yet). We should keep running, raise an alert, possibly add another shoot state condition (for backup), but the one thing we should absolutely not do is to shut down the cluster. It's our main job/responsibility. If ETCD is healthy, we keep running.

We had a similar discussion with @stoyanr when we discussed the "bad case scenario". Shutting down a healthy control plane is definitely causing an outage while not taking backups or not reaching Gardener for some time is very bad, but not a neck-breaker, if recovered.

Please do not terminate healthy ETCDs/control planes. No managed service out there will ever do that. Being available is the single most important SLI/KPI that we have.

Besides, treating backup readiness different from pod readiness should make things hopefully also simpler/less tightly coupled and thereby less complex (clustered ETCD is already complex enough).

@aaronfern aaronfern self-assigned this Nov 29, 2021
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label May 29, 2022
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Jun 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants