[Feature] Introduce BackupReady condition #251

timuthy · 2021-11-03T11:12:06Z

Feature (What you would like to be added):
The multi-node etcd proposal includes the introduction of a BackupReady condition ref.

Motivation (Why is this needed?):
The BackupReady condition ~~will influence the decision if client traffic to the etcd cluster is cut-off.~~ will serve as an indicator whether the backup functionality is working as expected.

Approach/Hint to the implement solution (optional):
This condition is supposed to deduce the healthiness of backups from the snapshot Lease object maintained by the etcd-backup-restore side-car. Hence, the backup is considered unhealthy if the renewal time exceeds a configured threshold.

The text was updated successfully, but these errors were encountered:

vlerenc · 2021-11-03T13:08:51Z

The BackupReady condition will influence the decision if client traffic to the etcd cluster is cut-off.

See also #252 (comment):

@timuthy Why? Please don't. We have had already an outage when we lost blob store access and needlessly caused cluster outages. Back then we agreed that we should not do that (though it wasn't changed, yet). We should keep running, raise an alert, possibly add another shoot state condition (for backup), but the one thing we should absolutely not do is to shut down the cluster. It's our main job/responsibility. If ETCD is healthy, we keep running.

We had a similar discussion with @stoyanr when we discussed the "bad case scenario". Shutting down a healthy control plane is definitely causing an outage while not taking backups or not reaching Gardener for some time is very bad, but not a neck-breaker, if recovered.

Please do not terminate healthy ETCDs/control planes. No managed service out there will ever do that. Being available is the single most important SLI/KPI that we have.

Besides, treating backup readiness different from pod readiness should make things hopefully also simpler/less tightly coupled and thereby less complex (clustered ETCD is already complex enough).

timuthy added the kind/enhancement Enhancement, improvement, extension label Nov 3, 2021

This was referenced Nov 3, 2021

Multi-Node/Clustered ETCD #107

Closed

[Feature] Cut-off client traffic when backups are not healthy #252

Closed

aaronfern self-assigned this Nov 29, 2021

aaronfern mentioned this issue Dec 6, 2021

Added Backupready condition to the etcd status #271

Merged

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label May 29, 2022

timuthy closed this as completed in #271 Jun 23, 2022

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Introduce BackupReady condition #251

[Feature] Introduce BackupReady condition #251

timuthy commented Nov 3, 2021 •

edited

Loading

vlerenc commented Nov 3, 2021

[Feature] Introduce BackupReady condition #251

[Feature] Introduce BackupReady condition #251

Comments

timuthy commented Nov 3, 2021 • edited Loading

vlerenc commented Nov 3, 2021

timuthy commented Nov 3, 2021 •

edited

Loading