[Feature] Cut-off client traffic when backups are not healthy #252

timuthy · 2021-11-03T11:20:20Z

Feature (What you would like to be added):
Concerning the multi-node etcd proposal (ref) client traffic should be cut-off as soon as a failure in the backup procedure is detected.

Motivation (Why is this needed?):

when the backup upload fails rather than continuing to serve requests while the gap between the last backup and the current data increases which might lead to unacceptably large amount of data loss if disaster strikes.

Approach/Hint to the implement solution (optional):
Consult BackupReady condition to check if backups are healthy. In case of unhealthy backups, the client traffic is supposed to be cut-off by changing the spec.selector of the client service object.

Depends on #158, #251

The text was updated successfully, but these errors were encountered:

vlerenc · 2021-11-03T13:08:03Z

Concerning the multi-node etcd proposal (ref) client traffic should be cut-off as soon as a failure in the backup procedure is detected.

@timuthy Why? Please don't. We have had already an outage when we lost blob store access and needlessly caused cluster outages. Back then we agreed that we should not do that (though it wasn't changed, yet). We should keep running, raise an alert, possibly add another shoot state condition (for backup), but the one thing we should absolutely not do is to shut down the cluster. It's our main job/responsibility. If ETCD is healthy, we keep running.

We had a similar discussion with @stoyanr when we discussed the "bad case scenario". Shutting down a healthy control plane is definitely causing an outage while not taking backups or not reaching Gardener for some time is very bad, but not a neck-breaker, if recovered.

Please do not terminate healthy ETCDs/control planes. No managed service out there will ever do that. Being available is the single most important SLI/KPI that we have.

Besides, treating backup readiness different from pod readiness should make things hopefully also simpler/less tightly coupled and thereby less complex (clustered ETCD is already complex enough).

ishan16696 · 2021-11-03T14:02:15Z

There is also a open issue which was basically opened Not to block the client traffic and to decouple the readiness probe of etcd with snapshotter and it also suggested to enhance the multi-node etcd proposal to address this new requirement.

timuthy · 2021-11-03T15:01:59Z

@vlerenc as discussed, this issue was solely opened with the information extracted from the multi-node etcd proposal document and under the assumption that not taking backups while continue processing data in etcd becomes a real issue.

Let's close this for now and instead continue with #147.
/close

timuthy added the kind/enhancement Enhancement, improvement, extension label Nov 3, 2021

timuthy mentioned this issue Nov 3, 2021

Multi-Node/Clustered ETCD #107

Closed

34 tasks

vlerenc mentioned this issue Nov 3, 2021

[Feature] Introduce BackupReady condition #251

Closed

gardener-robot closed this as completed Nov 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Cut-off client traffic when backups are not healthy #252

[Feature] Cut-off client traffic when backups are not healthy #252

timuthy commented Nov 3, 2021

vlerenc commented Nov 3, 2021

ishan16696 commented Nov 3, 2021

timuthy commented Nov 3, 2021

[Feature] Cut-off client traffic when backups are not healthy #252

[Feature] Cut-off client traffic when backups are not healthy #252

Comments

timuthy commented Nov 3, 2021

vlerenc commented Nov 3, 2021

ishan16696 commented Nov 3, 2021

timuthy commented Nov 3, 2021