Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Cut-off client traffic when backups are not healthy #252

Closed
timuthy opened this issue Nov 3, 2021 · 3 comments
Closed

[Feature] Cut-off client traffic when backups are not healthy #252

timuthy opened this issue Nov 3, 2021 · 3 comments
Labels
kind/enhancement Enhancement, improvement, extension

Comments

@timuthy
Copy link
Member

timuthy commented Nov 3, 2021

Feature (What you would like to be added):
Concerning the multi-node etcd proposal (ref) client traffic should be cut-off as soon as a failure in the backup procedure is detected.

Motivation (Why is this needed?):

when the backup upload fails rather than continuing to serve requests while the gap between the last backup and the current data increases which might lead to unacceptably large amount of data loss if disaster strikes.

Approach/Hint to the implement solution (optional):
Consult BackupReady condition to check if backups are healthy. In case of unhealthy backups, the client traffic is supposed to be cut-off by changing the spec.selector of the client service object.

Depends on #158, #251

@timuthy timuthy added the kind/enhancement Enhancement, improvement, extension label Nov 3, 2021
@timuthy timuthy mentioned this issue Nov 3, 2021
34 tasks
@vlerenc
Copy link
Member

vlerenc commented Nov 3, 2021

Concerning the multi-node etcd proposal (ref) client traffic should be cut-off as soon as a failure in the backup procedure is detected.

@timuthy Why? Please don't. We have had already an outage when we lost blob store access and needlessly caused cluster outages. Back then we agreed that we should not do that (though it wasn't changed, yet). We should keep running, raise an alert, possibly add another shoot state condition (for backup), but the one thing we should absolutely not do is to shut down the cluster. It's our main job/responsibility. If ETCD is healthy, we keep running.

We had a similar discussion with @stoyanr when we discussed the "bad case scenario". Shutting down a healthy control plane is definitely causing an outage while not taking backups or not reaching Gardener for some time is very bad, but not a neck-breaker, if recovered.

Please do not terminate healthy ETCDs/control planes. No managed service out there will ever do that. Being available is the single most important SLI/KPI that we have.

Besides, treating backup readiness different from pod readiness should make things hopefully also simpler/less tightly coupled and thereby less complex (clustered ETCD is already complex enough).

@ishan16696
Copy link
Member

There is also a open issue which was basically opened Not to block the client traffic and to decouple the readiness probe of etcd with snapshotter and it also suggested to enhance the multi-node etcd proposal to address this new requirement.

@timuthy
Copy link
Member Author

timuthy commented Nov 3, 2021

@vlerenc as discussed, this issue was solely opened with the information extracted from the multi-node etcd proposal document and under the assumption that not taking backups while continue processing data in etcd becomes a real issue.

Let's close this for now and instead continue with #147.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Enhancement, improvement, extension
Projects
None yet
Development

No branches or pull requests

4 participants