Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garden Cluster Disaster Recovery (a.k.a. Gardener Ring) #233

Closed
7 tasks
vlerenc opened this issue Jun 22, 2018 · 1 comment
Closed
7 tasks

Garden Cluster Disaster Recovery (a.k.a. Gardener Ring) #233

vlerenc opened this issue Jun 22, 2018 · 1 comment
Assignees
Labels
area/disaster-recovery Disaster recovery related component/gardener Gardener kind/epic Large multi-story topic lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@vlerenc
Copy link
Member

vlerenc commented Jun 22, 2018

Story

  • As operator I want my garden cluster (function-wise) to recover from a disaster, so that I can ensure business continuity.

Motivation

Availability SLO, Business continuity.

Acceptance Criteria

  • Garden cluster functionality recovers from single zone loss (=partial disaster) with a Recovery Time Objective (RTO) of 5 minutes.

Implementation Proposal

Form a highly available distributed self-healing Gardener ring that watches its hosting garden clusters (forming a ring of seed/shoots) and repairs lost clusters underneath itself autonomously and automatically. This is not so much a pure disaster recovery as a mix of:

  • HA Gardener
  • Self-healing of rather than complete relocation to a different garden cluster (works only if the affected zone becomes available again)

Idea how this could be set up: Let's use Minikube (on an IaaS VM with IaaS LBs) to bootstrap a Gardener, create a ring of seeds (actually garden clusters) and transfer the first seed control plane into the last seed cluster, finally deploy an etcd cluster across the garden clusters, an API server as well, form a virtual distributed node-less Kubernetes cluster and transfer the Gardener control plane into that ring and shut down the Minikube cluster again. The result should be 3 seeds (actually garden clusters), one watching the other with a distributed Gardener using a distributed etcd on a distributed virtual node-less Kubernetes cluster consisting only of API servers. The whole thing becomes a self-healing Gardener ring.

Positive side effect: We can reduce our effort into Kubify and focus even more on the Gardener. Also, we can leverage the Gardener functionality for the clusters that run Gardener itself, which is an immense improvement since day-2 operations are much stronger in Gardener than they are in Kubify for obvious reasons. In addition, Gardener is more thoroughly quality-tested and production-hardened than Kubify, which in comparison runs few (but nonetheless critical garden) clusters.

Prerequisites/Requirements

Resources

Release Notes

- In case of a partial garden cluster outage or loss (partial disaster limited to one zone), Gardener can now self-heal autonomously and automatically to ensure business continuity with an RTO of 5 minutes.

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit tests are provided: Have you written automated unit tests?
  • Integration tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added/changed public API, was it really necessary/is it minimal?
  • Operations guide: Have you updated the operations guide about ops-relevant changes?
  • User documentation: Have you updated the READMEs/documentation about user-relevant changes?
@vlerenc vlerenc added component/gardener Gardener area/disaster-recovery Disaster recovery related kind/epic Large multi-story topic and removed component/gardener Gardener labels Jun 27, 2018
@gardener-robot-ci-1 gardener-robot-ci-1 added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 5, 2018
@gardener-robot-ci-1 gardener-robot-ci-1 added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 5, 2018
@gardener-robot-ci-1 gardener-robot-ci-1 added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 4, 2019
richardyuwen pushed a commit to richardyuwen/gardener that referenced this issue Mar 26, 2019
@gardener-robot-ci-1 gardener-robot-ci-1 added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 6, 2019
@gardener-robot-ci-1 gardener-robot-ci-1 added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 6, 2019
@gardener-robot-ci-1 gardener-robot-ci-1 added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 1, 2019
@ghost ghost added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 1, 2019
@ghost ghost removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 2, 2019
@ghost ghost added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 31, 2020
@ghost ghost added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 1, 2020
@ghost ghost added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 1, 2020
@vlerenc vlerenc added roadmap/external and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Sep 24, 2020
@vlerenc vlerenc added this to the 2021-Q4 milestone Sep 24, 2020
@gardener-robot gardener-robot removed this from the 2021-Q4 milestone Sep 30, 2020
@gardener-robot gardener-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 17, 2020
@rfranzke
Copy link
Member

rfranzke commented Apr 8, 2021

/close as it's unlikely that we will implement this ring due to complexity concerns

@gardener-robot gardener-robot added the roadmap/standalone Roadmap for the (on-prem) standalone delivery, e.g. CDC, NS2, etc. label May 21, 2021
@vlerenc vlerenc removed roadmap/standalone Roadmap for the (on-prem) standalone delivery, e.g. CDC, NS2, etc. lifecycle/icebox labels Jun 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/disaster-recovery Disaster recovery related component/gardener Gardener kind/epic Large multi-story topic lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

6 participants