Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Harmonize scaling operations of the etcd cluster #589

Open
shreyas-s-rao opened this issue May 3, 2023 · 0 comments
Open

[Feature] Harmonize scaling operations of the etcd cluster #589

shreyas-s-rao opened this issue May 3, 2023 · 0 comments
Labels
area/control-plane Control plane related area/high-availability High availability related area/quality Output qualification (tests, checks, scans, automation in general, etc.) related kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age) priority/1 Priority (lower number equals higher priority)

Comments

@shreyas-s-rao
Copy link
Contributor

Feature (What you would like to be added):
Harmonize scaling operations of the etcd cluster through deterministic steps, controlled by etcd-druid.

Motivation (Why is this needed?):
The current approach to scaling out an etcd cluster from a single-member etcd to a 3-member etcd cluster is two-fold - it first requires the first replica to be healthy, or in other words to start with a healthy single-member etcd cluster. The second step is to scale the etcd statefulset to 3 replicas, ie, introducing two new etcd pods at once. The two new etcd pods compete with each other to add themselves as learners to the existing cluster. While one of them succeeds, the other continues retrying this operation, although the operation constantly fails because the etcd cluster only supports one learner in the cluster at any given point of time. The one that succeeds then proceeds to wait for its data to catch up with the leader of the etcd cluster, and once its data is in sync with the leader, it promotes itself a voting member of the cluster. Once this is complete, the other pod succeeds in adding itself as a learner and continues with the same procedure of syncing its data with the leader and promoting itself to a voting member.

This entire process relies on a race between the two new members, and has lead to multiple issues with scale-out scenarios of etcd clusters, all of which are detailed in #584. The current approach also leads to many edge cases that need to be specially handled by either druid or backup-restore.

Druid needs to use a more deterministic approach to scale etcd clusters. This can be achieved by harmonizing the scaling out of the etcd cluster in deterministic steps or states, ie, 0 -> 1 -> 2 -> 3 -> 0. This ensures that druid is able to handle the scaling from either 0 or 1 replicas up to 3 replicas using similar steps/sub-operations, ie, no special handling of different cases is necessary. This provides two main advantages:

  1. Makes the code leaner - deterministic steps to achieve well-defined states result in lesser edge cases and eliminate artificial races between different etcd members, thus drastically reducing the chances of failed or hanging scale-out operations
  2. Moving between well-defined states allows druid to control the scale-out operation with precision. Since druid is the operator of the etcd cluster, it becomes druid's duty to handle cluster-wide operations such as scale-out, scale-in, etc.

In addition to scaling out the etcd cluster, the above approach can also be used for scaling in the etcd cluster - not necessarily in one step, but in two steps - 3 -> 0 and then 0 -> 1. This introduces a downtime for the etcd, but it's a trade-off that is made to achieve determinism, and since the etcd cluster is not scaled out/in frequently during the lifetime of the cluster, it may be acceptable.

Note: to achieve scale-ins, druid needs to be able to manage volumes, as defined in #481.

Approach/Hint to the implement solution (optional):
To be discussed and finalized shortly. A possible solution utilizes the concept of EtcdMember[State] to act as a communication layer between druid and individual etcd pods.

@shreyas-s-rao shreyas-s-rao added area/control-plane Control plane related area/high-availability High availability related area/quality Output qualification (tests, checks, scans, automation in general, etc.) related kind/enhancement Enhancement, improvement, extension priority/1 Priority (lower number equals higher priority) labels May 3, 2023
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Feb 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Control plane related area/high-availability High availability related area/quality Output qualification (tests, checks, scans, automation in general, etc.) related kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age) priority/1 Priority (lower number equals higher priority)
Projects
None yet
Development

No branches or pull requests

2 participants