[Feature] Harmonize scaling operations of the etcd cluster #589

shreyas-s-rao · 2023-05-03T08:52:00Z

Feature (What you would like to be added):
Harmonize scaling operations of the etcd cluster through deterministic steps, controlled by etcd-druid.

Motivation (Why is this needed?):
The current approach to scaling out an etcd cluster from a single-member etcd to a 3-member etcd cluster is two-fold - it first requires the first replica to be healthy, or in other words to start with a healthy single-member etcd cluster. The second step is to scale the etcd statefulset to 3 replicas, ie, introducing two new etcd pods at once. The two new etcd pods compete with each other to add themselves as learners to the existing cluster. While one of them succeeds, the other continues retrying this operation, although the operation constantly fails because the etcd cluster only supports one learner in the cluster at any given point of time. The one that succeeds then proceeds to wait for its data to catch up with the leader of the etcd cluster, and once its data is in sync with the leader, it promotes itself a voting member of the cluster. Once this is complete, the other pod succeeds in adding itself as a learner and continues with the same procedure of syncing its data with the leader and promoting itself to a voting member.

This entire process relies on a race between the two new members, and has lead to multiple issues with scale-out scenarios of etcd clusters, all of which are detailed in #584. The current approach also leads to many edge cases that need to be specially handled by either druid or backup-restore.

Druid needs to use a more deterministic approach to scale etcd clusters. This can be achieved by harmonizing the scaling out of the etcd cluster in deterministic steps or states, ie, 0 -> 1 -> 2 -> 3 -> 0. This ensures that druid is able to handle the scaling from either 0 or 1 replicas up to 3 replicas using similar steps/sub-operations, ie, no special handling of different cases is necessary. This provides two main advantages:

Makes the code leaner - deterministic steps to achieve well-defined states result in lesser edge cases and eliminate artificial races between different etcd members, thus drastically reducing the chances of failed or hanging scale-out operations
Moving between well-defined states allows druid to control the scale-out operation with precision. Since druid is the operator of the etcd cluster, it becomes druid's duty to handle cluster-wide operations such as scale-out, scale-in, etc.

In addition to scaling out the etcd cluster, the above approach can also be used for scaling in the etcd cluster - not necessarily in one step, but in two steps - 3 -> 0 and then 0 -> 1. This introduces a downtime for the etcd, but it's a trade-off that is made to achieve determinism, and since the etcd cluster is not scaled out/in frequently during the lifetime of the cluster, it may be acceptable.

Note: to achieve scale-ins, druid needs to be able to manage volumes, as defined in #481.

Approach/Hint to the implement solution (optional):
To be discussed and finalized shortly. A possible solution utilizes the concept of EtcdMember[State] to act as a communication layer between druid and individual etcd pods.

ishan16696 mentioned this issue Jun 20, 2023

[BUG] etcd stuck during restore after redeploying the etcd instance CR #621

Closed

shreyas-s-rao mentioned this issue Jun 29, 2023

Fixes for etcd status fields #594

Merged

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Feb 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Harmonize scaling operations of the etcd cluster #589

[Feature] Harmonize scaling operations of the etcd cluster #589

shreyas-s-rao commented May 3, 2023

[Feature] Harmonize scaling operations of the etcd cluster #589

[Feature] Harmonize scaling operations of the etcd cluster #589

Comments

shreyas-s-rao commented May 3, 2023