[Feature] Druid-controlled updates to the pods in the etcd cluster #588

shreyas-s-rao · 2023-05-03T06:39:49Z

Feature (What you would like to be added):
Druid-controlled updates to the pods in the etcd cluster.

Motivation (Why is this needed?):
Currently, druid deploys the etcd cluster as a statefulset with number of replicas set to the desired number of members in the etcd cluster. The spec.updateStrategy of this statefulset is set to RollingUpdate, which allows the statefulset controller to roll the etcd pods one after the other in a rolling fashion. The order of updating each of pod is deterministic - from largest ordinal to the smallest, as per the documentation. This works fine for a perfectly healthy etcd cluster, but poses a risk for a multi-node etcd cluster with an unhealthy pod.

Consider a 3-member etcd cluster etcd-main, with pods etcd-main-0, etcd-main-1 and etcd-main-2 running, in the pink of health. At this point, if etcd-main-0 (or etcd-main-1) becomes unhealthy due to multiple reasons (network connectivity issues, zone outages, node failure, or simply an etcd issue which might be resolvable by restarting the pod), then the etcd cluster still maintains quorum with the other two healthy members, but is now only one step away from losing quorum. What happens now if there's an update to the etcd statefulset spec, like a change in etcd-backup-restore image version or a configuration change to the etcd or etcdbrctl processes? The statefulset controller starts rolling the pods starting with etcd-main-2. As soon as it deletes this pod to make room for the updated pod, the cluster loses quorum. This leads to a downtime of the etcd, subsequently causing a downtime to the kube-apiserver whom the etcd is backing, until the updated etcd-main-2 pod comes back up.

This is an artificially introduced quorum loss scenario, which can be entirely avoided if druid takes control of the order of updation of the etcd pods.

Approach/Hint to the implement solution (optional):

Setting the etcd statefulset's spec.updateStrategy to OnDelete essentially disables automatic rollouts of pods upon statefulset spec updates, and instead tells the statefulset controller to wait until a pod is deleted before restarting it with the updated pod spec. This provides druid the freedom to check which pods are healthy and which are not, and take a careful decision on the order of pod updation. In the above case where etcd-main-0 became unhealthy, druid can first update etcd-main-0 to ensure that quorum is still maintained by the other two other members. The pod spec update can potentially fix any problem with etcd-main-0 such as an internal error or by rescheduling it to a different node which might not be suffereing from the same network connectivity issues. Druid can then proceed with the updation of the rest of the etcd pods. In essence, this method reduces the likelihood of a artificially indiced quorum loss caused by a badly ordered updation of etcd pods in the cluster.

Changing the updateStrategy of the statefulset to OnDelete is also beneficial in the case of rolling the volumes backing the etcd pods, as explained by @unmarshall in gardener/gardener-extension-provider-aws#646 and further discussed in #481.

Note: components such as VPA or HVPA which currently directly update the statefulset spec with new container resource recommendations will need to be specially accommodated in the new approach by possibly adding a new predicate for the etcd controller to also react to changes in the statefulset's spec.template.spec.containers[*].resources field and trigger reconciliations accordingly, so that the underlying pods are updated with the new resource recommendations from VPA/HVPA.

The text was updated successfully, but these errors were encountered:

unmarshall · 2024-01-21T09:57:12Z

There is an issue when using onDelete which should be kept in mind.
kubernetes/kubernetes#73492

ishan16696 · 2024-07-25T10:06:32Z

Consider a 3-member etcd cluster etcd-main, with pods etcd-main-0, etcd-main-1 and etcd-main-2 running, in the pink of health. At this point, if etcd-main-0 (or etcd-main-1) becomes unhealthy due to multiple reasons (network connectivity issues, zone outages, node failure, or simply an etcd issue which might be resolvable by restarting the pod), then the etcd cluster still maintains quorum with the other two healthy members, but is now only one step away from losing quorum. What happens now if there's an update to the etcd statefulset spec, like a change in etcd-backup-restore image version or a configuration change to the etcd or etcdbrctl processes? The statefulset controller starts rolling the pods starting with etcd-main-2. As soon as it deletes this pod to make room for the updated pod, the cluster loses quorum. This leads to a downtime of the etcd, subsequently causing a downtime to the kube-apiserver whom the etcd is backing, until the updated etcd-main-2 pod comes back up.

this scenario can also be avoided by setting spec.updateStrategy.rollingUpdate.maxUnavailable to 1 but this feature is only available via feature gate of api-server and it's still alpha feature.
I was just wondering, can we use this feature till etcd-druid moved to onDelete strategy ?

ishan16696 · 2024-07-26T12:55:05Z

this scenario can also be avoided by setting spec.updateStrategy.rollingUpdate.maxUnavailable to 1 but this feature is only available via feature gate of api-server and it's still alpha feature.
I was just wondering, can we use this feature till etcd-druid moved to onDelete strategy ?

We decided that as it's an alpha feature we won't be using this and will be directly moving to OnDelete strategy.

unmarshall · 2024-07-29T06:26:36Z

Rough Discussion notes:
On-Delete strategy for stateful set.md

seshachalam-yv · 2024-08-12T09:15:18Z

After an offline meeting with @ashwani2k , @ishan16696 , and @renormalize , we discussed three scenarios related to the OnDelete strategy:

Safe-to-Evict Flag and Voluntary Disruptions:

If we set cluster-autoscaler.kubernetes.io/safe-to-evict: "false", consider a scenario where the Vertical Pod Autoscaler (VPA) is evicting a pod during voluntary disruptions. Simultaneously, there may be no unhealthy pods, compelling us to select a candidate and trigger pod deletion to apply the latest updates. This scenario could potentially lead to transient quorum loss because Pod Disruption Budgets (PDBs) are not respected due to direct deletion calls made by the OnDelete pod updater.

Simultaneous Reconciliation and Node Update:

Suppose etcd reconciliation and node reconciliation (rolling update of a node) occur concurrently, leading to a node drain attempting to evict a pod. Simultaneously, the OnDelete update component might also select this candidate and trigger deletion. This action can cause transient quorum loss because PDBs are not respected by the deletion calls from the OnDelete pod updater.

Etcd Reconciliation with Node-Pressure Eviction:

Consider a scenario where, during etcd reconciliation, we select a candidate and trigger pod deletion. At the same time, due to high utilization of an already running pod, the kubelet may initiate a node-pressure eviction, an involuntary disruption that does not respect PDBs.

From our brainstorming session, we concluded that we should use the eviction API whenever we are deleting a healthy pod during voluntary disruptions. This approach ensures that any simultaneous involuntary disruptions can be mitigated by PDBs.

Key Takeaways:

Evict Healthy Pods:
Use the eviction API to manage healthy pods, respecting PDBs and preventing disruptions.
Delete Unhealthy Pods:
Directly delete unhealthy pods when necessary, as they are not necessarily protected by PDBs.

By adopting this strategy, we safeguard against scenarios 1 and 2 and, to some extent, can also prevent scenario 3 if node-pressure eviction occurs before our deletion process begins. This method ensures greater stability and reliability in managing our Kubernetes resources.

For further understanding of Voluntary and Involuntary disruptions, you can read more here.

If PR #855 is merged, we can fully utilize eviction for managing pods

shreyas-s-rao added area/control-plane Control plane related area/high-availability High availability related area/robustness Robustness, reliability, resilience related kind/enhancement Enhancement, improvement, extension priority/1 Priority (lower number equals higher priority) labels May 3, 2023

shreyas-s-rao mentioned this issue Jul 23, 2023

Improved scale-up detection. #647

Merged

shreyas-s-rao mentioned this issue Nov 7, 2023

[Feature] Allow druid to delete and trigger recreation of PVCs on-demand #481

Open

unmarshall assigned unmarshall and shreyas-s-rao Jan 21, 2024

anveshreddy18 self-assigned this Aug 12, 2024

anveshreddy18 mentioned this issue Sep 23, 2024

DEP: Design proposal for OnDelete Strategy #879

Open

unmarshall mentioned this issue Oct 22, 2024

GEP-29 - Autoscaling Storage Volumes gardener/gardener#10690

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Druid-controlled updates to the pods in the etcd cluster #588

[Feature] Druid-controlled updates to the pods in the etcd cluster #588

shreyas-s-rao commented May 3, 2023

unmarshall commented Jan 21, 2024 •

edited

Loading

ishan16696 commented Jul 25, 2024 •

edited

Loading

ishan16696 commented Jul 26, 2024

unmarshall commented Jul 29, 2024 •

edited

Loading

seshachalam-yv commented Aug 12, 2024 •

edited

Loading

[Feature] Druid-controlled updates to the pods in the etcd cluster #588

[Feature] Druid-controlled updates to the pods in the etcd cluster #588

Comments

shreyas-s-rao commented May 3, 2023

unmarshall commented Jan 21, 2024 • edited Loading

ishan16696 commented Jul 25, 2024 • edited Loading

ishan16696 commented Jul 26, 2024

unmarshall commented Jul 29, 2024 • edited Loading

seshachalam-yv commented Aug 12, 2024 • edited Loading

Safe-to-Evict Flag and Voluntary Disruptions:

Simultaneous Reconciliation and Node Update:

Etcd Reconciliation with Node-Pressure Eviction:

Key Takeaways:

unmarshall commented Jan 21, 2024 •

edited

Loading

ishan16696 commented Jul 25, 2024 •

edited

Loading

unmarshall commented Jul 29, 2024 •

edited

Loading

seshachalam-yv commented Aug 12, 2024 •

edited

Loading