-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Druid-controlled updates to the pods in the etcd cluster #588
Comments
There is an issue when using |
this scenario can also be avoided by setting spec.updateStrategy.rollingUpdate.maxUnavailable to |
We decided that as it's an alpha feature we won't be using this and will be directly moving to |
Rough Discussion notes: |
After an offline meeting with @ashwani2k , @ishan16696 , and @renormalize , we discussed three scenarios related to the OnDelete strategy: Safe-to-Evict Flag and Voluntary Disruptions:If we set cluster-autoscaler.kubernetes.io/safe-to-evict: "false", consider a scenario where the Vertical Pod Autoscaler (VPA) is evicting a pod during voluntary disruptions. Simultaneously, there may be no unhealthy pods, compelling us to select a candidate and trigger pod deletion to apply the latest updates. This scenario could potentially lead to transient quorum loss because Pod Disruption Budgets (PDBs) are not respected due to direct deletion calls made by the OnDelete pod updater. Simultaneous Reconciliation and Node Update:Suppose etcd reconciliation and node reconciliation (rolling update of a node) occur concurrently, leading to a node drain attempting to evict a pod. Simultaneously, the OnDelete update component might also select this candidate and trigger deletion. This action can cause transient quorum loss because PDBs are not respected by the deletion calls from the OnDelete pod updater. Etcd Reconciliation with Node-Pressure Eviction:Consider a scenario where, during etcd reconciliation, we select a candidate and trigger pod deletion. At the same time, due to high utilization of an already running pod, the kubelet may initiate a node-pressure eviction, an involuntary disruption that does not respect PDBs. From our brainstorming session, we concluded that we should use the eviction API whenever we are deleting a healthy pod during voluntary disruptions. This approach ensures that any simultaneous involuntary disruptions can be mitigated by PDBs. Key Takeaways:Evict Healthy Pods: By adopting this strategy, we safeguard against scenarios 1 and 2 and, to some extent, can also prevent scenario 3 if node-pressure eviction occurs before our deletion process begins. This method ensures greater stability and reliability in managing our Kubernetes resources. For further understanding of Voluntary and Involuntary disruptions, you can read more here.
|
Feature (What you would like to be added):
Druid-controlled updates to the pods in the etcd cluster.
Motivation (Why is this needed?):
Currently, druid deploys the etcd cluster as a statefulset with number of replicas set to the desired number of members in the etcd cluster. The
spec.updateStrategy
of this statefulset is set toRollingUpdate
, which allows the statefulset controller to roll the etcd pods one after the other in a rolling fashion. The order of updating each of pod is deterministic - from largest ordinal to the smallest, as per the documentation. This works fine for a perfectly healthy etcd cluster, but poses a risk for a multi-node etcd cluster with an unhealthy pod.Consider a 3-member etcd cluster
etcd-main
, with podsetcd-main-0
,etcd-main-1
andetcd-main-2
running, in the pink of health. At this point, ifetcd-main-0
(oretcd-main-1
) becomes unhealthy due to multiple reasons (network connectivity issues, zone outages, node failure, or simply an etcd issue which might be resolvable by restarting the pod), then the etcd cluster still maintains quorum with the other two healthy members, but is now only one step away from losing quorum. What happens now if there's an update to the etcd statefulset spec, like a change inetcd-backup-restore
image version or a configuration change to theetcd
oretcdbrctl
processes? The statefulset controller starts rolling the pods starting withetcd-main-2
. As soon as it deletes this pod to make room for the updated pod, the cluster loses quorum. This leads to a downtime of the etcd, subsequently causing a downtime to the kube-apiserver whom the etcd is backing, until the updatedetcd-main-2
pod comes back up.This is an artificially introduced quorum loss scenario, which can be entirely avoided if druid takes control of the order of updation of the etcd pods.
Approach/Hint to the implement solution (optional):
Setting the etcd statefulset's
spec.updateStrategy
toOnDelete
essentially disables automatic rollouts of pods upon statefulset spec updates, and instead tells the statefulset controller to wait until a pod is deleted before restarting it with the updated pod spec. This provides druid the freedom to check which pods are healthy and which are not, and take a careful decision on the order of pod updation. In the above case whereetcd-main-0
became unhealthy, druid can first updateetcd-main-0
to ensure that quorum is still maintained by the other two other members. The pod spec update can potentially fix any problem withetcd-main-0
such as an internal error or by rescheduling it to a different node which might not be suffereing from the same network connectivity issues. Druid can then proceed with the updation of the rest of the etcd pods. In essence, this method reduces the likelihood of a artificially indiced quorum loss caused by a badly ordered updation of etcd pods in the cluster.Changing the updateStrategy of the statefulset to
OnDelete
is also beneficial in the case of rolling the volumes backing the etcd pods, as explained by @unmarshall in gardener/gardener-extension-provider-aws#646 and further discussed in #481.The text was updated successfully, but these errors were encountered: