Serialize the eviction of pods with volumes #262

PadmaB · 2019-05-08T04:04:46Z

Feature (What you would like to be added):
MCM to intelligently handle the eviction of pods with volumes

Motivation (Why is this needed?):
Rolling update of nodes take longer when more volumes are attached to the node. An average of 10+ volumes are attached to the Nodes in the SEED cluster, rolling updates taking longer directly impacts the downtime of ETCD.
Similar impact on the Shoot cluster workloads is possible

Approach/Hint to the implement solution (optional):
One option could be to implement a drain controller in MCM - yet to be detailed out

vlerenc · 2019-05-08T04:08:20Z

From: gardener/gardener#993 (comment)

Maybe it could have 2 queues, one that works in parallel and deals with all pods that have no PVs and one that works serially and one-by-one evicts pods with PVs. Maybe it is sufficient to wait for the pod to be evicted and the PV to be successfully detached before the next one is processed, but maybe the controller would have to wait for the attach to complete, too. That however would be ugly / a very ugly dependency as there is no guarantee that this will even work (pod scheduling or attach may be blocked by unrelated reasons). Hopefully, waiting for the detach operation to complete puts a sufficient brake on the "flow of pods" that it will lead to quasi-serialised attach operations (we put a brake on the pods that leave a node, so the inflow of pods on the other/new nodes will be slowed down).

vlerenc · 2019-05-08T09:16:48Z

Hopefully, the separated queue approach for pods with and without PVs is not that complex to implement. In the end, you would basically throttle the eviction for pods with PVs - that’s all. Checking whether a volume is detached can be (hopefully) done without infrastructure specific code by looking at the node status (volumesAttached and volumesInUse). If it’s done in the small piece of eviction code (originally copied from kubectl), it shouldn’t be a lot of work and remains rather local, which is good. A separate controller is more effort, especially if it shall serve more use cases, but I don’t think it will be accepted widely. For that apps/service use dedicated operators as they have more goals anyways. From our PoV, MCM is the right spot for this kind of optimisation, because we actually also solve a machine-specific problem: machines with many volumes cannot be quickly drained or rather this has negative side effects on the whole machine rolling update/replacement topic. That’s why we in MCM can and maybe should (?) improve the experience with a machine- or rather “mass-volume-detach-problem-aware eviction logic”. WDYT?

vlerenc · 2019-05-08T15:51:11Z

@hardikdr brought up the point, "How [we would] prioritise the pods to be evicted. Basically chances are, we pick a wrong pod first which technically cant be evicted [may be due to it’s PDB violations]. 'Not violating PDB' logic is I think implemented at server. Though there should be a way out.". That's a very good point and here the simple solution I hoped for gets more complicated. Can we check whether the pod received its sigterm or rather "check after the grace period to see if it's still around? If it is, it's most likely around, because it couldn't be evicted right away, so continue with the next pod in our queue? It still puts a brake on the eviction of pods with volumes and even though not perfect, pods that get evicted fast get so one by one while the other ones may occasionally be evicted in parallel, but overall/in some/many cases the situation will improve. WDYT?

amshuman-kr · 2019-05-15T08:52:15Z

With the above approach, we will still need to make sure the volumes are detached (for the pods with volume) before proceeding to evict the next one (with volumes).

vlerenc · 2019-05-15T14:53:28Z

...unless a timeout kicks in (e.g. something like 60-120 seconds). Then we would continue with the next pod regardless, otherwise we might be hanging in for too long/uncontrolled.

ggaurav10 · 2019-06-03T10:10:34Z

Should we also consider rare(?) cases in which deployments/pods, etc. share PVCs? Otherwise, in these cases, each eviction of such pods will end up waiting until configured timeout-for-single-pod-eviction. There could also be circular dependencies in which case we will need to evict such pods in parallel.

ggaurav10 · 2019-06-04T07:52:35Z

Also, in a special case in which pods with PDB share a volume, the serial eviction of all such pods fails. This is because the pods that are evicted early can't start on other nodes because of unavailability of the volume. The volume is not detached from previous node because other pods using that volume can't be evicted because of PDB. This creates a deadlock.
This situation can happen even in current implementation of drain logic, and we are saved by the global machine drain timeout. We can continue relying on that. WDYT?

vlerenc · 2019-06-04T14:36:59Z

Rare case

I guess, we have to live with it or what do you suggest?

Shared PVs, no anti-affinity, PDB of 1, that's a bug on end user side, I would think, regardless of MCM.

ggaurav10 · 2019-06-05T01:00:02Z

After discussions with @amshuman-kr, it was decided that for pods that share PVs, drain will not wait for shared PVs to detach.
For such pods with PDBs, we can live with relying on global drain timeout. can't think of an ideal way to respect PDBs for such pods while draining a machine.

PadmaB added kind/enhancement Enhancement, improvement, extension component/machine-controller-manager platform/all area/control-plane Control plane related labels May 8, 2019

PadmaB added this to the 1905a milestone May 13, 2019

ggaurav10 mentioned this issue Jun 4, 2019

Serialise eviction for pods with PVs #275

Merged

PadmaB modified the milestones: 1905a, 1906a Jun 10, 2019

PadmaB modified the milestones: 1906a, 1906b Jun 24, 2019

hardikdr closed this as completed in #275 Jun 25, 2019

amshuman-kr mentioned this issue Jul 4, 2019

Feature/drain termination grace period #288

Merged

ghost added the component/mcm Machine Controller Manager (including Node Problem Detector, Cluster Auto Scaler, etc.) label Mar 7, 2020

timuthy mentioned this issue Sep 8, 2023

Parallelize the eviction of pods with volumes #848

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialize the eviction of pods with volumes #262

Serialize the eviction of pods with volumes #262

PadmaB commented May 8, 2019

vlerenc commented May 8, 2019

vlerenc commented May 8, 2019

vlerenc commented May 8, 2019

amshuman-kr commented May 15, 2019

vlerenc commented May 15, 2019

ggaurav10 commented Jun 3, 2019 •

edited

Loading

ggaurav10 commented Jun 4, 2019

vlerenc commented Jun 4, 2019

ggaurav10 commented Jun 5, 2019

Serialize the eviction of pods with volumes #262

Serialize the eviction of pods with volumes #262

Comments

PadmaB commented May 8, 2019

vlerenc commented May 8, 2019

vlerenc commented May 8, 2019

vlerenc commented May 8, 2019

amshuman-kr commented May 15, 2019

vlerenc commented May 15, 2019

ggaurav10 commented Jun 3, 2019 • edited Loading

ggaurav10 commented Jun 4, 2019

vlerenc commented Jun 4, 2019

ggaurav10 commented Jun 5, 2019

ggaurav10 commented Jun 3, 2019 •

edited

Loading