Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialize the eviction of pods with volumes #262

Closed
PadmaB opened this issue May 8, 2019 · 9 comments · Fixed by #275 or #288
Closed

Serialize the eviction of pods with volumes #262

PadmaB opened this issue May 8, 2019 · 9 comments · Fixed by #275 or #288
Labels
area/control-plane Control plane related component/mcm Machine Controller Manager (including Node Problem Detector, Cluster Auto Scaler, etc.) kind/enhancement Enhancement, improvement, extension platform/all
Milestone

Comments

@PadmaB
Copy link
Contributor

PadmaB commented May 8, 2019

Feature (What you would like to be added):
MCM to intelligently handle the eviction of pods with volumes

Motivation (Why is this needed?):
Rolling update of nodes take longer when more volumes are attached to the node. An average of 10+ volumes are attached to the Nodes in the SEED cluster, rolling updates taking longer directly impacts the downtime of ETCD.
Similar impact on the Shoot cluster workloads is possible

Approach/Hint to the implement solution (optional):
One option could be to implement a drain controller in MCM - yet to be detailed out

@vlerenc
Copy link
Member

vlerenc commented May 8, 2019

From: gardener/gardener#993 (comment)

Maybe it could have 2 queues, one that works in parallel and deals with all pods that have no PVs and one that works serially and one-by-one evicts pods with PVs. Maybe it is sufficient to wait for the pod to be evicted and the PV to be successfully detached before the next one is processed, but maybe the controller would have to wait for the attach to complete, too. That however would be ugly / a very ugly dependency as there is no guarantee that this will even work (pod scheduling or attach may be blocked by unrelated reasons). Hopefully, waiting for the detach operation to complete puts a sufficient brake on the "flow of pods" that it will lead to quasi-serialised attach operations (we put a brake on the pods that leave a node, so the inflow of pods on the other/new nodes will be slowed down).

@vlerenc
Copy link
Member

vlerenc commented May 8, 2019

Hopefully, the separated queue approach for pods with and without PVs is not that complex to implement. In the end, you would basically throttle the eviction for pods with PVs - that’s all. Checking whether a volume is detached can be (hopefully) done without infrastructure specific code by looking at the node status (volumesAttached and volumesInUse). If it’s done in the small piece of eviction code (originally copied from kubectl), it shouldn’t be a lot of work and remains rather local, which is good. A separate controller is more effort, especially if it shall serve more use cases, but I don’t think it will be accepted widely. For that apps/service use dedicated operators as they have more goals anyways. From our PoV, MCM is the right spot for this kind of optimisation, because we actually also solve a machine-specific problem: machines with many volumes cannot be quickly drained or rather this has negative side effects on the whole machine rolling update/replacement topic. That’s why we in MCM can and maybe should (?) improve the experience with a machine- or rather “mass-volume-detach-problem-aware eviction logic”. WDYT?

@vlerenc
Copy link
Member

vlerenc commented May 8, 2019

@hardikdr brought up the point, "How [we would] prioritise the pods to be evicted. Basically chances are, we pick a wrong pod first which technically cant be evicted [may be due to it’s PDB violations]. 'Not violating PDB' logic is I think implemented at server. Though there should be a way out.". That's a very good point and here the simple solution I hoped for gets more complicated. Can we check whether the pod received its sigterm or rather "check after the grace period to see if it's still around? If it is, it's most likely around, because it couldn't be evicted right away, so continue with the next pod in our queue? It still puts a brake on the eviction of pods with volumes and even though not perfect, pods that get evicted fast get so one by one while the other ones may occasionally be evicted in parallel, but overall/in some/many cases the situation will improve. WDYT?

@PadmaB PadmaB added this to the 1905a milestone May 13, 2019
@amshuman-kr
Copy link
Collaborator

With the above approach, we will still need to make sure the volumes are detached (for the pods with volume) before proceeding to evict the next one (with volumes).

@vlerenc
Copy link
Member

vlerenc commented May 15, 2019

...unless a timeout kicks in (e.g. something like 60-120 seconds). Then we would continue with the next pod regardless, otherwise we might be hanging in for too long/uncontrolled.

@ggaurav10
Copy link
Contributor

ggaurav10 commented Jun 3, 2019

Should we also consider rare(?) cases in which deployments/pods, etc. share PVCs? Otherwise, in these cases, each eviction of such pods will end up waiting until configured timeout-for-single-pod-eviction. There could also be circular dependencies in which case we will need to evict such pods in parallel.

@ggaurav10
Copy link
Contributor

Also, in a special case in which pods with PDB share a volume, the serial eviction of all such pods fails. This is because the pods that are evicted early can't start on other nodes because of unavailability of the volume. The volume is not detached from previous node because other pods using that volume can't be evicted because of PDB. This creates a deadlock.
This situation can happen even in current implementation of drain logic, and we are saved by the global machine drain timeout. We can continue relying on that. WDYT?

@vlerenc
Copy link
Member

vlerenc commented Jun 4, 2019

Rare case

I guess, we have to live with it or what do you suggest?

Shared PVs, no anti-affinity, PDB of 1, that's a bug on end user side, I would think, regardless of MCM.

@ggaurav10
Copy link
Contributor

After discussions with @amshuman-kr, it was decided that for pods that share PVs, drain will not wait for shared PVs to detach.
For such pods with PDBs, we can live with relying on global drain timeout. can't think of an ideal way to respect PDBs for such pods while draining a machine.

@PadmaB PadmaB modified the milestones: 1905a, 1906a Jun 10, 2019
@PadmaB PadmaB modified the milestones: 1906a, 1906b Jun 24, 2019
@ghost ghost added the component/mcm Machine Controller Manager (including Node Problem Detector, Cluster Auto Scaler, etc.) label Mar 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Control plane related component/mcm Machine Controller Manager (including Node Problem Detector, Cluster Auto Scaler, etc.) kind/enhancement Enhancement, improvement, extension platform/all
Projects
None yet
4 participants