-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialize the eviction of pods with volumes #262
Comments
From: gardener/gardener#993 (comment) Maybe it could have 2 queues, one that works in parallel and deals with all pods that have no PVs and one that works serially and one-by-one evicts pods with PVs. Maybe it is sufficient to wait for the pod to be evicted and the PV to be successfully detached before the next one is processed, but maybe the controller would have to wait for the attach to complete, too. That however would be ugly / a very ugly dependency as there is no guarantee that this will even work (pod scheduling or attach may be blocked by unrelated reasons). Hopefully, waiting for the detach operation to complete puts a sufficient brake on the "flow of pods" that it will lead to quasi-serialised attach operations (we put a brake on the pods that leave a node, so the inflow of pods on the other/new nodes will be slowed down). |
Hopefully, the separated queue approach for pods with and without PVs is not that complex to implement. In the end, you would basically throttle the eviction for pods with PVs - that’s all. Checking whether a volume is detached can be (hopefully) done without infrastructure specific code by looking at the node status ( |
@hardikdr brought up the point, "How [we would] prioritise the pods to be evicted. Basically chances are, we pick a wrong pod first which technically cant be evicted [may be due to it’s PDB violations]. 'Not violating PDB' logic is I think implemented at server. Though there should be a way out.". That's a very good point and here the simple solution I hoped for gets more complicated. Can we check whether the pod received its sigterm or rather "check after the grace period to see if it's still around? If it is, it's most likely around, because it couldn't be evicted right away, so continue with the next pod in our queue? It still puts a brake on the eviction of pods with volumes and even though not perfect, pods that get evicted fast get so one by one while the other ones may occasionally be evicted in parallel, but overall/in some/many cases the situation will improve. WDYT? |
With the above approach, we will still need to make sure the volumes are detached (for the pods with volume) before proceeding to evict the next one (with volumes). |
...unless a timeout kicks in (e.g. something like 60-120 seconds). Then we would continue with the next pod regardless, otherwise we might be hanging in for too long/uncontrolled. |
Should we also consider rare(?) cases in which deployments/pods, etc. share PVCs? Otherwise, in these cases, each eviction of such pods will end up waiting until configured timeout-for-single-pod-eviction. There could also be circular dependencies in which case we will need to evict such pods in parallel. |
Also, in a special case in which pods with PDB share a volume, the serial eviction of all such pods fails. This is because the pods that are evicted early can't start on other nodes because of unavailability of the volume. The volume is not detached from previous node because other pods using that volume can't be evicted because of PDB. This creates a deadlock. |
I guess, we have to live with it or what do you suggest?
|
After discussions with @amshuman-kr, it was decided that for pods that share PVs, drain will not wait for shared PVs to detach. |
Feature (What you would like to be added):
MCM to intelligently handle the eviction of pods with volumes
Motivation (Why is this needed?):
Rolling update of nodes take longer when more volumes are attached to the node. An average of 10+ volumes are attached to the Nodes in the SEED cluster, rolling updates taking longer directly impacts the downtime of ETCD.
Similar impact on the Shoot cluster workloads is possible
Approach/Hint to the implement solution (optional):
One option could be to implement a drain controller in MCM - yet to be detailed out
The text was updated successfully, but these errors were encountered: