-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CNI DEL not called on node reboot #4727
Comments
@haircommander this was another behavior I spotted while investigating #4719 I haven't looked into the root cause yet, though. |
Ah see this is difficult though. When we reboot a node, the CNI pod is terminated. Here's a reboot flow: Fixing this issue would require: Can you think of any other way to work around this? |
I think I'd need to look into
Not all CNI plugins have an associated pod that is required to be running in order to do a tearDown, but yeah for those that do we'd probably need a way to make sure the
The way I would have expected this to work would be for CRI-O to leave some sort of breadcrumbs about sandbox containers that it had launched. On restart, CRI-O would compare those breadcrumbs to the set of valid, running sandboxes and run teardown logic on each one that is stopped. To handle the reboot case, the tearDown logic would be retried with some sort of exponential backoff in order to give time for any dependencies to become ready (e.g., a network pod). |
Obviously this might not be relevant, but here's similar discussion from a long time ago regarding Kubernetes' "dockershim" implementation, which has since been fixed (and is now about to be removed): kubernetes/kubernetes#14940 |
What does the CNI plugin need to cleanup the resources? I see we create an object:
which is passed into our ocicni helper TearDownPodWithContext, which eventually calls the CNI del. We could pretty easily save the sandbox state (I believe we have most of those fields already recoverable), but if the node rebooted, the netns will have been cleaned up (it lives in /var/run/ right now). If the netns isn't there, can CNI plugins do any tearing down? |
Yep, the CNI plugin can still cleanup other resources not tied to the netns. This is a relevant comment from the issue I linked above:
Basically, you can call DEL on a CNI plugin as many times as you like, even if most or all of its resources have already been cleaned up. The CNI plugin shouldn't care if the resources it's being asked to clean up are already gone. Calico, for example, will just skip anything that doesn't exist, but proceed to clean up anything that still does. |
Here's the CNI SPEC's wording: https://github.com/containernetworking/cni/blob/master/SPEC.md#del-remove-container-from-network-or-un-apply-modifications
and
|
Okay, this will require a fairly substancial rework, but it's definitely worth it to not leak state. |
for posterity, I'll define some places where we potentially leak CNI resources, and eventually (hopefully) link PRs to fixing those spots:
|
cc: @saschagrunert |
case 1 is solved here #4767 |
As another note: we may have trouble covering case 2 in a safe way (force reboot) We currently clear the container and image directories completely if crio detects that the node didn't shut down gracefully. Specifically, it attempts to sync changes to disk. This is needed to prevent storage corruptions As a result, we don't know about the containers that existed before the forced reboot, and cannot call cni del on them. I am not sure how we would remediate that |
@haircommander could we also wipe out |
The sandbox network cleanup will remove related resources. Other container runtimes may have active networks in there. I thought about using the CNI result (pod annotation) to cleanup the networks and results in |
alright #4767 is now covering cases 1, 3 and 5. unsure if we'll support 2, and I am not convinced 4 represents any meaningful number of cases |
It's worth mentioning that not all state managed by CNI plugins will be written to disk locally. Calico, for example, stores some of its state off-box, which makes actually calling CNI del more important (rather that say, writing state to a volatile local volume) |
@caseydavenport btw, have you gotten the chance to test against #4767 yet? |
@haircommander I haven't yet - I'll do that soon! |
this has been fixed in 1.20-1.22 branches! |
Description
I'm seeing a behavior where, upon rebooting an OpenShift node using CRI-O, the CNI plugin does not appear to be called to tear down resources for the previously running pods on that node.
Basically, I am making note of the running pod sandbox IDs, rebooting the node with
sudo reboot now
(without draining the pods, effectively simulating a failure scenario) and letting the node come back up. When it does, I see the pods get launched again with new sandboxes. However, the old sandboxes never seem to get a CNI DEL.Similarly, I can see cache files in
/var/lib/cni/results
for both the new and old sandboxes. Does CRIO make any attempt to release these resources on node reboot?Steps to reproduce the issue:
Describe the results you received:
I expect CNI DEL to be called on the old sandbox IDs, providing a chance for CNI plugins to release any associated resources.
Describe the results you expected:
CNI DEL is not called.
Additional information you deem important (e.g. issue happens only occasionally):
I haven't looked at the code yet, but this seems very reproducible in my environment and from the logs it looks like no effort is even made to release the resources. It seems counter-intuitive to need this given a node-reboot will tear down namespaces, etc., but there are other resources which persist across node reboot and thus need to be released even if the original sandbox is no longer running.
Output of
crio --version
:Additional environment details (AWS, VirtualBox, physical, etc.):
OpenShift, IPI in AWS using Calico CNI.
The text was updated successfully, but these errors were encountered: