Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNI DEL not called on node reboot #4727

Closed
caseydavenport opened this issue Apr 7, 2021 · 19 comments
Closed

CNI DEL not called on node reboot #4727

caseydavenport opened this issue Apr 7, 2021 · 19 comments
Assignees

Comments

@caseydavenport
Copy link

caseydavenport commented Apr 7, 2021

Description

I'm seeing a behavior where, upon rebooting an OpenShift node using CRI-O, the CNI plugin does not appear to be called to tear down resources for the previously running pods on that node.

Basically, I am making note of the running pod sandbox IDs, rebooting the node with sudo reboot now (without draining the pods, effectively simulating a failure scenario) and letting the node come back up. When it does, I see the pods get launched again with new sandboxes. However, the old sandboxes never seem to get a CNI DEL.

Similarly, I can see cache files in /var/lib/cni/results for both the new and old sandboxes. Does CRIO make any attempt to release these resources on node reboot?

Steps to reproduce the issue:

  1. Run some pods on a node. Note the pod sandbox container IDs.
  2. Reboot the node, see that the pods are re-launched with new pod sandboxes
  3. See no evidence in the logs that the old sandbox IDs were released.

Describe the results you received:

I expect CNI DEL to be called on the old sandbox IDs, providing a chance for CNI plugins to release any associated resources.

Describe the results you expected:

CNI DEL is not called.

Additional information you deem important (e.g. issue happens only occasionally):

I haven't looked at the code yet, but this seems very reproducible in my environment and from the logs it looks like no effort is even made to release the resources. It seems counter-intuitive to need this given a node-reboot will tear down namespaces, etc., but there are other resources which persist across node reboot and thus need to be released even if the original sandbox is no longer running.

Output of crio --version:

crio version 1.18.2-18.rhaos4.5.git754d46b.el8
Version:    1.18.2-18.rhaos4.5.git754d46b.el8
GoVersion:  go1.13.4
Compiler:   gc
Platform:   linux/amd64
Linkmode:   dynamic

Additional environment details (AWS, VirtualBox, physical, etc.):

OpenShift, IPI in AWS using Calico CNI.

@caseydavenport
Copy link
Author

@haircommander this was another behavior I spotted while investigating #4719

I haven't looked into the root cause yet, though.

@haircommander
Copy link
Member

Ah see this is difficult though. When we reboot a node, the CNI pod is terminated. Here's a reboot flow:
crio wipe is called, which wipes all stale containers (to allow resources like held ports to be released, which allows the kube-apiserver to start after a non-graceful shutdown)
CNI pod is deleted with crio wipe
the server does attempt to call networkStop on any failed pods, but it doesn't find any failed pods because crio wipe wiped them.
Even if crio wipe hadn't run, the networkStop would fail because the CNI pod is down, so we'd get some vague error about the plugin being inaccessible (I think).

Fixing this issue would require:
cri-o know which pod is a networking pod
cri-o restart the networking pod (before the kubelet has started) and call networkStop when running crio wipe (so each container's networking resources are cleaned up)

Can you think of any other way to work around this?

@caseydavenport
Copy link
Author

I think I'd need to look into crio wipe a bit more, I'm not familiar with it. I sort of would have expected that whatever wipes the stale containers would also call networkStop in order to free resources.

cri-o restart the networking pod

Not all CNI plugins have an associated pod that is required to be running in order to do a tearDown, but yeah for those that do we'd probably need a way to make sure the networkStop calls happen while that pod is running.

Can you think of any other way to work around this?

The way I would have expected this to work would be for CRI-O to leave some sort of breadcrumbs about sandbox containers that it had launched. On restart, CRI-O would compare those breadcrumbs to the set of valid, running sandboxes and run teardown logic on each one that is stopped.

To handle the reboot case, the tearDown logic would be retried with some sort of exponential backoff in order to give time for any dependencies to become ready (e.g., a network pod).

@caseydavenport
Copy link
Author

caseydavenport commented Apr 7, 2021

Obviously this might not be relevant, but here's similar discussion from a long time ago regarding Kubernetes' "dockershim" implementation, which has since been fixed (and is now about to be removed): kubernetes/kubernetes#14940

@haircommander
Copy link
Member

What does the CNI plugin need to cleanup the resources? I see we create an object:

    ocicni.PodNetwork{                                                                                                                                                                                                                 
        Name:      sb.KubeName(),                                                                                                                                                                                                             
        Namespace: sb.Namespace(),                                                                                                                                                                                                            
        Networks:  []ocicni.NetAttachment{},                                                                                                                                                                                                  
        ID:        sb.ID(),                                                                                                                                                                                                                   
        NetNS:     sb.NetNsPath(),                                                                                                                                                                                                            
        RuntimeConfig: map[string]ocicni.RuntimeConfig{                                                                                                                                                                                       
            network: {Bandwidth: bwConfig},                                                                                                                                                                                                   
        },                                                                                                                                                                                                                                    
    }

which is passed into our ocicni helper TearDownPodWithContext, which eventually calls the CNI del.

We could pretty easily save the sandbox state (I believe we have most of those fields already recoverable), but if the node rebooted, the netns will have been cleaned up (it lives in /var/run/ right now). If the netns isn't there, can CNI plugins do any tearing down?

@caseydavenport
Copy link
Author

If the netns isn't there, can CNI plugins do any tearing down?

Yep, the CNI plugin can still cleanup other resources not tied to the netns.

This is a relevant comment from the issue I linked above:

Also, in CNI upstream we're discussing adding language that DEL should be best-effort and that even if the netns isn't present, the plugin should still clean up whatever it can including IPAM. That would get rid of a couple of blocks here, and I think is appropriate.

Basically, you can call DEL on a CNI plugin as many times as you like, even if most or all of its resources have already been cleaned up. The CNI plugin shouldn't care if the resources it's being asked to clean up are already gone.

Calico, for example, will just skip anything that doesn't exist, but proceed to clean up anything that still does.

@caseydavenport
Copy link
Author

caseydavenport commented Apr 7, 2021

Here's the CNI SPEC's wording: https://github.com/containernetworking/cni/blob/master/SPEC.md#del-remove-container-from-network-or-un-apply-modifications

Plugins should generally complete a DEL action without error even if some resources are missing. For example, an IPAM plugin should generally release an IP allocation and return success even if the container network namespace no longer exists, unless that network namespace is critical for IPAM management.

and

For another example, the bridge plugin should delegate the DEL action to the IPAM plugin and clean up its own resources even if the container network namespace and/or container network interface no longer exist.

@haircommander
Copy link
Member

Okay, this will require a fairly substancial rework, but it's definitely worth it to not leak state.
CRI-O's server already has the ability to restore pods and containers, I think we should best-effort attempt to restore them, and delete them if we detect the node rebooted, rather than just remove the storage and ignore all other content.

@haircommander haircommander self-assigned this Apr 7, 2021
@haircommander
Copy link
Member

haircommander commented Apr 7, 2021

for posterity, I'll define some places where we potentially leak CNI resources, and eventually (hopefully) link PRs to fixing those spots:

  • crio wipe no forced reboot
  • crio wipe forced reboot
  • resource store cleaning stale resources
  • server unsuccessful restore of pod
  • server successful restore of a pod?

@mrunalp
Copy link
Member

mrunalp commented Apr 15, 2021

cc: @saschagrunert

@haircommander
Copy link
Member

haircommander commented Apr 15, 2021

case 1 is solved here #4767

@haircommander
Copy link
Member

As another note: we may have trouble covering case 2 in a safe way (force reboot)

We currently clear the container and image directories completely if crio detects that the node didn't shut down gracefully. Specifically, it attempts to sync changes to disk. This is needed to prevent storage corruptions

As a result, we don't know about the containers that existed before the forced reboot, and cannot call cni del on them. I am not sure how we would remediate that

cc @saschagrunert @mrunalp @giuseppe

@giuseppe
Copy link
Member

@haircommander could we also wipe out /var/lib/cni ?

@saschagrunert
Copy link
Member

@haircommander could we also wipe out /var/lib/cni ?

The sandbox network cleanup will remove related resources. Other container runtimes may have active networks in there.

I thought about using the CNI result (pod annotation) to cleanup the networks and results in /var/lib/cni if there are still stale resources left after a network removal. But I was not able to reproduce such an error path at all.

@haircommander
Copy link
Member

alright #4767 is now covering cases 1, 3 and 5. unsure if we'll support 2, and I am not convinced 4 represents any meaningful number of cases

@caseydavenport
Copy link
Author

could we also wipe out /var/lib/cni ?

It's worth mentioning that not all state managed by CNI plugins will be written to disk locally. Calico, for example, stores some of its state off-box, which makes actually calling CNI del more important (rather that say, writing state to a volatile local volume)

@haircommander
Copy link
Member

@caseydavenport btw, have you gotten the chance to test against #4767 yet?

@caseydavenport
Copy link
Author

@haircommander I haven't yet - I'll do that soon!

@haircommander
Copy link
Member

this has been fixed in 1.20-1.22 branches!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants