CNI DEL not called on node reboot #4727

caseydavenport · 2021-04-07T00:26:55Z

Description

I'm seeing a behavior where, upon rebooting an OpenShift node using CRI-O, the CNI plugin does not appear to be called to tear down resources for the previously running pods on that node.

Basically, I am making note of the running pod sandbox IDs, rebooting the node with sudo reboot now (without draining the pods, effectively simulating a failure scenario) and letting the node come back up. When it does, I see the pods get launched again with new sandboxes. However, the old sandboxes never seem to get a CNI DEL.

Similarly, I can see cache files in /var/lib/cni/results for both the new and old sandboxes. Does CRIO make any attempt to release these resources on node reboot?

Steps to reproduce the issue:

Run some pods on a node. Note the pod sandbox container IDs.
Reboot the node, see that the pods are re-launched with new pod sandboxes
See no evidence in the logs that the old sandbox IDs were released.

Describe the results you received:

I expect CNI DEL to be called on the old sandbox IDs, providing a chance for CNI plugins to release any associated resources.

Describe the results you expected:

CNI DEL is not called.

Additional information you deem important (e.g. issue happens only occasionally):

I haven't looked at the code yet, but this seems very reproducible in my environment and from the logs it looks like no effort is even made to release the resources. It seems counter-intuitive to need this given a node-reboot will tear down namespaces, etc., but there are other resources which persist across node reboot and thus need to be released even if the original sandbox is no longer running.

Output of crio --version:

crio version 1.18.2-18.rhaos4.5.git754d46b.el8
Version:    1.18.2-18.rhaos4.5.git754d46b.el8
GoVersion:  go1.13.4
Compiler:   gc
Platform:   linux/amd64
Linkmode:   dynamic

Additional environment details (AWS, VirtualBox, physical, etc.):

OpenShift, IPI in AWS using Calico CNI.

The text was updated successfully, but these errors were encountered:

caseydavenport · 2021-04-07T00:27:38Z

@haircommander this was another behavior I spotted while investigating #4719

I haven't looked into the root cause yet, though.

haircommander · 2021-04-07T13:49:27Z

Ah see this is difficult though. When we reboot a node, the CNI pod is terminated. Here's a reboot flow:
crio wipe is called, which wipes all stale containers (to allow resources like held ports to be released, which allows the kube-apiserver to start after a non-graceful shutdown)
CNI pod is deleted with crio wipe
the server does attempt to call networkStop on any failed pods, but it doesn't find any failed pods because crio wipe wiped them.
Even if crio wipe hadn't run, the networkStop would fail because the CNI pod is down, so we'd get some vague error about the plugin being inaccessible (I think).

Fixing this issue would require:
cri-o know which pod is a networking pod
cri-o restart the networking pod (before the kubelet has started) and call networkStop when running crio wipe (so each container's networking resources are cleaned up)

Can you think of any other way to work around this?

caseydavenport · 2021-04-07T15:46:40Z

I think I'd need to look into crio wipe a bit more, I'm not familiar with it. I sort of would have expected that whatever wipes the stale containers would also call networkStop in order to free resources.

cri-o restart the networking pod

Not all CNI plugins have an associated pod that is required to be running in order to do a tearDown, but yeah for those that do we'd probably need a way to make sure the networkStop calls happen while that pod is running.

Can you think of any other way to work around this?

The way I would have expected this to work would be for CRI-O to leave some sort of breadcrumbs about sandbox containers that it had launched. On restart, CRI-O would compare those breadcrumbs to the set of valid, running sandboxes and run teardown logic on each one that is stopped.

To handle the reboot case, the tearDown logic would be retried with some sort of exponential backoff in order to give time for any dependencies to become ready (e.g., a network pod).

caseydavenport · 2021-04-07T16:14:01Z

Obviously this might not be relevant, but here's similar discussion from a long time ago regarding Kubernetes' "dockershim" implementation, which has since been fixed (and is now about to be removed): kubernetes/kubernetes#14940

haircommander · 2021-04-07T16:25:35Z

What does the CNI plugin need to cleanup the resources? I see we create an object:

    ocicni.PodNetwork{                                                                                                                                                                                                                 
        Name:      sb.KubeName(),                                                                                                                                                                                                             
        Namespace: sb.Namespace(),                                                                                                                                                                                                            
        Networks:  []ocicni.NetAttachment{},                                                                                                                                                                                                  
        ID:        sb.ID(),                                                                                                                                                                                                                   
        NetNS:     sb.NetNsPath(),                                                                                                                                                                                                            
        RuntimeConfig: map[string]ocicni.RuntimeConfig{                                                                                                                                                                                       
            network: {Bandwidth: bwConfig},                                                                                                                                                                                                   
        },                                                                                                                                                                                                                                    
    }

which is passed into our ocicni helper TearDownPodWithContext, which eventually calls the CNI del.

We could pretty easily save the sandbox state (I believe we have most of those fields already recoverable), but if the node rebooted, the netns will have been cleaned up (it lives in /var/run/ right now). If the netns isn't there, can CNI plugins do any tearing down?

caseydavenport · 2021-04-07T16:28:33Z

If the netns isn't there, can CNI plugins do any tearing down?

Yep, the CNI plugin can still cleanup other resources not tied to the netns.

This is a relevant comment from the issue I linked above:

Also, in CNI upstream we're discussing adding language that DEL should be best-effort and that even if the netns isn't present, the plugin should still clean up whatever it can including IPAM. That would get rid of a couple of blocks here, and I think is appropriate.

Basically, you can call DEL on a CNI plugin as many times as you like, even if most or all of its resources have already been cleaned up. The CNI plugin shouldn't care if the resources it's being asked to clean up are already gone.

Calico, for example, will just skip anything that doesn't exist, but proceed to clean up anything that still does.

caseydavenport · 2021-04-07T16:29:45Z

Here's the CNI SPEC's wording: https://github.com/containernetworking/cni/blob/master/SPEC.md#del-remove-container-from-network-or-un-apply-modifications

Plugins should generally complete a DEL action without error even if some resources are missing. For example, an IPAM plugin should generally release an IP allocation and return success even if the container network namespace no longer exists, unless that network namespace is critical for IPAM management.

and

For another example, the bridge plugin should delegate the DEL action to the IPAM plugin and clean up its own resources even if the container network namespace and/or container network interface no longer exist.

haircommander · 2021-04-07T16:41:55Z

Okay, this will require a fairly substancial rework, but it's definitely worth it to not leak state.
CRI-O's server already has the ability to restore pods and containers, I think we should best-effort attempt to restore them, and delete them if we detect the node rebooted, rather than just remove the storage and ignore all other content.

haircommander · 2021-04-07T19:41:56Z

for posterity, I'll define some places where we potentially leak CNI resources, and eventually (hopefully) link PRs to fixing those spots:

crio wipe no forced reboot
crio wipe forced reboot
resource store cleaning stale resources
server unsuccessful restore of pod
server successful restore of a pod?

mrunalp · 2021-04-15T03:50:35Z

cc: @saschagrunert

haircommander · 2021-04-15T20:11:03Z

case 1 is solved here #4767

haircommander · 2021-04-15T20:13:37Z

As another note: we may have trouble covering case 2 in a safe way (force reboot)

We currently clear the container and image directories completely if crio detects that the node didn't shut down gracefully. Specifically, it attempts to sync changes to disk. This is needed to prevent storage corruptions

As a result, we don't know about the containers that existed before the forced reboot, and cannot call cni del on them. I am not sure how we would remediate that

cc @saschagrunert @mrunalp @giuseppe

giuseppe · 2021-04-16T10:46:52Z

@haircommander could we also wipe out /var/lib/cni ?

saschagrunert · 2021-04-16T10:54:59Z

@haircommander could we also wipe out /var/lib/cni ?

The sandbox network cleanup will remove related resources. Other container runtimes may have active networks in there.

I thought about using the CNI result (pod annotation) to cleanup the networks and results in /var/lib/cni if there are still stale resources left after a network removal. But I was not able to reproduce such an error path at all.

haircommander · 2021-04-16T21:18:38Z

alright #4767 is now covering cases 1, 3 and 5. unsure if we'll support 2, and I am not convinced 4 represents any meaningful number of cases

caseydavenport · 2021-04-23T15:52:57Z

could we also wipe out /var/lib/cni ?

It's worth mentioning that not all state managed by CNI plugins will be written to disk locally. Calico, for example, stores some of its state off-box, which makes actually calling CNI del more important (rather that say, writing state to a volatile local volume)

haircommander · 2021-04-23T16:19:38Z

@caseydavenport btw, have you gotten the chance to test against #4767 yet?

caseydavenport · 2021-04-26T14:47:06Z

@haircommander I haven't yet - I'll do that soon!

haircommander · 2021-07-23T16:09:13Z

this has been fixed in 1.20-1.22 branches!

haircommander self-assigned this Apr 7, 2021

saschagrunert mentioned this issue Apr 15, 2021

Cleanup pod network on sandbox removal #4763

Closed

haircommander mentioned this issue Apr 15, 2021

Introduce InternalWipe #4767

Merged

haircommander closed this as completed Jul 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CNI DEL not called on node reboot #4727

CNI DEL not called on node reboot #4727

caseydavenport commented Apr 7, 2021 •

edited

Loading

caseydavenport commented Apr 7, 2021

haircommander commented Apr 7, 2021

caseydavenport commented Apr 7, 2021

caseydavenport commented Apr 7, 2021 •

edited

Loading

haircommander commented Apr 7, 2021

caseydavenport commented Apr 7, 2021

caseydavenport commented Apr 7, 2021 •

edited

Loading

haircommander commented Apr 7, 2021

haircommander commented Apr 7, 2021 •

edited

Loading

mrunalp commented Apr 15, 2021

haircommander commented Apr 15, 2021 •

edited

Loading

haircommander commented Apr 15, 2021

giuseppe commented Apr 16, 2021

saschagrunert commented Apr 16, 2021

haircommander commented Apr 16, 2021

caseydavenport commented Apr 23, 2021

haircommander commented Apr 23, 2021

caseydavenport commented Apr 26, 2021

haircommander commented Jul 23, 2021

CNI DEL not called on node reboot #4727

CNI DEL not called on node reboot #4727

Comments

caseydavenport commented Apr 7, 2021 • edited Loading

caseydavenport commented Apr 7, 2021

haircommander commented Apr 7, 2021

caseydavenport commented Apr 7, 2021

caseydavenport commented Apr 7, 2021 • edited Loading

haircommander commented Apr 7, 2021

caseydavenport commented Apr 7, 2021

caseydavenport commented Apr 7, 2021 • edited Loading

haircommander commented Apr 7, 2021

haircommander commented Apr 7, 2021 • edited Loading

mrunalp commented Apr 15, 2021

haircommander commented Apr 15, 2021 • edited Loading

haircommander commented Apr 15, 2021

giuseppe commented Apr 16, 2021

saschagrunert commented Apr 16, 2021

haircommander commented Apr 16, 2021

caseydavenport commented Apr 23, 2021

haircommander commented Apr 23, 2021

caseydavenport commented Apr 26, 2021

haircommander commented Jul 23, 2021

caseydavenport commented Apr 7, 2021 •

edited

Loading

caseydavenport commented Apr 7, 2021 •

edited

Loading

caseydavenport commented Apr 7, 2021 •

edited

Loading

haircommander commented Apr 7, 2021 •

edited

Loading

haircommander commented Apr 15, 2021 •

edited

Loading