Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

antrea-agent kept crashing on iptables error after restarting a Node with many Pods running on it #1499

Closed
tnqn opened this issue Nov 5, 2020 · 1 comment · Fixed by #1497 or #1534
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@tnqn
Copy link
Member

tnqn commented Nov 5, 2020

Describe the bug
In large scale clusters, xtables lock may be hold by kubelet/ kube-proxy/ portmap for a long time, especially when there are many service rules in nat table. antrea-agent may not be able to acquire the lock in short time. If the agent blocks on the lock or quit itself, the CNI server won't be running, causing all CNI requests to fail.
If the Pods' restart policy is Always and there are dead Pods, container runtime will keep retrying calling CNIs, during which portmap is called first, leading to more xtables lock competitor.

For example, after rebooting a Node which had ran many Pods, antrea-agent kept crashing.

I1104 10:57:03.639575       1 log_file.go:99] Set log file max size to 104857600
I1104 10:57:03.640447       1 agent.go:63] Starting Antrea agent (version v0.10.0-unknown)
I1104 10:57:03.640470       1 client.go:34] No kubeconfig file was specified. Falling back to in-cluster config
I1104 10:57:03.643157       1 ovs_client.go:67] Connecting to OVSDB at address /var/run/openvswitch/db.sock
I1104 10:57:04.483437       1 agent.go:197] Setting up node network
I1104 10:57:04.513559       1 agent.go:584] Setting Node MTU=1450
I1104 10:57:04.516233       1 ovs_client.go:110] Bridge exists: 243df4d8-4016-4f88-9592-639e47b9d131
I1104 10:57:04.517418       1 agent.go:675] Using round number 24
I1104 10:57:04.517475       1 ofctrl.go:170] Initialize connection or re-connect to /var/run/openvswitch/br-int.mgmt.
I1104 10:57:05.517794       1 ofctrl.go:185] Connected to socket /var/run/openvswitch/br-int.mgmt
I1104 10:57:05.517987       1 ofctrl.go:247] New connection..
I1104 10:57:05.518010       1 ofctrl.go:254] Send hello with OF version: 4
I1104 10:57:05.518116       1 ofctrl.go:268] Received Openflow 1.3 Hello message
I1104 10:57:05.518508       1 ofctrl.go:285] Received ofp1.3 Switch feature response: {Header:{Version:4 Type:6 Length:32 Xid:3} DPID:00:00:da:f4:3d:24:88:4f Buffers:0 NumTables:254 AuxilaryId:0 pad:[0 0] Capabilities:79 Actions:0 Ports:[]}
I1104 10:57:05.518545       1 ofSwitch.go:85] Openflow Connection for new switch: 00:00:da:f4:3d:24:88:4f
I1104 10:57:05.518817       1 ofctrl_bridge.go:220] OFSwitch is connected: 00:00:da:f4:3d:24:88:4f
I1104 10:57:15.531445       1 agent.go:343] Deleting stale flows from previous round if any
I1104 10:57:15.531499       1 agent.go:233] Persisting round number 24 to OVSDB
I1104 10:57:15.532106       1 agent.go:238] Round number 24 was persisted to OVSDB
E1104 10:57:23.837039       1 iptables.go:134] Failed to execute iptables-restore: exit status 4
stdin:
*mangle
:ANTREA-MANGLE - [0:0]
COMMIT
*filter
:ANTREA-FORWARD - [0:0]
-A ANTREA-FORWARD -m comment --comment "Antrea: accept packets from local pods" -i antrea-gw0 -j ACCEPT
-A ANTREA-FORWARD -m comment --comment "Antrea: accept packets to local pods" -o antrea-gw0 -j ACCEPT
COMMIT
*nat
:ANTREA-POSTROUTING - [0:0]
-A ANTREA-POSTROUTING -m comment --comment "Antrea: masquerade pod to external packets" -s 10.132.32.0/24 -m set ! --match-set ANTREA-POD-IP dst -j MASQUERADE
COMMIT

stderr:
Another app is currently holding the xtables lock; still 8s 200000us time ahead to have a chance to grab the lock...
Another app is currently holding the xtables lock; still 6s 200000us time ahead to have a chance to grab the lock...
Another app is currently holding the xtables lock; still 4s 200000us time ahead to have a chance to grab the lock...
Another app is currently holding the xtables lock; still 2s 200000us time ahead to have a chance to grab the lock...
Another app is currently holding the xtables lock; still 0s 200000us time ahead to have a chance to grab the lock...
Another app is currently holding the xtables lock. Stopped waiting after 10s.
F1104 10:57:23.838521       1 main.go:58] Error running agent: error initializing agent: failed to initialize iptables: error executing iptables-restore: exit status 4
goroutine 1 [running]:
k8s.io/klog.stacks(0xc000489900, 0xc0000be100, 0xa8, 0xfd)
        /tmp/gopath/pkg/mod/k8s.io/klog@v1.0.0/klog.go:875 +0xb8
k8s.io/klog.(*loggingT).output(0x30da940, 0xc000000003, 0xc000126380, 0x30130c9, 0x7, 0x3a, 0x0)
        /tmp/gopath/pkg/mod/k8s.io/klog@v1.0.0/klog.go:826 +0x330
k8s.io/klog.(*loggingT).printf(0x30da940, 0x3, 0x1e43bdf, 0x17, 0xc000669d08, 0x1, 0x1)
        /tmp/gopath/pkg/mod/k8s.io/klog@v1.0.0/klog.go:707 +0x14b
k8s.io/klog.Fatalf(...)
        /tmp/gopath/pkg/mod/k8s.io/klog@v1.0.0/klog.go:1276
main.newAgentCommand.func1(0xc000142f00, 0xc0002dec80, 0x0, 0x8)
        /usr/src/github.com/vmware-tanzu/antrea/cmd/antrea-agent/main.go:58 +0x211
github.com/spf13/cobra.(*Command).execute(0xc000142f00, 0xc00004e0a0, 0x8, 0x8, 0xc000142f00, 0xc00004e0a0)
        /tmp/gopath/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830 +0x2aa
github.com/spf13/cobra.(*Command).ExecuteC(0xc000142f00, 0x3054f00, 0xc000000180, 0xc0002b3f50)
        /tmp/gopath/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914 +0x2fb
github.com/spf13/cobra.(*Command).Execute(...)
        /tmp/gopath/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
main.main()
        /usr/src/github.com/vmware-tanzu/antrea/cmd/antrea-agent/main.go:37 +0x56

Many iptables commands are running:

# ps -ef|grep iptables |head
root     13273     1  0 13:44 ?        00:00:00 /usr/sbin/iptables -t nat -X CNI-SN-163b6ccdafc1dd98d7e49 --wait
root     15718     1  0 13:45 ?        00:00:00 /usr/sbin/iptables -t nat -S OUTPUT --wait
root     16878     1  0 13:46 ?        00:00:00 /usr/sbin/iptables -t nat -X CNI-DN-27bde851b8afb80efcc37 --wait
root     17062     1  0 13:46 ?        00:00:00 /usr/sbin/iptables -t nat -X CNI-SN-282ceca1b9bf11c96c25f --wait
root     17125     1  0 13:46 ?        00:00:00 /usr/sbin/iptables -t nat -S CNI-HOSTPORT-DNAT --wait
root     17526     1  0 13:46 ?        00:00:00 /usr/sbin/iptables -t nat -N CNI-SN-503d55cea56b9fb0be499 --wait
root     17701     1  0 13:46 ?        00:00:00 /usr/sbin/iptables -t nat -X CNI-DN-772a1811fe6cff027994e --wait
root     17914     1  0 13:46 ?        00:00:00 /usr/sbin/iptables -t nat -F CNI-SN-6b7d4e8740afb4cd6b6b9 --wait
root     17985     1  0 13:46 ?        00:00:00 /usr/sbin/iptables -t nat -F CNI-SN-1e7edff655d7a9784be7d --wait
root     18019 18011  0 13:46 ?        00:00:00 /usr/sbin/iptables -t nat -S OUTPUT --wait

They are spawned by portmap processes:

# ps -ef|grep portmap |head
root     17921   613  0 13:46 ?        00:00:00 /opt/cni/bin/portmap
root     17968   613  0 13:46 ?        00:00:00 /opt/cni/bin/portmap
root     17998   613  0 13:46 ?        00:00:00 /opt/cni/bin/portmap
root     18011   613  0 13:46 ?        00:00:00 /opt/cni/bin/portmap
root     18078   613  0 13:46 ?        00:00:00 /opt/cni/bin/portmap
root     18079   613  0 13:46 ?        00:00:00 /opt/cni/bin/portmap
root     18189   613  0 13:47 ?        00:00:00 /opt/cni/bin/portmap
root     18202   613  0 13:47 ?        00:00:00 /opt/cni/bin/portmap
root     18207   613  0 13:47 ?        00:00:00 /opt/cni/bin/portmap
root     18219   613  0 13:47 ?        00:00:00 /opt/cni/bin/portmap

The parent process 613 is containerd:

# ps -ef|grep 613
root       613     1  2 09:00 ?        00:06:13 /usr/local/bin/containerd

To Reproduce

  1. Create thousands of Services
  2. Create ~100 Pods on a Node
  3. After all Pods get running, restart the Node

Expected
antrea-agent shouldn't keep crashing.

Actual behavior
antrea-agent kept crashing, no CNI requests succeeded, kubelet kept retrying StopPodSandbox.

Versions:
Please provide the following information:

  • Antrea version (Docker image tag). 0.6.0 (which introduces portmap plugin) and later releases
@tnqn tnqn added the kind/bug Categorizes issue or PR as related to a bug. label Nov 5, 2020
@tnqn tnqn added this to the Antrea v0.11.0 release milestone Nov 5, 2020
@tnqn tnqn self-assigned this Nov 5, 2020
@tnqn
Copy link
Member Author

tnqn commented Nov 11, 2020

The issue was reproduced even antrea-agent was running and was able to process CNI Del requests.

Confirmed this is due to competition between portmap processes and kube-proxy. There are mainly 3 iptables clients: portmap, antrea-agent, kube-proxy:
After a Node restarted, all containers were dead so kubelet/containerd tried to recreate them, but it needs to delete their old sandbox containers first, so hundreds of portmap processes are invoked to execute CNI Del. In the meantime, kube-proxy started and installed more than 200,000 iptables rules which made the iptables operation very expensive. So hundred of portmap processes and kube-proxy competed for the xtables lock. If a portmap process acquired the lock, it needed to execute 4 iptables operations to finish its job, each of them took tens of seconds. The problem is that kubelet only give it 2 minutes to finish its job, or it would be cancelled and considered failure and a portmap process might not be able to acquire the lock 4 times continuously, so it had a great chance to fail, then kubelet retried it repeatedly.

#1497 made an improvement in antrea that antrea-agent won't exit if it can't acquire the xtables lock so if the CNI request can pass portmap process, antrea-agent can process it to ensure the whole CNI request succeed. However, in this case, only a few Pods passed the portmap process because of the above competition and timeout (antrea-agent only received a few CNI Del requests, most of them failed in portmap).

This is not an issue specific to Antrea because it doesn't play a role in the xtables lock competition and doesn't fail CNI Del requests that managed to get to it. Any CNI in such scale will have this issue.

While I was working a fix on portmap plugin side, I found there is an optimization in the latest release can fix this issue: containernetworking/plugins@877602d.

So we can just bump up the portmap binary packaged to antrea image to fix it, created a PR for this: #1534

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
1 participant