Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bridge: host ARP entries are not updated via arp_notify syscall #756

Closed
xh4n3 opened this issue Jun 24, 2022 · 10 comments
Closed

bridge: host ARP entries are not updated via arp_notify syscall #756

xh4n3 opened this issue Jun 24, 2022 · 10 comments
Labels

Comments

@xh4n3
Copy link

xh4n3 commented Jun 24, 2022

Hi 👋 , we recently upgraded the bridge plugin from 0.8.x to 1.1.1, and found some newly created pods are not accessible from the host.

The downtime is about ten seconds, and it only appears when the new pod is reusing a used IP (which is already released back to the pool). We searched for the IP in ip neighbor, and we saw a stale item with an old MAC address first, then the item changed to a failed state, then it was replaced by a new MAC address.

After a downgrade back to 0.8.x, I cannot reproduce this issue anymore. A suspicious cause might be #687

We did a tcpdump, and found out arp_notify syscall and gratuitous arping are generating different ARP requests. It should be the reason for an old item doesn't get replaced in the first place.

The ARP request sent by arp_notify, the case in which the old ARP entry doesn't go away until 10 seconds
image

The ARP request send by gratuitous arping, the case in which the old ARP entry gets replaced in the first place
image

There is a slight difference in the Target MAC Address. I guess the arp_notify ARP doesn't get propagated to the bridge interface, then the host didn't catch it?

Do you have any ideas? Thanks in advance. 😃

@mccv1r0
Copy link
Member

mccv1r0 commented Jun 24, 2022

Target MAC address of 00:00:00:00:00:00 looks legit. See The 'target hardware address' field is ignored and SHOULD be set to all zeroes. and The target hardware address is included for completeness and network monitoring. It has no meaning in the request form, since it is this number that the machine is requesting

Is there any difference in the Destination MAC between the two?

I'd expect the L2 Destination MAC to be broadcast ff:ff:ff:ff:ff:ff. A quick test with cnitool doesn't have any destination address set in what I capture. I'll play around some more.

Also, what system, kernel etc are you using?

@xh4n3
Copy link
Author

xh4n3 commented Jun 27, 2022

@mccv1r0 Thanks!

Destination MAC addresses are the same ff:ff:ff:ff:ff:ff in my case. The only different is the target MAC address.

arping on the left, arp_notify on the right.
image

As a matter of fact, this difference comes from their code, the kernel left the target_hw blank, while arping sets the target_hw to broadcast address(then target_hw gets copied to dest address later), see kernel and arping

I am using a customized OS with cloud-kernel, the kernel has been modified some on the base of 4.19 mainline. The arp sending args in inetdev_send_gratuitous_arp function is identical with the mainline one, see cloud-kernel devinet

We traced a little bit, and found the 'arp_notify' ARP announcement packets always get dropped at br_handle_frame symbol. I'll try to dig it further.

@xh4n3
Copy link
Author

xh4n3 commented Jun 29, 2022

I used a bpftrace script to track why the arp announcement gets dropped.

It turns out that br_handle_frame was called when the bridge port's state is still disabled. So the announcement didn't make it to br_handle_frame_finish where ARP entries get updated.
state check here.

There are some conditions for this to happen:

  1. the node must have no other pod using container networks, IOW, the newly-created pod is the first pod on this node that uses container network
  2. the newly-created pod must not send out requests on startup, otherwise the pod announces ARP by itself

BTW, I can reproduce this issue on CentOS 3.10.0-1160.53.1.el7.x86_64 as well, but I didn't try the newer kernel.

Briefly, there may be some kind of race condition that causes arp_notify ARP announcement was handled before the bridge port goes ready, then the ARP announcement gets ignored.

@MikeZappa87
Copy link
Contributor

I used a bpftrace script to track why the arp announcement gets dropped.

It turns out that br_handle_frame was called when the bridge port's state is still disabled. So the announcement didn't make it to br_handle_frame_finish where ARP entries get updated. state check here.

There are some conditions for this to happen:

  1. the node must have no other pod using container networks, IOW, the newly-created pod is the first pod on this node that uses container network
  2. the newly-created pod must not send out requests on startup, otherwise the pod announces ARP by itself

BTW, I can reproduce this issue on CentOS 3.10.0-1160.53.1.el7.x86_64 as well, but I didn't try the newer kernel.

Briefly, there may be some kind of race condition that causes arp_notify ARP announcement was handled before the bridge port goes ready, then the ARP announcement gets ignored.

Can you share that script?

@dcbw
Copy link
Member

dcbw commented Jul 6, 2022

What if we set forward_delay=0 ? Does that "fix" the race?

@xh4n3
Copy link
Author

xh4n3 commented Jul 7, 2022

I used a bpftrace script to track why the arp announcement gets dropped.
It turns out that br_handle_frame was called when the bridge port's state is still disabled. So the announcement didn't make it to br_handle_frame_finish where ARP entries get updated. state check here.
There are some conditions for this to happen:

  1. the node must have no other pod using container networks, IOW, the newly-created pod is the first pod on this node that uses container network
  2. the newly-created pod must not send out requests on startup, otherwise the pod announces ARP by itself

BTW, I can reproduce this issue on CentOS 3.10.0-1160.53.1.el7.x86_64 as well, but I didn't try the newer kernel.
Briefly, there may be some kind of race condition that causes arp_notify ARP announcement was handled before the bridge port goes ready, then the ARP announcement gets ignored.

Can you share that script?

Uploaded to https://gist.github.com/xh4n3/0a848898064d9a1587528d41e69c5803 with script outputs.

@xh4n3
Copy link
Author

xh4n3 commented Jul 7, 2022

What if we set forward_delay=0 ? Does that "fix" the race?

Tried, the 10 seconds delay didn't go away.

# ip neigh add 172.20.0.139 dev cni0 lladdr 86:03:17:7a:eb:6e nud stale // add a stale entry to mock `reusing IP` 
# while true; do date; ip -s neigh  | grep 172.20.0.139; sleep 1; done
Thu Jul  7 10:55:45 CST 2022
172.20.0.139 dev cni0 lladdr 86:03:17:7a:eb:6e used 78/138/78 probes 0 STALE
Thu Jul  7 10:55:46 CST 2022
172.20.0.139 dev cni0 lladdr 86:03:17:7a:eb:6e used 79/139/79 probes 0 STALE
Thu Jul  7 10:55:47 CST 2022
172.20.0.139 dev cni0 lladdr 86:03:17:7a:eb:6e used 80/140/80 probes 0 STALE
Thu Jul  7 10:55:48 CST 2022 // at this time, I start to ping 172.20.0.139, the ping hangs
172.20.0.139 dev cni0 lladdr 86:03:17:7a:eb:6e ref 1 used 0/141/0 probes 0 DELAY
Thu Jul  7 10:55:49 CST 2022
172.20.0.139 dev cni0 lladdr 86:03:17:7a:eb:6e ref 1 used 0/142/1 probes 0 DELAY
Thu Jul  7 10:55:50 CST 2022
172.20.0.139 dev cni0 lladdr 86:03:17:7a:eb:6e ref 1 used 0/143/2 probes 0 DELAY
Thu Jul  7 10:55:51 CST 2022
172.20.0.139 dev cni0 lladdr 86:03:17:7a:eb:6e ref 1 used 0/144/3 probes 0 DELAY
Thu Jul  7 10:55:52 CST 2022
172.20.0.139 dev cni0 lladdr 86:03:17:7a:eb:6e ref 1 used 0/145/4 probes 0 DELAY
Thu Jul  7 10:55:53 CST 2022
172.20.0.139 dev cni0 lladdr 86:03:17:7a:eb:6e ref 1 used 0/146/0 probes 1 PROBE
Thu Jul  7 10:55:54 CST 2022
172.20.0.139 dev cni0 lladdr 86:03:17:7a:eb:6e ref 1 used 0/147/1 probes 2 PROBE
Thu Jul  7 10:55:55 CST 2022
172.20.0.139 dev cni0 lladdr 86:03:17:7a:eb:6e ref 1 used 0/148/2 probes 3 PROBE
Thu Jul  7 10:55:56 CST 2022 // pong received
172.20.0.139 dev cni0 lladdr 06:1c:62:4b:0d:1e ref 1 used 0/0/0 probes 4 REACHABLE

@MikeZappa87
Copy link
Contributor

With dmesg I only see

[142421.463071] cni0: port 1(veth0) entered blocking state
[142421.463073] cni0: port 1(veth0) entered forwarding state

The transition from blocking to forwarding is quite fast, and it looks like above the port never entered a forwarding state?

@xh4n3
Copy link
Author

xh4n3 commented Jul 8, 2022

The transition from blocking to forwarding is quite fast, and it looks like above the port never entered a forwarding state?

I don't think dmesg can provide detailed transition log.

In the gist above, the arp_notify's ARP packet was handled in br_handle_frame before the host veth interface #60 being set to state 3 BR_STATE_FORWARDING.

...
10:03:16 br_set_state    triggered: dev_ifindex 60	new_state 0
...
10:03:16 br_handle_frame triggered: dev_ifindex 60	source_mac \xce\xb1}N\xda\x12	dest_mac \xff\xff\xff\xff\xff\xff	bridge_port_state 0	bridge_ifindex 11
...
10:03:16 br_set_state    triggered: dev_ifindex 60	new_state 3
...

@DrmagicE
Copy link

Any update on this? Facing the same issue with CNI v1.2.0 on Centos7.6(3.10.0-1160.el7.x86_64).
It takes about 8-10 seconds for the first ping from the host to the newly created pod when reusing PodIP.

It seems the mac address of the new pod is not populated in the ARP table:

# 1. The pod with mac 22:47:9a:ce:70:88 is deleted.
# 2. The new pod reusing the same IP 10.244.0.13 is created, but we still get the old mac address.
[root@localhost ~]# ip n | grep 10.244.0.13
10.244.0.13 dev cni0 lladdr 22:47:9a:ce:70:88 STALE
[root@localhost ~]# ping 10.244.0.13
PING 10.244.0.13 (10.244.0.13) 56(84) bytes of data.
64 bytes from 10.244.0.13: icmp_seq=10 ttl=64 time=0.210 ms
64 bytes from 10.244.0.13: icmp_seq=11 ttl=64 time=0.116 ms
64 bytes from 10.244.0.13: icmp_seq=12 ttl=64 time=0.090 ms
^C
--- 10.244.0.13 ping statistics ---
12 packets transmitted, 3 received, 75% packet loss, time 11018ms
rtt min/avg/max/mdev = 0.090/0.138/0.210/0.053 ms
# After the slow ping, the host can get the right mac address for the new pod.
[root@localhost ~]# ip n | grep 10.244.0.13
10.244.0.13 dev cni0 lladdr 2e:db:fb:ac:d1:d6 REACHABLE

The issue gets fixed after downgrading the bridge plugin back to 0.9.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants