bridge: host ARP entries are not updated via arp_notify syscall #756

xh4n3 · 2022-06-24T05:40:04Z

Hi 👋 , we recently upgraded the bridge plugin from 0.8.x to 1.1.1, and found some newly created pods are not accessible from the host.

The downtime is about ten seconds, and it only appears when the new pod is reusing a used IP (which is already released back to the pool). We searched for the IP in ip neighbor, and we saw a stale item with an old MAC address first, then the item changed to a failed state, then it was replaced by a new MAC address.

After a downgrade back to 0.8.x, I cannot reproduce this issue anymore. A suspicious cause might be #687

We did a tcpdump, and found out arp_notify syscall and gratuitous arping are generating different ARP requests. It should be the reason for an old item doesn't get replaced in the first place.

The ARP request sent by arp_notify, the case in which the old ARP entry doesn't go away until 10 seconds

The ARP request send by gratuitous arping, the case in which the old ARP entry gets replaced in the first place

There is a slight difference in the Target MAC Address. I guess the arp_notify ARP doesn't get propagated to the bridge interface, then the host didn't catch it?

Do you have any ideas? Thanks in advance. 😃

The text was updated successfully, but these errors were encountered:

mccv1r0 · 2022-06-24T17:54:19Z

Target MAC address of 00:00:00:00:00:00 looks legit. See The 'target hardware address' field is ignored and SHOULD be set to all zeroes. and The target hardware address is included for completeness and network monitoring. It has no meaning in the request form, since it is this number that the machine is requesting

Is there any difference in the Destination MAC between the two?

I'd expect the L2 Destination MAC to be broadcast ff:ff:ff:ff:ff:ff. A quick test with cnitool doesn't have any destination address set in what I capture. I'll play around some more.

Also, what system, kernel etc are you using?

xh4n3 · 2022-06-27T09:00:50Z

@mccv1r0 Thanks!

Destination MAC addresses are the same ff:ff:ff:ff:ff:ff in my case. The only different is the target MAC address.

arping on the left, arp_notify on the right.

As a matter of fact, this difference comes from their code, the kernel left the target_hw blank, while arping sets the target_hw to broadcast address(then target_hw gets copied to dest address later), see kernel and arping

I am using a customized OS with cloud-kernel, the kernel has been modified some on the base of 4.19 mainline. The arp sending args in inetdev_send_gratuitous_arp function is identical with the mainline one, see cloud-kernel devinet

We traced a little bit, and found the 'arp_notify' ARP announcement packets always get dropped at br_handle_frame symbol. I'll try to dig it further.

xh4n3 · 2022-06-29T06:41:13Z

I used a bpftrace script to track why the arp announcement gets dropped.

It turns out that br_handle_frame was called when the bridge port's state is still disabled. So the announcement didn't make it to br_handle_frame_finish where ARP entries get updated.
state check here.

There are some conditions for this to happen:

the node must have no other pod using container networks, IOW, the newly-created pod is the first pod on this node that uses container network
the newly-created pod must not send out requests on startup, otherwise the pod announces ARP by itself

BTW, I can reproduce this issue on CentOS 3.10.0-1160.53.1.el7.x86_64 as well, but I didn't try the newer kernel.

Briefly, there may be some kind of race condition that causes arp_notify ARP announcement was handled before the bridge port goes ready, then the ARP announcement gets ignored.

MikeZappa87 · 2022-07-06T15:20:01Z

I used a bpftrace script to track why the arp announcement gets dropped.

It turns out that br_handle_frame was called when the bridge port's state is still disabled. So the announcement didn't make it to br_handle_frame_finish where ARP entries get updated. state check here.

There are some conditions for this to happen:

the node must have no other pod using container networks, IOW, the newly-created pod is the first pod on this node that uses container network

the newly-created pod must not send out requests on startup, otherwise the pod announces ARP by itself

BTW, I can reproduce this issue on CentOS 3.10.0-1160.53.1.el7.x86_64 as well, but I didn't try the newer kernel.

Briefly, there may be some kind of race condition that causes arp_notify ARP announcement was handled before the bridge port goes ready, then the ARP announcement gets ignored.

Can you share that script?

dcbw · 2022-07-06T19:26:25Z

What if we set forward_delay=0 ? Does that "fix" the race?

xh4n3 · 2022-07-07T02:49:42Z

I used a bpftrace script to track why the arp announcement gets dropped.
It turns out that br_handle_frame was called when the bridge port's state is still disabled. So the announcement didn't make it to br_handle_frame_finish where ARP entries get updated. state check here.
There are some conditions for this to happen:

the node must have no other pod using container networks, IOW, the newly-created pod is the first pod on this node that uses container network

the newly-created pod must not send out requests on startup, otherwise the pod announces ARP by itself

BTW, I can reproduce this issue on CentOS 3.10.0-1160.53.1.el7.x86_64 as well, but I didn't try the newer kernel.
Briefly, there may be some kind of race condition that causes arp_notify ARP announcement was handled before the bridge port goes ready, then the ARP announcement gets ignored.

Can you share that script?

Uploaded to https://gist.github.com/xh4n3/0a848898064d9a1587528d41e69c5803 with script outputs.

xh4n3 · 2022-07-07T03:02:32Z

What if we set forward_delay=0 ? Does that "fix" the race?

Tried, the 10 seconds delay didn't go away.

# ip neigh add 172.20.0.139 dev cni0 lladdr 86:03:17:7a:eb:6e nud stale // add a stale entry to mock `reusing IP` 
# while true; do date; ip -s neigh  | grep 172.20.0.139; sleep 1; done
Thu Jul  7 10:55:45 CST 2022
172.20.0.139 dev cni0 lladdr 86:03:17:7a:eb:6e used 78/138/78 probes 0 STALE
Thu Jul  7 10:55:46 CST 2022
172.20.0.139 dev cni0 lladdr 86:03:17:7a:eb:6e used 79/139/79 probes 0 STALE
Thu Jul  7 10:55:47 CST 2022
172.20.0.139 dev cni0 lladdr 86:03:17:7a:eb:6e used 80/140/80 probes 0 STALE
Thu Jul  7 10:55:48 CST 2022 // at this time, I start to ping 172.20.0.139, the ping hangs
172.20.0.139 dev cni0 lladdr 86:03:17:7a:eb:6e ref 1 used 0/141/0 probes 0 DELAY
Thu Jul  7 10:55:49 CST 2022
172.20.0.139 dev cni0 lladdr 86:03:17:7a:eb:6e ref 1 used 0/142/1 probes 0 DELAY
Thu Jul  7 10:55:50 CST 2022
172.20.0.139 dev cni0 lladdr 86:03:17:7a:eb:6e ref 1 used 0/143/2 probes 0 DELAY
Thu Jul  7 10:55:51 CST 2022
172.20.0.139 dev cni0 lladdr 86:03:17:7a:eb:6e ref 1 used 0/144/3 probes 0 DELAY
Thu Jul  7 10:55:52 CST 2022
172.20.0.139 dev cni0 lladdr 86:03:17:7a:eb:6e ref 1 used 0/145/4 probes 0 DELAY
Thu Jul  7 10:55:53 CST 2022
172.20.0.139 dev cni0 lladdr 86:03:17:7a:eb:6e ref 1 used 0/146/0 probes 1 PROBE
Thu Jul  7 10:55:54 CST 2022
172.20.0.139 dev cni0 lladdr 86:03:17:7a:eb:6e ref 1 used 0/147/1 probes 2 PROBE
Thu Jul  7 10:55:55 CST 2022
172.20.0.139 dev cni0 lladdr 86:03:17:7a:eb:6e ref 1 used 0/148/2 probes 3 PROBE
Thu Jul  7 10:55:56 CST 2022 // pong received
172.20.0.139 dev cni0 lladdr 06:1c:62:4b:0d:1e ref 1 used 0/0/0 probes 4 REACHABLE

MikeZappa87 · 2022-07-07T18:29:50Z

With dmesg I only see

[142421.463071] cni0: port 1(veth0) entered blocking state
[142421.463073] cni0: port 1(veth0) entered forwarding state

The transition from blocking to forwarding is quite fast, and it looks like above the port never entered a forwarding state?

xh4n3 · 2022-07-08T04:55:49Z

The transition from blocking to forwarding is quite fast, and it looks like above the port never entered a forwarding state?

I don't think dmesg can provide detailed transition log.

In the gist above, the arp_notify's ARP packet was handled in br_handle_frame before the host veth interface #60 being set to state 3 BR_STATE_FORWARDING.

...
10:03:16 br_set_state    triggered: dev_ifindex 60	new_state 0
...
10:03:16 br_handle_frame triggered: dev_ifindex 60	source_mac \xce\xb1}N\xda\x12	dest_mac \xff\xff\xff\xff\xff\xff	bridge_port_state 0	bridge_ifindex 11
...
10:03:16 br_set_state    triggered: dev_ifindex 60	new_state 3
...

DrmagicE · 2023-03-19T12:36:44Z

Any update on this? Facing the same issue with CNI v1.2.0 on Centos7.6(3.10.0-1160.el7.x86_64).
It takes about 8-10 seconds for the first ping from the host to the newly created pod when reusing PodIP.

It seems the mac address of the new pod is not populated in the ARP table:

# 1. The pod with mac 22:47:9a:ce:70:88 is deleted.
# 2. The new pod reusing the same IP 10.244.0.13 is created, but we still get the old mac address.
[root@localhost ~]# ip n | grep 10.244.0.13
10.244.0.13 dev cni0 lladdr 22:47:9a:ce:70:88 STALE
[root@localhost ~]# ping 10.244.0.13
PING 10.244.0.13 (10.244.0.13) 56(84) bytes of data.
64 bytes from 10.244.0.13: icmp_seq=10 ttl=64 time=0.210 ms
64 bytes from 10.244.0.13: icmp_seq=11 ttl=64 time=0.116 ms
64 bytes from 10.244.0.13: icmp_seq=12 ttl=64 time=0.090 ms
^C
--- 10.244.0.13 ping statistics ---
12 packets transmitted, 3 received, 75% packet loss, time 11018ms
rtt min/avg/max/mdev = 0.090/0.138/0.210/0.053 ms
# After the slow ping, the host can get the right mac address for the new pod.
[root@localhost ~]# ip n | grep 10.244.0.13
10.244.0.13 dev cni0 lladdr 2e:db:fb:ac:d1:d6 REACHABLE

The issue gets fixed after downgrading the bridge plugin back to 0.9.1.

github-actions bot added the Stale label Sep 7, 2022

github-actions bot closed this as completed Sep 15, 2022

MikeZappa87 reopened this Sep 15, 2022

github-actions bot closed this as completed Sep 23, 2022

DrmagicE mentioned this issue Mar 19, 2023

bridge: ARP entry not populated when reusing pod IP. #858

Open

This was referenced Sep 21, 2023

e2e: set liveness probe timeout seconds for conformance test kubernetes/kubernetes#120796

Closed

e2e: set liveness probe timeout seconds for conformance test kubernetes/kubernetes#120819

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bridge: host ARP entries are not updated via arp_notify syscall #756

bridge: host ARP entries are not updated via arp_notify syscall #756

xh4n3 commented Jun 24, 2022

mccv1r0 commented Jun 24, 2022

xh4n3 commented Jun 27, 2022 •

edited

Loading

xh4n3 commented Jun 29, 2022 •

edited

Loading

MikeZappa87 commented Jul 6, 2022

dcbw commented Jul 6, 2022

xh4n3 commented Jul 7, 2022

xh4n3 commented Jul 7, 2022

MikeZappa87 commented Jul 7, 2022

xh4n3 commented Jul 8, 2022

DrmagicE commented Mar 19, 2023

bridge: host ARP entries are not updated via arp_notify syscall #756

bridge: host ARP entries are not updated via arp_notify syscall #756

Comments

xh4n3 commented Jun 24, 2022

mccv1r0 commented Jun 24, 2022

xh4n3 commented Jun 27, 2022 • edited Loading

xh4n3 commented Jun 29, 2022 • edited Loading

MikeZappa87 commented Jul 6, 2022

dcbw commented Jul 6, 2022

xh4n3 commented Jul 7, 2022

xh4n3 commented Jul 7, 2022

MikeZappa87 commented Jul 7, 2022

xh4n3 commented Jul 8, 2022

DrmagicE commented Mar 19, 2023

xh4n3 commented Jun 27, 2022 •

edited

Loading

xh4n3 commented Jun 29, 2022 •

edited

Loading