Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Egress UDP VXLAN packet dropped by kernel/netfilter ? #17528

Closed
vincentmli opened this issue Oct 4, 2021 · 23 comments
Closed

Egress UDP VXLAN packet dropped by kernel/netfilter ? #17528

vincentmli opened this issue Oct 4, 2021 · 23 comments
Labels
kind/bug This is a bug in the Cilium logic.

Comments

@vincentmli
Copy link
Contributor

Bug report

While testing PR #17370 (comment), after ping test working for couple of minutes, then the ping stops working, I noticed the egress UDP VXLAN packet with dst port 8472 got dropped because tcpdump on the k8s node interface shows no packet with destination port 8472 to remote external VXLAN device, used iptables trace to trace the VXLAN packet

trace for failed ping test:

Oct  3 23:52:30 centos-dev kernel: TRACE: raw:OUTPUT:policy:1 IN= OUT=ens192 SRC=10.169.72.233 DST=10.169.72.236 LEN=134 TOS=0x00 PREC=0x00 TTL=64 ID=43679 PROTO=UDP SPT=57346 DPT=8472 LEN=114
Oct  3 23:52:30 centos-dev kernel: TRACE: mangle:OUTPUT:policy:1 IN= OUT=ens192 SRC=10.169.72.233 DST=10.169.72.236 LEN=134 TOS=0x00 PREC=0x00 TTL=64 ID=43679 PROTO=UDP SPT=57346 DPT=8472 LEN=114
Oct  3 23:52:30 centos-dev kernel: TRACE: filter:OUTPUT:policy:1 IN= OUT=ens192 SRC=10.169.72.233 DST=10.169.72.236 LEN=134 TOS=0x00 PREC=0x00 TTL=64 ID=43679 PROTO=UDP SPT=57346 DPT=8472 LEN=114
Oct  3 23:52:30 centos-dev kernel: TRACE: mangle:POSTROUTING:policy:1 IN= OUT=ens192 SRC=10.169.72.233 DST=10.169.72.236 LEN=134 TOS=0x00 PREC=0x00 TTL=64 ID=43679 PROTO=UDP SPT=57346 DPT=8472 LEN=114 MARK=0xc00

trace for working ping test

Oct  3 23:59:51 centos-dev kernel: TRACE: raw:OUTPUT:policy:1 IN= OUT=ens192 SRC=10.169.72.233 DST=10.169.72.236 LEN=134 TOS=0x00 PREC=0x00 TTL=64 ID=7890 PROTO=UDP SPT=57346 DPT=8472 LEN=114
Oct  3 23:59:51 centos-dev kernel: TRACE: mangle:OUTPUT:policy:1 IN= OUT=ens192 SRC=10.169.72.233 DST=10.169.72.236 LEN=134 TOS=0x00 PREC=0x00 TTL=64 ID=7890 PROTO=UDP SPT=57346 DPT=8472 LEN=114
Oct  3 23:59:51 centos-dev kernel: TRACE: nat:OUTPUT:policy:1 IN= OUT=ens192 SRC=10.169.72.233 DST=10.169.72.236 LEN=134 TOS=0x00 PREC=0x00 TTL=64 ID=7890 PROTO=UDP SPT=57346 DPT=8472 LEN=114
Oct  3 23:59:51 centos-dev kernel: TRACE: filter:OUTPUT:policy:1 IN= OUT=ens192 SRC=10.169.72.233 DST=10.169.72.236 LEN=134 TOS=0x00 PREC=0x00 TTL=64 ID=7890 PROTO=UDP SPT=57346 DPT=8472 LEN=114
Oct  3 23:59:51 centos-dev kernel: TRACE: mangle:POSTROUTING:policy:1 IN= OUT=ens192 SRC=10.169.72.233 DST=10.169.72.236 LEN=134 TOS=0x00 PREC=0x00 TTL=64 ID=7890 PROTO=UDP SPT=57346 DPT=8472 LEN=114 MARK=0xc00
Oct  3 23:59:51 centos-dev kernel: TRACE: nat:POSTROUTING:policy:1 IN= OUT=ens192 SRC=10.169.72.233 DST=10.169.72.236 LEN=134 TOS=0x00 PREC=0x00 TTL=64 ID=7890 PROTO=UDP SPT=57346 DPT=8472 LEN=114 MARK=0xc00

Note for working trace, it has TRACE: nat:POSTROUTING:policy:1 IN= OUT=ens192 and TRACE: nat:OUTPUT:policy:1 IN= OUT=ens192 triggered , failing trace missing those nat

also no testing failure on bpf-next kernel, testing failed on 5.2, 5.4 kernel so far, I am going to test more kernel versions.

the test will start working if deleting the cilium agent pod and restart the cilium agent pod, then after couple of minutes, fail again

General Information

  • Cilium version (run cilium version)
    tested on cilium 1.10.4 stable release and master branch, stable lease failed less often, master branch failed more often
  • Kernel version (run uname -a)
    tested 5.2, 5.4, bpf-next so far, 5.2,5.4 failed.
  • Orchestration system version in use (e.g. kubectl version, ...)
    v1.21.1
  • Link to relevant artifacts (policies, deployments scripts, ...)
  • Generate and upload a system zip:
curl -sLO https://git.io/cilium-sysdump-latest.zip && python cilium-sysdump-latest.zip

cilium-sysdump-20211004-122357.zip

How to reproduce the issue

deploy container image vli39/cilium:ipcache that has #17370 built and feature enabled, with the attached cilium-ipcache.yaml.txt manifest

change the API server IP, external Linux machine IP accordingly, deploy busybox pod and ping any ip address within the external VXLAN CIDR

  1. run attached VXLAN responder scapy script fake.py.txt on external Linux machine
    fake.py.txt
    cilium-ipcache.yaml.txt
@vincentmli vincentmli added the kind/bug This is a bug in the Cilium logic. label Oct 4, 2021
@vincentmli
Copy link
Contributor Author

Note for working trace, it has TRACE: nat:POSTROUTING:policy:1 IN= OUT=ens192 and TRACE: nat:OUTPUT:policy:1 IN= OUT=ens192 triggered , failing trace missing those nat

also no testing failure on bpf-next kernel, testing failed on 5.2, 5.4 kernel so far, I am going to test more kernel versions.

while testing kernel 5.10 now, I noticed the iptables trace is also missing nat, but the ping still works

[root@centos-dev ~]# grep 'ID=33207' /var/log/kern.log 
Oct  4 12:44:24 centos-dev kernel: TRACE: raw:OUTPUT:policy:1 IN= OUT=ens192 SRC=10.169.72.233 DST=10.169.72.236 LEN=134 TOS=0x00 PREC=0x00 TTL=64 ID=33207 PROTO=UDP SPT=58269 DPT=8472 LEN=114 
Oct  4 12:44:24 centos-dev kernel: TRACE: mangle:OUTPUT:policy:1 IN= OUT=ens192 SRC=10.169.72.233 DST=10.169.72.236 LEN=134 TOS=0x00 PREC=0x00 TTL=64 ID=33207 PROTO=UDP SPT=58269 DPT=8472 LEN=114 
Oct  4 12:44:24 centos-dev kernel: TRACE: filter:OUTPUT:policy:1 IN= OUT=ens192 SRC=10.169.72.233 DST=10.169.72.236 LEN=134 TOS=0x00 PREC=0x00 TTL=64 ID=33207 PROTO=UDP SPT=58269 DPT=8472 LEN=114 
Oct  4 12:44:24 centos-dev kernel: TRACE: mangle:POSTROUTING:policy:1 IN=cilium_vxlan OUT=ens192 MAC=82:36:4c:98:2e:56:4a:c9:5f:12:19:2f:08:00 SRC=10.169.72.233 DST=10.169.72.236 LEN=134 TOS=0x00 PREC=0x00 TTL=64 ID=33207 PROTO=UDP SPT=58269 DPT=8472 LEN=114 MARK=0xc00 

busybox ping, ignore the huge time numbers, it is likely the scapy VXLAN responder script not setting time correct, pinging a real Linux vxlan tunnel device having same ping testing failure

64 bytes from 10.1.2.236: seq=320 ttl=63 time=2757243.864 ms
64 bytes from 10.1.2.236: seq=321 ttl=63 time=2758246.001 ms

@vincentmli
Copy link
Contributor Author

from https://upload.wikimedia.org/wikipedia/commons/3/37/Netfilter-packet-flow.svg
netfilter output path traffic flow mangle postrouting->nat postrouting->egress qdisc->interface output

the failed test hit mangle postrouting, missing nat postrouting, my understanding is the packet is unlikely dropped by egress qdisc because to-netdev is not attached to the device because I don't have firewall or node port enabled, bpftool confirms that

[root@cilium-dev home]# kubectl exec -it cilium-q84fb -n kube-system -- bpftool net show
Defaulted container "cilium-agent" out of: cilium-agent, clean-cilium-state (init)
xdp:

tc:
ens192(2) clsact/ingress bpf_netdev_ens192.o:[from-netdev] id 675
cilium_net(6) clsact/ingress bpf_host_cilium_net.o:[to-host] id 668
cilium_host(7) clsact/ingress bpf_host.o:[to-host] id 654
cilium_host(7) clsact/egress bpf_host.o:[from-host] id 661
cilium_vxlan(8) clsact/ingress bpf_overlay.o:[from-overlay] id 640
cilium_vxlan(8) clsact/egress bpf_overlay.o:[to-overlay] id 647
lxcfc3f9169921e(10) clsact/ingress bpf_lxc.o:[from-container] id 682

flow_dissector:

@vincentmli
Copy link
Contributor Author

I have run the test on kernel 5.4.150 stable release for couple of hours, no failing. maybe there is some kernel bug before kernel 5.4 stable release

@vincentmli
Copy link
Contributor Author

run the test on stable kernel release 4.9.282 PR #17370 just doesn't work at all, unlikely to be PR #17370 issue though, it just exposed some problem in old kernel with Cilium

@vincentmli
Copy link
Contributor Author

close this since 5.1/5.2/5.3 kernel is end of life and seems unlikely distributed widely

@pchaigno
Copy link
Member

pchaigno commented Oct 19, 2021

In the bpftool net output above, why was there a BPF program attached to the native device, ens192? You said host firewall and BPF NodePort are disabled, so I would expect the native device to be free of any program.

@vincentmli
Copy link
Contributor Author

In the bpftool net output above, why was there a BPF program attached to the native device, ens192? You said host firewall and BPF NodePort are disabled, so I would expect the native device to be free of any program. And if a program was attached, I would expect one on egress too.

yes both firewall and BPF NodePort are disabled, and both 5.2 and 5.12 kernel shows the same ingress BPF attached to ens192

# kubectl  get no -o wide 
NAME                      STATUS   ROLES                  AGE    VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION   CONTAINER-RUNTIME
centos-dev.localdomain    Ready    <none>                 126d   v1.21.3   10.169.72.233   <none>        CentOS Linux 8          5.2.0            docker://20.10.7
centos7-dev.localdomain   Ready    <none>                 83d    v1.21.3   10.169.72.234   <none>        CentOS Linux 7 (Core)   4.9.282          docker://20.10.7
cilium-dev                Ready    control-plane,master   126d   v1.21.1   10.169.72.239   <none>        CentOS Linux 8          5.12.0           docker://20.10.7

cilium agent pod

# kubectl get po -l k8s-app=cilium -A -o wide
NAMESPACE     NAME           READY   STATUS             RESTARTS   AGE   IP              NODE                      NOMINATED NODE   READINESS GATES
kube-system   cilium-b7rmx   1/1     Running            0          14m   10.169.72.239   cilium-dev                <none>           <none>
kube-system   cilium-gw2n5   0/1     CrashLoopBackOff   7          14m   10.169.72.234   centos7-dev.localdomain   <none>           <none>
kube-system   cilium-jvm45   1/1     Running            1          14m   10.169.72.233   centos-dev.localdomain    <none>           <none

5.2 bpftool net show

# kubectl exec -it  cilium-jvm45  -n kube-system bpftool net show
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Defaulted container "cilium-agent" out of: cilium-agent, clean-cilium-state (init)
xdp:

tc:
ens192(2) clsact/ingress bpf_netdev_ens192.o:[from-netdev] id 483
cilium_net(6) clsact/ingress bpf_host_cilium_net.o:[to-host] id 476
cilium_host(7) clsact/ingress bpf_host.o:[to-host] id 462
cilium_host(7) clsact/egress bpf_host.o:[from-host] id 469
cilium_vxlan(8) clsact/ingress bpf_overlay.o:[from-overlay] id 448
cilium_vxlan(8) clsact/egress bpf_overlay.o:[to-overlay] id 455
lxce02f77b564b4(10) clsact/ingress bpf_lxc.o:[from-container] id 513
lxced61b020143c(12) clsact/ingress bpf_lxc.o:[from-container] id 509
lxcaf9d2c5248c3(14) clsact/ingress bpf_lxc.o:[from-container] id 505
lxc61a6907679d1(16) clsact/ingress bpf_lxc.o:[from-container] id 502

flow_dissector:

5.12 bpftool net show

# kubectl exec -it   cilium-b7rmx   -n kube-system bpftool net show
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Defaulted container "cilium-agent" out of: cilium-agent, clean-cilium-state (init)
xdp:

tc:
ens192(2) clsact/ingress bpf_netdev_ens192.o:[from-netdev] id 35744
cilium_net(5) clsact/ingress bpf_host_cilium_net.o:[to-host] id 35737
cilium_host(6) clsact/ingress bpf_host.o:[to-host] id 35723
cilium_host(6) clsact/egress bpf_host.o:[from-host] id 35730
cilium_vxlan(7) clsact/ingress bpf_overlay.o:[from-overlay] id 35709
cilium_vxlan(7) clsact/egress bpf_overlay.o:[to-overlay] id 35716
lxc1835fc22517b(9) clsact/ingress bpf_lxc.o:[from-container] id 35751
lxcaa7a3ce9ff20(11) clsact/ingress bpf_lxc.o:[from-container] id 35760
cilium_geneve(12) clsact/ingress bpf_overlay.o:[from-overlay] id 32484
cilium_geneve(12) clsact/egress bpf_overlay.o:[to-overlay] id 32491

flow_dissector:

for 5.2 kernel, the ping from busybox-233 on 5.2 kernel will work for a couple of minutes, then stops working

# kubectl  get po -o wide
NAME                          READY   STATUS    RESTARTS   AGE     IP           NODE                     NOMINATED NODE   READINESS GATES
busybox-233-ccb4ffc6c-cfvdr   1/1     Running   1          3m52s   10.0.1.71    centos-dev.localdomain   <none>           <none>
busybox-239-f97ff47c5-52sw6   1/1     Running   350        14d     10.0.0.224   cilium-dev               <none>           <none>


5.2 kernel busybox-233 pod ping test works about 160 seconds, then stops working

# kubectl exec -it busybox-233-ccb4ffc6c-cfvdr  -- ping 10.1.5.236
PING 10.1.5.236 (10.1.5.236): 56 data bytes
64 bytes from 10.1.5.236: seq=0 ttl=63 time=1.813 ms
.......................
64 bytes from 10.1.5.236: seq=159 ttl=63 time=1.097 ms
64 bytes from 10.1.5.236: seq=160 ttl=63 time=1.291 ms

^C
--- 10.1.5.236 ping statistics ---
183 packets transmitted, 161 packets received, 12% packet loss
round-trip min/avg/max = 0.991/1.306/2.246 ms

5.4 kernel stable release and above does not have this problem

@vincentmli vincentmli reopened this Oct 19, 2021
@pchaigno
Copy link
Member

yes both firewall and BPF NodePort are disabled, and both 5.2 and 5.12 kernel shows the same ingress BPF attached to ens192

Ah, got it. That program is configured at

interfaceNames = append(interfaceNames, device)
symbols = append(symbols, symbolFromHostNetdevEp)
directions = append(directions, dirIngress)
and loaded because you have defined devices probably (?). Could you try without devices set?

@vincentmli
Copy link
Contributor Author

removed devices setting, same ingress attached to physical devices, problem persists

# kubectl  exec -it  cilium-7p77j  -n kube-system -- bpftool net show
Defaulted container "cilium-agent" out of: cilium-agent, clean-cilium-state (init)
xdp:

tc:
ens192(2) clsact/ingress bpf_netdev_ens192.o:[from-netdev] id 57525
ens224(3) clsact/ingress bpf_netdev_ens224.o:[from-netdev] id 57532
cilium_net(5) clsact/ingress bpf_host_cilium_net.o:[to-host] id 57518
cilium_host(6) clsact/ingress bpf_host.o:[to-host] id 57504
cilium_host(6) clsact/egress bpf_host.o:[from-host] id 57511
cilium_vxlan(7) clsact/ingress bpf_overlay.o:[from-overlay] id 57490
cilium_vxlan(7) clsact/egress bpf_overlay.o:[to-overlay] id 57497
lxcaa7a3ce9ff20(11) clsact/ingress bpf_lxc.o:[from-container] id 58383
cilium_geneve(12) clsact/ingress bpf_overlay.o:[from-overlay] id 56806
cilium_geneve(12) clsact/egress bpf_overlay.o:[to-overlay] id 56813

flow_dissector:

@vincentmli
Copy link
Contributor Author

this issue seems like a good test candidate for https://github.com/cilium/pwru, I am going to try pwru

@vincentmli
Copy link
Contributor Author

what is weird now is kernel 5.4.155 stable release completely stops working for same ping test, 5.4 kernel from upstream works a few minutes, then stops working, kernel 5.5 has no problem, it looks we need to get to the bottom of this issue since kernel 5.4 stable release is also affected

@pchaigno
Copy link
Member

removed devices setting, same ingress attached to physical devices, problem persists

I still see the from-netdev BPF program attached to two native devices so something is not working as expected here. Could you share a sysdump of that setup?

@vincentmli
Copy link
Contributor Author

here is pwru output for working and non-working tuple output

ping working:

[root@cilium-dev home]# kubectl exec -it busybox-233-ccb4ffc6c-q5fjl   -- ping  -c 1 10.1.5.236
PING 10.1.5.236 (10.1.5.236): 56 data bytes
64 bytes from 10.1.5.236: seq=0 ttl=63 time=2.869 ms

[root@centos-dev pwru]# ./pwru --filter-dst-ip=10.169.72.236 --filter-dst-port=8472 --filter-proto=udp --output-tuple

2021/10/21 09:57:37 Attaching kprobes...
1059 / 1059 [--------------------------------------------------------------------------------------------------] 100.00% 33 p/s
Attached (ignored 0)
2021/10/21 09:58:10 Listening for events..
               SKB         PROCESS                     FUNC        TIMESTAMP
0xffff91c1c5ed8700          [ping]             ip_local_out     233850850321 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700          [ping]           __ip_local_out     233850857264 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700          [ping]             nf_hook_slow     233850861643 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700          [ping]                ip_output     233850889746 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700          [ping]             nf_hook_slow     233850893202 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700          [ping]         ip_finish_output     233850907189 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700          [ping]       __ip_finish_output     233850910615 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700          [ping]        ip_finish_output2     233850914052 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700          [ping]     neigh_resolve_output     233850918029 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700          [ping]       __neigh_event_send     233850921075 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700          [ping]                skb_clone     233850924081 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed9800          [ping]              arp_solicit     233850927477 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed9800          [ping]              consume_skb     233850999223 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed9800          [ping]          skb_release_all     233851001838 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed9800          [ping]   skb_release_head_state     233851003922 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed9800          [ping]         skb_release_data     233851006066 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed9800          [ping]             kfree_skbmem     233851008140 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700       [<empty>]        ipv4_neigh_lookup     233851592075 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700       [<empty>]     neigh_resolve_output     233851600321 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700       [<empty>]               eth_header     233851603787 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700       [<empty>]                 skb_push     233851607194 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700       [<empty>]           dev_queue_xmit     233851610149 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700       [<empty>]         __dev_queue_xmit     233851612444 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700       [<empty>]      netdev_core_pick_tx     233851615349 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700       [<empty>]           netdev_pick_tx     233851617744 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700       [<empty>]           __skb_get_hash     233851621922 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700       [<empty>]          sch_direct_xmit     233851628454 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700       [<empty>]   validate_xmit_skb_list     233851630939 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700          [pwru]      __dev_kfree_skb_any     233851684220 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700          [pwru]              consume_skb     233851692445 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700          [pwru]          skb_release_all     233851695761 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700          [pwru]   skb_release_head_state     233851698657 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700          [pwru]               sock_wfree     233851701482 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700          [pwru]         skb_release_data     233851705730 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700          [pwru]            skb_free_head     233851709307 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700          [pwru]             kfree_skbmem     233851712643 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700       [<empty>]        validate_xmit_skb     233851637962 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700       [<empty>]       netif_skb_features     233851640818 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700       [<empty>]     skb_network_protocol     233851643623 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700       [<empty>]       validate_xmit_xfrm     233851646718 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed8700       [<empty>]      dev_hard_start_xmit     233851649314 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c5ed9800   [ksoftirqd/4]             ip_local_out     233852771838 10.169.72.233:37878->10.169.72.236:8472(udp)
0xffff91c1c5ed9800   [ksoftirqd/4]           __ip_local_out     233852775565 10.169.72.233:37878->10.169.72.236:8472(udp)
0xffff91c1c5ed9800   [ksoftirqd/4]             nf_hook_slow     233852778311 10.169.72.233:37878->10.169.72.236:8472(udp)
0xffff91c1c5ed9800   [ksoftirqd/4]                ip_output     233852784071 10.169.72.233:37878->10.169.72.236:8472(udp)
0xffff91c1c5ed9800   [ksoftirqd/4]             nf_hook_slow     233852786867 10.169.72.233:37878->10.169.72.236:8472(udp)
0xffff91c1c5ed9800   [ksoftirqd/4]         ip_finish_output     233852791666 10.169.72.233:37878->10.169.72.236:8472(udp)
0xffff91c1c5ed9800   [ksoftirqd/4]       __ip_finish_output     233852794281 10.169.72.233:37878->10.169.72.236:8472(udp)
0xffff91c1c5ed9800   [ksoftirqd/4]        ip_finish_output2     233852796936 10.169.72.233:37878->10.169.72.236:8472(udp)
0xffff91c1c5ed9800   [ksoftirqd/4]           dev_queue_xmit     233852800182 10.169.72.233:37878->10.169.72.236:8472(udp)
0xffff91c1c5ed9800   [ksoftirqd/4]         __dev_queue_xmit     233852802837 10.169.72.233:37878->10.169.72.236:8472(udp)
0xffff91c1c5ed9800   [ksoftirqd/4]      netdev_core_pick_tx     233852805752 10.169.72.233:37878->10.169.72.236:8472(udp)
0xffff91c1c5ed9800   [ksoftirqd/4]           netdev_pick_tx     233852808397 10.169.72.233:37878->10.169.72.236:8472(udp)
0xffff91c1c5ed9800   [ksoftirqd/4]           __skb_get_hash     233852811343 10.169.72.233:37878->10.169.72.236:8472(udp)
0xffff91c1c5ed9800   [ksoftirqd/4]          sch_direct_xmit     233852815781 10.169.72.233:37878->10.169.72.236:8472(udp)
0xffff91c1c5ed9800   [ksoftirqd/4]   validate_xmit_skb_list     233852818527 10.169.72.233:37878->10.169.72.236:8472(udp)
0xffff91c1c5ed9800   [ksoftirqd/4]        validate_xmit_skb     233852821101 10.169.72.233:37878->10.169.72.236:8472(udp)
0xffff91c1c5ed9800   [ksoftirqd/4]       netif_skb_features     233852823766 10.169.72.233:37878->10.169.72.236:8472(udp)
0xffff91c1c5ed9800   [ksoftirqd/4]     skb_network_protocol     233852826291 10.169.72.233:37878->10.169.72.236:8472(udp)
0xffff91c1c5ed9800   [ksoftirqd/4]       validate_xmit_xfrm     233852828636 10.169.72.233:37878->10.169.72.236:8472(udp)
0xffff91c1c5ed9800   [ksoftirqd/4]      dev_hard_start_xmit     233852830810 10.169.72.233:37878->10.169.72.236:8472(udp)
0xffff91c1c5ed9800       [<empty>]      __dev_kfree_skb_any     233852854444 10.169.72.233:37878->10.169.72.236:8472(udp)
0xffff91c1c5ed9800       [<empty>]              consume_skb     233852858793 10.169.72.233:37878->10.169.72.236:8472(udp)
0xffff91c1c5ed9800       [<empty>]          skb_release_all     233852861458 10.169.72.233:37878->10.169.72.236:8472(udp)
0xffff91c1c5ed9800       [<empty>]   skb_release_head_state     233852863762 10.169.72.233:37878->10.169.72.236:8472(udp)
0xffff91c1c5ed9800       [<empty>]         skb_release_data     233852866167 10.169.72.233:37878->10.169.72.236:8472(udp)
0xffff91c1c5ed9800       [<empty>]            skb_free_head     233852868812 10.169.72.233:37878->10.169.72.236:8472(udp)
0xffff91c1c5ed9800       [<empty>]             kfree_skbmem     233852871126 10.169.72.233:37878->10.169.72.236:8472(udp)
^C2021/10/21 09:59:21 Received signal, exiting program..

ping not working:

[root@cilium-dev home]# kubectl exec -it busybox-233-ccb4ffc6c-q5fjl   -- ping -c 1  10.1.5.236
PING 10.1.5.236 (10.1.5.236): 56 data bytes

[root@centos-dev pwru]# ./pwru --filter-dst-ip=10.169.72.236 --filter-dst-port=8472 --filter-proto=udp --output-tuple

2021/10/21 10:02:16 Attaching kprobes...
1059 / 1059 [--------------------------------------------------------------------------------------------------] 100.00% 33 p/s
Attached (ignored 0)
2021/10/21 10:02:48 Listening for events..
               SKB         PROCESS                     FUNC        TIMESTAMP
0xffff91c1c792a000          [ping]             ip_local_out     471836971739 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c792a000          [ping]           __ip_local_out     471836978643 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c792a000          [ping]             nf_hook_slow     471836982099 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c792a000          [ping]                ip_output     471837008469 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c792a000          [ping]             nf_hook_slow     471837011685 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c792a000          [ping]         ip_finish_output     471837023047 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c792a000          [ping]       __ip_finish_output     471837025451 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c792a000          [ping]        ip_finish_output2     471837027916 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c792a000          [ping]     neigh_resolve_output     471837030531 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c792a000          [ping]       __neigh_event_send     471837032945 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c792a000          [ping]               eth_header     471837035400 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c792a000          [ping]                 skb_push     471837037474 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c792a000          [ping]           dev_queue_xmit     471837039688 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c792a000          [ping]         __dev_queue_xmit     471837041662 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c792a000          [ping]      netdev_core_pick_tx     471837043916 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c792a000          [ping]           netdev_pick_tx     471837048715 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c792a000          [ping]           __skb_get_hash     471837050989 10.169.72.233:51644->10.169.72.236:8472(udp)

@vincentmli
Copy link
Contributor Author

ping not working:

[root@cilium-dev home]# kubectl exec -it busybox-233-ccb4ffc6c-q5fjl   -- ping -c 1  10.1.5.236
PING 10.1.5.236 (10.1.5.236): 56 data bytes

[root@centos-dev pwru]# ./pwru --filter-dst-ip=10.169.72.236 --filter-dst-port=8472 --filter-proto=udp --output-tuple

2021/10/21 10:02:16 Attaching kprobes...
1059 / 1059 [--------------------------------------------------------------------------------------------------] 100.00% 33 p/s
Attached (ignored 0)
2021/10/21 10:02:48 Listening for events..
               SKB         PROCESS                     FUNC        TIMESTAMP
0xffff91c1c792a000          [ping]             ip_local_out     471836971739 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c792a000          [ping]           __ip_local_out     471836978643 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c792a000          [ping]             nf_hook_slow     471836982099 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c792a000          [ping]                ip_output     471837008469 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c792a000          [ping]             nf_hook_slow     471837011685 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c792a000          [ping]         ip_finish_output     471837023047 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c792a000          [ping]       __ip_finish_output     471837025451 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c792a000          [ping]        ip_finish_output2     471837027916 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c792a000          [ping]     neigh_resolve_output     471837030531 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c792a000          [ping]       __neigh_event_send     471837032945 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c792a000          [ping]               eth_header     471837035400 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c792a000          [ping]                 skb_push     471837037474 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c792a000          [ping]           dev_queue_xmit     471837039688 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c792a000          [ping]         __dev_queue_xmit     471837041662 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c792a000          [ping]      netdev_core_pick_tx     471837043916 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c792a000          [ping]           netdev_pick_tx     471837048715 10.169.72.233:51644->10.169.72.236:8472(udp)
0xffff91c1c792a000          [ping]           __skb_get_hash     471837050989 10.169.72.233:51644->10.169.72.236:8472(udp)

it seems stopped before __dev_xmit_skb

static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)      
{                                                                                
        struct net_device *dev = skb->dev;                                       
        struct netdev_queue *txq;                                                
        struct Qdisc *q;                                                         
        int rc = -ENOMEM;                                                        
        bool again = false;                                                      
                                                                                 
        skb_reset_mac_header(skb);                                               
                                                                                 
        if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))            
                __skb_tstamp_tx(skb, NULL, NULL, skb->sk, SCM_TSTAMP_SCHED);     
                                                                                 
        /* Disable soft irqs for various locks below. Also                       
         * stops preemption for RCU.                                             
         */                                                                      
        rcu_read_lock_bh();                                                      
                                                                                 
        skb_update_prio(skb);                                                    
                                                                                 
        qdisc_pkt_len_init(skb);                                                 
#ifdef CONFIG_NET_CLS_ACT                                                        
        skb->tc_at_ingress = 0;                                                  
# ifdef CONFIG_NET_EGRESS                                                        
        if (static_branch_unlikely(&egress_needed_key)) {                        
                skb = sch_handle_egress(skb, &rc, dev);                          
                if (!skb)                                                        
                        goto out;                                                
        }                                                                        
# endif                                                                          
#endif                                                                           
        /* If device/qdisc don't need skb->dst, release it right now while       
         * its hot in this cpu cache.                                            
         */                                                                      
        if (dev->priv_flags & IFF_XMIT_DST_RELEASE)                              
                skb_dst_drop(skb);                                               
        else                                                                     
                skb_dst_force(skb);                                              
                                                                                 
        txq = netdev_core_pick_tx(dev, skb, sb_dev);                             
        q = rcu_dereference_bh(txq->qdisc);                                      
                                                                                 
        trace_net_dev_queue(skb);                                                
        if (q->enqueue) {                                                        
                rc = __dev_xmit_skb(skb, q, dev, txq);                           
                goto out;                                                        
        }                                                                        

@vincentmli
Copy link
Contributor Author

removed devices setting, same ingress attached to physical devices, problem persists

I still see the from-netdev BPF program attached to two native devices so something is not working as expected here. Could you share a sysdump of that setup?

I may missed something, any reason why from-netdev should not be attached to native devices? I attached sysdump with device setting added back since removing devices setting did not make difference.
cilium-sysdump-20211021-103609.zip

@vincentmli
Copy link
Contributor Author

I noticed the VXLAN packet from ping always use same source port 10.169.72.233:51644, something to do with netdev_pick_tx and __skb_get_hash ?

@vincentmli
Copy link
Contributor Author

another pwru trace for non-working 5.4 kernel on Centos 8

[root@centos-dev pwru]# ./pwru --filter-dst-ip=10.169.72.236 --filter-dst-port=8472 --filter-proto=udp --output-tuple


2021/10/21 14:32:23 Attaching kprobes...
1059 / 1059 [-----------------------------------------------------------------------------------------] 100.00% 33 p/s
Attached (ignored 0)
2021/10/21 14:32:56 Listening for events..
               SKB         PROCESS                     FUNC        TIMESTAMP

0xffff9235473d4600          [ping]             ip_local_out     953026790621 10.169.72.233:37728->10.169.72.236:8472(udp)
0xffff9235473d4600          [ping]           __ip_local_out     953026796953 10.169.72.233:37728->10.169.72.236:8472(udp)
0xffff9235473d4600          [ping]             nf_hook_slow     953026800760 10.169.72.233:37728->10.169.72.236:8472(udp)
0xffff9235473d4600          [ping]                ip_output     953026823503 10.169.72.233:37728->10.169.72.236:8472(udp)
0xffff9235473d4600          [ping]             nf_hook_slow     953026825958 10.169.72.233:37728->10.169.72.236:8472(udp)
0xffff9235473d4600          [ping]         ip_finish_output     953026833923 10.169.72.233:37728->10.169.72.236:8472(udp)
0xffff9235473d4600          [ping]       __ip_finish_output     953026836287 10.169.72.233:37728->10.169.72.236:8472(udp)
0xffff9235473d4600          [ping]        ip_finish_output2     953026838522 10.169.72.233:37728->10.169.72.236:8472(udp)
0xffff9235473d4600          [ping]           dev_queue_xmit     953026840976 10.169.72.233:37728->10.169.72.236:8472(udp)
0xffff9235473d4600          [ping]         __dev_queue_xmit     953026843030 10.169.72.233:37728->10.169.72.236:8472(udp)
0xffff9235473d4600          [ping]             tcf_classify     953026845194 10.169.72.233:37728->10.169.72.236:8472(udp)
0xffff9235473d4600          [ping]      netdev_core_pick_tx     953026861675 10.169.72.233:37728->10.169.72.236:8472(udp)
0xffff9235473d4600          [ping]           netdev_pick_tx     953026865553 10.169.72.233:37728->10.169.72.236:8472(udp)
0xffff9235473d4600          [ping]           __skb_get_hash     953026868558 10.169.72.233:37728->10.169.72.236:8472(udp)
0xffff9235473d4600          [ping]           kfree_skb_list     953026874820 10.169.72.233:37728->10.169.72.236:8472(udp)
0xffff9235473d4600          [ping]                kfree_skb     953026877445 10.169.72.233:37728->10.169.72.236:8472(udp)
0xffff9235473d4600          [ping]          skb_release_all     953026880180 10.169.72.233:37728->10.169.72.236:8472(udp)
0xffff9235473d4600          [ping]   skb_release_head_state     953026882755 10.169.72.233:37728->10.169.72.236:8472(udp)
0xffff9235473d4600          [ping]               sock_wfree     953026885410 10.169.72.233:37728->10.169.72.236:8472(udp)
0xffff9235473d4600          [ping]         skb_release_data     953026888326 10.169.72.233:37728->10.169.72.236:8472(udp)
0xffff9235473d4600          [ping]            skb_free_head     953026890390 10.169.72.233:37728->10.169.72.236:8472(udp)
0xffff9235473d4600          [ping]             kfree_skbmem     953026892504 10.169.72.233:37728->10.169.72.236:8472(udp)

I also added pr_info below to double check if packet dropped with NET_XMIT_DROP in __dev_xmit_skb, but I don't get any kernel log with pr_info, so it looks it matches the pwru trace that __dev_xmit_skb is not called

# git diff net/core/dev.c
diff --git a/net/core/dev.c b/net/core/dev.c
index 99ac84ff398f..7982a2ae3662 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3392,6 +3392,8 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
                                              &q->state))) {
                                __qdisc_drop(skb, &to_free);
                                rc = NET_XMIT_DROP;
+                               pr_info("single core TCQ_F_NOLOCK dev name %s  NET_XMIT_DROP %d\n",
+                                                                        skb->dev->name, NET_XMIT_DROP);
                                goto end_run;
                        }
                        qdisc_bstats_cpu_update(q, skb);
@@ -3425,6 +3427,8 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
        spin_lock(root_lock);
        if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
                __qdisc_drop(skb, &to_free);
+               pr_info("muti core __QDISC_STATE_DEACTIVATED dev name %s  NET_XMIT_DROP %d\n",
+                                                        skb->dev->name, NET_XMIT_DROP);
                rc = NET_XMIT_DROP;
        } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
                   qdisc_run_begin(q)) {

one good thing is when I run stable kernel release on 5.4.150 on Centos 8, the ping works and ping test runs over an hour with no problem and continues working, but stable kernel release 5.4.150 on Centos 7 does not work at all.

@vincentmli
Copy link
Contributor Author

one good thing is when I run stable kernel release on 5.4.150 on Centos 8, the ping works and ping test runs over an hour with no problem and continues working, but stable kernel release 5.4.150 on Centos 7 does not work at all.

I found out 5.4.150 on Centos 7 missing CONFIG_CGROUP_BPF=y, I think that is the problem, 5.4 stable release should have no problem

@vincentmli
Copy link
Contributor Author

since kernel 5.5 works and 5.4 works intermittently, I git diff the difference between 5.4 and 5.5, I suspect below commit might somehow fixed the issue, just guess. and kernel 5.4.150 code looks different for below code.

commit 90b2be27bb0e56483f335cc10fb59ec66882b949
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Nov 8 08:45:23 2019 -0800

    net/sched: annotate lockless accesses to qdisc->empty
    
    KCSAN reported the following race [1]
    
    BUG: KCSAN: data-race in __dev_queue_xmit / net_tx_action
    

@vincentmli
Copy link
Contributor Author

vincentmli commented Oct 21, 2021

I found out 5.4.150 on Centos 7 missing CONFIG_CGROUP_BPF=y, I think that is the problem, 5.4 stable release should have no problem

actually 5.4.150 on Centos 7 still not working completely, tried to use same exact kernel config between Centos 7 and Centos 8 for 5.4.150, but got error with BTF

BTF: .tmp_vmlinux.btf: pahole (pahole) is not available
Failed to generate BTF for vmlinux
Try to disable CONFIG_DEBUG_INFO_BTF
make[1]: *** [vmlinux] Error 1
make: *** [sub-make] Error 2

it seems I need to manually compile pahole and install pahole on Centos 7, I am less worried now that kernel stable release 5.4.150 breaks #17370 since 5.4.150 works fine on Centos 8

@brb
Copy link
Member

brb commented Nov 8, 2021

@vincentmli Any TLDR why it was broken?

@vincentmli
Copy link
Contributor Author

@vincentmli Any TLDR why it was broken?

@brb I did not dig further since 5.4.150 stable release on centos 8 works for my PR that is the least kernel requirement

@vincentmli
Copy link
Contributor Author

@brb I suspect kernel 90b2be27bb0 fixed the issue, but I did not do git bisect between 5.4 and 5.5 to confirm since git bisect may take too much time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic.
Projects
None yet
Development

No branches or pull requests

3 participants