IPv6 TCP connections broken / ACK packet not forwarded in case of service cluster-ip node/remote-pod #17941

mdaur · 2021-11-19T21:39:50Z

Is there an existing issue for this?

I have searched the existing issues

What happened?

I am facing an issue in a dual stack cilium setup (v1.10.5-b0836e8) with full kube-proxy replacement (strict). The nodes make use of a BGP setup and all routes are exchanged (cluster/node-cidr), service CIDR is advertised for ECMP.

Issue: all packet drops after SYN, ACK (only IPv6), tcp session can not be established:

"hubble observe --since 1m --pod delete/prometheus-msteams-758bcbb47b-9bjg6" | grep 2003:XXXX:611:9600::21
Nov 18 20:26:55.315: [2003:XXXX:611:9600::21]:40524 -> delete/prometheus-msteams-758bcbb47b-9bjg6:2000 to-endpoint FORWARDED (TCP Flags: SYN)
Nov 18 20:26:55.315: [2003:XXXX:611:9600::21]:40524 <- delete/prometheus-msteams-758bcbb47b-9bjg6:2000 to-stack FORWARDED (TCP Flags: SYN, ACK)
-> missing entries / missing further packets e.g. ACK

IPv4 (same service, same pod works well):

Nov 18 20:26:48.127: 192.168.1.22:58182 -> delete/prometheus-msteams-758bcbb47b-9bjg6:2000 to-endpoint FORWARDED (TCP Flags: SYN)
Nov 18 20:26:48.127: 192.168.1.22:58182 <- delete/prometheus-msteams-758bcbb47b-9bjg6:2000 to-stack FORWARDED (TCP Flags: SYN, ACK)
Nov 18 20:26:48.127: 192.168.1.22:58182 -> delete/prometheus-msteams-758bcbb47b-9bjg6:2000 to-endpoint FORWARDED (TCP Flags: ACK)
Nov 18 20:26:48.127: 192.168.1.22:58182 -> delete/prometheus-msteams-758bcbb47b-9bjg6:2000 to-endpoint FORWARDED (TCP Flags: ACK, PSH)
Nov 18 20:26:48.128: 192.168.1.22:58182 <- delete/prometheus-msteams-758bcbb47b-9bjg6:2000 to-stack FORWARDED (TCP Flags: ACK, PSH)
Nov 18 20:26:48.128: 192.168.1.22:58182 -> delete/prometheus-msteams-758bcbb47b-9bjg6:2000 to-endpoint FORWARDED (TCP Flags: ACK, FIN)
Nov 18 20:26:48.128: 192.168.1.22:58182 -> delete/prometheus-msteams-758bcbb47b-9bjg6:2000 to-endpoint FORWARDED (TCP Flags: ACK)
Nov 18 20:26:48.128: 192.168.1.22:58182 <- delete/prometheus-msteams-758bcbb47b-9bjg6:2000 to-stack FORWARDED (TCP Flags: ACK, FIN)

tcpdump node1 (ingress, ebpf snat works well 2003:a :611:9600::76 (ingress client) -> 2003:a :611:9600::21 (node1) -> 2003:a :611:9611::2 (remote pod)

22:17:29.637078 IP6 2003:a:611:9600::76.48762 > 2003:a:611:9605::9abf.2000: Flags [S], seq 2457750650, win 64800, options [mss 1440,sackOK,TS val 3809735993 ecr 0,nop,wscale 7], length 0
22:17:29.637142 IP6 2003:a:611:9600::21.48762 > 2003:a:611:9611::2.2000: Flags [S], seq 2457750650, win 64800, options [mss 1440,sackOK,TS val 3809735993 ecr 0,nop,wscale 7], length 0
22:17:29.638585 IP6 2003:a:611:9611::2.2000 > 2003:a:611:9600::21.48762: Flags [S.], seq 2723963965, ack 2457750651, win 64260, options [mss 1440,sackOK,TS val 1868375260 ecr 3809735993,nop,wscale 7], length 0
22:17:29.638636 IP6 2003:a:611:9605::9abf.2000 > 2003:a:611:9600::76.48762: Flags [S.], seq 2723963965, ack 2457750651, win 64260, options [mss 1440,sackOK,TS val 1868375260 ecr 3809735993,nop,wscale 7], length 0
22:17:29.639831 IP6 2003:a:611:9600::76.48762 > 2003:a:611:9605::9abf.2000: Flags [.], ack 1, win 507, options [nop,nop,TS val 3809735996 ecr 1868375260], length 0

<- next packets do not arrive at node2 where pod 2003:a :611:9611::2.2000 resides ->

22:17:29.639870 IP6 2003:a:611:9600::21.48762 > 2003:a:611:9611::2.2000:    Flags [.], ack 1, win 507, options [nop,nop,TS val 3809735996 ecr 1868375260], length 0
22:17:29.640220 IP6 2003:a:611:9600::76.48762 > 2003:a:611:9605::9abf.2000: Flags [P.], seq 1:93, ack 1, win 507, options [nop,nop,TS val 3809735996 ecr 1868375260], length 92
22:17:29.640254 IP6 2003:a:611:9600::21.48762 > 2003:a:611:9611::2.2000:    Flags [P.], seq 1:93, ack 1, win 507, options [nop,nop,TS val 3809735996 ecr 1868375260], length 92
22:17:29.859473 IP6 2003:a:611:9600::76.48762 > 2003:a:611:9605::9abf.2000: Flags [P.], seq 1:93, ack 1, win 507, options [nop,nop,TS val 3809736216 ecr 1868375260], length 92
22:17:29.859519 IP6 2003:a:611:9600::21.48762 > 2003:a:611:9611::2.2000:    Flags [P.], seq 1:93, ack 1, win 507, options [nop,nop,TS val 3809736216 ecr 1868375260], length 92
22:17:30.079468 IP6 2003:a:611:9600::76.48762 > 2003:a:611:9605::9abf.2000: Flags [P.], seq 1:93, ack 1, win 507, options [nop,nop,TS val 3809736436 ecr 1868375260], length 92
22:17:30.079524 IP6 2003:a:611:9600::21.48762 > 2003:a:611:9611::2.2000:    Flags [P.], seq 1:93, ack 1, win 507, options [nop,nop,TS val 3809736436 ecr 1868375260], length 92
22:17:30.519421 IP6 2003:a:611:9600::76.48762 > 2003:a:611:9605::9abf.2000: Flags [P.], seq 1:93, ack 1, win 507, options [nop,nop,TS val 3809736876 ecr 1868375260], length 9

tcpdump node2 (remote-pod 2003:a :611:9611::2 retransmits of SYN/ACK due to missing ACK response)

22:17:29.638206 IP6 2003:a:611:9600::21.48762 > 2003:a:611:9611::2.2000: Flags [S], seq 2457750650, win 64800, options [mss 1440,sackOK,TS val 3809735993 ecr 0,nop,wscale 7], length 0
22:17:29.638409 IP6 2003:a:611:9611::2.2000 > 2003:a:611:9600::21.48762: Flags [S.], seq 2723963965, ack 2457750651, win 64260, options [mss 1440,sackOK,TS val 1868375260 ecr 3809735993,nop,wscale 7], length 0
22:17:30.650827 IP6 2003:a:611:9611::2.2000 > 2003:a:611:9600::21.48762: Flags [S.], seq 2723963965, ack 2457750651, win 64260, options [mss 1440,sackOK,TS val 1868376272 ecr 3809735993,nop,wscale 7], length 0
22:17:32.666809 IP6 2003:a:611:9611::2.2000 > 2003:a:611:9600::21.48762: Flags [S.], seq 2723963965, ack 2457750651, win 64260, options [mss 1440,sackOK,TS val 1868378288 ecr 3809735993,nop,wscale 7], length 0
22:17:36.794820 IP6 2003:a:611:9611::2.2000 > 2003:a:611:9600::21.48762: Flags [S.], seq 2723963965, ack 2457750651, win 64260, options [mss 1440,sackOK,TS val 1868382416 ecr 3809735993,nop,wscale 7], length 0
22:17:44.986811 IP6 2003:a:611:9611::2.2000 > 2003:a:611:9600::21.48762: Flags [S.], seq 2723963965, ack 2457750651, win 64260, options [mss 1440,sackOK,TS val 1868390608 ecr 3809735993,nop,wscale 7], length 0
22:18:01.115181 IP6 2003:a:611:9611::2.2000 > 2003:a:611:9600::21.48762: Flags [S.], seq 2723963965, ack 2457750651, win 64260, options [mss 1440,sackOK,TS val 1868406736 ecr 3809735993,nop,wscale 7], length 0

Summary:

IPv6 connectivity fails ingress using cluster-ip/remote-pod (TCP)
IPv6 connectivity works ingress using cluster-ip/remote-pod (UDP)
IPv6 connectivity works pod/pod (same nodes, different nodes), node/node, node/pod, node/remote-pod, ingress-cluster-ip/local-pod, pod/egress, node/egress
In case of IPv4 the same scenario ingress using cluster-ip/remote-pod with same service/pod work well (like all other IPv4 scenarios).
no network policies are applied.

Cilium Version

v1.10.5-b0836e8

Kernel Version

Linux node11 5.4.0-90-generic #101-Ubuntu SMP Fri Oct 15 20:00:55 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Kubernetes Version

Ranger K3S v1.21.5+k3s2

Sysdump

No response

Relevant log output

Are there any know issues in an ingress to node/remote-pod scenario for IPv6 TCP services (default service making use of a cluster IP with PreferDualStack)?
- Ubuntu focal, default kernel 5.4.0-90-generic, K3S v1.21.5+k3s2
- KubeProxyReplacement:   Strict   [enp2s0 192.168.1.21 2003:XXXX:611:9600::21 (Direct Routing)]
- Cilium:                 Ok   1.10.5 (v1.10.5-b0836e8)
- NodeMonitor:            Listening for events on 4 CPUs with 64x4096 of shared memory
- Cilium health daemon:   Ok
- IPAM:                   IPv4: 13/30 allocated, IPv6: 13/30 allocated
- ClusterMesh:            0/0 clusters ready, 0 global-services
- BandwidthManager:       Disabled
- Host Routing:           Legacy
- Masquerading:           Disabled
- Controller Status:      71/71 healthy
- Proxy Status:           OK, ip 10.10.0.12, 0 redirects active on ports 10000-20000
- Hubble:                 Ok   Current/Max Flows: 4095/4095 (100.00%), Flows/s: 9.24   Metrics: Disabled
- Encryption:             Disabled
- Cluster health:         2/2 reachable   (2021-11-18T21:05:37Z)

KubeProxyReplacement Details:
  Status:                Strict
  Socket LB Protocols:   TCP, UDP
  Devices:               enp2s0 192.168.1.22 2003:a:611:9600::22 (Direct Routing)
  Mode:                  SNAT
  Backend Selection:     Random
  Session Affinity:      Enabled
  XDP Acceleration:      Disabled
  Services:
  - ClusterIP:      Enabled
  - NodePort:       Enabled (Range: 30000-32767)
  - LoadBalancer:   Enabled
  - externalIPs:    Enabled
  - HostPort:       Enabled

Anything else?

For sure I can share all the debug logs/traces of the cilium-bugtool but for a first touchpoint I think this is a little bit to much. How can I figure out where the ACK packets e.g. in the trace above gets lost?
On the other hand IPv6 UDP services work well (service cluster-ip / host -> remote pod).

Any thoughts how to debug further are highly appreciated.
/martin

Code of Conduct

I agree to follow this project's Code of Conduct

borkmann · 2021-11-22T08:44:26Z

Thanks for the report, @mdaur !

<- next packets do not arrive at node2 where pod 2003:a :611:9611::2.2000 resides ->

Do you happen to know if they actually leave the node, meaning dropped somewhere on the network before reaching node2?

IPv6 connectivity works ingress using cluster-ip/remote-pod (UDP)

Is it correct to assume that you did similar packet by packet analysis with tcpdump above to trace a few subsequent packets that they make it to the node2's backend, right? Meaning, exactly the same SNAT config, service VIP, etc, and they pass through just fine?

Also, could you attach the pcap from the tcpdump node1 session for further analysis?

One other question to 'cluster-ip node/remote-pod', I presume you run the agent with bpf-lb-external-clusterip (#15650), right?

mdaur · 2021-11-22T19:37:45Z

Hello borkmann,
Thank you very much for your reply.

Is it correct to assume that you did similar packet by packet analysis with tcpdump above to trace a few subsequent packets that they make it to the node2's backend, right? Meaning, exactly the same SNAT config, service VIP, etc, and they pass through just fine?

Yes I did the captures for multiple services and of course for multiple times. With the same result all the time. The ACK packet got lost always (for multiple times and different services). In the mentioned scenario node1 and node2 have been connected by a unmanged switch without any filters. It is very unlikely that the packets got dropped on the L2 switch. But yes I missed the opportunity to check with a mirror port if the packets are really not leaving the node by making use of another switch. So far I am pretty sure the packets did not leave the node due to I could figure out to make it work by switching to kernel 5.10.80 (ubuntu mainline) meanwhile.

Also, could you attach the pcap from the tcpdump node1 session for further analysis?

Yes, see attached pcap ipv6.zip. Packet 9,10 (stream 0 inbound, stream 1 natted) show the lost ACK packet, or at least the packet which did not make it to node2.

One other question to 'cluster-ip node/remote-pod', I presume you run the agent with bpf-lb-external-clusterip (#15650), right?

Yes, I ran the agent with option bpf-lb-external-clusterip btw. helm value bpf.lbExternalClusterIP:true.

borkmann · 2021-11-22T21:01:44Z

Also, could you attach the pcap from the tcpdump node1 session for further analysis?

Yes, see attached pcap ipv6.zip. Packet 9,10 (stream 0 inbound, stream 1 natted) show the lost ACK packet, or at least the packet which did not make it to node2.

That helped, I think I can see the issue. See the ICMPv6 one at pkt 6. Looks like router is doing the forwarding for us, but not in subsequent packets.

Could you try with latest image or with v1.11.0-rc2 to see if it is fixed there?

Both quay.io/cilium/cilium:v1.11.0-rc2 or quay.io/cilium/cilium-ci:latest should be okay to test.

We reworked the neighbor cache there and it should address it.

borkmann · 2021-11-25T09:01:52Z

@mdaur any progress wrt above? Thx

mdaur · 2021-11-25T21:42:53Z

I will share my outcome by end of tomorrow. Sorry.

mdaur · 2021-12-01T21:51:55Z

@borkmann now I can confirm that the icmp6 redirects are not longer there and tcp sessions work well for ipv6 if the traffic need to be forwarded to a pod on another node. e.g. (orign 2003:a:611:9600:c9fc:8043:ad2:4c5d, forwarding node 2003:a:611:9600::21, pod on remote node 2003:a:611:9612::11, service ip 2003:a:611:9605::9abf tcp/2000). Tested with quay.io/cilium/cilium:v1.11.0-rc3 / kernel 5.4.0-90-generic default ubuntu focal amd64)

22:31:06.064025 IP6 2003:a:611:9600:c9fc:8043:ad2:4c5d.33148 > 2003:a:611:9605::9abf.2000: Flags [S], seq 339812777, win 28800, options [mss 1440,sackOK,TS val 3699472835 ecr 0,nop,wscale 6], length 0
22:31:06.064093 IP6 2003:a:611:9600::21.33148 > 2003:a:611:9612::11.2000: Flags [S], seq 339812777, win 28800, options [mss 1440,sackOK,TS val 3699472835 ecr 0,nop,wscale 6], length 0
22:31:06.064457 IP6 2003:a:611:9612::11.2000 > 2003:a:611:9600::21.33148: Flags [S.], seq 2484932767, ack 339812778, win 64260, options [mss 1440,sackOK,TS val 1775479634 ecr 3699472835,nop,wscale 7], length 0
22:31:06.064497 IP6 2003:a:611:9605::9abf.2000 > 2003:a:611:9600:c9fc:8043:ad2:4c5d.33148: Flags [S.], seq 2484932767, ack 339812778, win 64260, options [mss 1440,sackOK,TS val 1775479634 ecr 3699472835,nop,wscale 7], length 0
22:31:06.065251 IP6 2003:a:611:9600:c9fc:8043:ad2:4c5d.33148 > 2003:a:611:9605::9abf.2000: Flags [.], ack 1, win 450, options [nop,nop,TS val 3699472837 ecr 1775479634], length 0
22:31:06.065310 IP6 2003:a:611:9600::21.33148 > 2003:a:611:9612::11.2000: Flags [.], ack 1, win 450, options [nop,nop,TS val 3699472837 ecr 1775479634], length 0
22:31:06.066044 IP6 2003:a:611:9600:c9fc:8043:ad2:4c5d.33148 > 2003:a:611:9605::9abf.2000: Flags [P.], seq 1:93, ack 1, win 450, options [nop,nop,TS val 3699472837 ecr 1775479634], length 92

borkmann · 2022-01-20T12:56:22Z

@borkmann now I can confirm that the icmp6 redirects are not longer there and tcp sessions work well for ipv6 if the traffic need to be forwarded to a pod on another node.

Thanks a lot @mdaur. In other words, it would be okay to close the issue?

mdaur added kind/bug This is a bug in the Cilium logic. needs/triage This issue requires triaging to establish severity and next steps. labels Nov 19, 2021

borkmann assigned borkmann and qmonnet Nov 22, 2021

qmonnet added kind/community-report This was reported by a user in the Cilium community, eg via Slack. and removed needs/triage This issue requires triaging to establish severity and next steps. labels Nov 25, 2021

aanm added the sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. label Jan 6, 2022

mdaur closed this as completed Apr 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IPv6 TCP connections broken / ACK packet not forwarded in case of service cluster-ip node/remote-pod #17941

IPv6 TCP connections broken / ACK packet not forwarded in case of service cluster-ip node/remote-pod #17941

mdaur commented Nov 19, 2021

borkmann commented Nov 22, 2021 •

edited

mdaur commented Nov 22, 2021 •

edited

borkmann commented Nov 22, 2021

borkmann commented Nov 25, 2021

mdaur commented Nov 25, 2021

mdaur commented Dec 1, 2021

borkmann commented Jan 20, 2022

IPv6 TCP connections broken / ACK packet not forwarded in case of service cluster-ip node/remote-pod #17941

IPv6 TCP connections broken / ACK packet not forwarded in case of service cluster-ip node/remote-pod #17941

Comments

mdaur commented Nov 19, 2021

Is there an existing issue for this?

What happened?

Cilium Version

Kernel Version

Kubernetes Version

Sysdump

Relevant log output

Anything else?

Code of Conduct

borkmann commented Nov 22, 2021 • edited

mdaur commented Nov 22, 2021 • edited

borkmann commented Nov 22, 2021

borkmann commented Nov 25, 2021

mdaur commented Nov 25, 2021

mdaur commented Dec 1, 2021

borkmann commented Jan 20, 2022

borkmann commented Nov 22, 2021 •

edited

mdaur commented Nov 22, 2021 •

edited