Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPv6 TCP connections broken / ACK packet not forwarded in case of service cluster-ip node/remote-pod #17941

Closed
2 tasks done
mdaur opened this issue Nov 19, 2021 · 7 comments
Closed
2 tasks done
Assignees
Labels
kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.

Comments

@mdaur
Copy link

mdaur commented Nov 19, 2021

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

I am facing an issue in a dual stack cilium setup (v1.10.5-b0836e8) with full kube-proxy replacement (strict). The nodes make use of a BGP setup and all routes are exchanged (cluster/node-cidr), service CIDR is advertised for ECMP.

Issue: all packet drops after SYN, ACK (only IPv6), tcp session can not be established:

"hubble observe --since 1m --pod delete/prometheus-msteams-758bcbb47b-9bjg6" | grep 2003:XXXX:611:9600::21
Nov 18 20:26:55.315: [2003:XXXX:611:9600::21]:40524 -> delete/prometheus-msteams-758bcbb47b-9bjg6:2000 to-endpoint FORWARDED (TCP Flags: SYN)
Nov 18 20:26:55.315: [2003:XXXX:611:9600::21]:40524 <- delete/prometheus-msteams-758bcbb47b-9bjg6:2000 to-stack FORWARDED (TCP Flags: SYN, ACK)
-> missing entries / missing further packets e.g. ACK

IPv4 (same service, same pod works well):

Nov 18 20:26:48.127: 192.168.1.22:58182 -> delete/prometheus-msteams-758bcbb47b-9bjg6:2000 to-endpoint FORWARDED (TCP Flags: SYN)
Nov 18 20:26:48.127: 192.168.1.22:58182 <- delete/prometheus-msteams-758bcbb47b-9bjg6:2000 to-stack FORWARDED (TCP Flags: SYN, ACK)
Nov 18 20:26:48.127: 192.168.1.22:58182 -> delete/prometheus-msteams-758bcbb47b-9bjg6:2000 to-endpoint FORWARDED (TCP Flags: ACK)
Nov 18 20:26:48.127: 192.168.1.22:58182 -> delete/prometheus-msteams-758bcbb47b-9bjg6:2000 to-endpoint FORWARDED (TCP Flags: ACK, PSH)
Nov 18 20:26:48.128: 192.168.1.22:58182 <- delete/prometheus-msteams-758bcbb47b-9bjg6:2000 to-stack FORWARDED (TCP Flags: ACK, PSH)
Nov 18 20:26:48.128: 192.168.1.22:58182 -> delete/prometheus-msteams-758bcbb47b-9bjg6:2000 to-endpoint FORWARDED (TCP Flags: ACK, FIN)
Nov 18 20:26:48.128: 192.168.1.22:58182 -> delete/prometheus-msteams-758bcbb47b-9bjg6:2000 to-endpoint FORWARDED (TCP Flags: ACK)
Nov 18 20:26:48.128: 192.168.1.22:58182 <- delete/prometheus-msteams-758bcbb47b-9bjg6:2000 to-stack FORWARDED (TCP Flags: ACK, FIN)

tcpdump node1 (ingress, ebpf snat works well 2003:a :611:9600::76 (ingress client) -> 2003:a :611:9600::21 (node1) -> 2003:a :611:9611::2 (remote pod)

22:17:29.637078 IP6 2003:a:611:9600::76.48762 > 2003:a:611:9605::9abf.2000: Flags [S], seq 2457750650, win 64800, options [mss 1440,sackOK,TS val 3809735993 ecr 0,nop,wscale 7], length 0
22:17:29.637142 IP6 2003:a:611:9600::21.48762 > 2003:a:611:9611::2.2000: Flags [S], seq 2457750650, win 64800, options [mss 1440,sackOK,TS val 3809735993 ecr 0,nop,wscale 7], length 0
22:17:29.638585 IP6 2003:a:611:9611::2.2000 > 2003:a:611:9600::21.48762: Flags [S.], seq 2723963965, ack 2457750651, win 64260, options [mss 1440,sackOK,TS val 1868375260 ecr 3809735993,nop,wscale 7], length 0
22:17:29.638636 IP6 2003:a:611:9605::9abf.2000 > 2003:a:611:9600::76.48762: Flags [S.], seq 2723963965, ack 2457750651, win 64260, options [mss 1440,sackOK,TS val 1868375260 ecr 3809735993,nop,wscale 7], length 0
22:17:29.639831 IP6 2003:a:611:9600::76.48762 > 2003:a:611:9605::9abf.2000: Flags [.], ack 1, win 507, options [nop,nop,TS val 3809735996 ecr 1868375260], length 0

<- next packets do not arrive at node2 where pod 2003:a :611:9611::2.2000 resides ->

22:17:29.639870 IP6 2003:a:611:9600::21.48762 > 2003:a:611:9611::2.2000:    Flags [.], ack 1, win 507, options [nop,nop,TS val 3809735996 ecr 1868375260], length 0
22:17:29.640220 IP6 2003:a:611:9600::76.48762 > 2003:a:611:9605::9abf.2000: Flags [P.], seq 1:93, ack 1, win 507, options [nop,nop,TS val 3809735996 ecr 1868375260], length 92
22:17:29.640254 IP6 2003:a:611:9600::21.48762 > 2003:a:611:9611::2.2000:    Flags [P.], seq 1:93, ack 1, win 507, options [nop,nop,TS val 3809735996 ecr 1868375260], length 92
22:17:29.859473 IP6 2003:a:611:9600::76.48762 > 2003:a:611:9605::9abf.2000: Flags [P.], seq 1:93, ack 1, win 507, options [nop,nop,TS val 3809736216 ecr 1868375260], length 92
22:17:29.859519 IP6 2003:a:611:9600::21.48762 > 2003:a:611:9611::2.2000:    Flags [P.], seq 1:93, ack 1, win 507, options [nop,nop,TS val 3809736216 ecr 1868375260], length 92
22:17:30.079468 IP6 2003:a:611:9600::76.48762 > 2003:a:611:9605::9abf.2000: Flags [P.], seq 1:93, ack 1, win 507, options [nop,nop,TS val 3809736436 ecr 1868375260], length 92
22:17:30.079524 IP6 2003:a:611:9600::21.48762 > 2003:a:611:9611::2.2000:    Flags [P.], seq 1:93, ack 1, win 507, options [nop,nop,TS val 3809736436 ecr 1868375260], length 92
22:17:30.519421 IP6 2003:a:611:9600::76.48762 > 2003:a:611:9605::9abf.2000: Flags [P.], seq 1:93, ack 1, win 507, options [nop,nop,TS val 3809736876 ecr 1868375260], length 9

tcpdump node2 (remote-pod 2003:a :611:9611::2 retransmits of SYN/ACK due to missing ACK response)

22:17:29.638206 IP6 2003:a:611:9600::21.48762 > 2003:a:611:9611::2.2000: Flags [S], seq 2457750650, win 64800, options [mss 1440,sackOK,TS val 3809735993 ecr 0,nop,wscale 7], length 0
22:17:29.638409 IP6 2003:a:611:9611::2.2000 > 2003:a:611:9600::21.48762: Flags [S.], seq 2723963965, ack 2457750651, win 64260, options [mss 1440,sackOK,TS val 1868375260 ecr 3809735993,nop,wscale 7], length 0
22:17:30.650827 IP6 2003:a:611:9611::2.2000 > 2003:a:611:9600::21.48762: Flags [S.], seq 2723963965, ack 2457750651, win 64260, options [mss 1440,sackOK,TS val 1868376272 ecr 3809735993,nop,wscale 7], length 0
22:17:32.666809 IP6 2003:a:611:9611::2.2000 > 2003:a:611:9600::21.48762: Flags [S.], seq 2723963965, ack 2457750651, win 64260, options [mss 1440,sackOK,TS val 1868378288 ecr 3809735993,nop,wscale 7], length 0
22:17:36.794820 IP6 2003:a:611:9611::2.2000 > 2003:a:611:9600::21.48762: Flags [S.], seq 2723963965, ack 2457750651, win 64260, options [mss 1440,sackOK,TS val 1868382416 ecr 3809735993,nop,wscale 7], length 0
22:17:44.986811 IP6 2003:a:611:9611::2.2000 > 2003:a:611:9600::21.48762: Flags [S.], seq 2723963965, ack 2457750651, win 64260, options [mss 1440,sackOK,TS val 1868390608 ecr 3809735993,nop,wscale 7], length 0
22:18:01.115181 IP6 2003:a:611:9611::2.2000 > 2003:a:611:9600::21.48762: Flags [S.], seq 2723963965, ack 2457750651, win 64260, options [mss 1440,sackOK,TS val 1868406736 ecr 3809735993,nop,wscale 7], length 0

Summary:

  • IPv6 connectivity fails ingress using cluster-ip/remote-pod (TCP)
  • IPv6 connectivity works ingress using cluster-ip/remote-pod (UDP)
  • IPv6 connectivity works pod/pod (same nodes, different nodes), node/node, node/pod, node/remote-pod, ingress-cluster-ip/local-pod, pod/egress, node/egress
  • In case of IPv4 the same scenario ingress using cluster-ip/remote-pod with same service/pod work well (like all other IPv4 scenarios).
  • no network policies are applied.

Cilium Version

v1.10.5-b0836e8

Kernel Version

Linux node11 5.4.0-90-generic #101-Ubuntu SMP Fri Oct 15 20:00:55 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Kubernetes Version

Ranger K3S v1.21.5+k3s2

Sysdump

No response

Relevant log output

Are there any know issues in an ingress to node/remote-pod scenario for IPv6 TCP services (default service making use of a cluster IP with PreferDualStack)?
- Ubuntu focal, default kernel 5.4.0-90-generic, K3S v1.21.5+k3s2
- KubeProxyReplacement:   Strict   [enp2s0 192.168.1.21 2003:XXXX:611:9600::21 (Direct Routing)]
- Cilium:                 Ok   1.10.5 (v1.10.5-b0836e8)
- NodeMonitor:            Listening for events on 4 CPUs with 64x4096 of shared memory
- Cilium health daemon:   Ok
- IPAM:                   IPv4: 13/30 allocated, IPv6: 13/30 allocated
- ClusterMesh:            0/0 clusters ready, 0 global-services
- BandwidthManager:       Disabled
- Host Routing:           Legacy
- Masquerading:           Disabled
- Controller Status:      71/71 healthy
- Proxy Status:           OK, ip 10.10.0.12, 0 redirects active on ports 10000-20000
- Hubble:                 Ok   Current/Max Flows: 4095/4095 (100.00%), Flows/s: 9.24   Metrics: Disabled
- Encryption:             Disabled
- Cluster health:         2/2 reachable   (2021-11-18T21:05:37Z)

KubeProxyReplacement Details:
  Status:                Strict
  Socket LB Protocols:   TCP, UDP
  Devices:               enp2s0 192.168.1.22 2003:a:611:9600::22 (Direct Routing)
  Mode:                  SNAT
  Backend Selection:     Random
  Session Affinity:      Enabled
  XDP Acceleration:      Disabled
  Services:
  - ClusterIP:      Enabled
  - NodePort:       Enabled (Range: 30000-32767)
  - LoadBalancer:   Enabled
  - externalIPs:    Enabled
  - HostPort:       Enabled

Anything else?

For sure I can share all the debug logs/traces of the cilium-bugtool but for a first touchpoint I think this is a little bit to much. How can I figure out where the ACK packets e.g. in the trace above gets lost?
On the other hand IPv6 UDP services work well (service cluster-ip / host -> remote pod).

Any thoughts how to debug further are highly appreciated.
/martin

Code of Conduct

  • I agree to follow this project's Code of Conduct
@mdaur mdaur added kind/bug This is a bug in the Cilium logic. needs/triage This issue requires triaging to establish severity and next steps. labels Nov 19, 2021
@borkmann
Copy link
Member

borkmann commented Nov 22, 2021

Thanks for the report, @mdaur !

<- next packets do not arrive at node2 where pod 2003:a :611:9611::2.2000 resides ->

Do you happen to know if they actually leave the node, meaning dropped somewhere on the network before reaching node2?

IPv6 connectivity works ingress using cluster-ip/remote-pod (UDP)

Is it correct to assume that you did similar packet by packet analysis with tcpdump above to trace a few subsequent packets that they make it to the node2's backend, right? Meaning, exactly the same SNAT config, service VIP, etc, and they pass through just fine?

Also, could you attach the pcap from the tcpdump node1 session for further analysis?

One other question to 'cluster-ip node/remote-pod', I presume you run the agent with bpf-lb-external-clusterip (#15650), right?

@mdaur
Copy link
Author

mdaur commented Nov 22, 2021

Hello borkmann,
Thank you very much for your reply.

Is it correct to assume that you did similar packet by packet analysis with tcpdump above to trace a few subsequent packets that they make it to the node2's backend, right? Meaning, exactly the same SNAT config, service VIP, etc, and they pass through just fine?

Yes I did the captures for multiple services and of course for multiple times. With the same result all the time. The ACK packet got lost always (for multiple times and different services). In the mentioned scenario node1 and node2 have been connected by a unmanged switch without any filters. It is very unlikely that the packets got dropped on the L2 switch. But yes I missed the opportunity to check with a mirror port if the packets are really not leaving the node by making use of another switch. So far I am pretty sure the packets did not leave the node due to I could figure out to make it work by switching to kernel 5.10.80 (ubuntu mainline) meanwhile.

Also, could you attach the pcap from the tcpdump node1 session for further analysis?

Yes, see attached pcap ipv6.zip. Packet 9,10 (stream 0 inbound, stream 1 natted) show the lost ACK packet, or at least the packet which did not make it to node2.

One other question to 'cluster-ip node/remote-pod', I presume you run the agent with bpf-lb-external-clusterip (#15650), right?

Yes, I ran the agent with option bpf-lb-external-clusterip btw. helm value bpf.lbExternalClusterIP:true.

@borkmann
Copy link
Member

Also, could you attach the pcap from the tcpdump node1 session for further analysis?

Yes, see attached pcap ipv6.zip. Packet 9,10 (stream 0 inbound, stream 1 natted) show the lost ACK packet, or at least the packet which did not make it to node2.

That helped, I think I can see the issue. See the ICMPv6 one at pkt 6. Looks like router is doing the forwarding for us, but not in subsequent packets.

Could you try with latest image or with v1.11.0-rc2 to see if it is fixed there?

Both quay.io/cilium/cilium:v1.11.0-rc2 or quay.io/cilium/cilium-ci:latest should be okay to test.

We reworked the neighbor cache there and it should address it.

@borkmann
Copy link
Member

@mdaur any progress wrt above? Thx

@qmonnet qmonnet added kind/community-report This was reported by a user in the Cilium community, eg via Slack. and removed needs/triage This issue requires triaging to establish severity and next steps. labels Nov 25, 2021
@mdaur
Copy link
Author

mdaur commented Nov 25, 2021

I will share my outcome by end of tomorrow. Sorry.

@mdaur
Copy link
Author

mdaur commented Dec 1, 2021

@borkmann now I can confirm that the icmp6 redirects are not longer there and tcp sessions work well for ipv6 if the traffic need to be forwarded to a pod on another node. e.g. (orign 2003:a:611:9600:c9fc:8043:ad2:4c5d, forwarding node 2003:a:611:9600::21, pod on remote node 2003:a:611:9612::11, service ip 2003:a:611:9605::9abf tcp/2000). Tested with quay.io/cilium/cilium:v1.11.0-rc3 / kernel 5.4.0-90-generic default ubuntu focal amd64)

22:31:06.064025 IP6 2003:a:611:9600:c9fc:8043:ad2:4c5d.33148 > 2003:a:611:9605::9abf.2000: Flags [S], seq 339812777, win 28800, options [mss 1440,sackOK,TS val 3699472835 ecr 0,nop,wscale 6], length 0
22:31:06.064093 IP6 2003:a:611:9600::21.33148 > 2003:a:611:9612::11.2000: Flags [S], seq 339812777, win 28800, options [mss 1440,sackOK,TS val 3699472835 ecr 0,nop,wscale 6], length 0
22:31:06.064457 IP6 2003:a:611:9612::11.2000 > 2003:a:611:9600::21.33148: Flags [S.], seq 2484932767, ack 339812778, win 64260, options [mss 1440,sackOK,TS val 1775479634 ecr 3699472835,nop,wscale 7], length 0
22:31:06.064497 IP6 2003:a:611:9605::9abf.2000 > 2003:a:611:9600:c9fc:8043:ad2:4c5d.33148: Flags [S.], seq 2484932767, ack 339812778, win 64260, options [mss 1440,sackOK,TS val 1775479634 ecr 3699472835,nop,wscale 7], length 0
22:31:06.065251 IP6 2003:a:611:9600:c9fc:8043:ad2:4c5d.33148 > 2003:a:611:9605::9abf.2000: Flags [.], ack 1, win 450, options [nop,nop,TS val 3699472837 ecr 1775479634], length 0
22:31:06.065310 IP6 2003:a:611:9600::21.33148 > 2003:a:611:9612::11.2000: Flags [.], ack 1, win 450, options [nop,nop,TS val 3699472837 ecr 1775479634], length 0
22:31:06.066044 IP6 2003:a:611:9600:c9fc:8043:ad2:4c5d.33148 > 2003:a:611:9605::9abf.2000: Flags [P.], seq 1:93, ack 1, win 450, options [nop,nop,TS val 3699472837 ecr 1775479634], length 92

@aanm aanm added the sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. label Jan 6, 2022
@borkmann
Copy link
Member

@borkmann now I can confirm that the icmp6 redirects are not longer there and tcp sessions work well for ipv6 if the traffic need to be forwarded to a pod on another node.

Thanks a lot @mdaur. In other words, it would be okay to close the issue?

@mdaur mdaur closed this as completed Apr 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

No branches or pull requests

4 participants