Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoadBalancer service unable to reach external Endpoint #26515

Closed
2 tasks done
PKizzle opened this issue Jun 27, 2023 · 9 comments
Closed
2 tasks done

LoadBalancer service unable to reach external Endpoint #26515

PKizzle opened this issue Jun 27, 2023 · 9 comments
Labels
area/loadbalancing Impacts load-balancing and Kubernetes service implementations kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.

Comments

@PKizzle
Copy link

PKizzle commented Jun 27, 2023

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

I point a custom Endpoint to an external IP address and use a LoadBalancer service that is linked to this Endpoint. I expect to be able to reach the external service using the LoadBalancer IP address but am unable to connect. Strangely, I can connect to the external service using the LoadBalancer IP address from any node but not from outside the cluster.

Cilium Version

Client: 1.14.0-snapshot.4 6c8db75 2023-06-16T12:17:20-07:00 go version go1.20.5 linux/arm64
Daemon: 1.14.0-snapshot.4 6c8db75 2023-06-16T12:17:20-07:00 go version go1.20.5 linux/arm64

Kernel Version

6.1.21-v8+ aarch64 GNU/Linux

Kubernetes Version

v1.27.2+k3s1

Metallb Version

v0.13.10

Sysdump

cilium-sysdump-20230627-234741.tar.gz

The sysdump file extension had to be modified in order to upload to GitHub. Change it back to .tar.zst before attempting to unpack.

Relevant log output

msg="Kubernetes service definition changed" action=service-updated endpoints="${EXT_IPV4}:6690/TCP,${EXT_IPV6}:6690/TCP" k8sNamespace=network k8sSvcName=drive-server old-endpoints= old-service=nil service="frontends:[10.43.41.69 fd99::187]/ports=[drive-server]/selector=map[]" subsys=k8s-watcher
msg="Upserting service" backends="[${EXT_IPV4}:6690]" l7LBFrontendPorts="[]" l7LBProxyPort=0 loadBalancerSourceRanges="[]" serviceIP="{10.43.41.69 {TCP 6690} 0}" serviceName=drive-server serviceNamespace=network sessionAffinity=false sessionAffinityTimeout=0 subsys=service svcExtTrafficPolicy=Cluster svcHealthCheckNodePort=0 svcIntTrafficPolicy=Cluster svcType=ClusterIP
msg="Acquired service ID" backends="[${EXT_IPV4}:6690]" l7LBFrontendPorts="[]" l7LBProxyPort=0 loadBalancerSourceRanges="[]" serviceID=56 serviceIP="{10.43.41.69 {TCP 6690} 0}" serviceName=drive-server serviceNamespace=network sessionAffinity=false sessionAffinityTimeout=0 subsys=service svcExtTrafficPolicy=Cluster svcHealthCheckNodePort=0 svcIntTrafficPolicy=Cluster svcType=ClusterIP

Anything else?

I have switched the CNI plugin from tigera/calico and over there the aforementioned configuration was working fine.

Follow these steps to re-create the issue:

  1. Create a custom Endpoint pointing to an external IP address
  2. Create a LoadBalancer service with the same name as the endpoint
  3. Try to connect to external service using the LoadBalancer IP address

Code of Conduct

  • I agree to follow this project's Code of Conduct
@PKizzle PKizzle added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels Jun 27, 2023
@ldelossa ldelossa added sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. and removed needs/triage This issue requires triaging to establish severity and next steps. labels Jun 28, 2023
@PKizzle
Copy link
Author

PKizzle commented Jun 29, 2023

The same situation with version 1.14.0-rc.0

@brb
Copy link
Member

brb commented Jun 29, 2023

Strangely, I can connect to the external service using the LoadBalancer IP address from any node.

Thanks for the issue. Are you saying that you are able to connect to the service from outside the cluster, but not able from inside the cluster?

@brb brb added the area/loadbalancing Impacts load-balancing and Kubernetes service implementations label Jun 29, 2023
@PKizzle
Copy link
Author

PKizzle commented Jun 29, 2023

I am able to connect from within the cluster as well as from the host network of the Kubernetes node. However, not from any other external device.

@PKizzle
Copy link
Author

PKizzle commented Jul 5, 2023

This is the tcpdump when using telnet to connect to the port from one of the cluster's nodes:

listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
20:17:08.464752 IP6 ${NODE_IPV6}.43216 > ${DEST_IPV6}.6690: Flags [S], seq 3098569786, win 42700, options [mss 1220,sackOK,TS val 2067486859 ecr 0,nop,wscale 11], length 0
20:17:08.465545 IP6 ${DEST_IPV6}.6690 > ${NODE_IPV6}.43216: Flags [S.], seq 168912877, ack 3098569787, win 24160, options [mss 1220,sackOK,TS val 873284922 ecr 2067486859,nop,wscale 7], length 0
20:17:08.465814 IP6 ${NODE_IPV6}.43216 > ${DEST_IPV6}.6690: Flags [.], ack 1, win 21, options [nop,nop,TS val 2067486860 ecr 873284922], length 0
20:17:09.202896 IP6 ${NODE_IPV6}.43216 > ${DEST_IPV6}.6690: Flags [P.], seq 1:6, ack 1, win 21, options [nop,nop,TS val 2067487597 ecr 873284922], length 5
20:17:09.203044 IP6 ${DEST_IPV6}.6690 > ${NODE_IPV6}.43216: Flags [.], ack 6, win 189, options [nop,nop,TS val 873285660 ecr 2067487597], length 0
20:17:09.492395 IP6 ${NODE_IPV6}.43216 > ${DEST_IPV6}.6690: Flags [P.], seq 6:11, ack 1, win 21, options [nop,nop,TS val 2067487886 ecr 873285660], length 5
20:17:09.492491 IP6 ${DEST_IPV6}.6690 > ${NODE_IPV6}.43216: Flags [.], ack 11, win 189, options [nop,nop,TS val 873285949 ecr 2067487886], length 0
20:17:09.492852 IP6 ${DEST_IPV6}.6690 > ${NODE_IPV6}.43216: Flags [F.], seq 1, ack 11, win 189, options [nop,nop,TS val 873285949 ecr 2067487886], length 0
20:17:09.493026 IP6 ${NODE_IPV6}.43216 > ${DEST_IPV6}.6690: Flags [F.], seq 11, ack 2, win 21, options [nop,nop,TS val 2067487887 ecr 873285949], length 0
20:17:09.493193 IP6 ${DEST_IPV6}.6690 > ${NODE_IPV6}.43216: Flags [.], ack 12, win 189, options [nop,nop,TS val 873285950 ecr 2067487887], length 0
20:17:10.589406 IP6 ${NODE_IPV6}.43222 > ${DEST_IPV6}.6690: Flags [S], seq 193202518, win 42700, options [mss 1220,sackOK,TS val 2067488983 ecr 0,nop,wscale 11], length 0
20:17:10.589939 IP6 ${DEST_IPV6}.6690 > ${NODE_IPV6}.43222: Flags [S.], seq 4198144842, ack 193202519, win 24160, options [mss 1220,sackOK,TS val 873287046 ecr 2067488983,nop,wscale 7], length 0
20:17:10.590157 IP6 ${NODE_IPV6}.43222 > ${DEST_IPV6}.6690: Flags [.], ack 1, win 21, options [nop,nop,TS val 2067488984 ecr 873287046], length 0
20:17:11.639823 IP6 ${NODE_IPV6}.43222 > ${DEST_IPV6}.6690: Flags [P.], seq 1:6, ack 1, win 21, options [nop,nop,TS val 2067490034 ecr 873287046], length 5
20:17:11.639936 IP6 ${DEST_IPV6}.6690 > ${NODE_IPV6}.43222: Flags [.], ack 6, win 189, options [nop,nop,TS val 873288097 ecr 2067490034], length 0
20:17:11.750522 IP6 ${NODE_IPV6}.43222 > ${DEST_IPV6}.6690: Flags [P.], seq 6:11, ack 1, win 21, options [nop,nop,TS val 2067490144 ecr 873288097], length 5
20:17:11.750623 IP6 ${DEST_IPV6}.6690 > ${NODE_IPV6}.43222: Flags [.], ack 11, win 189, options [nop,nop,TS val 873288207 ecr 2067490144], length 0
20:17:11.751281 IP6 ${DEST_IPV6}.6690 > ${NODE_IPV6}.43222: Flags [F.], seq 1, ack 11, win 189, options [nop,nop,TS val 873288207 ecr 2067490144], length 0
20:17:11.751479 IP6 ${NODE_IPV6}.43222 > ${DEST_IPV6}.6690: Flags [F.], seq 11, ack 2, win 21, options [nop,nop,TS val 2067490145 ecr 873288207], length 0
20:17:11.751614 IP6 ${DEST_IPV6}.6690 > ${NODE_IPV6}.43222: Flags [.], ack 12, win 189, options [nop,nop,TS val 873288208 ecr 2067490145], length 0
^C
20 packets captured
20 packets received by filter
0 packets dropped by kernel

When I try to telnet from a device outside the cluster this is what Wireshark captures:
CleanShot 2023-07-05 at 20 33 22

192.168.178.13 is the external device running telnet
192.168.178.239 is the LoadBalancer IP
192.168.178.55 is the IP specified in the external Endpoint

@PKizzle
Copy link
Author

PKizzle commented Jul 6, 2023

I think I found the issue. As can be seen from the Wireshark capture above the external endpoint's traffic is directly routed to the telnet sender without changing it to the LoadBalancer's IP address. When comparing with another external endpoint exposed via an ingress (not directly using a LoadBalancer IP) this is the difference between the two.

@PKizzle
Copy link
Author

PKizzle commented Jul 8, 2023

Ok so this related to DSR. When using the SNAT mode everything works as expected. Also this issue seems to be very similar to #26407 just for a LoadBalancer service not a NodePort one. @brb Is there any further data that you need from my side to better understand the issue?

@brb
Copy link
Member

brb commented Jul 17, 2023

Ok so this related to DSR. When using the SNAT mode everything works as expected

Aha, so I think the failure is expected. What happens is that the external IP owner (custom Endpoint to an external IP address) does not have any Cilium's BPF program which could handle the DSR reply. In particular, before sending the reply to the client, it needs to rev-DNAT it.

@PKizzle
Copy link
Author

PKizzle commented Jul 17, 2023

Would it be possible to add this special case to the hybrid mode and let the endpoint use SNAT?

@brb
Copy link
Member

brb commented Jul 17, 2023

Today, the hybrid mode is per L4 protocol (i.e., TCP / UDP), and not per service type / configuration. Feel free to create a CFP for the latter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/loadbalancing Impacts load-balancing and Kubernetes service implementations kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

No branches or pull requests

3 participants