-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cilium LoadBalancer stops working if there is no other active session to the node IP (direct routing, kube-proxy-less, BGP) #22782
Comments
Hi @eripa, thanks so much for the detailed bug report! One question, if you remove the hybrid SNAT/DSR mode and just keep it to SNAT only.. essentially this line:
... would it start to work then? |
Thanks @borkmann, unfortunately this doesn't make any difference for this issue. I've tried SNAT, DSR and hybrid with the same results. |
Ok, with regards to your description about the neighbor issue.. you observe this happening on the worker node which contains the backend (so not the LB node), is that correct? Could you run |
In this case the backend is the LB, since the backend pod is hosted on the node and the node is announcing the LoadBalancer IP. Below is the router's routing table, where you can see the LoadBalancer IP ( The Finally, I don't see any ARP lookups when hitting the LoadBalancer, but I do see them when I hit the backend
|
Thanks for the detailed report. I think we were able to figure out what is going on. If we first look at the PCAPs of the scenario where LB traffic doesn't work, I see the following:
If we then look at the PCAPs of for the Pod IP where it does work, i see the following:
Now that node has the ARP record for the client the LB works since the node will at that point start using the ARP record for return traffic for the LB VIP as well. So we have 3 issues here:
My suggestion to resolve this issue is to make a dedicated CIDR for clients, lets say 192.168.6.0/24, this should force traffic in both directions to go via the router. Then make sure return traffic can reach the clients, which should be easier to diagnose in that setup since you won't have packets bypassing your router. Lets keep this issue as a bug report for the incorrect handling of return traffic. Some technical details: After discussing it with @borkmann the conclusion is that the return traffic in this case will be processed by rev_nodeport_lb{4,6}. Here we perform a FIB lookup, we suspect that this results in a |
Thanks a lot for looking into this and providing all the details and help @dylandreimerink, I really appreciate it. I moved the Kubernetes nodes over to a different subnet ( My problem is solved but please feel free to keep the issue open as needed. Here's a quick
and here is the same request on the router:
and finally the client:
|
Thanks for testing, yes, this clears it up given traffic goes via default gw. Beginning of Jan, I'm planning to consolidate most of the fib functionality to address issues such as this one. |
This issue has been automatically marked as stale because it has not |
Is there an existing issue for this?
What happened?
Hello,
I have an issue with Cilium running with direct routing with BGP control plane in kube-proxy free mode. Please note that I'm using a pre-release version of Cilium in order to be able to use the BGP control plane to announce LoadBalancer IP, which isn't part of v1.13-rc3 (
quay.io/cilium/cilium-ci:dab8723c01c94998fd082ae1630a58f62f19658f
).I did experience the same issue with the MetalLB based approach, and upgraded in the hope that it would improve my situation. Once v1.13-rc4 is out I can switch to that version.
I have a suspicion that it's related to ARP and neighbour management (see
ip neigh show
below). But I'm not sure. If I remove the neighbour (sudo ip neigh del 192.168.5.157 lladdr 64:4b:f0:02:00:b3 dev enp2s0
) it stops working in a seemingly same way.When the client isn't in the nodes
neighbour
list, and traffic to theLoadBalancer
is not working, any query to the Pod IP will work (and make the LoadBalancer work again, adding it to the neighbour list).Please advice. Thank you! 🙏
Edit: a similar setup, but using k3s, metallb and kube-proxy w/ iptables works fine with LoadBalancer IPs always being reachable.
Problem statement
Overall the setup seems to work, but
LoadBalancer
access stops working after a period of inactivity (seemingly as sessions expire), until a new session is established with the node in some other way, like SSH to the node or curl/dig to a Pod IP on the same Kubernetes node.If any session has been established outside hitting the
Loadbalancer
directly, then traffic works fine, but if the client goes offline for a while, and then comes back, the client can no longer reach the LoadBalancer IP. It only starts working again after that client has initialized another session with the node Internal IP or Pod IP.There are no Network Policies involved.
From what I can tell the main issue is a) clients fail to establish a fresh session with a LoadBalancer IP and b) if a session is established in some other way (like accessing node/pod ip), then the session will eventually expire and the loadbalancer traffic breaks.
Worth noting is that other clients that have established sessions still works fine.
Details
Both clients and Kubernets workers lives on the same network subnet,
192.168.5.0/24
. The Kubernetes workers announce the Pod CIDR and Load Balancer CIDR using the BGP Control Plane (not metallb).No tunnel, direct routing, kube-proxy-free setup using BGP.
Non working session
If I try to access the service from a Client that has been offline for a while:
I can observe the following using
hubble observe
, where the TCP handshake arrives, but the client never sees the SYN-ACK and thus retransmits the SYN:Using
tcpdump
on the kubernetes worker (the router shows the same), I can see that the request comes in, and theSYN-ACK
goes out. But the client never receives it, thus retransmits the package (Wireshark excerpt from pcap):Edit: captured new dumps
TCP dump pcap: pcap-not-working.zippcap of trying a "fresh" (not working) DNS lookup attempt to the
LoadBalancer
:The client never receives the
SYN-ACK
, i.e. I cannot see this package intcpdump
on the client.pcap of the same request (working), but made directly to the
PodIP
(behind the sameLoadBalanacer
):In this capture, we can see that there's an ARP request, which populates the neighbour list on the Kubernetes worker.
I can see the routes being properly announced via BGP to the router:
Working session
If initiate a session with the node itself,
192.168.5.20
, such asssh
or directly to a Pod IP. It starts working and works until the session expires.Client:
and
hubble observe
:Cilium Version
Please note that I'm using a pre-release version of Cilium in order to be able to use the BGP control plane to announce LoadBalancer IP, which isn't part of v1.13-rc3 (
quay.io/cilium/cilium-ci:dab8723c01c94998fd082ae1630a58f62f19658f
).I needed the improvement in #22397 to fully be able to replace the metalLB based BGP implementation.
Kernel Version
Debian 11 with backports kernel:
Kubernetes Version
Sysdump
The original zip file was too large, so I recompressed it using
xz
and then created azip
out of that because GitHub doesn't acceptxz
files. I hope this is OK. :sad:cilium-sysdump-20221217-131223.tar.xz.zip
Relevant log output
Anything else?
linux neighbours
while having the problem on a client with IP
192.168.5.157
I cannot see it in theip neigh
on the worker node:Once I initiate a session to the node, I can observe the client showing up on the hosts neighbour list:
Cilium configuration
configuration options
Using Helm, relevant values:
status
BGP configuration
Code of Conduct
The text was updated successfully, but these errors were encountered: