-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return-traffic for LB hairpin using loopback address as destination #23913
Comments
The problem is the missing
The 3rd entry's |
A good example looks like this:
while a bad example looks like this: Seems like after lb translation, the connection tracking for the new packet didn't correctly related this packet to the original service w/ the right rev_nat index. Not exactly sure how this happens, but we know when this happens, the stale ct entry will be there, all connection trying to use the source port will timeout. |
I have a turn-key reproduction for GKE, but I am trying to reproduce using OSS master/v1.13 using kind/metallb, but can't get the agent configured to use the loopback IP. What is a minimal agent configuration to ensure |
The problem is that the loopback conntrack entries (
The conntrack entry pair now has two different lifetimes. When the This is because we only create this entry in the special loopback case (ref: https://sourcegraph.com/github.com/cilium/cilium/-/blob/bpf/lib/conntrack.h?L961-989) Any connection that reuses the 5-tuple will perform a ct lookup for the
One workaround for this is to drop the
This reduces the window for which this issue can encountered. |
Irrelevant note: Found that the
|
/cc |
It seems this was solved with #22972 I can not reprod in master but I think I can with the previous commit 80af06e
Install one Pod and expose it with one Service
Once the pod is running get the Service ClusterIP and exec into the pod to run a curl against the ClusterIP
One the anetd pod we should see the asymmetry on the timeouts for the link-local address
|
If I run with the patch from #22972 the timers seems correct now
|
Confirmed that fixes the issue. Note: #22972 does change the ct states -- the Rx closing FIN(/ACK) is missed by these conntrack entries and therefore remain in the ct map for the TCP (syn seen) lifetime:
|
I think we can close it then Thanks /close |
/reopen @aditighag can you please reopen, it seems that with certain configuration the issue can be reproduced, we are trying to figure out what exactly is this combination Thanks |
This is easy to repro on the affected scenarios with the reproducer in #23913 (comment) by execing in the Pod a while loop that curls to the Service IP, it eventually hangs and seems to match explanations in #23913 (comment)
The SYN arrives byt the SYN ACK never gets back because the ct entry found is not able to reverse the nat |
I see that is the expected behavior for Services, to set @joestringer is this expected? that "loopback" services don't set the I don't have clear if is deliberate that we miss to update some known state on the TCP close handshake |
We've identified the trigger. When helm: agent flags: With the above settings, the TCP FIN/ACK sets the
else:
Reproduction: This state diff is reproducible on
|
Very broadly, without looking at the details, I'd expect the transmit / receive closing bits to be set based on the perspective of the local entry upon egress or ingress of the corresponding packet. So when the pod transmits a TCP FIN A->B I'd expect TxClosing to be set for the A->B connection at A, then when B receives it, it'd set RxClosing at B, then B should respond with the FIN-ACK and that would set TxClosing at B, the packet is routed back and received at A, where it would set RxClosing. When it comes to LB hairpin I would then posit that this should happen the same but for connections A->C1 and C->B, then B->C and C->A. Footnotes
|
sweet, @sypakine this clarifies everything, so the bug is clear now, in LB hairpin during all the test and permutations we did we never saw A setting RxClosing It seems that depending on which features are enabled, for LB hairping, the conntrack states can be the same or different. The conntrack states drive the entry timeout, and if both are different then you'll eventually hit this bug |
Yes, this is the observation I made @ #23913 (comment):
It seems that this flow dodged a wider issue because all ct entries fail to set the |
Nice find! Even if this doesn't always trigger a connectivity failure, the additional conntrack entries hanging around are not ideal. |
Not sure I follow all the debugging results so far, but looks like you're on the right track 👍. For the loopback case, we should end up having three CT entries:
The underlying problem is that for the loopback case, the CT_INGRESS entry already gets created while creating the CT_EGRESS entry (to set FYI, I've parked some WIP cleanups in https://github.com/julianwiedmann/cilium/tree/1.15-bpf-loopback-cleanups that I hope to land in |
👋 hi @aojea, is there still interest in resolving this particular issue (missing If so, #27602 would benefit from testing / contributing tests that reproduce the issue. |
I have inconsistent results with different versions, I need to find a good way to consistently repro this |
The missing RevNAT should be fixed now with #27602. |
Thanks for all the work on this @julianwiedmann, @aditighag, @aojea, and @anfernee! |
Is there an existing issue for this?
What happened?
tl;dr: for LB hairpin traffic, return-traffic may not restore the connection state (SNATed to the service IP nor DNATed to to the backend address).
Observations
Example working flow:
Example flow which encountered the issue:
The failure signature appears exclusively for flows that are initially detected as
new
rather thanestablished
(see signature screenshot below).I am unable to reproduce this issue on demand. I've tried inducing by regenerating endpoints, flushing the conntrack table, etc. Looking for insight into the issue or next steps for debugging.
Context
An nginx ingress LB pod that is communicating via an internal LB address which is served by the same nginx pod. In this scenario, the source address is changed to the loopback address (ref: https://github.com/cilium/cilium/blob/master/bpf/lib/lb.h#L1705-L1721).
For this LB loopback scenario The outgoing flow should:
When the reverse translation does not occur for the return traffic, the return traffic does not reach the client.
Cilium Version
Client: 1.11.4 c18daa70f5 2022-03-10T13:43:39-08:00 go version go1.17.9 linux/amd64
Daemon: 1.11.4 c18daa70f5 2022-03-10T13:43:39-08:00 go version go1.17.9 linux/amd64
Kernel Version
Linux gke-<cluster_name>-default-pool- 5.10.147+ #1 SMP Thu Nov 10 10:14:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Kubernetes Version
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.14-gke.1800", GitCommit:"1eab5b8da4acab130c72aea21eb7ed3e96523ca2", GitTreeState:"clean", BuildDate:"2022-12-07T09:32:46Z", GoVersion:"go1.17.13b7", Compiler:"gc", Platform:"linux/amd64"}
Sysdump
Can acquire if required.
Relevant log output
Anything else?
cilium monitor
output from a successful case (return traffic is DNATed to backend):cilium monitor
output from a failure case (return traffic is NOT NATed):Signature:
Code of Conduct
The text was updated successfully, but these errors were encountered: