New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
L2 Missing Routes in native routing #26588
Comments
I see this error in your log (this is a log for a single node, but all nodes have a similar error).
This happens when your nodes are not L2 reachable to each other. In this case, the logs say your node
With |
Those log entries are a red herring (I think?) I don't need routes to the external IP from pods. Those routes are What is missing, are the actual L2 routes to other nodes (which doesn't involve the external IP at all). (In fact, creating a route to the external IP from the internal VLAN will totally fubar the cluster, so I'm happy that it is failing to do that) |
This log message is a bit confusing, but it doesn't try to insert cilium/pkg/datapath/linux/node.go Lines 222 to 226 in 043ff5b
I think there's a bug in the |
That was a good hint (I'll have to see if it creates routes), but the error message goes away when adding onlink routes to the other two nodes:
|
I think you did the right workaround. Now
For the moment, I recommend you go with this workaround. I can work for modifying Cilium to take |
Thanks for the quick response. Indeed, this workaround works (most other things on the main network are using ipv6, so are unaffected by any of this -- monitoring/ssh/etc), from what I can tell. I'm very curious how it is working, so I'll probably be breaking out tcpdump in the near future just to understand what is really going on.
Really appreciate this and all the hard work that has gone into this project. It truly is a very powerful tool. |
@withinboredom BTW after you did this #26588 (comment), did the pods become reachable to each other? I'm wondering how your external network forward the traffic to the right node without knowing PodCIDR => NodeIP mapping. More specifically, who is |
The pods were previously reachable after adding the manual L2 routes. I had a script run on boot that called Once I added the onlink route to the other nodes's "management" ip (which is public fwiw, the 65.x ips you were asking about) add attached it the the internal VLAN, traffic going between nodes moved off of the 'real' physical network and into the VLAN. Linux appears to be fine with this and it will happily respond on the VLAN with the 65.x ips. So, full connectivity is now had, without the node-local policy. Cilium sees this as a completed route and now creates the "appropriate" routes without my script. The bug (I think its a bug) is that cilium didn't see that it was on a VLAN and seemed to only see the physical NIC, thus it thought there had to be a direct route or something and didn't create the routes. If it just blindly created the routes, like my script did, it would have been fine. That's what led me to report this in the first place. FWIW, this is the script |
Might be worth showing the network topology a bit:
There's only a single physical NIC, but several networks. |
Is there an existing issue for this?
What happened?
When starting cilium on a multi-node cluster, and native routing, I expected it to 'just work' as long as the L2 network is configured.
How cilium as installed:
However, once cilium starts up, it only configures the route on the existing node. Example of the routing table:
Working routes (manually adding the routes from other nodes):
Scripting this to handle node reboots and changes is not all that difficult, but it seems like Cilium itself is better equipped to handle this.
Cilium Version
Kernel Version
Linux capital 5.15.0-76-generic #83-Ubuntu SMP Thu Jun 15 19:16:32 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Kubernetes Version
Sysdump
cilium-sysdump-20230702-072346.zip
Relevant log output
No response
Anything else?
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: