Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

L2 Missing Routes in native routing #26588

Open
2 tasks done
withinboredom opened this issue Jul 2, 2023 · 9 comments
Open
2 tasks done

L2 Missing Routes in native routing #26588

withinboredom opened this issue Jul 2, 2023 · 9 comments
Assignees
Labels
kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.

Comments

@withinboredom
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

When starting cilium on a multi-node cluster, and native routing, I expected it to 'just work' as long as the L2 network is configured.

How cilium as installed:

cilium upgrade --set kubeProxyReplacement=strict --set k8sServiceHost=cluster.bottled.codes --set k8sServicePort=6443 --datapath-mode native --set tunnel=disabled --set ipv4NativeRoutingCIDR=10.0.0.0/8 --set autoDirectNodeRoutes=true --set bpf.masquerade=true --set endpointRoutes.enabled=false --set loadBalancer.mode=dsr --set bpf.hostLegacyRouting=false

However, once cilium starts up, it only configures the route on the existing node. Example of the routing table:

default via 65.108.75.193 dev enp8s0 proto static onlink
10.0.0.0/24 via 10.0.0.227 dev cilium_host src 10.0.0.227 <-- cilium added
10.0.0.0/8 dev internal proto kernel scope link src 10.3.0.0
10.0.0.227 dev cilium_host scope link
192.168.0.0/16 dev external proto kernel scope link src 192.168.100.3

Working routes (manually adding the routes from other nodes):

default via 65.108.75.193 dev enp8s0 proto static onlink
10.0.0.0/24 via 10.0.0.227 dev cilium_host src 10.0.0.227 <-- cilium added
10.0.0.0/8 dev internal proto kernel scope link src 10.3.0.0
10.0.0.227 dev cilium_host scope link
10.0.1.0/24 via 10.0.1.118 dev internal <-- manual
10.0.2.0/24 via 10.0.2.36 dev internal <-- manual
192.168.0.0/16 dev external proto kernel scope link src 192.168.100.3

Scripting this to handle node reboots and changes is not all that difficult, but it seems like Cilium itself is better equipped to handle this.

Cilium Version

cilium-cli: v0.15.0 compiled with go1.20.4 on linux/amd64
cilium image (default): v1.13.4
cilium image (stable): v1.13.4
cilium image (running): 1.13.4

Kernel Version

Linux capital 5.15.0-76-generic #83-Ubuntu SMP Thu Jun 15 19:16:32 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Kubernetes Version

{
  "clientVersion": {
    "major": "1",
    "minor": "26",
    "gitVersion": "v1.26.6+k3s1",
    "gitCommit": "3b1919b0d55811707bd1168f0abf11cccc656c26",
    "gitTreeState": "clean",
    "buildDate": "2023-06-26T17:51:14Z",
    "goVersion": "go1.19.10",
    "compiler": "gc",
    "platform": "linux/amd64"
  },
  "kustomizeVersion": "v4.5.7",
  "serverVersion": {
    "major": "1",
    "minor": "26",
    "gitVersion": "v1.26.6+k3s1",
    "gitCommit": "3b1919b0d55811707bd1168f0abf11cccc656c26",
    "gitTreeState": "clean",
    "buildDate": "2023-06-26T17:51:14Z",
    "goVersion": "go1.19.10",
    "compiler": "gc",
    "platform": "linux/amd64"
  }
}

Sysdump

cilium-sysdump-20230702-072346.zip

Relevant log output

No response

Anything else?

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@withinboredom withinboredom added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels Jul 2, 2023
@YutaroHayakawa YutaroHayakawa added sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. and removed needs/triage This issue requires triaging to establish severity and next steps. labels Jul 3, 2023
@YutaroHayakawa
Copy link
Member

I see this error in your log (this is a log for a single node, but all nodes have a similar error).

2023-07-02T05:05:22.840850290Z level=warning msg="Unable to install direct node route {Ifindex: 0 Dst: 10.0.1.0/24 Src: <nil> Gw: 65.108.66.126 Flags: [] Table: 0 Realm: 0}" error="route to destination 65.108.66.126 contains gateway 65.108.6.193, must be directly reachable" subsys=linux-datapath
2023-07-02T05:05:22.840877850Z level=warning msg="Unable to install direct node route {Ifindex: 0 Dst: 10.0.0.0/24 Src: <nil> Gw: 65.108.75.198 Flags: [] Table: 0 Realm: 0}" error="route to destination 65.108.75.198 contains gateway 65.108.6.193, must be directly reachable" subsys=linux-datapath

This happens when your nodes are not L2 reachable to each other. In this case, the logs say your node capital (65.108.66.126) is not directly (L2) reachable from cantor. Looking at the address assignment of cantor, I see /32 route assigned to the node.

2: enp9s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc fq state UP group default qlen 1000
    link/ether a8:a1:59:8e:72:b5 brd ff:ff:ff:ff:ff:ff
    inet 65.108.6.254/32 scope global enp9s0
       valid_lft forever preferred_lft forever
    inet6 2a01:4f9:6a:4297::2/128 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::aaa1:59ff:fe8e:72b5/64 scope link
       valid_lft forever preferred_lft forever

With /32, you cannot directly reach any node with L2.

@withinboredom
Copy link
Author

withinboredom commented Jul 4, 2023

Those log entries are a red herring (I think?) I don't need routes to the external IP from pods. Those routes are onlink and don't need any special help. They are public ips thus the routing table for them is outside of the cluster.

What is missing, are the actual L2 routes to other nodes (which doesn't involve the external IP at all).

(In fact, creating a route to the external IP from the internal VLAN will totally fubar the cluster, so I'm happy that it is failing to do that)

@YutaroHayakawa
Copy link
Member

This log message is a bit confusing, but it doesn't try to insert routes to the external IP from pods. In this case, what it tries to insert is <podCIDR of capital> via < capital's nodeIP>. The error message is generated from here.

if routes[0].Gw != nil && !routes[0].Gw.IsUnspecified() && !routes[0].Gw.Equal(nodeIP) {
err = fmt.Errorf("route to destination %s contains gateway %s, must be directly reachable",
nodeIP, routes[0].Gw.String())
return
}

I think there's a bug in the --set autoDirectNodeRoutes=true implementation. Basically, what it does is do ip route get <capital's nodeIP> and check the next hop. If there's a next hop, it fails. However, this doesn't take the onlink route into account.

@withinboredom
Copy link
Author

withinboredom commented Jul 5, 2023

That was a good hint (I'll have to see if it creates routes), but the error message goes away when adding onlink routes to the other two nodes:

ip route                                                                                                                                                                                                   ─╯
default via 65.108.66.65 dev enp9s0 proto static onlink
10.0.0.0/24 via 65.108.75.198 dev internal <-- I think cilium did this one
10.0.0.0/8 dev internal proto kernel scope link src 10.1.0.0
10.0.1.0/24 via 10.0.1.48 dev cilium_host src 10.0.1.48
10.0.1.48 dev cilium_host scope link
10.0.2.0/24 via 65.108.6.254 dev internal
65.108.6.254 dev internal scope link <-- added
65.108.75.198 dev internal scope link <-- added
167.235.212.72/29 dev external proto kernel scope link src 167.235.212.72
192.168.0.0/16 dev external proto kernel scope link src 192.168.100.4

@YutaroHayakawa
Copy link
Member

YutaroHayakawa commented Jul 5, 2023

I think you did the right workaround. Now 65.108.6.254 and 65.108.75.198 are in the link scope, so now the check I mentioned above succeeds because the NodeIPs of other nodes are L2 reachable. That's why Cilium added

10.0.0.0/24 via 65.108.75.198 dev internal
10.0.2.0/24 via 65.108.6.254 dev internal

For the moment, I recommend you go with this workaround. I can work for modifying Cilium to take onlink into account.

@YutaroHayakawa YutaroHayakawa self-assigned this Jul 5, 2023
@withinboredom
Copy link
Author

Thanks for the quick response. Indeed, this workaround works (most other things on the main network are using ipv6, so are unaffected by any of this -- monitoring/ssh/etc), from what I can tell. I'm very curious how it is working, so I'll probably be breaking out tcpdump in the near future just to understand what is really going on.

I can work for modifying Cilium to take onlink into account.

Really appreciate this and all the hard work that has gone into this project. It truly is a very powerful tool.

@YutaroHayakawa
Copy link
Member

YutaroHayakawa commented Jul 6, 2023

@withinboredom BTW after you did this #26588 (comment), did the pods become reachable to each other? I'm wondering how your external network forward the traffic to the right node without knowing PodCIDR => NodeIP mapping. More specifically, who is 65.108.66.65 and what it does?

@withinboredom
Copy link
Author

The pods were previously reachable after adding the manual L2 routes. I had a script run on boot that called cilium status on the other two nodes to get the 10.0.0.0/8 gateway for that node. They're all on the same L2 network, so it just works™️. The only issue I had was contacting other nodes ips (aka, the kubeapi server), the local one was fine. I worked around that by creating a node-local policy in cilium.

Once I added the onlink route to the other nodes's "management" ip (which is public fwiw, the 65.x ips you were asking about) add attached it the the internal VLAN, traffic going between nodes moved off of the 'real' physical network and into the VLAN. Linux appears to be fine with this and it will happily respond on the VLAN with the 65.x ips. So, full connectivity is now had, without the node-local policy. Cilium sees this as a completed route and now creates the "appropriate" routes without my script.

The bug (I think its a bug) is that cilium didn't see that it was on a VLAN and seemed to only see the physical NIC, thus it thought there had to be a direct route or something and didn't create the routes. If it just blindly created the routes, like my script did, it would have been fine. That's what led me to report this in the first place.

FWIW, this is the script

@withinboredom
Copy link
Author

Might be worth showing the network topology a bit:

  |--------internet------|-------- [ssh/management/etc]
  |           |          |
  |--external-VLAN-------|-------- [load balancer : external ip pool : MetalLB]
  |           |          |
  |--internal-VLAN-------|-------- [workers]
  |           |          |    
Node1       Node2      Node3

There's only a single physical NIC, but several networks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

No branches or pull requests

2 participants