Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datapath: Fix ENI egress routing table for cilium_host IP #29335

Merged
merged 2 commits into from
Nov 27, 2023

Conversation

gandro
Copy link
Member

@gandro gandro commented Nov 22, 2023

On ENI, we install source-based egress routing rules that steer traffic from Cilium-mananged IPs (i.e. pods, but also the health IP, ingress IP and router IP) to the correct egress interface.

For pod IPs, this is done in the CNI plugin:

if err := routingInfo.Configure(
ipConfig.Address.IP,
int(conf.DeviceMTU),
conf.EgressMultiHomeIPRuleCompat,
); err != nil {

For ingress and health IP, this is done from cilium-agent:

cilium/daemon/cmd/ipam.go

Lines 401 to 405 in ed20c8a

if err := ingressRouting.Configure(
result.IP,
d.mtuConfig.GetDeviceMTU(),
option.Config.EgressMultiHomeIPRuleCompat,
); err != nil {
if err := routingConfig.Configure(
healthIP,
mtuConfig.GetDeviceMTU(),
option.Config.EgressMultiHomeIPRuleCompat,
); err != nil {

For the router (aka cilium_host) IP however, this was done differently. Commit f34371c added a new routing.SetupRules function that duplicated parts of the routing.RoutingInfo.Configure logic, but missed a crucial part: Namely the creation of the per-ENI routing table that the source-based egress rule points towards.

This means that if the cilium_host IP address was allocated from a different ENI than the pod, health and ingress IP addresses, that the routing table for that ENI was never created. This led to connectivity issues, in particular in combination with IPSec.

This commit addresses that issue by having the router IP use the same code path as the other IP users: Using RoutingInfo.Configure. This not only fixes the bug, but removes some code that was otherwise only used for the router IP.

There is one major difference between other users of RoutingInfo.Configure and the newly introduced use for the cilium_host IP: For the cilium_host IP, we skip the creation of the ingress rule (by passing in host=true), as otherwise the cilium_host IP would not be considered a local address of the host network namespace. This is consistent with the old SetupRules function did also not create such an ingress rule.

Long-term, it remains questionable if the setup of egress rules in ENI mode should be left to IPAM clients, as every client seems to do it slightly differently. Maybe this is better done by either the IPAM subsystem or a separate device manager.

Fixes: f34371c ("ipam: Add routes for cilium_host ENI address")

@gandro gandro added release-note/bug This PR fixes an issue in a previous release of Cilium. area/eni Impacts ENI based IPAM. needs-backport/1.12 needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch area/ipam Impacts IP address management functionality. needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Nov 22, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.14.5 Nov 22, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.13.10 Nov 22, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.12.17 Nov 22, 2023
@gandro
Copy link
Member Author

gandro commented Nov 22, 2023

/ci-eks

This commit extends the `Configure` method of `RoutingInfo` with a flag
to skip the creation of the ingress rule. The ingress rule is needed for
endpoints such that those are forwarded via the `main` routing table.
But for the `cilium_host` (aka. router) IP, we want to route it via the
`local` table (which would be skipped by the ingress rule). Without a
lookup in the `local` routing table, Linux will not consider
`cilium_host` to be an address of the local host, and for example not
respond to ICMP requests.

Note that this commit does not yet use `RoutingInfo.Configure` to set up
the `cilium_host` IP, this will be done in the next commit. This commit
here merely prepares the method for that and does not contain any
functional changes by itself (which can be observed by the fact that all
callers pass in `host=false`).

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
On ENI, we install source-based egress routing rules that steer traffic
from Cilium-mananged IPs (i.e. pods, but also the health IP, ingress IP
and router IP) to the correct egress interface.

For pod IPs, this is done in the CNI plugin:

https://github.com/cilium/cilium/blob/7875f6acb5a2fd2b0e3e6c993c9995c0d322e55d/plugins/cilium-cni/interface.go#L59-L63

For ingress and health IP, this is done from cilium-agent:

https://github.com/cilium/cilium/blob/ed20c8acde8c76d405d6c9fac3c9de44aa3bb403/daemon/cmd/ipam.go#L401-L405
https://github.com/cilium/cilium/blob/e49430286b5d63b00062758a10a2b37458f94525/cilium-health/launch/endpoint.go#L329-L333

For the `cilium_host` (aka router) IP however, this was done differently.
Commit f34371c added a new `routing.SetupRules` function that
duplicated parts of the `routing.RoutingInfo.Configure` logic, but
missed a crucial part: Namely the creation of the per-ENI routing table
that the source-based egress rule points towards.

This means that if the `cilium_host` IP address was allocated from a
different ENI than the pod, health and ingress IP addresses, that the
routing table for that ENI was never created. This led to connectivity
issues, in particular in combination with IPSec.

This commit addresses that issue by having the `cilium_host` IP use the same
code path as the other IP users: Using `RoutingInfo.Configure`. This not
only fixes the bug, but removes some code that was otherwise only used
for the router IP.

There is one major difference between other users of
`RoutingInfo.Configure` and the newly introduced use for the
`cilium_host` IP: For the `cilium_host` IP, we skip the creation of the
ingress rule (by passing in `host=true`), as otherwise the `cilium_host`
IP would not be considered a local address of the host network
namespace. This is consistent with the old `SetupRules` function did
also not create such an ingress rule.

Long-term, it remains questionable if the setup of egress rules in ENI
mode should be left to IPAM clients, as every client seems to do it
slightly differently. Maybe this is better done by either the IPAM subsystem
or a separate device manager.

Fixes: f34371c ("ipam: Add routes for cilium_host ENI address")

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro
Copy link
Member Author

gandro commented Nov 23, 2023

I have locally tested this by installing Cilium with:

helm upgrade --install cilium ./install/kubernetes/cilium \
  --namespace kube-system \
  --set eni.enabled=true \
  --set ipam.mode=eni \
  --set egressMasqueradeInterfaces=eth0 \
  --set routingMode=native \
  --set endpointHealthChecking.enabled=false \
  --set image.override=quay.io/cilium/cilium-ci:f4b9743ac75ed66de00eb9d613ae2ae4cc85b9f9 \
  --set image.pullPolicy=IfNotPresent \
  --set=debug.enabled=true

And then adding a node to the cluster which does not have any pods scheduled on it (by scaling the node group).

Since the cilium_host is the only IP allocated on that new node (it's important to turn off endpointHealthChecking for that), we can see that before the PR, the ENI routing table was missing, where as after, the table is there as expected:

Before (running main):

root@ip-192-168-151-181:/home/cilium# ip addr show dev cilium_host
4: cilium_host@cilium_net: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default qlen 1000
    link/ether 8e:38:75:5f:f9:f6 brd ff:ff:ff:ff:ff:ff
    inet 192.168.155.214/32 scope global cilium_host
       valid_lft forever preferred_lft forever
    inet6 fe80::8c38:75ff:fe5f:f9f6/64 scope link 
       valid_lft forever preferred_lft forever
root@ip-192-168-151-181:/home/cilium# ip rule
9:	from all fwmark 0x200/0xf00 lookup 2004
100:	from all lookup local
109:	from all fwmark 0x80/0x80 lookup main
111:	from 192.168.155.214 to 192.168.0.0/16 lookup 10
1024:	from all fwmark 0x80/0x80 lookup main
32766:	from all lookup main
32767:	from all lookup default
root@ip-192-168-151-181:/home/cilium# ip route show table 10
Error: ipv4: FIB table does not exist.
Dump terminated

After (upgrade):

root@ip-192-168-151-181:/home/cilium# ip addr show dev cilium_host
4: cilium_host@cilium_net: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default qlen 1000
    link/ether 8e:38:75:5f:f9:f6 brd ff:ff:ff:ff:ff:ff
    inet 192.168.155.214/32 scope global cilium_host
       valid_lft forever preferred_lft forever
    inet6 fe80::8c38:75ff:fe5f:f9f6/64 scope link 
       valid_lft forever preferred_lft forever
root@ip-192-168-151-181:/home/cilium# ip rule
9:	from all fwmark 0x200/0xf00 lookup 2004
100:	from all lookup local
109:	from all fwmark 0x80/0x80 lookup main
111:	from 192.168.155.214 to 192.168.0.0/16 lookup 10
1024:	from all fwmark 0x80/0x80 lookup main
32766:	from all lookup main
32767:	from all lookup default
root@ip-192-168-151-181:/home/cilium# ip route show table 10
default via 192.168.128.1 dev eth0 proto kernel 
192.168.128.1 dev eth0 proto kernel scope link 

@gandro gandro marked this pull request as ready for review November 23, 2023 15:02
@gandro gandro requested review from a team as code owners November 23, 2023 15:02
@gandro gandro added the backport/author The backport will be carried out by the author of the PR. label Nov 23, 2023
@gandro
Copy link
Member Author

gandro commented Nov 23, 2023

/test

Copy link
Contributor

@rgo3 rgo3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Net negative LOC 🚀

Copy link
Member

@borkmann borkmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚀

@gandro gandro removed the needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch label Nov 27, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.14 in 1.14.5 Nov 27, 2023
@gandro gandro mentioned this pull request Nov 27, 2023
1 task
@gandro gandro added backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. and removed needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Nov 27, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.13 in 1.13.10 Nov 27, 2023
@gandro gandro mentioned this pull request Nov 27, 2023
1 task
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.12 in 1.12.17 Nov 27, 2023
@github-actions github-actions bot added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels Nov 28, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.14 to Backport done to v1.14 in 1.14.5 Nov 28, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed this from Backport done to v1.14 in 1.14.5 Nov 28, 2023
@github-actions github-actions bot added backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. and removed backport-pending/1.12 labels Nov 28, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Backport done to v1.14 in 1.14.5 Nov 28, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed this from Backport pending to v1.12 in 1.12.17 Nov 28, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Backport done to v1.12 in 1.12.17 Nov 28, 2023
@github-actions github-actions bot added backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels Nov 28, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed this from Backport pending to v1.13 in 1.13.10 Nov 28, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Backport done to v1.13 in 1.13.10 Nov 28, 2023
@christarazi christarazi added the sig/ipam IP address management, including cloud IPAM label Nov 28, 2023
Copy link
Member

@aditighag aditighag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the fix, and the detailed commit description. 🚀

My first thought after reading the commit description was thinking about the case(s) where users would hit such an error condition as cilium_host is one of the core interface that the agent manages.

I see that the fix is already marked for backports to older branches. 👍 Could you comment on how we could identify such issues? Was there a specific error message logged when the agent tried installing rules when the routing table wasn't created? (I suppose users could notice missing routing table(s) by running ip route commands.
On a related note, how can we prevent such cases in future for cilium managed devices? Creating a routing table in specific error conditions while setting up ENI rules is a potential option.

@gandro
Copy link
Member Author

gandro commented Nov 30, 2023

Could you comment on how we could identify such issues? Was there a specific error message logged when the agent tried installing rules when the routing table wasn't created?

The bug that was surfaced was that IPsec encrypted traffic was leaving on the wrong interface, and not on the egress interface the router IP was attached to. This lead to traffic being dropped by AWS network (which performs source IP validation). The wrong egress interface was expected, since the egress rule was defunct (without a routing table, it can't do anything). There was no log message, just broken IPsec encryption. Thus also the backport.

The main reason why we did not discover this broken egress rule earlier is that it only surfaces under the following circumstances: The router IP must be allocated from a different ENI than the pod (workload) IPs. As soon as there is at least one pod IP sharing the same ENI as the router IP, the CNI ADD call (that led to to the pod IP being allocated) would created the missing table. Because IP allocation is random, the bug basically only surfaces if one has lots of pre-allocated IPs (e.g. using min-allocate) distributed over multiple interfaces, but not many pods scheduled on the node yet.

On a related note, how can we prevent such cases in future for cilium managed devices? Creating a routing table in specific error conditions while setting up ENI rules is a potential option.

Yeah, the idea of an ENI specific device manager has been floated around. I have an open TODO on my plate to write down my thoughts on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/eni Impacts ENI based IPAM. area/ipam Impacts IP address management functionality. backport/author The backport will be carried out by the author of the PR. backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/ipam IP address management, including cloud IPAM
Projects
No open projects
1.12.17
Backport done to v1.12
1.13.10
Backport done to v1.13
1.14.5
Backport done to v1.14
Status: Released
Status: Released
Status: Released
Development

Successfully merging this pull request may close these issues.

None yet

6 participants