Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node clusterpool Endpoint CIDRs used in AWS ENI mode erroneously #22273

Closed
2 tasks done
jkulesa opened this issue Nov 19, 2022 · 11 comments
Closed
2 tasks done

Node clusterpool Endpoint CIDRs used in AWS ENI mode erroneously #22273

jkulesa opened this issue Nov 19, 2022 · 11 comments
Labels
area/eni Impacts ENI based IPAM. kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. sig/agent Cilium agent related. sig/ipam IP address management, including cloud IPAM stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.

Comments

@jkulesa
Copy link

jkulesa commented Nov 19, 2022

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

When using Cilium 1.12.4 in native AWS ENI mode (not chaining), if you login to a cilium pod you can see the following:

$ cilium node list
Name                                           IPv4 Address     Endpoint CIDR   IPv6 Address   Endpoint CIDR
ip-10-196-160-103.us-west-2.compute.internal   10.196.160.103   10.103.0.0/16
ip-10-196-161-128.us-west-2.compute.internal   10.196.161.128   10.128.0.0/16

That endpoint CIDR is interesting, since its not any range we've specified (10.196.x.x or 10.10.x.x). That range looks like its from here:

DefaultIPv4Prefix = "10.%d.0.1"

But since we are using ENI mode, there aren't Cluster pools to set - and nor can they BE set via the configmap (or helm).

The problem is that this node data is used in various places - one of them is iptables masquerading:

node.GetIPv4AllocRange().String(),

...and so this inserts IPs into the masq iptables that don't route or don't participate in any way. This would make sense when using cluster pools - but not ENI mode (BPF masq mode seems fine in this case)

another place is here:

if routeSpec, err := installDirectRoute(cidr, newIP); err != nil {

In this case it doesn't break anything but it is trying to generate routes that shouldn't be generated from those 10.x.0.0/16 ranges as DEST:

level=warning msg="Unable to install direct node route {Ifindex: 0 Dst: 10.194.0.0/16 Src: <nil> Gw: 10.196.164.194 Flags: [] Table: 0 Realm: 0}" error="route to destination 10.196.164.194 contains gateway 10.196.162.1, must be directly reachable" subsys=linux-datapath
level=warning msg="Unable to install direct node route {Ifindex: 0 Dst: 10.103.0.0/16 Src: <nil> Gw: 10.196.160.103 Flags: [] Table: 0 Realm: 0}" error="route to destination 10.196.160.103 contains gateway 10.196.162.1, must be directly reachable" subsys=linux-datapath
level=warning msg="Unable to install direct node route {Ifindex: 0 Dst: 10.128.0.0/16 Src: <nil> Gw: 10.196.161.128 Flags: [] Table: 0 Realm: 0}" error="route to destination 10.196.161.128 contains gateway 10.196.162.1, must be directly reachable" subsys=linux-datapath
level=warning msg="Unable to install direct node route {Ifindex: 0 Dst: 10.216.0.0/16 Src: <nil> Gw: 10.196.165.216 Flags: [] Table: 0 Realm: 0}" error="route to destination 10.196.165.216 contains gateway 10.196.162.1, must be directly reachable" subsys=linux-datapath

there might be other code locations that I didn't find

It seems there is a notable mismatch about the node IPAM pool of IP addresses in ClusterPool mode, and when you are using ENI mode(where clusterpool CIDR addresses are still assigned to this node datastructure). So either these particular codepaths need to be modernized to handle AWS ENI usecase, or maybe just feed the node CIDR something will "just work".

I'd propose in AWS ENI mode, the Node Endpoint CIDR should be set to what podCIDR is for EACH node (ipv4 and v6) instead of these made up 10.x.0.0 addresses that end up have odd side effects in random places.

In our case I'd expect to see this - and if so would expect iptables masquerade to work (and those route errors to go away)

$ cilium node list
Name                                           IPv4 Address     Endpoint CIDR   IPv6 Address   Endpoint CIDR
ip-10-196-160-103.us-west-2.compute.internal   10.196.160.103   10.10.0.0/16
ip-10-196-161-128.us-west-2.compute.internal   10.196.161.128   10.10.0.0/16

Cilium Version

1.12.4 (and 1.12.3)

Kernel Version

5.15.0-1022-aws

Kubernetes Version

1.22.15

Sysdump

cilium-sysdump-20221118-221157.zip

used --quick otherwise wouldn't fit in 25MB

Relevant log output

level=warning msg="Unable to install direct node route {Ifindex: 0 Dst: 10.194.0.0/16 Src: <nil> Gw: 10.196.164.194 Flags: [] Table: 0 Realm: 0}" error="route to destination 10.196.164.194 contains gateway 10.196.162.1, must be directly reachable" subsys=linux-datapath
level=warning msg="Unable to install direct node route {Ifindex: 0 Dst: 10.103.0.0/16 Src: <nil> Gw: 10.196.160.103 Flags: [] Table: 0 Realm: 0}" error="route to destination 10.196.160.103 contains gateway 10.196.162.1, must be directly reachable" subsys=linux-datapath
level=warning msg="Unable to install direct node route {Ifindex: 0 Dst: 10.128.0.0/16 Src: <nil> Gw: 10.196.161.128 Flags: [] Table: 0 Realm: 0}" error="route to destination 10.196.161.128 contains gateway 10.196.162.1, must be directly reachable" subsys=linux-datapath
level=warning msg="Unable to install direct node route {Ifindex: 0 Dst: 10.216.0.0/16 Src: <nil> Gw: 10.196.165.216 Flags: [] Table: 0 Realm: 0}" error="route to destination 10.196.165.216 contains gateway 10.196.162.1, must be directly reachable" subsys=linux-datapath

Anything else?

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@jkulesa jkulesa added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels Nov 19, 2022
@jkulesa jkulesa changed the title Cilium Endpoint CIDRs used in AWS ENI mode when ipv4NativeRoutingCIDR should be used instead Cilium Endpoint CIDRs used in AWS ENI mode when podCIDR should be used instead Nov 19, 2022
@jkulesa jkulesa changed the title Cilium Endpoint CIDRs used in AWS ENI mode when podCIDR should be used instead Node Endpoint CIDRs used in AWS ENI mode when podCIDR should be used instead Nov 19, 2022
@jkulesa jkulesa changed the title Node Endpoint CIDRs used in AWS ENI mode when podCIDR should be used instead Node clusterpool Endpoint CIDRs used in AWS ENI mode erroneously Nov 19, 2022
@christarazi christarazi added area/eni Impacts ENI based IPAM. sig/ipam IP address management, including cloud IPAM sig/agent Cilium agent related. labels Nov 21, 2022
@christarazi
Copy link
Member

Is this causing connectivity issues in your environment or just that the log msgs warning?

@jkulesa
Copy link
Author

jkulesa commented Nov 21, 2022

The initial reason to start poking around was that ipv4 masquerading(iptables) wasn't working, and then the odd routes attempting to be inserted were noticed when looking thru the logs(caused by autodirectnoderoutes). They happen not to align with any legit routes for now, but I don't fully understand what will happen when they hit ranges of the 10 space that we do use. But the main direct breakage we've detected is ENI mode* with iptables masquerading seems broken due to the fact the NAT rules inserted reference these node ranges which are totally fictional.

So far the best we can tell BFP masquerading does seem to work.

*ENI mode for us also means using separate pod and host subnets(same VPC), along with prefix delegation.

@jkulesa
Copy link
Author

jkulesa commented Nov 21, 2022

Oh and here is an example of the broken NAT iptable:

Chain CILIUM_POST_nat (1 references)
target     prot opt source               destination         
ACCEPT     all  --  10.243.0.0/16        0.0.0.0/0            match-set cilium_node_set_v4 dst /* exclude traffic to cluster nodes from masquerade */
MASQUERADE  all  --  10.243.0.0/16       !10.196.0.0/16        /* cilium masquerade non-cluster */
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            mark match 0xa00/0xe00 /* exclude proxy return traffic from masquerade */
SNAT       all  --  127.0.0.1            0.0.0.0/0            /* cilium host->cluster from 127.0.0.1 masquerade */ to:10.10.78.66

you can see the 10.243.0.0/16 source address, which isn't either our host network (10.196.0.0/16) or the pod network (10.10.0.0/16) but instead appears to be a vestigial bit of the Cluster Pool endpoint CIDR range code.

$ cilium node list
Name                                           IPv4 Address     Endpoint CIDR   IPv6 Address   Endpoint CIDR
...        
ip-10-196-162-243.us-west-2.compute.internal   10.196.162.243   10.243.0.0/16                  
...

@github-actions
Copy link

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jan 21, 2023
@christarazi christarazi removed the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jan 23, 2023
@github-actions
Copy link

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Mar 25, 2023
@christarazi christarazi removed the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Mar 27, 2023
@github-actions
Copy link

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label May 27, 2023
@aanm
Copy link
Member

aanm commented May 30, 2023

Related with #9409

@christarazi christarazi removed the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label May 30, 2023
@mathpl mathpl assigned learnitall and unassigned learnitall May 30, 2023
@gandro
Copy link
Member

gandro commented Jun 21, 2023

Quick triage on this:

Overall I agree that we suffer from assumptions that there is a single pod CIDR, which of course does not hold in AWS ENI mode.

However, we also particularly don't support certain features in ENI mode that rely on the "there is a pod CIDR" assumption, namely:

  • installDirectRoute should not be an issue here. It is only used with autoDirectNodeRoutes=true, which we never supported in ENI mode and which is also not needed (since the EC2 fabric does the routing for us)
  • With iptables-based masquerading, our documentation requires egressMasqueradeInterfaces to be set. This will make the code not use the pod CIDR for the cilium masquerade non-cluster rule:

if option.Config.EgressMasqueradeInterfaces != "" {
progArgs = append(
progArgs,
"-o", option.Config.EgressMasqueradeInterfaces)
} else {
progArgs = append(
progArgs,
"-s", allocRange,
"!", "-o", "cilium_+")
}

  • But this is poorly documented at the moment, and I think we should also make that limitation clear in the ENI concepts page. However, it seems that the exclude traffic to cluster nodes from masquerade rule that we install, is actually buggy, since it unconditionally uses the pod CIDR to create the masquerading rule. We probably should respect egressMasqueradeInterfaces there as well:

if err := prog.runProg([]string{
"-t", "nat",
"-A", ciliumPostNatChain,
"-s", allocRange,
"-m", "set", "--match-set", prog.getIpset(), "dst",
"-m", "comment", "--comment", "exclude traffic to cluster nodes from masquerade",
"-j", "ACCEPT"}); err != nil {
return err

gandro added a commit to gandro/cilium that referenced this issue Jun 21, 2023
By default, in the iptables-based masquerading mode, Cilium will only
masquerade traffic coming from the local pod CIDR (`allocRange` in
`installMasqueradeRules`). However, many IPAM modes such as ENI or
multi-pool IPAM do not have a single pod CIDR. Instead, those modes rely
on the `egress-masquerade-interfaces` setting, which masquerades all
traffic if it leaves one of the `egress-masquerade-interfaces` devices.

Therefore, the "exclude traffic to cluster nodes from masquerade"
`CILIUM_POST_nat` rule should also respect the
`egress-masquerade-interfaces` setting and not masquerade traffic
regardless of the value of `allocRange` (which will not be valid in
settings such as ENI mode).

This likely has not manifested in ENI mode as an issue, because in ENI
mode we derive the native routing CIDR (`snatDstExclusionCIDR` in
`installMasqueradeRules`) from the EC2 VPC CIDR, which usually contains
the node IPs too. However, we should not rely on that, since we are
adding additional non-podCIDR based IPAM modes such as multi-pool where
this will not be true.

Related: cilium#22273

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
borkmann pushed a commit that referenced this issue Jun 27, 2023
By default, in the iptables-based masquerading mode, Cilium will only
masquerade traffic coming from the local pod CIDR (`allocRange` in
`installMasqueradeRules`). However, many IPAM modes such as ENI or
multi-pool IPAM do not have a single pod CIDR. Instead, those modes rely
on the `egress-masquerade-interfaces` setting, which masquerades all
traffic if it leaves one of the `egress-masquerade-interfaces` devices.

Therefore, the "exclude traffic to cluster nodes from masquerade"
`CILIUM_POST_nat` rule should also respect the
`egress-masquerade-interfaces` setting and not masquerade traffic
regardless of the value of `allocRange` (which will not be valid in
settings such as ENI mode).

This likely has not manifested in ENI mode as an issue, because in ENI
mode we derive the native routing CIDR (`snatDstExclusionCIDR` in
`installMasqueradeRules`) from the EC2 VPC CIDR, which usually contains
the node IPs too. However, we should not rely on that, since we are
adding additional non-podCIDR based IPAM modes such as multi-pool where
this will not be true.

Related: #22273

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@github-actions
Copy link

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Aug 21, 2023
@christarazi christarazi removed the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Aug 21, 2023
@github-actions
Copy link

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Oct 21, 2023
Copy link

github-actions bot commented Nov 4, 2023

This issue has not seen any activity since it was marked stale.
Closing.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/eni Impacts ENI based IPAM. kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. sig/agent Cilium agent related. sig/ipam IP address management, including cloud IPAM stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.
Projects
None yet
Development

No branches or pull requests

5 participants