Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod to pod packets via egress + ingress proxy are MTU dropped when IPsec is enabled #33168

Open
2 of 3 tasks
jschwinger233 opened this issue Jun 16, 2024 · 1 comment
Open
2 of 3 tasks
Labels
affects/v1.13 This issue affects v1.13 branch affects/v1.14 This issue affects v1.14 branch affects/v1.15 This issue affects v1.15 branch area/encryption Impacts encryption support such as IPSec, WireGuard, or kTLS. area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. feature/ipsec Relates to Cilium's IPsec feature feature/ipv6 Relates to IPv6 protocol support kind/bug This is a bug in the Cilium logic. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.

Comments

@jschwinger233
Copy link
Member

jschwinger233 commented Jun 16, 2024

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Steps:

  1. Kind install cilium
export IMAGE=kindest/node:v1.29.4@sha256:3abb816a5b1061fb15c6e9e60856ec40d56b7b52bcea5f5f1350bc6e2320b6f8
     ./contrib/scripts/kind.sh --xdp --secondary-network "" 3 "" "" none dual 0.0.0.0 6443
 kubectl patch node kind-worker3 --type=json -p='[{"op":"add","path":"/metadata/labels/cilium.io~1no-schedule","value":"true"}]'

  if [[ "gcm(aes)" == "gcm(aes)" ]]; then
    key="rfc4106(gcm(aes)) $(dd if=/dev/urandom count=20 bs=1 2> /dev/null | xxd -p -c 64) 128"
  elif [[ "gcm(aes)" == "cbc(aes)" ]]; then
    key="hmac(sha256) $(dd if=/dev/urandom count=32 bs=1 2> /dev/null| xxd -p -c 64) cbc(aes) $(dd if=/dev/urandom count=32 bs=1 2> /dev/null| xxd -p -c 64)"
  else
    echo "Invalid key type"; exit 1
  fi
  kubectl create -n kube-system secret generic cilium-ipsec-keys \
    --from-literal=keys="3+ ${key}"
./cilium-cli install --wait     --chart-directory=./install/kubernetes/cilium     --helm-set=debug.enabled=true     --helm-set=debug.verbose=envoy     --helm-set=hubble.eventBufferCapacity=65535     --helm-set=bpf.monitorAggregation=none     --helm-set=cluster.name=default     --helm-set=authentication.mutual.spire.enabled=false     --nodes-without-cilium     --helm-set-string=kubeProxyReplacement=true     --set='' --helm-set=image.repository=quay.io/cilium/cilium-ci     --helm-set=image.useDigest=false     --helm-set=image.tag=412a46c753eaeff229cba3d83332e2dea3e192f3     --helm-set=operator.image.repository=quay.io/cilium/operator     --helm-set=operator.image.suffix=-ci     --helm-set=operator.image.tag=412a46c753eaeff229cba3d83332e2dea3e192f3     --helm-set=operator.image.useDigest=false     --helm-set=hubble.relay.image.repository=quay.io/cilium/hubble-relay-ci     --helm-set=hubble.relay.image.tag=412a46c753eaeff229cba3d83332e2dea3e192f3     --helm-set=hubble.relay.image.useDigest=false --helm-set-string=routingMode=native --helm-set-string=autoDirectNodeRoutes=true --helm-set-string=ipv4NativeRoutingCIDR=10.244.0.0/16 --helm-set-string=ipv6NativeRoutingCIDR=fd00:10:244::/56   --helm-set-string=endpointRoutes.enabled=true --helm-set=ipv6.enabled=true --helm-set=bpf.masquerade=true  --helm-set=encryption.enabled=true --helm-set=encryption.type=ipsec --helm-set=encryption.nodeEncryption=false
  1. Install cilium-cli-next
 cid=$(docker create quay.io/cilium/cilium-cli-ci:5401ce3551cc46052489b7153468b577830a63a4 ls)
  docker cp $cid:/usr/local/bin/cilium .//cilium-cli-next
  docker rm $cid
  1. Run connectivity test and see failures
./cilium-cli-next connectivity test --include-unsafe-tests  --flush-ct --test "pod-to-pod-with-l7-policy-encryption/"  -v  -p

Please note it only happens when both ingress policy and egress policy are working at the same time. Once deleting either ingress/egress policy, the connectivity is back.

Cilium Version

$ ./cilium-cli version
cilium-cli: v0.16.7 compiled with go1.22.2 on linux/amd64
cilium image (default): v1.15.4
cilium image (stable): v1.15.6
cilium image (running): 1.16.0-dev

Kernel Version

$ uname -a
Linux liangzc-l-PF4RDLEQ 6.5.0-1024-oem #25-Ubuntu SMP PREEMPT_DYNAMIC Mon May 20 14:47:48 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Kubernetes Version

$ kubectl version
Client Version: v1.30.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.4

Regression

No response

Sysdump

No response

Relevant log output

No response

Anything else?

This is yet another MTU issue, ipsec xfrm makes it even harder.

Fact one: IPsec xfrm reduces MTU from 1500 to 1446.

Even if all the net ifaces on the cilium node have been set MTU 1500 (even for those in 2005 route table), xfrm makes MTU 1446 on the sneak.

This happens in the ip_forward():

// https://elixir.bootlin.com/linux/v6.2/source/net/ipv4/ip_forward.c#L124
int ip_forward(struct sk_buff *skb)
{
[...]
	if (!xfrm4_route_forward(skb)) {
		SKB_DR_SET(reason, XFRM_POLICY);
		goto drop;
	}
	rt = skb_rtable(skb);

	if (opt->is_strictroute && rt->rt_uses_gateway)
		goto sr_failed;

	IPCB(skb)->flags |= IPSKB_FORWARDED;
	mtu = ip_dst_mtu_maybe_forward(&rt->dst, true);
	if (ip_exceeds_mtu(skb, mtu)) {
[...]
}

We know that before entering ip_forward() in ip_route_input_noref() , the skb has been set skb->_skb_refdst, which is used to determine MTU. However, the code above indicates the existence of xfrm could change the MTU inside ip_forward(): xfrm4_route_forward() can change skb->_skb_refdst to the result of xfrm_lookup_with_ifid().

The xfrm MTU has a special algorithm at https://elixir.bootlin.com/linux/v6.2/source/net/xfrm/xfrm_state.c#L2747, I didn't check all the details but I did use bpftrace to fetch the new MTU from xfrm4_route_forward(), in our cilium case the MTU indeed is reduced from 1500 to 1446.

Fact two: traffic from (local egress) proxy to (remote ingress) proxy has a different MTU from pod to pod

Either pod to pod or pod to remote ingress proxy has MTU 1423, which is set on the route instead of iface:

$ nspod client2-ccd7b8bdf-nt4nn ip r
default via 10.244.1.12 dev eth0 mtu 1423 
10.244.1.12 dev eth0 scope link

However, proxy to proxy traffic has MTU 1500. This subtle change explains why we see issues only if ingress and egress policy are installed together, the scenario we never covered in test before.

Proposed solution

It seems we can just change MTU for routing in the 2005 table, because proxy traffic will always end up there due to 0xa00/0xb00 mark.

For example, the current routes in 2005 table are

$ nscontainer kind-control-plane ip r s t 2005
default via 10.244.0.100 dev cilium_host proto kernel 
10.244.0.100 dev cilium_host proto kernel scope link 

We could change that to

default via 10.244.0.100 dev cilium_host proto kernel mtu 1446
10.244.0.100 dev cilium_host proto kernel scope link mtu 1446

(Haven't checked IPv6)

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct
@jschwinger233 jschwinger233 added kind/bug This is a bug in the Cilium logic. needs/triage This issue requires triaging to establish severity and next steps. kind/community-report This was reported by a user in the Cilium community, eg via Slack. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. area/encryption Impacts encryption support such as IPSec, WireGuard, or kTLS. feature/ipv6 Relates to IPv6 protocol support affects/v1.13 This issue affects v1.13 branch affects/v1.14 This issue affects v1.14 branch feature/ipsec Relates to Cilium's IPsec feature affects/v1.15 This issue affects v1.15 branch and removed kind/community-report This was reported by a user in the Cilium community, eg via Slack. labels Jun 16, 2024
@jschwinger233
Copy link
Member Author

Update: IPv6 faces the same MTU issue: xfrm reduces MTU from 1500 to 1426. But merely changing the route MTU still doesn't recover the v6 connectivity, there must be something else. So I added label feature/ipv6 and needs/triage.

brb added a commit to cilium/cilium-cli that referenced this issue Jun 17, 2024
Add L7 policy checks. Only for WG, while IPsec is currently suffering
from cilium/cilium#33168.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
michi-covalent pushed a commit to cilium/cilium-cli that referenced this issue Jun 18, 2024
Add L7 policy checks. Only for WG, while IPsec is currently suffering
from cilium/cilium#33168.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
@ti-mo ti-mo removed the needs/triage This issue requires triaging to establish severity and next steps. label Jun 20, 2024
yushoyamaguchi pushed a commit to yushoyamaguchi/cilium-cli that referenced this issue Jun 21, 2024
Add L7 policy checks. Only for WG, while IPsec is currently suffering
from cilium/cilium#33168.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.13 This issue affects v1.13 branch affects/v1.14 This issue affects v1.14 branch affects/v1.15 This issue affects v1.15 branch area/encryption Impacts encryption support such as IPSec, WireGuard, or kTLS. area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. feature/ipsec Relates to Cilium's IPsec feature feature/ipv6 Relates to IPv6 protocol support kind/bug This is a bug in the Cilium logic. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

No branches or pull requests

2 participants