Pod to pod packets via egress + ingress proxy are MTU dropped when IPsec is enabled #33168
Open
2 of 3 tasks
Labels
affects/v1.13
This issue affects v1.13 branch
affects/v1.14
This issue affects v1.14 branch
affects/v1.15
This issue affects v1.15 branch
area/encryption
Impacts encryption support such as IPSec, WireGuard, or kTLS.
area/proxy
Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers.
feature/ipsec
Relates to Cilium's IPsec feature
feature/ipv6
Relates to IPv6 protocol support
kind/bug
This is a bug in the Cilium logic.
sig/datapath
Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Is there an existing issue for this?
What happened?
Steps:
Please note it only happens when both ingress policy and egress policy are working at the same time. Once deleting either ingress/egress policy, the connectivity is back.
Cilium Version
Kernel Version
Kubernetes Version
Regression
No response
Sysdump
No response
Relevant log output
No response
Anything else?
This is yet another MTU issue, ipsec xfrm makes it even harder.
Fact one: IPsec xfrm reduces MTU from 1500 to 1446.
Even if all the net ifaces on the cilium node have been set MTU 1500 (even for those in 2005 route table), xfrm makes MTU 1446 on the sneak.
This happens in the ip_forward():
We know that before entering
ip_forward()
inip_route_input_noref()
, the skb has been setskb->_skb_refdst
, which is used to determine MTU. However, the code above indicates the existence of xfrm could change the MTU insideip_forward()
:xfrm4_route_forward()
can changeskb->_skb_refdst
to the result ofxfrm_lookup_with_ifid()
.The xfrm MTU has a special algorithm at https://elixir.bootlin.com/linux/v6.2/source/net/xfrm/xfrm_state.c#L2747, I didn't check all the details but I did use bpftrace to fetch the new MTU from
xfrm4_route_forward()
, in our cilium case the MTU indeed is reduced from 1500 to 1446.Fact two: traffic from (local egress) proxy to (remote ingress) proxy has a different MTU from pod to pod
Either pod to pod or pod to remote ingress proxy has MTU 1423, which is set on the route instead of iface:
However, proxy to proxy traffic has MTU 1500. This subtle change explains why we see issues only if ingress and egress policy are installed together, the scenario we never covered in test before.
Proposed solution
It seems we can just change MTU for routing in the 2005 table, because proxy traffic will always end up there due to 0xa00/0xb00 mark.
For example, the current routes in 2005 table are
We could change that to
(Haven't checked IPv6)
Cilium Users Document
Code of Conduct
The text was updated successfully, but these errors were encountered: