-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unencrypted traffic among nodes when IPsec is used with L7 egress proxy #31984
Comments
Otherwise the further patch for L7 proxy + IPsec fails pod-to-world testcases. To fix #31984, we are going to change the routing for traffic from egress proxy: when tunnel is disabled, the traffic will be routed to cilium_host instead of eth0. This leads to a consequence of changing packets' source IPs: from eth0's IP to cilium_host's IP. The implication requires additional masquerading for pod to world traffic, because these packets now have cilium_host's IP as source, rather than eth0. This patch ensures iptables masquerade rules are installed for pod to world traffic, even when KPR is enabled. Signed-off-by: gray <gray.liang@isovalent.com>
Otherwise the further patch for L7 proxy + IPsec fails pod-to-world testcases. To fix #31984, we are going to change the routing for traffic from egress proxy: when tunnel is disabled, the traffic will be routed to cilium_host instead of eth0. This leads to a consequence of changing packets' source IPs: from eth0's IP to cilium_host's IP. The implication requires additional masquerading for pod to world traffic, because these packets now have cilium_host's IP as source, rather than eth0. This patch ensures iptables masquerade rules are installed for pod to world traffic, even when KPR is enabled. Signed-off-by: gray <gray.liang@isovalent.com>
Otherwise the further patch for L7 proxy + IPsec fails pod-to-world testcases. To fix #31984, we are going to change the routing for traffic from egress proxy: when tunnel is disabled, the traffic will be routed to cilium_host instead of eth0. This leads to a consequence of changing packets' source IPs: from eth0's IP to cilium_host's IP. The implication requires additional masquerading for pod to world traffic, because these packets now have cilium_host's IP as source, rather than eth0. This patch ensures iptables masquerade rules are installed for pod to world traffic, even when KPR is enabled. Signed-off-by: gray <gray.liang@isovalent.com>
Otherwise the further patch for L7 proxy + IPsec fails pod-to-world testcases. To fix #31984, we are going to change the routing for traffic from egress proxy: when tunnel is disabled, the traffic will be routed to cilium_host instead of eth0. This leads to a consequence of changing packets' source IPs: from eth0's IP to cilium_host's IP. The implication requires additional masquerading for pod to world traffic, because these packets now have cilium_host's IP as source, rather than eth0. This patch ensures iptables masquerade rules are installed for pod to world traffic, even when KPR is enabled. Signed-off-by: gray <gray.liang@isovalent.com>
Otherwise the further patch for L7 proxy + IPsec fails pod-to-world testcases. To fix #31984, we are going to change the routing for traffic from egress proxy: when tunnel is disabled, the traffic will be routed to cilium_host instead of eth0. This leads to a consequence of changing packets' source IPs: from eth0's IP to cilium_host's IP. The implication requires additional masquerading for pod to world traffic, because these packets now have cilium_host's IP as source, rather than eth0. This patch ensures iptables masquerade rules are installed for pod to world traffic, even when KPR is enabled. Signed-off-by: gray <gray.liang@isovalent.com>
We have an iptables rule to set 0x200 mark for transparent socket: ``` *mangle -A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle -A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff ``` This rule is in the mangle PREROUTING which checks packets ingressed from a netdev. Let's then focus on the pod to world traffic when IPsec=on + proxy=on + tunnel=off. Currently, a pod-to-world packet will go through the path: 1. @lxc ingress: skb->mark is set to 0x200 and returned to stack 2. @stack iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy 3. @Proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world) 4. @stack routing: the new skb is routed to eth0 5. @stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain 6. @eth0 egress: the new skb is going to world Please note the new skb won't hit PREROUTING chain, where there is a rule setting skb->mark=0x200. To fix #31984, we are going to change the routing for packets from egress proxy: on the step 4 above, the new skb will be routed to cilium_host instead: 4. @stack routing: the new skb is routed to cilium_host 5. @cilium_host egress: the new skb is returned to stack 6. @cilium_net ingress: the new skb is returned to stack 7. @stack routing: the new skb is routed to eth0 8. @stack iptables: the new skb is traversing PREROUTING, FORWARD, POSTROUTING Look at step 8, we are hitting PREROUTING! Because of cilium/proxy#742, this to-world skb is also linked to a transparent socket, matching the "-m socket --transparent" condition. If we do nothing, this to-world skb will be set 0x200 mark, and then hit routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to local. It should have gone to the world. This patch fixes this future issue as a precaution (otherwise we'll break git-bisect). This patch provides a straightforward solution: at step 5 @cilium_host, we set a specical mark 0x800 (MARK_MAGIC_GOTO_WORLD), then iptables can exclude this mark using "-m mark ! --mark 0x800/0xf00". Signed-off-by: gray <gray.liang@isovalent.com>
After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. there is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING traversal, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. Signed-off-by: gray <gray.liang@isovalent.com>
We have an iptables rule to set 0x200 mark for transparent socket: ``` *mangle -A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle -A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff ``` This rule is in the mangle PREROUTING which checks packets ingressed from a netdev. Let's then focus on the pod to world traffic when IPsec=on + proxy=on + tunnel=off. Currently, a pod-to-world packet will go through the path: 1. @lxc ingress: skb->mark is set to 0x200 and returned to stack 2. @stack iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy 3. @Proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world) 4. @stack routing: the new skb is routed to eth0 5. @stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain 6. @eth0 egress: the new skb is going to world Please note the new skb won't hit PREROUTING chain, where there is a rule setting skb->mark=0x200. To fix #31984, we are going to change the routing for packets from egress proxy; consequently, on the step 4 above, the new skb will be routed to cilium_host instead: 4. @stack routing: the new skb is routed to cilium_host 5. @cilium_host egress: the new skb is returned to stack 6. @cilium_net ingress: the new skb is returned to stack 7. @stack routing: the new skb is routed to eth0 8. @stack iptables: the new skb is traversing PREROUTING, FORWARD, POSTROUTING Look at step 8, we are hitting PREROUTING! Because of cilium/proxy#742, this to-world skb is also linked to a transparent socket, matching the "-m socket --transparent" condition, the packet will fortunately have the 0x200 mark. If we do nothing, this to-world skb marked with 0x200 will then hit routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to local. It should have gone to the world. This patch fixes this future issue as a precaution (otherwise we'll break git-bisect). This patch provides a straightforward solution: at step 5 @cilium_host, we set a specical mark 0x800 (MARK_MAGIC_GOTO_WORLD), then iptables can exclude this mark using "-m mark ! --mark 0x800/0xf00". Signed-off-by: gray <gray.liang@isovalent.com>
After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. there is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING traversal, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. Signed-off-by: gray <gray.liang@isovalent.com>
We have an iptables rule to set 0x200 mark for transparent socket: ``` *mangle -A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle -A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff ``` This rule is in the mangle PREROUTING which checks packets ingressed from a netdev. Let's then focus on the pod to world traffic when IPsec=on + proxy=on + tunnel=off. Currently, a pod-to-world packet will go through the path: 1. @lxc ingress: skb->mark is set to 0x200 and returned to stack 2. @stack iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy 3. @Proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world) 4. @stack routing: the new skb is routed to eth0 5. @stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain 6. @eth0 egress: the new skb is going to world Please note the new skb won't hit PREROUTING chain, where there is a rule setting skb->mark=0x200. To fix #31984, we are going to change the routing for packets from egress proxy; consequently, on the step 4 above, the new skb will be routed to cilium_host instead: 4. @stack routing: the new skb is routed to cilium_host 5. @cilium_host egress: the new skb is returned to stack 6. @cilium_net ingress: the new skb is returned to stack 7. @stack routing: the new skb is routed to eth0 8. @stack iptables: the new skb is traversing PREROUTING, FORWARD, POSTROUTING Look at step 8, we are hitting PREROUTING! Because of cilium/proxy#742, this to-world skb is also linked to a transparent socket, matching the "-m socket --transparent" condition, the packet will fortunately have the 0x200 mark. If we do nothing, this to-world skb marked with 0x200 will then hit routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to local. It should have gone to the world. This patch fixes this future issue as a precaution (otherwise we'll break git-bisect). This patch provides a straightforward solution: at step 5 @cilium_host, we set a specical mark 0x800 (MARK_MAGIC_GOTO_WORLD), then iptables can exclude this mark using "-m mark ! --mark 0x800/0xf00". Signed-off-by: gray <gray.liang@isovalent.com>
After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. there is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING traversal, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. Signed-off-by: gray <gray.liang@isovalent.com>
We have an iptables rule to set 0x200 mark for transparent socket: ``` *mangle -A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle -A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff ``` This rule is in the mangle PREROUTING which checks packets ingressed from a netdev. Let's then focus on the pod to world traffic when IPsec=on + proxy=on + tunnel=off. Currently, a pod-to-world packet will go through the path: 1. @lxc ingress: skb->mark is set to 0x200 and returned to stack 2. @stack iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy 3. @Proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world) 4. @stack routing: the new skb is routed to eth0 5. @stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain 6. @eth0 egress: the new skb is going to world Please note the new skb won't hit PREROUTING chain, where there is a rule setting skb->mark=0x200. To fix #31984, we are going to change the routing for packets from egress proxy; consequently, on the step 4 above, the new skb will be routed to cilium_host instead: 4. @stack routing: the new skb is routed to cilium_host 5. @cilium_host egress: the new skb is returned to stack 6. @cilium_net ingress: the new skb is returned to stack 7. @stack routing: the new skb is routed to eth0 8. @stack iptables: the new skb is traversing PREROUTING, FORWARD, POSTROUTING Look at step 8, we are hitting PREROUTING! Because of cilium/proxy#742, this to-world skb is also linked to a transparent socket, matching the "-m socket --transparent" condition, the packet will fortunately have the 0x200 mark. If we do nothing, this to-world skb marked with 0x200 will then hit routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to local. It should have gone to the world. This patch fixes this future issue as a precaution (otherwise we'll break git-bisect). This patch provides a straightforward solution: at step 5 @cilium_host, we set a specical mark 0x800 (MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using "-m mark ! --mark 0x800/0xf00". Signed-off-by: gray <gray.liang@isovalent.com>
After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. there is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING traversal, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. Signed-off-by: gray <gray.liang@isovalent.com>
We have an iptables rule to set 0x200 mark for transparent socket: ``` *mangle -A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle -A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff ``` This rule is in the mangle PREROUTING which checks packets ingressed from a netdev. Let's then focus on the pod to world traffic when IPsec=on + proxy=on + tunnel=off. Currently, a pod-to-world packet will go through the path: 1. @lxc ingress: skb->mark is set to 0x200 and returned to stack 2. @stack iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy 3. @Proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world) 4. @stack routing: the new skb is routed to eth0 5. @stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain 6. @eth0 egress: the new skb is going to world Please note the new skb won't hit PREROUTING chain, where there is a rule setting skb->mark=0x200. To fix #31984, we are going to change the routing for packets from egress proxy; consequently, on the step 4 above, the new skb will be routed to cilium_host instead: 4. @stack routing: the new skb is routed to cilium_host 5. @cilium_host egress: the new skb is returned to stack 6. @cilium_net ingress: the new skb is returned to stack 7. @stack routing: the new skb is routed to eth0 8. @stack iptables: the new skb is traversing PREROUTING, FORWARD, POSTROUTING Look at step 8, we are hitting PREROUTING! Because of cilium/proxy#742, this to-world skb is also linked to a transparent socket, matching the "-m socket --transparent" condition, the packet will fortunately have the 0x200 mark. If we do nothing, this to-world skb marked with 0x200 will then hit routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to local. It should have gone to the world. This patch fixes this future issue as a precaution (otherwise we'll break git-bisect). This patch provides a straightforward solution: at step 5 @cilium_host, we set a specical mark 0x800 (MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using "-m mark ! --mark 0x800/0xf00". Signed-off-by: gray <gray.liang@isovalent.com>
After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. there is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING traversal, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. Signed-off-by: gray <gray.liang@isovalent.com>
After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. There is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. Signed-off-by: gray <gray.liang@isovalent.com>
After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. There is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. [1] To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. [1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825 [1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111 Signed-off-by: gray <gray.liang@isovalent.com>
We have an iptables rule to set 0x200 mark for transparent socket: ``` *mangle -A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle -A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff ``` This rule is in the mangle PREROUTING which checks packets ingressed from a netdev. Let's then focus on the pod to world traffic when IPsec=on + proxy=on + tunnel=off. Currently, a pod-to-world packet will go through the path: 1. @lxc ingress: skb->mark is set to 0x200 and returned to stack 2. @stack iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy 3. @Proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world) 4. @stack routing: the new skb is routed to eth0 5. @stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain 6. @eth0 egress: the new skb is going to world Please note the new skb won't hit PREROUTING chain, where there is a rule setting skb->mark=0x200. To fix #31984, we are going to change the routing for packets from egress proxy; consequently, on the step 4 above, the new skb will be routed to cilium_host instead: 4. @stack routing: the new skb is routed to cilium_host 5. @cilium_host egress: the new skb is returned to stack 6. @cilium_net ingress: the new skb is returned to stack 7. @stack routing: the new skb is routed to eth0 8. @stack iptables: the new skb is traversing PREROUTING, FORWARD, POSTROUTING Look at step 8, we are hitting PREROUTING! Because of cilium/proxy#742, this to-world skb is also linked to a transparent socket, matching the "-m socket --transparent" condition, the packet will fortunately have the 0x200 mark. If we do nothing, this to-world skb marked with 0x200 will then hit routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to local. It should have gone to the world. This patch fixes this future issue as a precaution (otherwise we'll break git-bisect). This patch provides a straightforward solution: at step 5 @cilium_host, we set a specical mark 0x800 (MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using "-m mark ! --mark 0x800/0xf00". Signed-off-by: gray <gray.liang@isovalent.com>
After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. There is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. [1] To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. [1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825 [1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111 Signed-off-by: gray <gray.liang@isovalent.com>
We have an iptables rule to set 0x200 mark for transparent socket: ``` *mangle -A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle -A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff ``` This rule is in the mangle PREROUTING which checks packets ingressed from a netdev. Let's then focus on the pod to world traffic when IPsec=on + proxy=on + tunnel=off. Currently, a pod-to-world packet will go through the path: 1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack 2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy 3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world) 4. stack routing: the new skb is routed to eth0 5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain 6. to_netdev@eth0: the new skb is going to world Please note the new skb won't hit PREROUTING chain, where there is a rule setting skb->mark=0x200. To fix #31984, we are going to change the routing for packets from egress proxy; consequently, on the step 4 above, the new skb will be routed to cilium_host instead: 4. stack routing: the new skb is routed to cilium_host 5. from_host@cilium_host: the new skb is returned to stack 6. to_host@cilium_net: the new skb is returned to stack 7. stack: PREROUTING, routing, FORWARD, POSTROUTING Look at step 7, we are hitting PREROUTING! Because of cilium/proxy#742, this to-world skb is also linked to a transparent socket, matching the "-m socket --transparent" condition, the packet will fortunately have the 0x200 mark. If we do nothing, this to-world skb marked with 0x200 will then hit routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to local. It should have gone to the world. This patch fixes this future issue as a precaution (otherwise we'll break git-bisect). This patch provides a straightforward solution: at step 5 from_host@cilium_host, we set a specical mark 0x800 (MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using "-m mark ! --mark 0x800/0xf00". Signed-off-by: gray <gray.liang@isovalent.com>
After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. There is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. [1] To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. [1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825 [1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111 Signed-off-by: gray <gray.liang@isovalent.com>
Because of cilium/proxy#742 and #32331, the backport of #32683 needs extra care. The above two PRs avoids an issue while bringing about a new issue:
Cilium 1.15- doesn't have these two PRs, so we have to:
I think I'll just disable bpf masquerading in ci-ipsec-e2e for 1.15-, for now. Then open a separate issue for it. Background: bpf masquerading doesn't do SNAT for an skb whose src IP is cilium internal IP (cilium_host). edit: 1.15- doesn't have kpr enabled in ci-ipsec-e2e. |
We have an iptables rule to set 0x200 mark for transparent socket: ``` *mangle -A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle -A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff ``` This rule is in the mangle PREROUTING which checks packets ingressed from a netdev. Let's then focus on the pod to world traffic when IPsec=on + proxy=on + tunnel=off. Currently, a pod-to-world packet will go through the path: 1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack 2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy 3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world) 4. stack routing: the new skb is routed to eth0 5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain 6. to_netdev@eth0: the new skb is going to world Please note the new skb won't hit PREROUTING chain, where there is a rule setting skb->mark=0x200. To fix #31984, we are going to change the routing for packets from egress proxy; consequently, on the step 4 above, the new skb will be routed to cilium_host instead: 4. stack routing: the new skb is routed to cilium_host 5. from_host@cilium_host: the new skb is returned to stack 6. to_host@cilium_net: the new skb is returned to stack 7. stack: PREROUTING, routing, FORWARD, POSTROUTING Look at step 7, we are hitting PREROUTING! Because of cilium/proxy#742, this to-world skb is also linked to a transparent socket, matching the "-m socket --transparent" condition, the packet will fortunately have the 0x200 mark. If we do nothing, this to-world skb marked with 0x200 will then hit routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to local. It should have gone to the world. This patch fixes this future issue as a precaution (otherwise we'll break git-bisect). This patch provides a straightforward solution: at step 5 from_host@cilium_host, we set a specical mark 0x800 (MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using "-m mark ! --mark 0x800/0xf00". Signed-off-by: gray <gray.liang@isovalent.com>
After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. There is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. [1] To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. [1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825 [1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111 Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: 3384d73 ] After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. There is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. [1] To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. [1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825 [1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111 Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: f93a40c ] We have an iptables rule to set 0x200 mark for transparent socket: ``` *mangle -A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle -A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff ``` This rule is in the mangle PREROUTING which checks packets ingressed from a netdev. Let's then focus on the pod to world traffic when IPsec=on + proxy=on + tunnel=off. Currently, a pod-to-world packet will go through the path: 1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack 2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy 3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world) 4. stack routing: the new skb is routed to eth0 5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain 6. to_netdev@eth0: the new skb is going to world Please note the new skb won't hit PREROUTING chain, where there is a rule setting skb->mark=0x200. To fix #31984, we are going to change the routing for packets from egress proxy; consequently, on the step 4 above, the new skb will be routed to cilium_host instead: 4. stack routing: the new skb is routed to cilium_host 5. from_host@cilium_host: the new skb is returned to stack 6. to_host@cilium_net: the new skb is returned to stack 7. stack: PREROUTING, routing, FORWARD, POSTROUTING Look at step 7, we are hitting PREROUTING! Because of cilium/proxy#742, this to-world skb is also linked to a transparent socket, matching the "-m socket --transparent" condition, the packet will fortunately have the 0x200 mark. If we do nothing, this to-world skb marked with 0x200 will then hit routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to local. It should have gone to the world. This patch fixes this future issue as a precaution (otherwise we'll break git-bisect). This patch provides a straightforward solution: at step 5 from_host@cilium_host, we set a specical mark 0x800 (MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using "-m mark ! --mark 0x800/0xf00". Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: 3384d73 ] After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. There is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. [1] To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. [1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825 [1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111 Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: f93a40c ] We have an iptables rule to set 0x200 mark for transparent socket: ``` *mangle -A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle -A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff ``` This rule is in the mangle PREROUTING which checks packets ingressed from a netdev. Let's then focus on the pod to world traffic when IPsec=on + proxy=on + tunnel=off. Currently, a pod-to-world packet will go through the path: 1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack 2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy 3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world) 4. stack routing: the new skb is routed to eth0 5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain 6. to_netdev@eth0: the new skb is going to world Please note the new skb won't hit PREROUTING chain, where there is a rule setting skb->mark=0x200. To fix #31984, we are going to change the routing for packets from egress proxy; consequently, on the step 4 above, the new skb will be routed to cilium_host instead: 4. stack routing: the new skb is routed to cilium_host 5. from_host@cilium_host: the new skb is returned to stack 6. to_host@cilium_net: the new skb is returned to stack 7. stack: PREROUTING, routing, FORWARD, POSTROUTING Look at step 7, we are hitting PREROUTING! Because of cilium/proxy#742, this to-world skb is also linked to a transparent socket, matching the "-m socket --transparent" condition, the packet will fortunately have the 0x200 mark. If we do nothing, this to-world skb marked with 0x200 will then hit routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to local. It should have gone to the world. This patch fixes this future issue as a precaution (otherwise we'll break git-bisect). This patch provides a straightforward solution: at step 5 from_host@cilium_host, we set a specical mark 0x800 (MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using "-m mark ! --mark 0x800/0xf00". Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: 3384d73 ] After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. There is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. [1] To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. [1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825 [1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111 Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: f93a40c ] We have an iptables rule to set 0x200 mark for transparent socket: ``` *mangle -A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle -A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff ``` This rule is in the mangle PREROUTING which checks packets ingressed from a netdev. Let's then focus on the pod to world traffic when IPsec=on + proxy=on + tunnel=off. Currently, a pod-to-world packet will go through the path: 1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack 2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy 3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world) 4. stack routing: the new skb is routed to eth0 5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain 6. to_netdev@eth0: the new skb is going to world Please note the new skb won't hit PREROUTING chain, where there is a rule setting skb->mark=0x200. To fix #31984, we are going to change the routing for packets from egress proxy; consequently, on the step 4 above, the new skb will be routed to cilium_host instead: 4. stack routing: the new skb is routed to cilium_host 5. from_host@cilium_host: the new skb is returned to stack 6. to_host@cilium_net: the new skb is returned to stack 7. stack: PREROUTING, routing, FORWARD, POSTROUTING Look at step 7, we are hitting PREROUTING! Because of cilium/proxy#742, this to-world skb is also linked to a transparent socket, matching the "-m socket --transparent" condition, the packet will fortunately have the 0x200 mark. If we do nothing, this to-world skb marked with 0x200 will then hit routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to local. It should have gone to the world. This patch fixes this future issue as a precaution (otherwise we'll break git-bisect). This patch provides a straightforward solution: at step 5 from_host@cilium_host, we set a specical mark 0x800 (MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using "-m mark ! --mark 0x800/0xf00". Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: 3384d73 ] After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. There is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. [1] To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. [1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825 [1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111 Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: f93a40c ] We have an iptables rule to set 0x200 mark for transparent socket: ``` *mangle -A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle -A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff ``` This rule is in the mangle PREROUTING which checks packets ingressed from a netdev. Let's then focus on the pod to world traffic when IPsec=on + proxy=on + tunnel=off. Currently, a pod-to-world packet will go through the path: 1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack 2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy 3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world) 4. stack routing: the new skb is routed to eth0 5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain 6. to_netdev@eth0: the new skb is going to world Please note the new skb won't hit PREROUTING chain, where there is a rule setting skb->mark=0x200. To fix #31984, we are going to change the routing for packets from egress proxy; consequently, on the step 4 above, the new skb will be routed to cilium_host instead: 4. stack routing: the new skb is routed to cilium_host 5. from_host@cilium_host: the new skb is returned to stack 6. to_host@cilium_net: the new skb is returned to stack 7. stack: PREROUTING, routing, FORWARD, POSTROUTING Look at step 7, we are hitting PREROUTING! Because of cilium/proxy#742, this to-world skb is also linked to a transparent socket, matching the "-m socket --transparent" condition, the packet will fortunately have the 0x200 mark. If we do nothing, this to-world skb marked with 0x200 will then hit routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to local. It should have gone to the world. This patch fixes this future issue as a precaution (otherwise we'll break git-bisect). This patch provides a straightforward solution: at step 5 from_host@cilium_host, we set a specical mark 0x800 (MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using "-m mark ! --mark 0x800/0xf00". Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: 3384d73 ] After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. There is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. [1] To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. [1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825 [1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111 Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: f93a40c ] We have an iptables rule to set 0x200 mark for transparent socket: ``` *mangle -A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle -A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff ``` This rule is in the mangle PREROUTING which checks packets ingressed from a netdev. Let's then focus on the pod to world traffic when IPsec=on + proxy=on + tunnel=off. Currently, a pod-to-world packet will go through the path: 1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack 2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy 3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world) 4. stack routing: the new skb is routed to eth0 5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain 6. to_netdev@eth0: the new skb is going to world Please note the new skb won't hit PREROUTING chain, where there is a rule setting skb->mark=0x200. To fix #31984, we are going to change the routing for packets from egress proxy; consequently, on the step 4 above, the new skb will be routed to cilium_host instead: 4. stack routing: the new skb is routed to cilium_host 5. from_host@cilium_host: the new skb is returned to stack 6. to_host@cilium_net: the new skb is returned to stack 7. stack: PREROUTING, routing, FORWARD, POSTROUTING Look at step 7, we are hitting PREROUTING! Because of cilium/proxy#742, this to-world skb is also linked to a transparent socket, matching the "-m socket --transparent" condition, the packet will fortunately have the 0x200 mark. If we do nothing, this to-world skb marked with 0x200 will then hit routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to local. It should have gone to the world. This patch fixes this future issue as a precaution (otherwise we'll break git-bisect). This patch provides a straightforward solution: at step 5 from_host@cilium_host, we set a specical mark 0x800 (MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using "-m mark ! --mark 0x800/0xf00". Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: 3384d73 ] After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. There is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. [1] To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. [1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825 [1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111 Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: f93a40c ] We have an iptables rule to set 0x200 mark for transparent socket: ``` *mangle -A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle -A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff ``` This rule is in the mangle PREROUTING which checks packets ingressed from a netdev. Let's then focus on the pod to world traffic when IPsec=on + proxy=on + tunnel=off. Currently, a pod-to-world packet will go through the path: 1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack 2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy 3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world) 4. stack routing: the new skb is routed to eth0 5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain 6. to_netdev@eth0: the new skb is going to world Please note the new skb won't hit PREROUTING chain, where there is a rule setting skb->mark=0x200. To fix #31984, we are going to change the routing for packets from egress proxy; consequently, on the step 4 above, the new skb will be routed to cilium_host instead: 4. stack routing: the new skb is routed to cilium_host 5. from_host@cilium_host: the new skb is returned to stack 6. to_host@cilium_net: the new skb is returned to stack 7. stack: PREROUTING, routing, FORWARD, POSTROUTING Look at step 7, we are hitting PREROUTING! Because of cilium/proxy#742, this to-world skb is also linked to a transparent socket, matching the "-m socket --transparent" condition, the packet will fortunately have the 0x200 mark. If we do nothing, this to-world skb marked with 0x200 will then hit routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to local. It should have gone to the world. This patch fixes this future issue as a precaution (otherwise we'll break git-bisect). This patch provides a straightforward solution: at step 5 from_host@cilium_host, we set a specical mark 0x800 (MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using "-m mark ! --mark 0x800/0xf00". Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: 3384d73 ] After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. There is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. [1] To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. [1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825 [1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111 Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: f93a40c ] We have an iptables rule to set 0x200 mark for transparent socket: ``` *mangle -A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle -A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff ``` This rule is in the mangle PREROUTING which checks packets ingressed from a netdev. Let's then focus on the pod to world traffic when IPsec=on + proxy=on + tunnel=off. Currently, a pod-to-world packet will go through the path: 1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack 2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy 3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world) 4. stack routing: the new skb is routed to eth0 5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain 6. to_netdev@eth0: the new skb is going to world Please note the new skb won't hit PREROUTING chain, where there is a rule setting skb->mark=0x200. To fix #31984, we are going to change the routing for packets from egress proxy; consequently, on the step 4 above, the new skb will be routed to cilium_host instead: 4. stack routing: the new skb is routed to cilium_host 5. from_host@cilium_host: the new skb is returned to stack 6. to_host@cilium_net: the new skb is returned to stack 7. stack: PREROUTING, routing, FORWARD, POSTROUTING Look at step 7, we are hitting PREROUTING! Because of cilium/proxy#742, this to-world skb is also linked to a transparent socket, matching the "-m socket --transparent" condition, the packet will fortunately have the 0x200 mark. If we do nothing, this to-world skb marked with 0x200 will then hit routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to local. It should have gone to the world. This patch fixes this future issue as a precaution (otherwise we'll break git-bisect). This patch provides a straightforward solution: at step 5 from_host@cilium_host, we set a specical mark 0x800 (MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using "-m mark ! --mark 0x800/0xf00". Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: 3384d73 ] After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. There is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. [1] To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. [1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825 [1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111 Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: f93a40c ] We have an iptables rule to set 0x200 mark for transparent socket: ``` *mangle -A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle -A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff ``` This rule is in the mangle PREROUTING which checks packets ingressed from a netdev. Let's then focus on the pod to world traffic when IPsec=on + proxy=on + tunnel=off. Currently, a pod-to-world packet will go through the path: 1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack 2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy 3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world) 4. stack routing: the new skb is routed to eth0 5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain 6. to_netdev@eth0: the new skb is going to world Please note the new skb won't hit PREROUTING chain, where there is a rule setting skb->mark=0x200. To fix #31984, we are going to change the routing for packets from egress proxy; consequently, on the step 4 above, the new skb will be routed to cilium_host instead: 4. stack routing: the new skb is routed to cilium_host 5. from_host@cilium_host: the new skb is returned to stack 6. to_host@cilium_net: the new skb is returned to stack 7. stack: PREROUTING, routing, FORWARD, POSTROUTING Look at step 7, we are hitting PREROUTING! Because of cilium/proxy#742, this to-world skb is also linked to a transparent socket, matching the "-m socket --transparent" condition, the packet will fortunately have the 0x200 mark. If we do nothing, this to-world skb marked with 0x200 will then hit routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to local. It should have gone to the world. This patch fixes this future issue as a precaution (otherwise we'll break git-bisect). This patch provides a straightforward solution: at step 5 from_host@cilium_host, we set a specical mark 0x800 (MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using "-m mark ! --mark 0x800/0xf00". Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: 3384d73 ] After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. There is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. [1] To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. [1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825 [1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111 Signed-off-by: gray <gray.liang@isovalent.com>
Security advisory: GHSA-j89h-qrvr-xc36
Previous PRs (#29530, #29594, #30095) addressed the vulnerabilities for IPsec and L7 ingress proxy, this issue is opened for IPsec and L7 egress proxy.
Initial Attempts
In the first place we tried to simply add an ip rule for from-egress-proxy traffic, just like what we did at bca5f07, which added an ip rule for from-ingress-proxy traffic. This commit (1c97de7) implemented the idea,
However, it has turned out to be not enough. We realized iptables also needs to be taken care of, so the second commit (1b5e7d6) followed. Please see the commit message for reasons.
After above two patches, ci-ipsec-e2e seemed to be happy, even with leak detection, but ci-ipsec-upgrade turned broken badly. We found the failure happened during downgrade from, say, 1.15-patched to 1.14-tip (minor downgrade). 1.15-patches will install the new from-egress-proxy rule + new iptables rule; after downgrade to 1.14-tip, the new iptables rule is cleaned up, but the new from-egress-proxy rule is left over, because the old 1.14-tip doesn't recognise from-egress-rule at all. The stale ip rule messed up.
To handle rules carefully without breaking downgrade, I propose the two-stage patches as followed.
Proposed solution
Stage one
We will have to merge the following PRs:
These PRs are doing a simple thing: remove from-egress-proxy rule, even they doesn't install the rule. The rule is probably installed by higher version of cilium.
Stage two
After PRs in stage one are all merged into stable branches and released in weeks, we can proceed with the following PRs:
ipsec: Fix unencrypted traffic when IPsec is used with L7 egress proxy #31955Pr/gray/1.15/egress proxy ipsec fix2 #31975Right now they consists of two commits described in "Initial Attempts" section, plus "manually specify downgrade version", plus leak detection CI checks. I will clean up the PRs once ready, for now the leak detection patches and "manually specify" patch are temporarily added to demonstrate the solution works. We have noted that ci-ipsec-upgrade and ci-ipsec-e2e are mostly in green status.
Outstanding issues
Leak detection check sometimes failes at the very end of ci-ipsec-e2e. The latest speculation is it's caused by conn-disrupt pods deletion.
If conn-disrupt-server is deleted with its ipcache entry, but an skb holding conn-disrupt-client -> conn-disrupt-server already gets emitted, then this skb at from_host@cilium_host won't be marked as need-ipsec-encryption (0xe00) due to ipcache entry missing. bpftrace wil then catch it.
Right now there is no obvious solution (if the theory turns out to be correct). We will move forward and discuss it later.
The text was updated successfully, but these errors were encountered: