Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unencrypted traffic among nodes when IPsec is used with L7 egress proxy #31984

Closed
4 of 11 tasks
jschwinger233 opened this issue Apr 16, 2024 · 1 comment · Fixed by #32683
Closed
4 of 11 tasks

Unencrypted traffic among nodes when IPsec is used with L7 egress proxy #31984

jschwinger233 opened this issue Apr 16, 2024 · 1 comment · Fixed by #32683
Assignees
Labels
area/encryption Impacts encryption support such as IPSec, WireGuard, or kTLS. area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. feature/ipsec Relates to Cilium's IPsec feature kind/bug This is a bug in the Cilium logic. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.

Comments

@jschwinger233
Copy link
Member

jschwinger233 commented Apr 16, 2024

Security advisory: GHSA-j89h-qrvr-xc36

Previous PRs (#29530, #29594, #30095) addressed the vulnerabilities for IPsec and L7 ingress proxy, this issue is opened for IPsec and L7 egress proxy.

Initial Attempts

In the first place we tried to simply add an ip rule for from-egress-proxy traffic, just like what we did at bca5f07, which added an ip rule for from-ingress-proxy traffic. This commit (1c97de7) implemented the idea,

However, it has turned out to be not enough. We realized iptables also needs to be taken care of, so the second commit (1b5e7d6) followed. Please see the commit message for reasons.

After above two patches, ci-ipsec-e2e seemed to be happy, even with leak detection, but ci-ipsec-upgrade turned broken badly. We found the failure happened during downgrade from, say, 1.15-patched to 1.14-tip (minor downgrade). 1.15-patches will install the new from-egress-proxy rule + new iptables rule; after downgrade to 1.14-tip, the new iptables rule is cleaned up, but the new from-egress-proxy rule is left over, because the old 1.14-tip doesn't recognise from-egress-rule at all. The stale ip rule messed up.

To handle rules carefully without breaking downgrade, I propose the two-stage patches as followed.

Proposed solution

Stage one

We will have to merge the following PRs:

These PRs are doing a simple thing: remove from-egress-proxy rule, even they doesn't install the rule. The rule is probably installed by higher version of cilium.

Stage two

After PRs in stage one are all merged into stable branches and released in weeks, we can proceed with the following PRs:

Right now they consists of two commits described in "Initial Attempts" section, plus "manually specify downgrade version", plus leak detection CI checks. I will clean up the PRs once ready, for now the leak detection patches and "manually specify" patch are temporarily added to demonstrate the solution works. We have noted that ci-ipsec-upgrade and ci-ipsec-e2e are mostly in green status.

Outstanding issues

Leak detection check sometimes failes at the very end of ci-ipsec-e2e. The latest speculation is it's caused by conn-disrupt pods deletion.

If conn-disrupt-server is deleted with its ipcache entry, but an skb holding conn-disrupt-client -> conn-disrupt-server already gets emitted, then this skb at from_host@cilium_host won't be marked as need-ipsec-encryption (0xe00) due to ipcache entry missing. bpftrace wil then catch it.

Right now there is no obvious solution (if the theory turns out to be correct). We will move forward and discuss it later.

@jschwinger233 jschwinger233 self-assigned this Apr 16, 2024
@jschwinger233 jschwinger233 added kind/bug This is a bug in the Cilium logic. area/encryption Impacts encryption support such as IPSec, WireGuard, or kTLS. feature/ipsec Relates to Cilium's IPsec feature area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. labels Apr 16, 2024
jschwinger233 added a commit that referenced this issue May 16, 2024
Otherwise the further patch for L7 proxy + IPsec fails pod-to-world
testcases.

To fix #31984, we are going to
change the routing for traffic from egress proxy: when tunnel is
disabled, the traffic  will be routed to cilium_host instead of eth0.
This leads to a consequence of changing packets' source IPs: from eth0's
IP to cilium_host's IP. The implication requires additional masquerading
for pod to world traffic, because these packets now have cilium_host's
IP as source, rather than eth0.

This patch ensures iptables masquerade rules are installed for pod to
world traffic, even when KPR is enabled.

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit that referenced this issue May 16, 2024
Otherwise the further patch for L7 proxy + IPsec fails pod-to-world
testcases.

To fix #31984, we are going to
change the routing for traffic from egress proxy: when tunnel is
disabled, the traffic  will be routed to cilium_host instead of eth0.
This leads to a consequence of changing packets' source IPs: from eth0's
IP to cilium_host's IP. The implication requires additional masquerading
for pod to world traffic, because these packets now have cilium_host's
IP as source, rather than eth0.

This patch ensures iptables masquerade rules are installed for pod to
world traffic, even when KPR is enabled.

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit that referenced this issue May 16, 2024
Otherwise the further patch for L7 proxy + IPsec fails pod-to-world
testcases.

To fix #31984, we are going to
change the routing for traffic from egress proxy: when tunnel is
disabled, the traffic  will be routed to cilium_host instead of eth0.
This leads to a consequence of changing packets' source IPs: from eth0's
IP to cilium_host's IP. The implication requires additional masquerading
for pod to world traffic, because these packets now have cilium_host's
IP as source, rather than eth0.

This patch ensures iptables masquerade rules are installed for pod to
world traffic, even when KPR is enabled.

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit that referenced this issue May 23, 2024
Otherwise the further patch for L7 proxy + IPsec fails pod-to-world
testcases.

To fix #31984, we are going to
change the routing for traffic from egress proxy: when tunnel is
disabled, the traffic  will be routed to cilium_host instead of eth0.
This leads to a consequence of changing packets' source IPs: from eth0's
IP to cilium_host's IP. The implication requires additional masquerading
for pod to world traffic, because these packets now have cilium_host's
IP as source, rather than eth0.

This patch ensures iptables masquerade rules are installed for pod to
world traffic, even when KPR is enabled.

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit that referenced this issue May 24, 2024
Otherwise the further patch for L7 proxy + IPsec fails pod-to-world
testcases.

To fix #31984, we are going to
change the routing for traffic from egress proxy: when tunnel is
disabled, the traffic  will be routed to cilium_host instead of eth0.
This leads to a consequence of changing packets' source IPs: from eth0's
IP to cilium_host's IP. The implication requires additional masquerading
for pod to world traffic, because these packets now have cilium_host's
IP as source, rather than eth0.

This patch ensures iptables masquerade rules are installed for pod to
world traffic, even when KPR is enabled.

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit that referenced this issue May 29, 2024
We have an iptables rule to set 0x200 mark for transparent socket:

```
*mangle
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle
-A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff
```

This rule is in the mangle PREROUTING which checks packets ingressed
from a netdev.

Let's then focus on the pod to world traffic when IPsec=on + proxy=on +
tunnel=off.

Currently, a pod-to-world packet will go through the path:
1. @lxc ingress: skb->mark is set to 0x200 and returned to stack
2. @stack iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy
3. @Proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world)
4. @stack routing: the new skb is routed to eth0
5. @stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain
6. @eth0 egress: the new skb is going to world

Please note the new skb won't hit PREROUTING chain, where there is a
rule setting skb->mark=0x200.

To fix #31984, we are going to
change the routing for packets from egress proxy: on the step 4 above,
the new skb will be routed to cilium_host instead:

4. @stack routing: the new skb is routed to cilium_host
5. @cilium_host egress: the new skb is returned to stack
6. @cilium_net ingress: the new skb is returned to stack
7. @stack routing: the new skb is routed to eth0
8. @stack iptables: the new skb is traversing PREROUTING, FORWARD, POSTROUTING

Look at step 8, we are hitting PREROUTING! Because of
cilium/proxy#742, this to-world skb is also
linked to a transparent socket, matching the "-m socket --transparent"
condition.

If we do nothing, this to-world skb will be set 0x200 mark, and then hit
routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to
local. It should have gone to the world.

This patch fixes this future issue as a precaution (otherwise we'll
break git-bisect).

This patch provides a straightforward solution: at step 5 @cilium_host,
we set a specical mark 0x800 (MARK_MAGIC_GOTO_WORLD), then iptables can
exclude this mark using "-m mark ! --mark 0x800/0xf00".

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit that referenced this issue May 29, 2024
After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. there is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
traversal, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all.

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit that referenced this issue May 29, 2024
We have an iptables rule to set 0x200 mark for transparent socket:

```
*mangle
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle
-A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff
```

This rule is in the mangle PREROUTING which checks packets ingressed
from a netdev.

Let's then focus on the pod to world traffic when IPsec=on + proxy=on +
tunnel=off.

Currently, a pod-to-world packet will go through the path:
1. @lxc ingress: skb->mark is set to 0x200 and returned to stack
2. @stack iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy
3. @Proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world)
4. @stack routing: the new skb is routed to eth0
5. @stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain
6. @eth0 egress: the new skb is going to world

Please note the new skb won't hit PREROUTING chain, where there is a
rule setting skb->mark=0x200.

To fix #31984, we are going to
change the routing for packets from egress proxy; consequently, on the
step 4 above, the new skb will be routed to cilium_host instead:

4. @stack routing: the new skb is routed to cilium_host
5. @cilium_host egress: the new skb is returned to stack
6. @cilium_net ingress: the new skb is returned to stack
7. @stack routing: the new skb is routed to eth0
8. @stack iptables: the new skb is traversing PREROUTING, FORWARD, POSTROUTING

Look at step 8, we are hitting PREROUTING! Because of
cilium/proxy#742, this to-world skb is also
linked to a transparent socket, matching the "-m socket --transparent"
condition, the packet will fortunately have the 0x200 mark.

If we do nothing, this to-world skb marked with 0x200 will then hit
routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to
local. It should have gone to the world.

This patch fixes this future issue as a precaution (otherwise we'll
break git-bisect).

This patch provides a straightforward solution: at step 5 @cilium_host,
we set a specical mark 0x800 (MARK_MAGIC_GOTO_WORLD), then iptables can
exclude this mark using "-m mark ! --mark 0x800/0xf00".

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit that referenced this issue May 29, 2024
After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. there is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
traversal, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all.

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit that referenced this issue May 29, 2024
We have an iptables rule to set 0x200 mark for transparent socket:

```
*mangle
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle
-A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff
```

This rule is in the mangle PREROUTING which checks packets ingressed
from a netdev.

Let's then focus on the pod to world traffic when IPsec=on + proxy=on +
tunnel=off.

Currently, a pod-to-world packet will go through the path:
1. @lxc ingress: skb->mark is set to 0x200 and returned to stack
2. @stack iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy
3. @Proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world)
4. @stack routing: the new skb is routed to eth0
5. @stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain
6. @eth0 egress: the new skb is going to world

Please note the new skb won't hit PREROUTING chain, where there is a
rule setting skb->mark=0x200.

To fix #31984, we are going to
change the routing for packets from egress proxy; consequently, on the
step 4 above, the new skb will be routed to cilium_host instead:

4. @stack routing: the new skb is routed to cilium_host
5. @cilium_host egress: the new skb is returned to stack
6. @cilium_net ingress: the new skb is returned to stack
7. @stack routing: the new skb is routed to eth0
8. @stack iptables: the new skb is traversing PREROUTING, FORWARD, POSTROUTING

Look at step 8, we are hitting PREROUTING! Because of
cilium/proxy#742, this to-world skb is also
linked to a transparent socket, matching the "-m socket --transparent"
condition, the packet will fortunately have the 0x200 mark.

If we do nothing, this to-world skb marked with 0x200 will then hit
routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to
local. It should have gone to the world.

This patch fixes this future issue as a precaution (otherwise we'll
break git-bisect).

This patch provides a straightforward solution: at step 5 @cilium_host,
we set a specical mark 0x800 (MARK_MAGIC_GOTO_WORLD), then iptables can
exclude this mark using "-m mark ! --mark 0x800/0xf00".

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit that referenced this issue May 29, 2024
After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. there is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
traversal, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all.

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit that referenced this issue May 29, 2024
We have an iptables rule to set 0x200 mark for transparent socket:

```
*mangle
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle
-A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff
```

This rule is in the mangle PREROUTING which checks packets ingressed
from a netdev.

Let's then focus on the pod to world traffic when IPsec=on + proxy=on +
tunnel=off.

Currently, a pod-to-world packet will go through the path:
1. @lxc ingress: skb->mark is set to 0x200 and returned to stack
2. @stack iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy
3. @Proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world)
4. @stack routing: the new skb is routed to eth0
5. @stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain
6. @eth0 egress: the new skb is going to world

Please note the new skb won't hit PREROUTING chain, where there is a
rule setting skb->mark=0x200.

To fix #31984, we are going to
change the routing for packets from egress proxy; consequently, on the
step 4 above, the new skb will be routed to cilium_host instead:

4. @stack routing: the new skb is routed to cilium_host
5. @cilium_host egress: the new skb is returned to stack
6. @cilium_net ingress: the new skb is returned to stack
7. @stack routing: the new skb is routed to eth0
8. @stack iptables: the new skb is traversing PREROUTING, FORWARD, POSTROUTING

Look at step 8, we are hitting PREROUTING! Because of
cilium/proxy#742, this to-world skb is also
linked to a transparent socket, matching the "-m socket --transparent"
condition, the packet will fortunately have the 0x200 mark.

If we do nothing, this to-world skb marked with 0x200 will then hit
routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to
local. It should have gone to the world.

This patch fixes this future issue as a precaution (otherwise we'll
break git-bisect).

This patch provides a straightforward solution: at step 5 @cilium_host,
we set a specical mark 0x800 (MARK_MAGIC_PROXY_TO_WORLD), then iptables can
exclude this mark using "-m mark ! --mark 0x800/0xf00".

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit that referenced this issue May 29, 2024
After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. there is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
traversal, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all.

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit that referenced this issue May 29, 2024
We have an iptables rule to set 0x200 mark for transparent socket:

```
*mangle
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle
-A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff
```

This rule is in the mangle PREROUTING which checks packets ingressed
from a netdev.

Let's then focus on the pod to world traffic when IPsec=on + proxy=on +
tunnel=off.

Currently, a pod-to-world packet will go through the path:
1. @lxc ingress: skb->mark is set to 0x200 and returned to stack
2. @stack iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy
3. @Proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world)
4. @stack routing: the new skb is routed to eth0
5. @stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain
6. @eth0 egress: the new skb is going to world

Please note the new skb won't hit PREROUTING chain, where there is a
rule setting skb->mark=0x200.

To fix #31984, we are going to
change the routing for packets from egress proxy; consequently, on the
step 4 above, the new skb will be routed to cilium_host instead:

4. @stack routing: the new skb is routed to cilium_host
5. @cilium_host egress: the new skb is returned to stack
6. @cilium_net ingress: the new skb is returned to stack
7. @stack routing: the new skb is routed to eth0
8. @stack iptables: the new skb is traversing PREROUTING, FORWARD, POSTROUTING

Look at step 8, we are hitting PREROUTING! Because of
cilium/proxy#742, this to-world skb is also
linked to a transparent socket, matching the "-m socket --transparent"
condition, the packet will fortunately have the 0x200 mark.

If we do nothing, this to-world skb marked with 0x200 will then hit
routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to
local. It should have gone to the world.

This patch fixes this future issue as a precaution (otherwise we'll
break git-bisect).

This patch provides a straightforward solution: at step 5 @cilium_host,
we set a specical mark 0x800 (MARK_MAGIC_PROXY_TO_WORLD), then iptables can
exclude this mark using "-m mark ! --mark 0x800/0xf00".

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit that referenced this issue May 29, 2024
After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. there is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
traversal, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all.

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit that referenced this issue May 29, 2024
After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. There is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all.

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit that referenced this issue May 29, 2024
After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. There is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all. [1]

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

[1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825
[1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit that referenced this issue Jun 3, 2024
We have an iptables rule to set 0x200 mark for transparent socket:

```
*mangle
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle
-A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff
```

This rule is in the mangle PREROUTING which checks packets ingressed
from a netdev.

Let's then focus on the pod to world traffic when IPsec=on + proxy=on +
tunnel=off.

Currently, a pod-to-world packet will go through the path:
1. @lxc ingress: skb->mark is set to 0x200 and returned to stack
2. @stack iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy
3. @Proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world)
4. @stack routing: the new skb is routed to eth0
5. @stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain
6. @eth0 egress: the new skb is going to world

Please note the new skb won't hit PREROUTING chain, where there is a
rule setting skb->mark=0x200.

To fix #31984, we are going to
change the routing for packets from egress proxy; consequently, on the
step 4 above, the new skb will be routed to cilium_host instead:

4. @stack routing: the new skb is routed to cilium_host
5. @cilium_host egress: the new skb is returned to stack
6. @cilium_net ingress: the new skb is returned to stack
7. @stack routing: the new skb is routed to eth0
8. @stack iptables: the new skb is traversing PREROUTING, FORWARD, POSTROUTING

Look at step 8, we are hitting PREROUTING! Because of
cilium/proxy#742, this to-world skb is also
linked to a transparent socket, matching the "-m socket --transparent"
condition, the packet will fortunately have the 0x200 mark.

If we do nothing, this to-world skb marked with 0x200 will then hit
routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to
local. It should have gone to the world.

This patch fixes this future issue as a precaution (otherwise we'll
break git-bisect).

This patch provides a straightforward solution: at step 5 @cilium_host,
we set a specical mark 0x800 (MARK_MAGIC_PROXY_TO_WORLD), then iptables can
exclude this mark using "-m mark ! --mark 0x800/0xf00".

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit that referenced this issue Jun 3, 2024
After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. There is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all. [1]

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

[1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825
[1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit that referenced this issue Jun 3, 2024
We have an iptables rule to set 0x200 mark for transparent socket:

```
*mangle
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle
-A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff
```

This rule is in the mangle PREROUTING which checks packets ingressed
from a netdev.

Let's then focus on the pod to world traffic when IPsec=on + proxy=on +
tunnel=off.

Currently, a pod-to-world packet will go through the path:
1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack
2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy
3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world)
4. stack routing: the new skb is routed to eth0
5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain
6. to_netdev@eth0: the new skb is going to world

Please note the new skb won't hit PREROUTING chain, where there is a
rule setting skb->mark=0x200.

To fix #31984, we are going to
change the routing for packets from egress proxy; consequently, on the
step 4 above, the new skb will be routed to cilium_host instead:

4. stack routing: the new skb is routed to cilium_host
5. from_host@cilium_host: the new skb is returned to stack
6. to_host@cilium_net: the new skb is returned to stack
7. stack: PREROUTING, routing, FORWARD, POSTROUTING

Look at step 7, we are hitting PREROUTING! Because of
cilium/proxy#742, this to-world skb is also
linked to a transparent socket, matching the "-m socket --transparent"
condition, the packet will fortunately have the 0x200 mark.

If we do nothing, this to-world skb marked with 0x200 will then hit
routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to
local. It should have gone to the world.

This patch fixes this future issue as a precaution (otherwise we'll
break git-bisect).

This patch provides a straightforward solution: at step 5
from_host@cilium_host, we set a specical mark 0x800
(MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using
"-m mark ! --mark 0x800/0xf00".

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit that referenced this issue Jun 3, 2024
After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. There is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all. [1]

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

[1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825
[1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111

Signed-off-by: gray <gray.liang@isovalent.com>
@jschwinger233
Copy link
Member Author

jschwinger233 commented Jun 4, 2024

Because of cilium/proxy#742 and #32331, the backport of #32683 needs extra care.

The above two PRs avoids an issue while bringing about a new issue:

  • issue solved: With to-world packets from proxy having original src IP, bpf masquerading works for them
  • new issue: With to-world packets from proxy linked with a transparent socket, we introduced new mark MARK_MAGIC_PROXY_TO_WORLD to avoid iptables modifying mark

Cilium 1.15- doesn't have these two PRs, so we have to:

  • solve the bpf masquerading issue for to-world packets from proxy
  • drop the patch introducing new mark MARK_MAGIC_PROXY_TO_WORLD

I think I'll just disable bpf masquerading in ci-ipsec-e2e for 1.15-, for now. Then open a separate issue for it.


Background: bpf masquerading doesn't do SNAT for an skb whose src IP is cilium internal IP (cilium_host).


edit: 1.15- doesn't have kpr enabled in ci-ipsec-e2e.

github-merge-queue bot pushed a commit that referenced this issue Jun 5, 2024
We have an iptables rule to set 0x200 mark for transparent socket:

```
*mangle
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle
-A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff
```

This rule is in the mangle PREROUTING which checks packets ingressed
from a netdev.

Let's then focus on the pod to world traffic when IPsec=on + proxy=on +
tunnel=off.

Currently, a pod-to-world packet will go through the path:
1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack
2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy
3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world)
4. stack routing: the new skb is routed to eth0
5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain
6. to_netdev@eth0: the new skb is going to world

Please note the new skb won't hit PREROUTING chain, where there is a
rule setting skb->mark=0x200.

To fix #31984, we are going to
change the routing for packets from egress proxy; consequently, on the
step 4 above, the new skb will be routed to cilium_host instead:

4. stack routing: the new skb is routed to cilium_host
5. from_host@cilium_host: the new skb is returned to stack
6. to_host@cilium_net: the new skb is returned to stack
7. stack: PREROUTING, routing, FORWARD, POSTROUTING

Look at step 7, we are hitting PREROUTING! Because of
cilium/proxy#742, this to-world skb is also
linked to a transparent socket, matching the "-m socket --transparent"
condition, the packet will fortunately have the 0x200 mark.

If we do nothing, this to-world skb marked with 0x200 will then hit
routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to
local. It should have gone to the world.

This patch fixes this future issue as a precaution (otherwise we'll
break git-bisect).

This patch provides a straightforward solution: at step 5
from_host@cilium_host, we set a specical mark 0x800
(MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using
"-m mark ! --mark 0x800/0xf00".

Signed-off-by: gray <gray.liang@isovalent.com>
github-merge-queue bot pushed a commit that referenced this issue Jun 5, 2024
After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. There is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all. [1]

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

[1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825
[1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit that referenced this issue Jun 6, 2024
[ upstream commit: 3384d73 ]

After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. There is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all. [1]

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

[1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825
[1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit that referenced this issue Jun 6, 2024
[ upstream commit: f93a40c ]

We have an iptables rule to set 0x200 mark for transparent socket:

```
*mangle
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle
-A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff
```

This rule is in the mangle PREROUTING which checks packets ingressed
from a netdev.

Let's then focus on the pod to world traffic when IPsec=on + proxy=on +
tunnel=off.

Currently, a pod-to-world packet will go through the path:
1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack
2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy
3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world)
4. stack routing: the new skb is routed to eth0
5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain
6. to_netdev@eth0: the new skb is going to world

Please note the new skb won't hit PREROUTING chain, where there is a
rule setting skb->mark=0x200.

To fix #31984, we are going to
change the routing for packets from egress proxy; consequently, on the
step 4 above, the new skb will be routed to cilium_host instead:

4. stack routing: the new skb is routed to cilium_host
5. from_host@cilium_host: the new skb is returned to stack
6. to_host@cilium_net: the new skb is returned to stack
7. stack: PREROUTING, routing, FORWARD, POSTROUTING

Look at step 7, we are hitting PREROUTING! Because of
cilium/proxy#742, this to-world skb is also
linked to a transparent socket, matching the "-m socket --transparent"
condition, the packet will fortunately have the 0x200 mark.

If we do nothing, this to-world skb marked with 0x200 will then hit
routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to
local. It should have gone to the world.

This patch fixes this future issue as a precaution (otherwise we'll
break git-bisect).

This patch provides a straightforward solution: at step 5
from_host@cilium_host, we set a specical mark 0x800
(MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using
"-m mark ! --mark 0x800/0xf00".

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit that referenced this issue Jun 6, 2024
[ upstream commit: 3384d73 ]

After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. There is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all. [1]

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

[1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825
[1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit that referenced this issue Jun 6, 2024
[ upstream commit: f93a40c ]

We have an iptables rule to set 0x200 mark for transparent socket:

```
*mangle
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle
-A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff
```

This rule is in the mangle PREROUTING which checks packets ingressed
from a netdev.

Let's then focus on the pod to world traffic when IPsec=on + proxy=on +
tunnel=off.

Currently, a pod-to-world packet will go through the path:
1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack
2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy
3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world)
4. stack routing: the new skb is routed to eth0
5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain
6. to_netdev@eth0: the new skb is going to world

Please note the new skb won't hit PREROUTING chain, where there is a
rule setting skb->mark=0x200.

To fix #31984, we are going to
change the routing for packets from egress proxy; consequently, on the
step 4 above, the new skb will be routed to cilium_host instead:

4. stack routing: the new skb is routed to cilium_host
5. from_host@cilium_host: the new skb is returned to stack
6. to_host@cilium_net: the new skb is returned to stack
7. stack: PREROUTING, routing, FORWARD, POSTROUTING

Look at step 7, we are hitting PREROUTING! Because of
cilium/proxy#742, this to-world skb is also
linked to a transparent socket, matching the "-m socket --transparent"
condition, the packet will fortunately have the 0x200 mark.

If we do nothing, this to-world skb marked with 0x200 will then hit
routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to
local. It should have gone to the world.

This patch fixes this future issue as a precaution (otherwise we'll
break git-bisect).

This patch provides a straightforward solution: at step 5
from_host@cilium_host, we set a specical mark 0x800
(MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using
"-m mark ! --mark 0x800/0xf00".

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit that referenced this issue Jun 6, 2024
[ upstream commit: 3384d73 ]

After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. There is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all. [1]

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

[1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825
[1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit that referenced this issue Jun 7, 2024
[ upstream commit: f93a40c ]

We have an iptables rule to set 0x200 mark for transparent socket:

```
*mangle
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle
-A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff
```

This rule is in the mangle PREROUTING which checks packets ingressed
from a netdev.

Let's then focus on the pod to world traffic when IPsec=on + proxy=on +
tunnel=off.

Currently, a pod-to-world packet will go through the path:
1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack
2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy
3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world)
4. stack routing: the new skb is routed to eth0
5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain
6. to_netdev@eth0: the new skb is going to world

Please note the new skb won't hit PREROUTING chain, where there is a
rule setting skb->mark=0x200.

To fix #31984, we are going to
change the routing for packets from egress proxy; consequently, on the
step 4 above, the new skb will be routed to cilium_host instead:

4. stack routing: the new skb is routed to cilium_host
5. from_host@cilium_host: the new skb is returned to stack
6. to_host@cilium_net: the new skb is returned to stack
7. stack: PREROUTING, routing, FORWARD, POSTROUTING

Look at step 7, we are hitting PREROUTING! Because of
cilium/proxy#742, this to-world skb is also
linked to a transparent socket, matching the "-m socket --transparent"
condition, the packet will fortunately have the 0x200 mark.

If we do nothing, this to-world skb marked with 0x200 will then hit
routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to
local. It should have gone to the world.

This patch fixes this future issue as a precaution (otherwise we'll
break git-bisect).

This patch provides a straightforward solution: at step 5
from_host@cilium_host, we set a specical mark 0x800
(MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using
"-m mark ! --mark 0x800/0xf00".

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit that referenced this issue Jun 7, 2024
[ upstream commit: 3384d73 ]

After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. There is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all. [1]

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

[1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825
[1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit that referenced this issue Jun 7, 2024
[ upstream commit: f93a40c ]

We have an iptables rule to set 0x200 mark for transparent socket:

```
*mangle
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle
-A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff
```

This rule is in the mangle PREROUTING which checks packets ingressed
from a netdev.

Let's then focus on the pod to world traffic when IPsec=on + proxy=on +
tunnel=off.

Currently, a pod-to-world packet will go through the path:
1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack
2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy
3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world)
4. stack routing: the new skb is routed to eth0
5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain
6. to_netdev@eth0: the new skb is going to world

Please note the new skb won't hit PREROUTING chain, where there is a
rule setting skb->mark=0x200.

To fix #31984, we are going to
change the routing for packets from egress proxy; consequently, on the
step 4 above, the new skb will be routed to cilium_host instead:

4. stack routing: the new skb is routed to cilium_host
5. from_host@cilium_host: the new skb is returned to stack
6. to_host@cilium_net: the new skb is returned to stack
7. stack: PREROUTING, routing, FORWARD, POSTROUTING

Look at step 7, we are hitting PREROUTING! Because of
cilium/proxy#742, this to-world skb is also
linked to a transparent socket, matching the "-m socket --transparent"
condition, the packet will fortunately have the 0x200 mark.

If we do nothing, this to-world skb marked with 0x200 will then hit
routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to
local. It should have gone to the world.

This patch fixes this future issue as a precaution (otherwise we'll
break git-bisect).

This patch provides a straightforward solution: at step 5
from_host@cilium_host, we set a specical mark 0x800
(MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using
"-m mark ! --mark 0x800/0xf00".

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit that referenced this issue Jun 7, 2024
[ upstream commit: 3384d73 ]

After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. There is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all. [1]

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

[1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825
[1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111

Signed-off-by: gray <gray.liang@isovalent.com>
qmonnet pushed a commit that referenced this issue Jun 7, 2024
[ upstream commit: f93a40c ]

We have an iptables rule to set 0x200 mark for transparent socket:

```
*mangle
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle
-A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff
```

This rule is in the mangle PREROUTING which checks packets ingressed
from a netdev.

Let's then focus on the pod to world traffic when IPsec=on + proxy=on +
tunnel=off.

Currently, a pod-to-world packet will go through the path:
1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack
2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy
3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world)
4. stack routing: the new skb is routed to eth0
5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain
6. to_netdev@eth0: the new skb is going to world

Please note the new skb won't hit PREROUTING chain, where there is a
rule setting skb->mark=0x200.

To fix #31984, we are going to
change the routing for packets from egress proxy; consequently, on the
step 4 above, the new skb will be routed to cilium_host instead:

4. stack routing: the new skb is routed to cilium_host
5. from_host@cilium_host: the new skb is returned to stack
6. to_host@cilium_net: the new skb is returned to stack
7. stack: PREROUTING, routing, FORWARD, POSTROUTING

Look at step 7, we are hitting PREROUTING! Because of
cilium/proxy#742, this to-world skb is also
linked to a transparent socket, matching the "-m socket --transparent"
condition, the packet will fortunately have the 0x200 mark.

If we do nothing, this to-world skb marked with 0x200 will then hit
routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to
local. It should have gone to the world.

This patch fixes this future issue as a precaution (otherwise we'll
break git-bisect).

This patch provides a straightforward solution: at step 5
from_host@cilium_host, we set a specical mark 0x800
(MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using
"-m mark ! --mark 0x800/0xf00".

Signed-off-by: gray <gray.liang@isovalent.com>
qmonnet pushed a commit that referenced this issue Jun 7, 2024
[ upstream commit: 3384d73 ]

After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. There is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all. [1]

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

[1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825
[1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit that referenced this issue Jun 7, 2024
[ upstream commit: f93a40c ]

We have an iptables rule to set 0x200 mark for transparent socket:

```
*mangle
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle
-A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff
```

This rule is in the mangle PREROUTING which checks packets ingressed
from a netdev.

Let's then focus on the pod to world traffic when IPsec=on + proxy=on +
tunnel=off.

Currently, a pod-to-world packet will go through the path:
1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack
2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy
3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world)
4. stack routing: the new skb is routed to eth0
5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain
6. to_netdev@eth0: the new skb is going to world

Please note the new skb won't hit PREROUTING chain, where there is a
rule setting skb->mark=0x200.

To fix #31984, we are going to
change the routing for packets from egress proxy; consequently, on the
step 4 above, the new skb will be routed to cilium_host instead:

4. stack routing: the new skb is routed to cilium_host
5. from_host@cilium_host: the new skb is returned to stack
6. to_host@cilium_net: the new skb is returned to stack
7. stack: PREROUTING, routing, FORWARD, POSTROUTING

Look at step 7, we are hitting PREROUTING! Because of
cilium/proxy#742, this to-world skb is also
linked to a transparent socket, matching the "-m socket --transparent"
condition, the packet will fortunately have the 0x200 mark.

If we do nothing, this to-world skb marked with 0x200 will then hit
routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to
local. It should have gone to the world.

This patch fixes this future issue as a precaution (otherwise we'll
break git-bisect).

This patch provides a straightforward solution: at step 5
from_host@cilium_host, we set a specical mark 0x800
(MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using
"-m mark ! --mark 0x800/0xf00".

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit that referenced this issue Jun 7, 2024
[ upstream commit: 3384d73 ]

After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. There is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all. [1]

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

[1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825
[1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111

Signed-off-by: gray <gray.liang@isovalent.com>
julianwiedmann pushed a commit that referenced this issue Jun 7, 2024
[ upstream commit: f93a40c ]

We have an iptables rule to set 0x200 mark for transparent socket:

```
*mangle
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle
-A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff
```

This rule is in the mangle PREROUTING which checks packets ingressed
from a netdev.

Let's then focus on the pod to world traffic when IPsec=on + proxy=on +
tunnel=off.

Currently, a pod-to-world packet will go through the path:
1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack
2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy
3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world)
4. stack routing: the new skb is routed to eth0
5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain
6. to_netdev@eth0: the new skb is going to world

Please note the new skb won't hit PREROUTING chain, where there is a
rule setting skb->mark=0x200.

To fix #31984, we are going to
change the routing for packets from egress proxy; consequently, on the
step 4 above, the new skb will be routed to cilium_host instead:

4. stack routing: the new skb is routed to cilium_host
5. from_host@cilium_host: the new skb is returned to stack
6. to_host@cilium_net: the new skb is returned to stack
7. stack: PREROUTING, routing, FORWARD, POSTROUTING

Look at step 7, we are hitting PREROUTING! Because of
cilium/proxy#742, this to-world skb is also
linked to a transparent socket, matching the "-m socket --transparent"
condition, the packet will fortunately have the 0x200 mark.

If we do nothing, this to-world skb marked with 0x200 will then hit
routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to
local. It should have gone to the world.

This patch fixes this future issue as a precaution (otherwise we'll
break git-bisect).

This patch provides a straightforward solution: at step 5
from_host@cilium_host, we set a specical mark 0x800
(MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using
"-m mark ! --mark 0x800/0xf00".

Signed-off-by: gray <gray.liang@isovalent.com>
julianwiedmann pushed a commit that referenced this issue Jun 7, 2024
[ upstream commit: 3384d73 ]

After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. There is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all. [1]

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

[1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825
[1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111

Signed-off-by: gray <gray.liang@isovalent.com>
dylandreimerink pushed a commit that referenced this issue Jun 11, 2024
[ upstream commit: f93a40c ]

We have an iptables rule to set 0x200 mark for transparent socket:

```
*mangle
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle
-A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff
```

This rule is in the mangle PREROUTING which checks packets ingressed
from a netdev.

Let's then focus on the pod to world traffic when IPsec=on + proxy=on +
tunnel=off.

Currently, a pod-to-world packet will go through the path:
1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack
2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy
3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world)
4. stack routing: the new skb is routed to eth0
5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain
6. to_netdev@eth0: the new skb is going to world

Please note the new skb won't hit PREROUTING chain, where there is a
rule setting skb->mark=0x200.

To fix #31984, we are going to
change the routing for packets from egress proxy; consequently, on the
step 4 above, the new skb will be routed to cilium_host instead:

4. stack routing: the new skb is routed to cilium_host
5. from_host@cilium_host: the new skb is returned to stack
6. to_host@cilium_net: the new skb is returned to stack
7. stack: PREROUTING, routing, FORWARD, POSTROUTING

Look at step 7, we are hitting PREROUTING! Because of
cilium/proxy#742, this to-world skb is also
linked to a transparent socket, matching the "-m socket --transparent"
condition, the packet will fortunately have the 0x200 mark.

If we do nothing, this to-world skb marked with 0x200 will then hit
routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to
local. It should have gone to the world.

This patch fixes this future issue as a precaution (otherwise we'll
break git-bisect).

This patch provides a straightforward solution: at step 5
from_host@cilium_host, we set a specical mark 0x800
(MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using
"-m mark ! --mark 0x800/0xf00".

Signed-off-by: gray <gray.liang@isovalent.com>
dylandreimerink pushed a commit that referenced this issue Jun 11, 2024
[ upstream commit: 3384d73 ]

After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. There is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all. [1]

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

[1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825
[1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111

Signed-off-by: gray <gray.liang@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/encryption Impacts encryption support such as IPSec, WireGuard, or kTLS. area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. feature/ipsec Relates to Cilium's IPsec feature kind/bug This is a bug in the Cilium logic. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
1 participant