Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodePort BPF with L7 netpol is broken for N/S when request is sent to node running SVC backend #21955

Closed
brb opened this issue Nov 1, 2022 · 4 comments · Fixed by #22756
Closed
Assignees
Labels
kind/bug This is a bug in the Cilium logic. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.

Comments

@brb
Copy link
Member

brb commented Nov 1, 2022

First issue is #21954. The second issue is that the L7 proxy will do the TCP handshake on behalf of the pod meaning that all replies to an outside client won't be rev-DNAT-ed (bpf_lxc is doing rev-DNAT for NodePort BPF). To fix this we should move the rev-DNAT to bpf_host @ eth0, as it was suggested by #17504.

The bonus issue is DSR when an intermediate LB node is hit first. In this case, the IP opt/ext is dropped by the proxy. The fix is to handle the opt / ext in bpf_host (same as rev-DNAT).

@brb brb added kind/bug This is a bug in the Cilium logic. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. labels Nov 1, 2022
@aditighag
Copy link
Member

bpf_lxc is doing rev-DNAT for NodePort BPF

Based on the traces in my environment, I'm not sure rev xlate is being done by bpf_lxc. We tail call into CILIUM_CALL_IPV4_NODEPORT_REVNAT since the endpoint is local - https://github.com/aditighag/cilium/blob/3fdb434a71e14236edabeaad7dce4768278db513/bpf/bpf_lxc.c#L927-L927.

Here are the traces that confirm that rev xlate happens in to-network -

<- network flow 0x0 , identity unknown->unknown state unknown ifindex eth0 orig-ip 0.0.0.0: 73.158.81.201:50214 -> 10.168.0.58:30619 tcp SYN
<- stack flow 0x0 , identity world->unknown state unknown ifindex eth0 orig-ip 0.0.0.0: 73.158.81.201:50214 -> 10.36.1.10:80 tcp SYN
-> endpoint 511 flow 0x0 , identity world->13879 state new ifindex 0 orig-ip 73.158.81.201: 73.158.81.201:50214 -> 10.36.1.10:80 tcp SYN
<- endpoint 511 flow 0x815468e7 , identity 13879->unknown state unknown ifindex 0 orig-ip 0.0.0.0: 10.36.1.10:80 -> 73.158.81.201:50214 tcp SYN, ACK
-> network flow 0x815468e7 , identity 13879->world state reply ifindex 0 orig-ip 0.0.0.0: 10.36.1.10:80 -> 73.158.81.201:50214 tcp SYN, ACK
-> network flow 0x815468e7 , identity unknown->unknown state unknown ifindex 0 orig-ip 0.0.0.0: 10.168.0.58:30619 -> 73.158.81.201:50214 tcp SYN, ACK

We are likely missing the same tail call for the reply traffic when proxy is involved.

@brb
Copy link
Member Author

brb commented Nov 1, 2022

bpf_lxc is doing rev-DNAT for NodePort BPF

bpf_lxc is doing the rev-DNAT via the tailcall which you linked in your comment. The problem with the L7 proxy reply path is that it doesn't involve bpf_lxc, so nobody can do the rev-DNAT.

@pchaigno
Copy link
Member

pchaigno commented Nov 1, 2022

To fix this we should move the rev-DNAT to bpf_host @ eth0, as it was suggested by #17504.

I think that would be a good thing in any case:

  1. By doing this, we ensure the Linux stack sees consistent IPs & ports for forward and reply traffic. It's less likely to drop packets because the connection looks invalid (e.g., forward is DNATed but not reply).
  2. By moving DSR to bpf_host, it will also make it easier to pipe the DSR information from e.g. a tunnel (IPIP or GENEVE) that we decapsulate at the native device to the DSR logic. Currently, with DSR logic at bpf_lxc, we need a special way to pass that information :(

cc @julianwiedmann

@brb
Copy link
Member Author

brb commented Nov 1, 2022

I think that would be a good thing in any case:

One more:

  1. Fixes N/S for service endpoints running in a host netns.

julianwiedmann added a commit to julianwiedmann/cilium that referenced this issue Jan 18, 2023
Replies by local service backends currently get their revDNAT processing in
from-container. There's two problems with that:
1. we have cases where the packet doesn't reach this part of the
   from-container program (ie. redirect to host proxy), and
2. we're not handling replies by hostns backends at all.

So add an additional check for RevDNAT in handle_nat_fwd() at to-netdev,
before such an untranslated reply leaves the node. The same is needed for
to-overlay, see
3a83623 ("bpf: add support for local NodePort via tunnel").

Fixes: cilium#22659
Fixes: cilium#22838
Fixes: cilium#21955
Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
aanm pushed a commit that referenced this issue Jan 23, 2023
[ upstream commit 54a8631 ]

Replies by local service backends currently get their revDNAT processing in
from-container. There's two problems with that:
1. we have cases where the packet doesn't reach this part of the
   from-container program (ie. redirect to host proxy), and
2. we're not handling replies by hostns backends at all.

So add an additional check for RevDNAT in handle_nat_fwd() at to-netdev,
before such an untranslated reply leaves the node. The same is needed for
to-overlay, see
3a83623 ("bpf: add support for local NodePort via tunnel").

Fixes: #22659
Fixes: #22838
Fixes: #21955
Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: André Martins <andre@cilium.io>
aanm pushed a commit that referenced this issue Jan 23, 2023
[ upstream commit 54a8631 ]

Replies by local service backends currently get their revDNAT processing in
from-container. There's two problems with that:
1. we have cases where the packet doesn't reach this part of the
   from-container program (ie. redirect to host proxy), and
2. we're not handling replies by hostns backends at all.

So add an additional check for RevDNAT in handle_nat_fwd() at to-netdev,
before such an untranslated reply leaves the node. The same is needed for
to-overlay, see
3a83623 ("bpf: add support for local NodePort via tunnel").

Fixes: #22659
Fixes: #22838
Fixes: #21955
Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: André Martins <andre@cilium.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants