Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix L7 ingress proxy iptables redirects in direct routing mode #8864

Conversation

joestringer
Copy link
Member

@joestringer joestringer commented Aug 9, 2019

Fix an issue where packets after the first inbound packet for a local pod with ingress proxy are dropped in the Linux forwarding logic:

In direct routing mode, when multiple nodes are involved in an HTTP
request and there is an ingress L7 policy applied at the destination, the
ACK for a TCP handshake inbound would previously be dropped early
in the forwarding stack, after the iptables PREROUTING tables:

CILIUM_PRE_raw:     IN=eth1 OUT= ... SYN
CILIUM_PRE_mangle:  IN=eth1 OUT= ... SYN
CILIUM_PRE_nat:     IN=eth1 OUT= ... SYN
CILIUM_FORWARD:     IN=eth1 OUT=lxc61b3fb1d7642 ... SYN
CILIUM_POST_mangle: IN= OUT=lxc61b3fb1d7642 ... SYN
CILIUM_POST_nat:    IN= OUT=lxc61b3fb1d7642 ... SYN
CILIUM_PRE_raw:     IN=cilium_host OUT= ... SYN
CILIUM_PRE_mangle:  IN=cilium_host OUT= ... SYN
INPUT:              IN=cilium_host OUT= ... SYN
CILIUM_OUTPUT_raw:  IN= OUT=cilium_host ... SYN ACK
OUTPUT:             IN= OUT=cilium_host ... SYN ACK
CILIUM_POST_mangle: IN= OUT=cilium_host ... SYN ACK
CILIUM_PRE_raw:     IN=cilium_net OUT= ... SYN ACK
CILIUM_PRE_mangle:  IN=cilium_net OUT= ... SYN ACK
CILIUM_FORWARD:     IN=cilium_net OUT=eth1 ... SYN ACK
CILIUM_POST_mangle: IN= OUT=eth1 ... SYN ACK
CILIUM_PRE_raw:     IN=eth1 OUT= ... ACK
CILIUM_PRE_mangle:  IN=eth1 OUT= ... ACK

(No subsequent evidence of iptables hits for the ACK)

Further investigation by Martynas and Daniel revealed that the traffic
was being dropped in the Linux kernel ip_forward() function, in the case
where the socket is assigned to the skb. If one disables early demux
(`sysctl -w net.ipv4.ip_early_demux=0`), then this behaviour is not
observed. Previous discussion which introduced this check can be seen on
the kernel lists [0], [1]. This drop case was introduced to highlight
cases where policy routing is not being properly applied, to drop
traffic early rather than potentially misrouting it within the stack.

One mitigation that was observed is that if the commit 0f7617569171
("Revert "cilium: fix up source address selection for cluster ip"") is
reverted, then SNAT will be applied to traffic that is being forwarded
towards the proxy after initial BPF processing in context of the
destination endpoint (during the CILIUM_POST_nat in the list above). As
a result, the tuple associated with the socket accepted by the proxy
would be distinct from the tuple for any incoming traffic from the eth1
device. Unfortunately, that rule relies upon a destination pod prefix
which cannot be relied upon in some datapath modes such as with
--enable-endpoint-routes (eg, with ENI integration).

Jarno noted that the upstream kernel TPROXY documentation[2] also
contains an iptables rule in the PREROUTING table which diverts traffic
that matches a local socket to a separate chain where a mark is applied
for policy routing, and the traffic is then accepted. We could
previously get away without having such a rule in most circumstances
because typically Cilium in tunnel mode handles all traffic within BPF
programs, tail-calling betweeen programs and with minimal involvement of
the rest of the stack. In other words, the early demux was being skipped
if traffic arrived to the node on a tunnel and hence the ACK would not
be dropped in a similar case in tunnel mode; we could forward through
BPF programs then apply the mark for policy routing in BPF, then pass
traffic to the stack which would then be forwarded correctly. However
for direct routing mode, we rely upon the stack's routing to forward
traffic so in this case the socket match rule is not optional.

Regarding the cost of this new rule, we realise that since the traffic
will be subject to early demux socket lookup anyway at this point in the
stack, there should be minimal additional cost associated with the new
socket lookup in the rule, since the stack would look up the socket
anyway.

[0] https://patchwork.ozlabs.org/patch/459585/
[1] https://patchwork.ozlabs.org/patch/460836/
[2] https://www.kernel.org/doc/Documentation/networking/tproxy.txt

Related: #8813

Follow-up: #8930


This change is Reviewable

@joestringer joestringer added pending-review sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. labels Aug 9, 2019
@joestringer joestringer requested review from jrajahalme and a team August 9, 2019 22:28
@joestringer
Copy link
Member Author

test-me-please

@coveralls
Copy link

coveralls commented Aug 9, 2019

Coverage Status

Coverage decreased (-0.009%) to 44.151% when pulling b833876 on joestringer:submit/l7-proxy-ingress-endpoint-routes into 9c36f26 on cilium:master.

bpf/init.sh Outdated Show resolved Hide resolved
@joestringer joestringer force-pushed the submit/l7-proxy-ingress-endpoint-routes branch from d1924c5 to fd3fb65 Compare August 13, 2019 18:39
@joestringer joestringer changed the title Fix local pod-pod ingress L7 proxying with --enable-endpoint-routes mode Fix L7 ingress proxy in direct routing mode Aug 13, 2019
@joestringer
Copy link
Member Author

joestringer commented Aug 13, 2019

test-me-please

EDIT: The build never kicked off on Jenkins. Will retry.

@joestringer
Copy link
Member Author

joestringer commented Aug 13, 2019

test-me-please

EDIT: Jenkins ignored me again. Retrying again.

@joestringer
Copy link
Member Author

joestringer commented Aug 13, 2019

test-me-please

EDIT: Ran out of space on jenkins node:

https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/14302/execution/node/7/log/

@joestringer
Copy link
Member Author

joestringer commented Aug 13, 2019

test-me-please

EDIT: Same problem, same node jenkins-fixed-new-0.

@joestringer
Copy link
Member Author

joestringer commented Aug 13, 2019

test-me-please

EDIT: The only failure is in the mode that this PR is trying to fix..

https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/14308/execution/node/151/log/?consoleFull

18:06:38  Summarizing 1 Failure:
18:06:38  
18:06:38  [Fail] K8sDatapathConfig PerEndpointRoute [It] Check connectivity with per endpoint routes 
18:06:38  /home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/k8s-1.15-gopath/src/github.com/cilium/cilium/test/k8sT/assertionHelpers.go:121
18:06:38  
18:06:38  Ran 52 of 365 Specs in 2918.250 seconds
18:06:38  FAIL! -- 51 Passed | 1 Failed | 0 Pending | 313 Skipped
18:06:38  --- FAIL: TestTest (2918.25s)
18:06:38  FAIL
18:06:38  
18:06:38  Ginkgo ran 1 suite in 48m45.638190146s
18:06:38  Test Suite Failed

However, it's not in the actual test, but in the bootstrapping / health endpoints setup:

17:55:12  ===================== TEST FAILED =====================
17:55:13  Cilium is not ready yet: connectivity health is failing: Cluster connectivity is unhealthy on 'cilium-5cvwn': Exitcode: 255 

I wonder if the per-endpoint routes are correctly configured for the health endpoints.

@tgraf
Copy link
Member

tgraf commented Aug 14, 2019

test-me-please

@joestringer joestringer force-pushed the submit/l7-proxy-ingress-endpoint-routes branch from fd3fb65 to 0ae4ee9 Compare August 14, 2019 18:39
@joestringer joestringer requested a review from a team August 14, 2019 18:39
@joestringer
Copy link
Member Author

New patch to fix cilium-health connectivity.

@joestringer
Copy link
Member Author

joestringer commented Aug 14, 2019

test-me-please

EDIT: Hit infra issue:

13:41:22  Cannot contact jenkins-node-3: java.lang.InterruptedException

@ianvernon ianvernon removed this from To Merge To Master in 1.6.0-rc6 Aug 16, 2019
joestringer added a commit to joestringer/cilium that referenced this pull request Aug 20, 2019
Refactor this into one place to make it easier and tidier to wrap
creation of these rules. For more detail on why they are necessary, see
cilium#8864.

Signed-off-by: Joe Stringer <joe@cilium.io>
tgraf pushed a commit that referenced this pull request Aug 21, 2019
Refactor this into one place to make it easier and tidier to wrap
creation of these rules. For more detail on why they are necessary, see
#8864.

Signed-off-by: Joe Stringer <joe@cilium.io>
jrajahalme pushed a commit that referenced this pull request Aug 21, 2019
[ upstream commit 4c3f84c ]

Refactor this into one place to make it easier and tidier to wrap
creation of these rules. For more detail on why they are necessary, see
#8864.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
ianvernon pushed a commit that referenced this pull request Aug 22, 2019
[ upstream commit 4c3f84c ]

Refactor this into one place to make it easier and tidier to wrap
creation of these rules. For more detail on why they are necessary, see
#8864.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
jrajahalme added a commit that referenced this pull request Jun 23, 2020
'--no-wildcard' allows the socket match to find zero-bound (listening)
sockets, which we do not want, as this may intercept (reply) traffic
intended for other nodes when an ephemeral source port number
allocated in one node happens to be the same as the allocated proxy
port number in 'this' node (the node doing the iptables socket match
changed here).

Fixes: #12241
Related: #8864
Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
jrajahalme added a commit that referenced this pull request Jun 25, 2020
'--no-wildcard' allows the socket match to find zero-bound (listening)
sockets, which we do not want, as this may intercept (reply) traffic
intended for other nodes when an ephemeral source port number
allocated in one node happens to be the same as the allocated proxy
port number in 'this' node (the node doing the iptables socket match
changed here).

Fixes: #12241
Related: #8864
Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
jrajahalme added a commit that referenced this pull request Jun 29, 2020
'--no-wildcard' allows the socket match to find zero-bound (listening)
sockets, which we do not want, as this may intercept (reply) traffic
intended for other nodes when an ephemeral source port number
allocated in one node happens to be the same as the allocated proxy
port number in 'this' node (the node doing the iptables socket match
changed here).

Fixes: #12241
Related: #8864
Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
joestringer pushed a commit that referenced this pull request Jun 29, 2020
'--no-wildcard' allows the socket match to find zero-bound (listening)
sockets, which we do not want, as this may intercept (reply) traffic
intended for other nodes when an ephemeral source port number
allocated in one node happens to be the same as the allocated proxy
port number in 'this' node (the node doing the iptables socket match
changed here).

Fixes: #12241
Related: #8864
Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
christarazi pushed a commit that referenced this pull request Jun 30, 2020
[ upstream commit ca767ee ]

'--no-wildcard' allows the socket match to find zero-bound (listening)
sockets, which we do not want, as this may intercept (reply) traffic
intended for other nodes when an ephemeral source port number
allocated in one node happens to be the same as the allocated proxy
port number in 'this' node (the node doing the iptables socket match
changed here).

Fixes: #12241
Related: #8864
Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
joestringer pushed a commit that referenced this pull request Jun 30, 2020
[ upstream commit ca767ee ]

'--no-wildcard' allows the socket match to find zero-bound (listening)
sockets, which we do not want, as this may intercept (reply) traffic
intended for other nodes when an ephemeral source port number
allocated in one node happens to be the same as the allocated proxy
port number in 'this' node (the node doing the iptables socket match
changed here).

Fixes: #12241
Related: #8864
Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
christarazi pushed a commit that referenced this pull request Jun 30, 2020
[ upstream commit ca767ee ]

'--no-wildcard' allows the socket match to find zero-bound (listening)
sockets, which we do not want, as this may intercept (reply) traffic
intended for other nodes when an ephemeral source port number
allocated in one node happens to be the same as the allocated proxy
port number in 'this' node (the node doing the iptables socket match
changed here).

Fixes: #12241
Related: #8864
Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
joestringer pushed a commit that referenced this pull request Jun 30, 2020
[ upstream commit ca767ee ]

'--no-wildcard' allows the socket match to find zero-bound (listening)
sockets, which we do not want, as this may intercept (reply) traffic
intended for other nodes when an ephemeral source port number
allocated in one node happens to be the same as the allocated proxy
port number in 'this' node (the node doing the iptables socket match
changed here).

Fixes: #12241
Related: #8864
Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
jrajahalme added a commit that referenced this pull request Jul 1, 2020
[ upstream commit ca767ee ]

'--no-wildcard' allows the socket match to find zero-bound (listening)
sockets, which we do not want, as this may intercept (reply) traffic
intended for other nodes when an ephemeral source port number
allocated in one node happens to be the same as the allocated proxy
port number in 'this' node (the node doing the iptables socket match
changed here).

Fixes: #12241
Related: #8864
Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
joestringer pushed a commit that referenced this pull request Jul 1, 2020
[ upstream commit ca767ee ]

'--no-wildcard' allows the socket match to find zero-bound (listening)
sockets, which we do not want, as this may intercept (reply) traffic
intended for other nodes when an ephemeral source port number
allocated in one node happens to be the same as the allocated proxy
port number in 'this' node (the node doing the iptables socket match
changed here).

Fixes: #12241
Related: #8864
Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants