Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds an ACCEPT rule for untracked pkts in filter:CILIUM_OUTPUT #17585

Merged
merged 1 commit into from Nov 11, 2021

Conversation

Weil0ng
Copy link
Contributor

@Weil0ng Weil0ng commented Oct 12, 2021

Currently, when no-track-port is specified for a pod (the only use
case for now is nodelocaldns), we insert several iptable rules to skip
conntrack for packet to and from the pod to achieve pararity with OSS
node-local-dns.

However, we need to add a specific accept rule int the CILIUM_OUTPUT
chain to accept such packets. Otherwise, a dns query pkt originated from
the hostns will skip conntrack and gets dropped in the filter OUTPUT
chain. This rule is however NOT needed for standard OSS node-local-dns
because it relies on the loopback rule installed by the OS to allowlist
this traffic pattern. With Cilium, we DNAT such packet in a way that its
dst is the pod IP of the local node-cache pod, so it will NOT hit the
loopback dev, hence we need to punch a specific hole to allowlist it.

Fixes: #16694

Signed-off-by: Weilong Cui cuiwl@google.com

@Weil0ng Weil0ng added sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/bug This PR fixes an issue in a previous release of Cilium. labels Oct 12, 2021
@Weil0ng Weil0ng requested review from aditighag and a team October 12, 2021 19:54
@Weil0ng Weil0ng requested a review from jrfastab October 12, 2021 19:54
@aditighag
Copy link
Member

Thanks for tracking this down. Any idea why we need these additional rules only for packets originating from hostns pods destined to node-local dns pod? Traffic from regular pods destined to port 53 should also hit the NO-TRACK rules, no?

I was under the impression that only when endpoint routes are disabled, intra-node pod-pod traffic will be redirected by BPF code to the destination device, in which case we wouldn't hit the issue you described. OTOH, when endpoints are enabled (default on GKE) traffic is handed over to the network stack for routing.

@Weil0ng
Copy link
Contributor Author

Weil0ng commented Oct 13, 2021

I was under the impression that only when endpoint routes are disabled, intra-node pod-pod traffic will be redirected by BPF code to the destination device, in which case we wouldn't hit the issue you described. OTOH, when endpoints are enabled (default on GKE) traffic is handed over to the network stack for routing.

Exactly. For non-hostns pods, pkt never hits raw:OUTPUT because it goes through PREROUTING -> FORWARD -> POSTROUTING path, so it never hits the NOTRACK rule on the "egress" path (from originating pod's perspective).

@pchaigno
Copy link
Member

Otherwise, a dns query pkt originated from the hostns will skip conntrack and gets dropped in the filter OUTPUT chain.

Why does it get dropped if conntrack is skipped? Is there a specific rule that drops it?

The change is trivial but the explanation is a bit hard to follow right now. Any chance you could illustrate with iptables rules?

@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.10.5 Oct 13, 2021
@Weil0ng
Copy link
Contributor Author

Weil0ng commented Oct 13, 2021

Why does it get dropped if conntrack is skipped? Is there a specific rule that drops it?

The change is trivial but the explanation is a bit hard to follow right now. Any chance you could illustrate with iptables rules?

The detailed analysis is in here: #16694 (comment) :) There I listed the iptable rules and why the pkt was dropped.

@aditighag aditighag added the area/lrp Impacts Local Redirect Policy. label Oct 13, 2021
@aditighag
Copy link
Member

aditighag commented Oct 13, 2021

I was under the impression that only when endpoint routes are disabled, intra-node pod-pod traffic will be redirected by BPF code to the destination device, in which case we wouldn't hit the issue you described. OTOH, when endpoints are enabled (default on GKE) traffic is handed over to the network stack for routing.

Exactly. For non-hostns pods, pkt never hits raw:OUTPUT because it goes through PREROUTING -> FORWARD -> POSTROUTING path, so it never hits the NOTRACK rule on the "egress" path (from originating pod's perspective).

I'm confused by this. The reason NO_TRACK rule was added for traffic going to node-local dns pods is because we wanted to skip tracking DNS traffic, no?
Here is your comment from the original issue - #13686 (comment).

These rules will be hit when we are running in endpoint routing mode where the pkts will be handed to the stack after existing the pod namespace.

Am I missing something?

@joestringer joestringer added this to Needs backport from master in 1.10.6 Oct 13, 2021
@joestringer joestringer removed this from Needs backport from master in 1.10.5 Oct 13, 2021
@Weil0ng
Copy link
Contributor Author

Weil0ng commented Oct 14, 2021

I'm confused by this. The reason NO_TRACK rule was added for traffic going to node-local dns pods is because we wanted to skip tracking DNS traffic, no?
Here is your comment from the original issue - #13686 (comment).

These rules will be hit when we are running in endpoint routing mode where the pkts will be handed to the stack after existing the pod namespace.

There are 3 separate sets of NOTRACK rules:

Issue here is that for the first 2 rules, we added ACCEPT rules to allowlist them, but for the last, we are missing that ACCEPT rule.

@Weil0ng
Copy link
Contributor Author

Weil0ng commented Oct 15, 2021

Friendly ping :)

@aditighag
Copy link
Member

aditighag commented Oct 15, 2021

Thanks for the context.

For non-hostns pods, pkt never hits raw:OUTPUT because it goes through PREROUTING -> FORWARD -> POSTROUTING path, so it never hits the NOTRACK rule on the "egress" path (from originating pod's perspective).

I was confused by your earlier statement. From my reading of the code, traffic from regular pods do hit the NOTRACK rule, but we've also added an ACCEPT rule to allow such traffic. If that's the case, the fix looks good to me.

@Weil0ng
Copy link
Contributor Author

Weil0ng commented Oct 15, 2021

I was confused by your earlier statement. From my reading of the code, traffic from regular pods do hit the NOTRACK rule, but we've also added an ACCEPT rule to allow such traffic. If that's the case, the fix looks good to me.

Yes, sorry I wasn't being clear in my initial comment, when I said "it never hits the NOTRACK rule" I really meant pkts from regular pods do NOT hit the rule in the raw:CILIUM_OUTPUT table, but they DO hit the NOTRACK rule in the PREROUTING table and as you said we have an ACCEPT rule to allow such pkts to go through.

@Weil0ng
Copy link
Contributor Author

Weil0ng commented Oct 19, 2021

Friendly ping @pchaigno :)

@pchaigno
Copy link
Member

pchaigno commented Nov 2, 2021

However, we need to add a specific accept rule int the CILIUM_OUTPUT chain to accept such packets. Otherwise, a dns query pkt originated from the hostns will skip conntrack and gets dropped in the filter OUTPUT chain.

I might be missing something obvious, but what's the relation between skipping conntrack and not matching any rule in filter:output? You seem to suggest the first causes the second.

Issue here is that for the first 2 rules, we added ACCEPT rules to allowlist them, but for the last, we are missing that ACCEPT rule.

Why wasn't the ACCEPT rule added before? Was it simply an oversight?

@Weil0ng
Copy link
Contributor Author

Weil0ng commented Nov 2, 2021

I might be missing something obvious, but what's the relation between skipping conntrack and not matching any rule in filter:output? You seem to suggest the first causes the second.

Yes, if you look at the filter:output rules here #16694 (comment), specificaly

3     300K  136M ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            state NEW,RELATED,ESTABLISHED
4        0     0 ACCEPT     all  --  *      lo      0.0.0.0/0            0.0.0.0/0

rule 3 only allows for traffic that is conntracked, and rule 4 only allows for traffic to go through lo device

Why wasn't the ACCEPT rule added before? Was it simply an oversight?

When we added the rules, we followed the OSS node-local-dns implementation, where they do not require such rule because same traffic flow would hit lo dev in their deployment mode. So yes, we overlooked this subtlety.

@pchaigno
Copy link
Member

pchaigno commented Nov 3, 2021

rule 3 only allows for traffic that is conntracked, and rule 4 only allows for traffic to go through lo device

Aah 🤦 Not sure how I missed that rule; that's exactly what I was looking for.

When we added the rules, we followed the OSS node-local-dns implementation, where they do not require such rule because same traffic flow would hit lo dev in their deployment mode.

Ah, makes sense 👍

@Weil0ng
Copy link
Contributor Author

Weil0ng commented Nov 3, 2021

/test

Job 'Cilium-PR-K8s-GKE' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sServicesTest Checks service across nodes Tests NodePort BPF Tests with direct routing Tests NodePort with sessionAffinity

Failure Output

FAIL: Expected

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-GKE so I can create a new GitHub issue to track it.

@Weil0ng
Copy link
Contributor Author

Weil0ng commented Nov 3, 2021

This change should not affect any of the failed tests, they are most likely flakes.

@pchaigno
Copy link
Member

pchaigno commented Nov 3, 2021

This change should not affect any of the failed tests, they are most likely flakes.

Can we ensure we have issues tracking those flakes?

Currently, when `no-track-port` is specified for a pod (the only use
case for now is nodelocaldns), we insert several iptable rules to skip
conntrack for packet to and from the pod to achieve pararity with OSS
node-local-dns.

However, we need to add a specific accept rule int the CILIUM_OUTPUT
chain to accept such packets. Otherwise, a dns query pkt originated from
the hostns will skip conntrack and gets dropped in the filter OUTPUT
chain. This rule is however NOT needed for standard OSS node-local-dns
because it relies on the loopback rule installed by the OS to allowlist
this traffic pattern. With Cilium, we DNAT such packet in a way that its
dst is the pod IP of the local node-cache pod, so it will NOT hit the
loopback dev, hence we need to punch a specific hole to allowlist it.

Fixes: 16694

Signed-off-by: Weilong Cui <cuiwl@google.com>
@Weil0ng
Copy link
Contributor Author

Weil0ng commented Nov 10, 2021

/test

@pchaigno pchaigno added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 11, 2021
@aanm aanm merged commit 2215c02 into cilium:master Nov 11, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport done to v1.10 in 1.10.6 Nov 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/lrp Impacts Local Redirect Policy. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
No open projects
1.10.6
Backport done to v1.10
Development

Successfully merging this pull request may close these issues.

DNS request from hostNetwork pods cannot be delivered to local backend with LRP
7 participants