-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
datapath: ICMP CT fixes #15275
datapath: ICMP CT fixes #15275
Conversation
test-me-please |
4.19 is hitting complexity issues. |
6ba3495
to
1cb41bd
Compare
test-me-please |
test-4.9 |
Converted to draft until the complexity issue has been resolved. |
The complexity issue should be resolved by #15217. |
1cb41bd
to
3edf2fa
Compare
test-1.19-4.19 |
7065466
to
c8de6ba
Compare
The [1] changed the ICMP ECHO/ECHO_REPLY ID placement in CT entries in order to fix the problem when an egress NAT entry for ECHO_REPLY cannot be found by a corresponding CT entry which lead to leaking NAT entries, as the CT GC could not find the NAT entries by the given CT entry. The changed placement introduced an interesting problem described below. What happens when a pod (10.154.0.89) sends ICMP EchoRequest to 8.8.8.8? A CT entry with the following key is created: dst src dport sport TUPLE_F_OUT | | | | | 0a 9a 00 59 08 08 08 08 00 00 08 00 01 00 <-- dst=pod because of the reverse before the second __ct_lookup. ("ICMP OUT 10.154.0.89:2048 -> 8.8.8.8:0 [...]" in the "cilium bpf ct list global" output). What happens when 8.8.8.8 sends ICMP EchoRequest to the pod? The lookup is performed for the reverse flow first with the following key: dst src dport sport TUPLE_F_OUT <-- dir is TUPLE_F_OUT | | | | | because we do the 0a 9a 00 59 08 08 08 08 00 00 08 00 01 00 lookup in reverse order first. The key matches the first __ct_lookup(), hence the return is CT_REPLY. Previously, before the changed ID placement, the CT key for 8.8.8.8 -> the pod lookup was: 0a 9a 00 59 08 08 08 08 08 00 00 00 01 00 This resulted in CT_NEW instead of CT_REPLY. [1]: #12729 Signed-off-by: Martynas Pumputis <m@lambda.lt>
Let's say that we have a pod sending ICMP ECHO request to outside. The handling of the request creates the following CT and NAT entries: CT | src | dst | dir | +------------+-----------+-----+ | outside:ID | pod:0 | OUT | NAT | src | dst | dir | +------------+-----------+-----+ | pod:ID | outside:0 | OUT | +------------+-----------+-----+ | outside:0 | host:ID | IN | Now, let's say that we have the outside sending ICMP echo request to the host running the pod with the same ID as above. The following NAT lookup is performed: outside:0 -> host:ID IN The lookup will find the NAT entry from the pod->outside case. This will translate the request making it to be delivered to the pod instead of the host. Fix this by making the ICMP ECHO ID placement in the NAT tuple to depend on the ICMP type instead of the packet direction. After this change, the NAT entries will be the same as above, but the lookup for the outside->host case is changed to the following: outside:ID -> host:0 IN (doesn't match any NAT entry above). Signed-off-by: Martynas Pumputis <m@lambda.lt>
Previously, when ICMP ECHO was sent from outside to a host managed by Cilium, the handling of the reply to it (ICMP ECHO_REPLY) used to create the following entries: CT | src | dst | dir | +------------+-----------+-----+ | outside:0 | host:ID | OUT | NAT | src | dst | dir | +------------+-----------+-----+ | host:0 | outside:ID| OUT | <-- ICMP ECHO_REPLY +------------+-----------+-----+ | outside:ID | host:ID | IN | <-- ICMP ECHO The NAT IN entry was useful only to avoid pod->outside to be SNAT-ed with the same ID, but this is no longer the case after the "datapath: Fix unintended SNAT of ICMP ECHO" commit. Also, this removes the problematic CT GC case in which for such a CT entry a corresponding NAT OUT entry with the existing GC logic could not be found. Signed-off-by: Martynas Pumputis <m@lambda.lt>
c8de6ba
to
b696de3
Compare
test-me-please |
Looks like net-next is hitting #15737, otherwise should be good for review |
This commit reduces the complexity of the 2/7 section of the bpf_host program by introducing a couple of state pruning points with the relax_verifier() helper. These points have have been determined by looking at the instructions that the verifier is spending the most passes on. We first start by obtaining the verifier logs: tc filter replace dev cilium_host ingress prio 1 handle 1 bpf da obj bpf_host.o sec to-host verb With these logs we can count how many times an instruction is examined by the verifier, and look for groups of sequential instructions with the highest complexity. With that information we can then disassemble the bpf_host program and use the debug symbols to approximately match the line of code that may require placing an additional state pruning point. Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
b696de3
to
107080d
Compare
No need to rerun full CI, I just amended the latest commit message |
|
@@ -845,6 +846,7 @@ static __always_inline int ct_create4(const void *map_main, | |||
|
|||
entry.lb_loopback = ct_state->loopback; | |||
entry.node_port = ct_state->node_port; | |||
relax_verifier(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, do we have a cheaper call for relax_verifier() internally with less overhead and which could potentially be inlined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 What do you have in mind? We also want something that has zero arguments to minimize impact on complexity.
See commit msgs.
The PR has been previously reviewed by @kkourt and @borkmann .