New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move ct_lookup in bpf_host.c to a separate tailcall #23831
Conversation
9ba08ee
to
a8ad172
Compare
/test |
b275b20
to
dc000df
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me and the commit history is nice! 🚀
I have a couple comments below. Most notably, I'd like to reduce the code movements in the last two commits. That will be painful during future backports and git blaming sessions.
7f261aa
to
6be447e
Compare
Pushed a new version. Julian's comment is not addressed yet. Did a cleanup and simplification of code, fixed a dozen minor bugs added in the previous changeset, deduplicated ct_lookup4 in whitelist_snated_egress_connections, reduced the number of revalidate_data calls, organized the code to avoid some flags like fill_trace, reordered handle_ipv{4,6}_cont after handle_ipv{4,6}. |
/test |
@gentoo-root I'm guessing, given recent assignments, you may be a bit low on cycles. If that's the case, I think it's fine to address Julian's comment in a follow up. If this PR stays open too long, it's likely to accumulate conflicts 😞 |
OT: is this going to fix the latest 4.19 complexity issues for bpf_host? |
f92ae82
to
0f9a087
Compare
Hmm, that's interesting. I didn't touch this file, and that line is pretty old (2020). How come CI started complaining only now? |
It reduces complexity of bpf_host, but I need to check specifically whether it's enough to fix the issues. |
/test |
/test-1.26-net-next Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment |
k8s kernel tests seem to fail because I added new tail calls, and a "Missed tail call" appears when migrating to Cilium that doesn't have new tail calls. #20691 |
c2a5aa1
to
3962764
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, took way too long to review this 😞. Had a closer look, please check if my comments make sense ...
bpf/bpf_host.c
Outdated
if (ret == CTX_ACT_OK) { | ||
if (from_host) | ||
invoke_tailcall_if(is_defined(ENABLE_HOST_FIREWALL), | ||
CILIUM_CALL_IPV4_CONT_FROM_HOST, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably just a matter of taste. But I would prefer to have the ctx_store_meta(ctx, CB_SRC_LABEL, secctx)
here, and not hidden inside handle_ipv4()
. It's already difficult enough to keep track of all the games we play with CB_SRC_LABEL
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about CB_FROM_HOST? Shall I move it out of handle_ipv4 too? If I do, it will be cumbersome to return 2 booleans in addition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CB_FROM_HOST
should probably stay in it current place - as that's where we create its input data.
I'm just very sensitive to CB_SRC_LABEL
, burned too many times already 😄.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I moved it and rebased.
(Please note that there is one more assignment to CB_SRC_LABEL in handle_ipv4 in the NAT46x64 shortcut that existed before my changes, but this should be fine, we do an unconditional tailcall right away.)
3962764
to
0e60b70
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't spot anything else, and you nicely shot down all my earlier concerns 😅.
Feel free to resolve my nits in a follow-on PR if it's easier. But as this needs a rebase anyway ...
0e60b70
to
575689a
Compare
OMG, rebased 1 hour ago, and tons of conflicts again =/ |
d3ad1d4
to
468f40d
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a second, cursory look. One comment below, but I think it can be fixed up after the tests haven't passed (or after they are triaged) without needing to then rerun the full test suite (assuming you don't rebase in the process).
/* Only enforce host policies for packets from host IPs. */ | ||
/* Only enforce host policies for packets from host IPs. */ | ||
if (src_sec_identity != HOST_ID && | ||
(is_defined(ENABLE_MASQUERADE) || ipcache_srcid != HOST_ID)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment doesn't match the condition anymore. Maybe we need to move the comment in whitelist_snated_egress_connections
here? We should probably also rename whitelist_snated_egress_connections
to something like track_snated_egress_connections
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we need to move the comment in whitelist_snated_egress_connections here?
I'll update this comment to make it more meaningful.
We should probably also rename whitelist_snated_egress_connections to something like track_snated_egress_connections.
Why? I believe the meaning of the old name still holds. By creating a CT entry (the only thing this function does at the moment) we are whitelisting these connections from applying host policies to reply packets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but that intent is hard to understand if the code comment is moved to where the condition is.
When handle_ipv4 returns an error, send_drop_notify_error_ext is called by tail_handle_ipv4, hence no need to call send_drop_notify_error from handle_ipv4 itself. Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
To prepare for splitting ipv4_host_policy_ingress, move the initialization of the is_untracked_fragment variable closer to its only usage. Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Split the conntrack lookup part of the host policy ingress handlers into a separate function to prepare for making it a separate tailcall. Use struct ct_buffer{4,6} for communication between two parts of the old function to prepare for passing the lookup result via a BPF map, similar to how it's done in bpf_lxc.c. Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Currently, ipv{4,6}_host_policy_egress calls revalidate_data to get the pointer to the IP header. However, if we analyze from which contexts it's called, we realize that in most cases this pointer is already available, and the call to revalidate_data can be optimized out: 1. handle_ipv{4,6} already have the pointer. 2. handle_to_netdev_ipv{4,6} also already have the pointer, moreover, handle_to_netdev_ipv4 calls revalidate_data and never uses its results - until this commit. 3. from_host_to_lxc doesn't have the pointer, but it always calls ipv{4,6}_host_policy_egress with src_id == HOST_ID, so this function will never return early before it would call revalidate_data. That means that the call to revalidate_data can be moved from ipv{4,6}_host_policy_egress to from_host_to_lxc without any performance penalty. This commit essentially drops a few revalidate_data calls, no strings attached. Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
ct_lookup4 is called by both ipv4_host_policy_egress_lookup and whitelist_snated_egress_connections. Both lookups are the same and can be replaced with ipv4_host_policy_egress_lookup. Remove the lookup part from whitelist_snated_egress_connections and reuse the lookup that is already present in ipv4_host_policy_egress. While at it, also replace the error check after ct_create4 from IS_ERR to ret < 0. ct_create4 only returns 0 or DROP_CT_CREATE_FAILED, and having a mix of different error checks confuses the verifier when it can't prove that CTX_ACT_DROP is not a possible outcome. Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Split the conntrack lookup part of the host policy egress handlers into a separate function to prepare for making it a separate tailcall. Use struct ct_buffer{4,6} for communication between two parts of the old function to prepare for passing the lookup result via a BPF map, similar to how it's done in bpf_lxc.c. Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
In order to reduce verifier complexity, split handle_ipv4 into two functions chained by a tailcall. The first part will stop after doing ct_lookup4 of the host firewall, and the second part will pick up the results from a BPF map and go on. The tailcall is saved, and the map is not used if host firewall is disabled, so the overhead is minimal in this case. Also move the "verifier workaround" call to revalidate_data in handle_ipv4 to the ENABLE_HOST_FIREWALL section, as its presence at the previous location now makes clang miscompile code and generate 32-bit zero extension on pkt and pkt_end pointers, which is not accepted by the verifier. Removing the workaround altogether leads to another verifier error: "R5 invalid mem access 'scalar'". Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
In order to reduce verifier complexity, split handle_ipv6 into two functions chained by a tailcall. The first part will stop after doing ct_lookup6 of the host firewall, and the second part will pick up the results from a BPF map and go on. The tailcall is saved, and the map is not used if host firewall is disabled, so the overhead is minimal in this case. Similar to the previous commit, also move the "verifier workaround" call to revalidate_data in handle_ipv6 to the ENABLE_HOST_FIREWALL section. Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
468f40d
to
92d83c4
Compare
/test Job 'Cilium-PR-K8s-1.26-kernel-net-next' hit: #24697 (90.61% similarity) |
/test-1.26-net-next |
Move ct_lookup in bpf_host.c to a separate tailcall to reduce verifier complexity.
Fixes: #17142