New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpf: Fix maglev hash with hostServices.hostNamespaceOnly #18336
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
I have a couple of comments below, one for datapath and a minor comment on the CI test. The datapath comment should be validated with the other members of the datapath team (which I've CC'd).
bpf/lib/lb.h
Outdated
@@ -757,7 +780,8 @@ static __always_inline int lb6_local(const void *map, struct __ctx_buff *ctx, | |||
struct ipv6_ct_tuple *tuple, | |||
const struct lb6_service *svc, | |||
struct ct_state *state, | |||
const bool skip_l3_xlate) | |||
const bool skip_l3_xlate, | |||
const bool use_random_selection) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this approach presumably works, I think the main concern (for me at least) is that it will cause further strain on the complexity of the BPF progs because of changing the protoype of such a fundamental, common function and all its callers.
Ultimately, the behavior we want is to force bpf_lxc
to use the random backend selection AFAIU. In that case, I think there are a couple of alternatives:
- Hardcode
LB_SELECTION
to random only forbpf_lxc
at compile-time in the agent. The code for the endpoint-specific BPF prog (bpf_lxc
) starts here:cilium/pkg/datapath/linux/config/config.go
Line 804 in 2581084
func (h *HeaderfileWriter) WriteEndpointConfig(w io.Writer, e datapath.EndpointConfiguration) error { LB_SELECTION
is set inside the node variantWriteNodeConfig()
. - Provide a bpc_lxc-specific
lb{4,6}_xlate
function which will always use random backend selection, rather than rely onLB_SELECTION
and is only called from thebpf_lxc
code.
In both scenarios, we only pay the "cost" once, rather than at each packet (albeit almost negligible; optimization will likely help), and more importantly, reduce program complexity (no conditionals in BPF code) for future changes.
IMO, I think the first option seems cleaner to me, as it's simpler Go code mangling vs. C code changes which very likely will end up duplicating some code (I presume).
That's my 2 cents, but would love to hear what the others from the datapath team think (@cilium/bpf @brb).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your feedback!
The first option makes sense to me too because it's simple, as you mentioned, while my current approach changes the common function and makes things more complex.
How about this one? I could confirm that the e2e tests with this branch have passed in my environment.
ysksuzuki@0d2dbc4
With this change, LB_SELECTION, initially defined in globally scoped node_config.h will be overridden in ep_config.h, and then bpf_lxc will stick to LB_SELECTION_RANDOM.
test/k8sT/Services.go
Outdated
ExpectAllPodsTerminated(kubectl) | ||
}) | ||
|
||
It("Checks ClusterIP connectivity in combination with hostServices.hostNamespaceOnly and maglev", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually the It
block text shouldn't overlap with the Context
text. The context text can be describing what the test is and the It
text can be describing what the assertion is.
So, I'd probably do something like:
Context("hostServices.hostNamespaceOnly and Maglev enabled")
...
It("Checks ClusterIP connectivity")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't you just #define LB_SELECTION LB_SELECTION_RANDOM
in bpf_lxc.c before including lb.h
?
Yes, I could confirm that the e2e tests have passed with this commit that defines LB_SELECTION_RANDOM in bpf_lxc.c. Now I finally understand what you were saying. |
Thanks Martynas. I think it would be good to add the motivation / why in the commit msg and to also add a comment explaining briefly why this is necessary inside the code itself (above the |
43fe453
to
31e711c
Compare
test/k8sT/Services.go
Outdated
@@ -566,6 +566,41 @@ var _ = SkipDescribeIf(helpers.RunsOn54Kernel, "K8sServicesTest", func() { | |||
|
|||
monitorRes.ExpectContains(clusterIP, "Service VIP not seen in monitor trace, indicating socket lb still in effect") | |||
}) | |||
|
|||
Context("hostServices.hostNamespaceOnly and Maglev enabled", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to avoid adding additional test case which redeploys cilium, and thus increases the test suite run time. Instead, you could reuse the "Checks connectivity when skipping socket lb in pod ns"
test case.
fec4f28
to
556fb7f
Compare
I needed #18493 to pass the K8sServiceTest in my environment. Will rebase the branch after it's merged. |
This test is failing in my environment even with the master branch.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting closer to merging 🚀
bpf/bpf_lxc.c
Outdated
@@ -30,6 +30,14 @@ | |||
#include "lib/nat46.h" | |||
#include "lib/identity.h" | |||
#include "lib/policy.h" | |||
|
|||
/* Override LB_SELECTION initially defined in node_config.h to force bpf_lxc to use the random backend selection | |||
* algorithm for in-cluster traffic. It will fail with the Maglev hash algorithm because Cilium doesn't provision |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: "It will fail" => "Otherwise, it will fail".
test/k8sT/Services.go
Outdated
@@ -459,6 +459,7 @@ var _ = SkipDescribeIf(helpers.RunsOn54Kernel, "K8sServicesTest", func() { | |||
BeforeAll(func() { | |||
DeployCiliumOptionsAndDNS(kubectl, ciliumFilename, map[string]string{ | |||
"bpf.lbExternalClusterIP": "true", | |||
"loadBalancer.algorithm": "maglev", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a comment why we enable Maglev in this test case.
test/k8sT/Services.go
Outdated
@@ -492,6 +493,7 @@ var _ = SkipDescribeIf(helpers.RunsOn54Kernel, "K8sServicesTest", func() { | |||
BeforeAll(func() { | |||
DeployCiliumOptionsAndDNS(kubectl, ciliumFilename, map[string]string{ | |||
"hostServices.hostNamespaceOnly": "true", | |||
"loadBalancer.algorithm": "maglev", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
This fixes the bug that Cilium drops packets destined to ClusterIP. Cilium tries to resolve ClusterIP to Pod IP using Maglev hash when hostServices.hostNamespaceOnly is enabled. However, it doesn't provision the Maglev LUT for ClusterIP. So it drops the packet. This commit fixes this problem by forcing bpf_lxc to use the random backend selection for ClusterIP. Also, Cilium will populate the Maglev LUT for ClusterIP if bpf.lbExternalClusterIP is set to true so that the external ingress traffic connecting ClusterIP will be properly handled. Fixes: cilium#17474 Signed-off-by: Yusuke Suzuki <yusuke-suzuki@cybozu.co.jp>
9b382ac
to
2769f6f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
This PR fixes the bug that Cilium drops packets destined to ClusterIP when configured in combination with loadBalancer.algorithm=maglev and hostServices.hostNamespaceOnly=true. See the commit message for details.
Fixes: #17474
Fixes: #16966
Please ensure your pull request adheres to the following guidelines:
description and a
Fixes: #XXX
line if the commit addresses a particularGitHub issue.