Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpf: Fix maglev hash with hostServices.hostNamespaceOnly #18336

Merged
merged 1 commit into from Feb 9, 2022

Conversation

ysksuzuki
Copy link
Member

@ysksuzuki ysksuzuki commented Dec 26, 2021

This PR fixes the bug that Cilium drops packets destined to ClusterIP when configured in combination with loadBalancer.algorithm=maglev and hostServices.hostNamespaceOnly=true. See the commit message for details.

Fixes: #17474
Fixes: #16966

Please ensure your pull request adheres to the following guidelines:

  • For first time contributors, read Submitting a pull request
  • All code is covered by unit and/or runtime tests where feasible.
  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
  • All commits are signed off. See the section Developer’s Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.
  • Thanks for contributing!

@ysksuzuki ysksuzuki requested a review from a team December 26, 2021 08:20
@ysksuzuki ysksuzuki requested a review from a team as a code owner December 26, 2021 08:20
@ysksuzuki ysksuzuki requested a review from a team December 26, 2021 08:20
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Dec 26, 2021
@christarazi christarazi requested a review from brb December 27, 2021 19:06
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

I have a couple of comments below, one for datapath and a minor comment on the CI test. The datapath comment should be validated with the other members of the datapath team (which I've CC'd).

bpf/lib/lb.h Outdated
@@ -757,7 +780,8 @@ static __always_inline int lb6_local(const void *map, struct __ctx_buff *ctx,
struct ipv6_ct_tuple *tuple,
const struct lb6_service *svc,
struct ct_state *state,
const bool skip_l3_xlate)
const bool skip_l3_xlate,
const bool use_random_selection)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this approach presumably works, I think the main concern (for me at least) is that it will cause further strain on the complexity of the BPF progs because of changing the protoype of such a fundamental, common function and all its callers.

Ultimately, the behavior we want is to force bpf_lxc to use the random backend selection AFAIU. In that case, I think there are a couple of alternatives:

  • Hardcode LB_SELECTION to random only for bpf_lxc at compile-time in the agent. The code for the endpoint-specific BPF prog (bpf_lxc) starts here:
    func (h *HeaderfileWriter) WriteEndpointConfig(w io.Writer, e datapath.EndpointConfiguration) error {
    . From my quick glance at the code, that seems like the right place given that LB_SELECTION is set inside the node variant WriteNodeConfig().
  • Provide a bpc_lxc-specific lb{4,6}_xlate function which will always use random backend selection, rather than rely on LB_SELECTION and is only called from the bpf_lxc code.

In both scenarios, we only pay the "cost" once, rather than at each packet (albeit almost negligible; optimization will likely help), and more importantly, reduce program complexity (no conditionals in BPF code) for future changes.

IMO, I think the first option seems cleaner to me, as it's simpler Go code mangling vs. C code changes which very likely will end up duplicating some code (I presume).

That's my 2 cents, but would love to hear what the others from the datapath team think (@cilium/bpf @brb).

Copy link
Member Author

@ysksuzuki ysksuzuki Dec 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your feedback!

The first option makes sense to me too because it's simple, as you mentioned, while my current approach changes the common function and makes things more complex.

How about this one? I could confirm that the e2e tests with this branch have passed in my environment.
ysksuzuki@0d2dbc4

With this change, LB_SELECTION, initially defined in globally scoped node_config.h will be overridden in ep_config.h, and then bpf_lxc will stick to LB_SELECTION_RANDOM.

ExpectAllPodsTerminated(kubectl)
})

It("Checks ClusterIP connectivity in combination with hostServices.hostNamespaceOnly and maglev", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually the It block text shouldn't overlap with the Context text. The context text can be describing what the test is and the It text can be describing what the assertion is.

So, I'd probably do something like:

Context("hostServices.hostNamespaceOnly and Maglev enabled")
...
    It("Checks ClusterIP connectivity")

Copy link
Member

@brb brb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't you just #define LB_SELECTION LB_SELECTION_RANDOM in bpf_lxc.c before including lb.h?

@ysksuzuki
Copy link
Member Author

Couldn't you just #define LB_SELECTION LB_SELECTION_RANDOM in bpf_lxc.c before including lb.h?

Yes, I could confirm that the e2e tests have passed with this commit that defines LB_SELECTION_RANDOM in bpf_lxc.c. Now I finally understand what you were saying.

@christarazi
Copy link
Member

Thanks Martynas. I think it would be good to add the motivation / why in the commit msg and to also add a comment explaining briefly why this is necessary inside the code itself (above the undef).

@@ -566,6 +566,41 @@ var _ = SkipDescribeIf(helpers.RunsOn54Kernel, "K8sServicesTest", func() {

monitorRes.ExpectContains(clusterIP, "Service VIP not seen in monitor trace, indicating socket lb still in effect")
})

Context("hostServices.hostNamespaceOnly and Maglev enabled", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to avoid adding additional test case which redeploys cilium, and thus increases the test suite run time. Instead, you could reuse the "Checks connectivity when skipping socket lb in pod ns" test case.

@ysksuzuki
Copy link
Member Author

ysksuzuki commented Jan 16, 2022

I needed #18493 to pass the K8sServiceTest in my environment. Will rebase the branch after it's merged.

@ysksuzuki
Copy link
Member Author

This test is failing in my environment even with the master branch.

Summarizing 1 Failure:

[Fail] K8sPolicyTest Multi-node policy test validates fromEntities policies [It] Validates fromEntities all policy
/home/cybozu/go/src/github.com/cilium/cilium/test/k8sT/Policies.go:1686

Ran 89 of 411 Specs in 7176.412 seconds
FAIL! -- 88 Passed | 1 Failed | 0 Pending | 322 Skipped
--- FAIL: TestTest (7176.53s)
FAIL

Copy link
Member

@brb brb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting closer to merging 🚀

bpf/bpf_lxc.c Outdated
@@ -30,6 +30,14 @@
#include "lib/nat46.h"
#include "lib/identity.h"
#include "lib/policy.h"

/* Override LB_SELECTION initially defined in node_config.h to force bpf_lxc to use the random backend selection
* algorithm for in-cluster traffic. It will fail with the Maglev hash algorithm because Cilium doesn't provision
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "It will fail" => "Otherwise, it will fail".

pkg/service/service.go Show resolved Hide resolved
@@ -459,6 +459,7 @@ var _ = SkipDescribeIf(helpers.RunsOn54Kernel, "K8sServicesTest", func() {
BeforeAll(func() {
DeployCiliumOptionsAndDNS(kubectl, ciliumFilename, map[string]string{
"bpf.lbExternalClusterIP": "true",
"loadBalancer.algorithm": "maglev",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a comment why we enable Maglev in this test case.

@@ -492,6 +493,7 @@ var _ = SkipDescribeIf(helpers.RunsOn54Kernel, "K8sServicesTest", func() {
BeforeAll(func() {
DeployCiliumOptionsAndDNS(kubectl, ciliumFilename, map[string]string{
"hostServices.hostNamespaceOnly": "true",
"loadBalancer.algorithm": "maglev",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

This fixes the bug that Cilium drops packets destined to ClusterIP.
Cilium tries to resolve ClusterIP to Pod IP using Maglev hash when
hostServices.hostNamespaceOnly is enabled. However, it doesn't provision
the Maglev LUT for ClusterIP. So it drops the packet.
This commit fixes this problem by forcing bpf_lxc to use the random
backend selection for ClusterIP. Also, Cilium will populate the Maglev
LUT for ClusterIP if bpf.lbExternalClusterIP is set to true so that
the external ingress traffic connecting ClusterIP will be properly
handled.

Fixes: cilium#17474

Signed-off-by: Yusuke Suzuki <yusuke-suzuki@cybozu.co.jp>
@ysksuzuki ysksuzuki force-pushed the fix-maglev-with-hostservice branch 2 times, most recently from 9b382ac to 2769f6f Compare January 28, 2022 02:34
Copy link
Member

@brb brb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@brb brb added release-note/bug This PR fixes an issue in a previous release of Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. labels Feb 8, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 8, 2022
@brb brb added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 8, 2022
@joamaki joamaki merged commit fc99c0c into cilium:master Feb 9, 2022
@qmonnet qmonnet mentioned this pull request Mar 30, 2022
18 tasks
@pchaigno pchaigno added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. and removed backport-pending/1.11 labels Apr 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
8 participants