Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load multiple programs for one CollectionSpec loading #22025

Merged
merged 1 commit into from
Dec 4, 2022
Merged

Load multiple programs for one CollectionSpec loading #22025

merged 1 commit into from
Dec 4, 2022

Conversation

alexkats
Copy link
Contributor

@alexkats alexkats commented Nov 7, 2022

This is a continuation of the effort towards reducing Pod Startup Latency for Kubernetes (#22023).

This change is a follow-up on #20702 (Parallel reloading of ingress and egress eBPF programs), it introduces the ability to load multiple programs with loading and verifying CollectionSpec just once per objPath. After the change the pod startup latency was reduced by ~300ms (3.6 to 3.3 seconds for 100 nodes).

Signed-off-by: Alex Katsman alexkats@google.com

@alexkats alexkats requested a review from a team as a code owner November 7, 2022 11:27
@alexkats alexkats requested a review from qmonnet November 7, 2022 11:27
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 7, 2022
@qmonnet qmonnet requested a review from ti-mo November 7, 2022 12:41
@qmonnet qmonnet added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Nov 7, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 7, 2022
@qmonnet qmonnet added sig/loader Impacts the loading of BPF programs into the kernel. dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Nov 7, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Nov 7, 2022
Copy link
Member

@qmonnet qmonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you!

@qmonnet
Copy link
Member

qmonnet commented Nov 7, 2022

/test

@alexkats
Copy link
Contributor Author

alexkats commented Nov 8, 2022

Hi @qmonnet,

Thanks for review! Could you re-run failed tests, please? From what I see k8s-1.16-kernel-4.9 (test-1.16-4.9) is a known flake. And looking into runtime (test-runtime) I can tell that is also a flake in this case, as the compilation of datapath failed, which happens before replacing datapath itself, which I changed.

@qmonnet
Copy link
Member

qmonnet commented Nov 8, 2022

Ok, didn't have to check the test results myself so I trust you. Just linking to the failed test results in case they repeat and we want to compare.

/test-1.16-4.9 (previous run: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.16-kernel-4.9/3062/)

@qmonnet
Copy link
Member

qmonnet commented Nov 8, 2022

/test-runtime (previous run: https://jenkins.cilium.io/job/Cilium-PR-Runtime-net-next/3882/)

@alexkats
Copy link
Contributor Author

alexkats commented Nov 8, 2022

test-runtime failed again, but the different test failed this time, here's a message: bpf_test.go:56: -bpf-test-path is a required flag. I'm not sure how it relates to the change, so I'd appreciate any help on this.

Update: I see multiple PRs failing because of the same issue, so I guess it's a common problem.

@alexkats
Copy link
Contributor Author

alexkats commented Nov 9, 2022

So, I rebased to the latest master, which includes the fix (#22043) for the latest failure. Could you please start tests once again?

@qmonnet
Copy link
Member

qmonnet commented Nov 9, 2022

/test

@ti-mo
Copy link
Contributor

ti-mo commented Nov 14, 2022

Please give me some time to review this, working through my queue.

Copy link
Contributor

@ti-mo ti-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'd love to include this in 1.13. Pointed out some changes I'd like to see.

pkg/datapath/loader/netlink.go Outdated Show resolved Hide resolved
pkg/datapath/loader/netlink.go Outdated Show resolved Hide resolved
pkg/datapath/loader/netlink.go Outdated Show resolved Hide resolved
pkg/datapath/loader/netlink.go Outdated Show resolved Hide resolved
pkg/datapath/loader/netlink.go Outdated Show resolved Hide resolved
pkg/datapath/loader/loader.go Outdated Show resolved Hide resolved
pkg/datapath/loader/loader_test.go Outdated Show resolved Hide resolved
pkg/datapath/loader/netlink.go Outdated Show resolved Hide resolved
@ti-mo
Copy link
Contributor

ti-mo commented Nov 22, 2022

@alexkats What's the reason for the frequent pushes? Are they just rebases?

@alexkats
Copy link
Contributor Author

@ti-mo, thanks for the review. I resolved all your comments. Regarding frequent pushes you're right, they were just rebases, wanted to keep my branch up-to-date.

Copy link
Contributor

@ti-mo ti-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Few more nits to address, then we'll run the full tests.

pkg/datapath/loader/loader.go Outdated Show resolved Hide resolved
pkg/datapath/loader/loader_test.go Outdated Show resolved Hide resolved
@alexkats
Copy link
Contributor Author

build-commits fails everywhere with this message: clang: error while loading shared libraries: libtinfo.so.5: cannot open shared object file: No such file or directory and conformance-test-ipv6 fails not only here as well.

@alexkats
Copy link
Contributor Author

alexkats commented Nov 23, 2022

Took a look at all 4 failed tests, looks like a flake not related to these changes to me, but tell me if I'm wrong here.

@ti-mo
Copy link
Contributor

ti-mo commented Nov 24, 2022

@alexkats I'll kick off a round of tests now, no need to keep rebasing the branch.

@ti-mo
Copy link
Contributor

ti-mo commented Nov 24, 2022

/test

Job 'Cilium-PR-K8s-1.24-kernel-4.19' failed:

Click to show.

Test Name

K8sDatapathLRPTests Checks local redirect policy LRP connectivity

Failure Output

FAIL: Found 1 k8s-app=cilium logs matching list of errors that must be investigated:

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.24-kernel-4.19 so I can create one.

@alexkats
Copy link
Contributor Author

Thanks, I just wanted to include latest fixes for tests from master

@alexkats
Copy link
Contributor Author

I took a look at the failures, here's what I see:

@alexkats
Copy link
Contributor Author

Hi, please let me know if there's anything else I should do here.

@aanm aanm added the kind/community-contribution This was a contribution made by a community member. label Dec 1, 2022
This is a continuation of the effort towards reducing Pod Startup
Latency for Kubernetes. Loading CollectionSpec once for multiple
programs from the same object file helps to reduce the latency by
~300ms.

Signed-off-by: Alex Katsman <alexkats@google.com>
@joestringer
Copy link
Member

/test

@alexkats
Copy link
Contributor Author

alexkats commented Dec 2, 2022

Soooo, we're down to 2 failures, and those are different from the previous ones 😊

UPD: Here's an issue for ci-aks flake: #22520
UPD2: ci-aks flake is the same as in #22162

@alexkats
Copy link
Contributor Author

alexkats commented Dec 2, 2022

The question here is what's the plan? Can we merge it, as I believe we had enough signals that this change is not breaking anything, at least from the CI perspective (all tests passed with the latest changes if we combine 2 last runs, where the difference is just a rebase)?

@pchaigno
Copy link
Member

pchaigno commented Dec 4, 2022

Nice job on triaging all those flakes!

The question here is what's the plan? Can we merge it, as I believe we had enough signals that this change is not breaking anything, at least from the CI perspective (all tests passed with the latest changes if we combine 2 last runs, where the difference is just a rebase)?

I agree. Reviews are in and the two test failures have been identified. Marking as ready to merge.

@pchaigno pchaigno added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Dec 4, 2022
@joestringer joestringer merged commit cb8ea8c into cilium:master Dec 4, 2022
@alexkats alexkats deleted the optimize-replace-datapath branch December 4, 2022 21:21
@alexkats
Copy link
Contributor Author

alexkats commented Dec 5, 2022

And one last question here: can it be a candidate for a backport to 1.12? Or this is a good item for an open session (unfortunately can't take part in it this week)?

@ti-mo
Copy link
Contributor

ti-mo commented Dec 8, 2022

1.13 is the first version of Cilium that will ship with ebpf-go based ELF loader, so not really possible unless we backport all of #19159.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/community-contribution This was a contribution made by a community member. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/loader Impacts the loading of BPF programs into the kernel.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants