Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

init.sh: move socketlb creation into own pkg #23557

Merged
merged 5 commits into from Apr 4, 2023

Conversation

rgo3
Copy link
Contributor

@rgo3 rgo3 commented Feb 3, 2023

Description taken from the main commit:

With this PR we translate code from bpf/init.sh that compiled and
loaded bpf code for the socketlb feature to a pure go implementation.

The main difference to the init.sh code is that now on a fresh install
of cilium, cilium will leverage bpf links to attach programs to cgroups
if bpf_link is available in the kernel.

Test cases focus on testing scenarios that occur on clean agent
startup, restart and upgrade for both attaching and detaching bpf
programs to/from cgroups.

Fixes: #20739

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 3, 2023
@rgo3 rgo3 mentioned this pull request Feb 3, 2023
@rgo3 rgo3 force-pushed the move-socketlb-to-go branch 2 times, most recently from d8d5b96 to 0770269 Compare February 3, 2023 14:44
@ti-mo ti-mo added sig/loader Impacts the loading of BPF programs into the kernel. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/misc This PR makes changes that have no direct user impact. labels Feb 3, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Feb 3, 2023
@ti-mo
Copy link
Contributor

ti-mo commented Feb 3, 2023

/test

@rgo3
Copy link
Contributor Author

rgo3 commented Feb 4, 2023

/test

4 similar comments
@ti-mo
Copy link
Contributor

ti-mo commented Feb 7, 2023

/test

@ti-mo
Copy link
Contributor

ti-mo commented Feb 7, 2023

/test

@ti-mo
Copy link
Contributor

ti-mo commented Feb 8, 2023

/test

@ti-mo
Copy link
Contributor

ti-mo commented Feb 8, 2023

/test

@rgo3
Copy link
Contributor Author

rgo3 commented Feb 14, 2023

/test

@ti-mo
Copy link
Contributor

ti-mo commented Mar 1, 2023

/test

@ti-mo ti-mo added the release-blocker/1.14 This issue will prevent the release of the next version of Cilium. label Mar 1, 2023
@ti-mo
Copy link
Contributor

ti-mo commented Mar 3, 2023

/test-1.24-5.4

@ti-mo
Copy link
Contributor

ti-mo commented Mar 3, 2023

/test-1.25-4.19

@ti-mo
Copy link
Contributor

ti-mo commented Mar 3, 2023

/test-1.26-net-next

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic looks much simpler and more reliable now 👍 (I didn't look super close at the details)

pkg/socketlb/cgroup.go Outdated Show resolved Hide resolved
Copy link
Member

@aditighag aditighag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the revisions and detailed comments: the code logic is easier to follow now, and adopting bpf_link approach only on fresh kernels looks like a sound approach.

I was looking at the ebpf library code, and I found a couple of instances of panic. They may or may not directly be in the link or PROG_ATTACH logic, but this could change. Can we recover from panic so that we don't end up crashing the agent?

While the code migration to Go code provides visibility and better error reporting, I've had to resort to cleaning things up (re: bpf_clear_cgroups) out of band a few times in the past. Any thoughts on providing back-up code that could be easily run as an independent binary? Could it be as simple as removing the pinned link for the new approach? This isn't necessarily a blocking comment for this PR, but it would be worthwhile to be considered as a follow-up.

pkg/socketlb/cgroup.go Outdated Show resolved Hide resolved
pkg/socketlb/cgroup.go Outdated Show resolved Hide resolved
pkg/socketlb/cgroup.go Outdated Show resolved Hide resolved
pkg/socketlb/cgroup.go Show resolved Hide resolved
@ti-mo
Copy link
Contributor

ti-mo commented Mar 24, 2023

I was looking at the ebpf library code, and I found a couple of instances of panic. They may or may not directly be in the link or PROG_ATTACH logic, but this could change. Can we recover from panic so that we don't end up crashing the agent?

I know you're just being cautious, but this feels nitpicky. panic is used very sparingly outside of tests and init(), and the caller is not expected to have to recover(). If there are particular cases that worry you, I'd suggest we improve this in the library, not sprinkle recover()s in the agent. If anything, this is something hive could do when launching modules. (cc @joamaki, @lmb)

The only code path that could potentially panic in a real-life scenario is the one in maskProfilerSignal(). We should probably make that return an error, but it's highly unexpected for Pthreadsigmask() to fail. The panic in unmaskProfilerSignal() is also extremely unlikely, though necessary. We need to restore per-thread state to be able to safely call UnlockOSThread(); failure to do so is unsafe.

While the code migration to Go code provides visibility and better error reporting, I've had to resort to cleaning things up (re: bpf_clear_cgroups) out of band a few times in the past. Any thoughts on providing back-up code that could be easily run as an independent binary?

I think that's just part of development, but that doesn't mean we can't maintain tooling to help. With the various ways of attaching programs, this may come in handy in the future.

Could it be as simple as removing the pinned link for the new approach?

Yep! Simply deleting the bpffs pin does the trick, as long as nothing else holds an open fd to the link.

This isn't necessarily a blocking comment for this PR, but it would be worthwhile to be considered as a follow-up.

Yes, let's not increase the scope of this PR and delay it further. Would this fit in cilium cleanup? I think this is part of a wider topic, but please drop an issue with the behaviour you'd like to see so the foundations team can potentially pick this up.

@lmb
Copy link
Contributor

lmb commented Mar 27, 2023

panic is used very sparingly outside of tests and init(), and the caller is not expected to have to recover(). If there are particular cases that worry you, I'd suggest we improve this in the library, not sprinkle recover()s in the agent.

Agreed, users of the lib are never meant to need to recover.

Copy link
Member

@aditighag aditighag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Barring couple of follow-ups, LGTM. Nice work!

While the code migration to Go code provides visibility and better error reporting, I've had to resort to cleaning things up (re: bpf_clear_cgroups) out of band a few times in the past. Any thoughts on providing back-up code that could be easily run as an independent binary?

I think that's just part of development, but that doesn't mean we can't maintain tooling to help. With the various ways of attaching programs, this may come in handy in the future.

Could it be as simple as removing the pinned link for the new approach?

Yep! Simply deleting the bpffs pin does the trick, as long as nothing else holds an open fd to the link.

It'll be good to document this.

This isn't necessarily a blocking comment for this PR, but it would be worthwhile to be considered as a follow-up.

Yes, let's not increase the scope of this PR and delay it further. Would this fit in cilium cleanup? I think this is part of a wider topic, but please drop an issue with the behaviour you'd like to see so the foundations team can potentially pick this up.

Here it is - #24585.

pkg/socketlb/cgroup.go Outdated Show resolved Hide resolved
Copy link
Member

@rolinh rolinh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vendor changes lgtm

@ti-mo ti-mo force-pushed the move-socketlb-to-go branch 2 times, most recently from 1daf500 to e8296f9 Compare March 30, 2023 12:29
@ti-mo
Copy link
Contributor

ti-mo commented Mar 30, 2023

/test

@ti-mo
Copy link
Contributor

ti-mo commented Mar 31, 2023

/test

@ti-mo
Copy link
Contributor

ti-mo commented Mar 31, 2023

Had to add some extra code that removes defunct links. Link.Update() fails with ENOLINK if the link's cgroup went away. This happens in the l4lb tests, since they run under dind, and the agent attaches to the container's cgroup, not the system's root. On every test step, the agent's container (and cgroup) is blown away, but the link persists on the shared /sys/fs/bpf. Something to keep in mind. @brb

@ti-mo
Copy link
Contributor

ti-mo commented Apr 3, 2023

test-runtime hit #24653, all other required tests are green.

rgo3 and others added 5 commits April 3, 2023 10:18
Signed-off-by: Robin Gögge <r.goegge@isovalent.com>
bpf.LoadCollection() would fail on kernels not supporting getpeername
since it tries loading all programs in the ELF.

Signed-off-by: Timo Beckers <timo@isovalent.com>
With subsequent patches making use of bpffs directories other than tc/globals,
clear up the terminology and reserve the term 'root' for the equivalent of
/sys/fs/bpf. Use 'TCGlobals' for tc/globals and 'Cilium' for cilium/.

Signed-off-by: Timo Beckers <timo@isovalent.com>
Signed-off-by: Robin Gögge <r.goegge@isovalent.com>
With this commit we translate code from bpf/init.sh that compiled and
loaded bpf code for the socketlb feature to a pure go implementation.

The main difference to the init.sh code is that now on a fresh install
of cilium, cilium will leverage bpf links to attach programs to cgroups
if bpf_link is available in the kernel.

Test cases focus on testing scenarios that occur on clean agent
startup, restart and upgrade for both attaching and detaching bpf
programs to/from cgroups.

Co-authored-by: Timo Beckers <timo@isovalent.com>
Signed-off-by: Robin Gögge <r.goegge@isovalent.com>
@ti-mo
Copy link
Contributor

ti-mo commented Apr 3, 2023

/test

Job 'Cilium-PR-K8s-1.24-kernel-5.4' failed:

Click to show.

Test Name

K8sDatapathConfig Host firewall With native routing

Failure Output

FAIL: Error deleting resource /home/jenkins/workspace/Cilium-PR-K8s-1.24-kernel-5.4/src/github.com/cilium/cilium/test/k8s/manifests/host-policies.yaml: Cannot retrieve "cilium-zhkz7"'s policy revision: cannot get policy revision: ""

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.24-kernel-5.4 so I can create one.

@ti-mo
Copy link
Contributor

ti-mo commented Apr 3, 2023

/test-1.24-5.4

@ti-mo
Copy link
Contributor

ti-mo commented Apr 3, 2023

/test-1.26-net-next

@ti-mo ti-mo added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 3, 2023
@squeed squeed merged commit 68fd9ee into cilium:master Apr 4, 2023
56 of 57 checks passed
@ti-mo ti-mo deleted the move-socketlb-to-go branch April 17, 2023 06:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-blocker/1.14 This issue will prevent the release of the next version of Cilium. release-note/misc This PR makes changes that have no direct user impact. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. sig/loader Impacts the loading of BPF programs into the kernel.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

init.sh: move socket-lb creation to Go package
8 participants