-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kpr: Enable/fix Cilium socket based load balancing in different environments #16259
kpr: Enable/fix Cilium socket based load balancing in different environments #16259
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch!
The patch at its current state does not seem to work for me [1]. it's probably because cilium tries to mount in /var/run/cilium
and it uses this directory (which is not the global one). Passing a --cgroup-root
also does not seem to work because then the detection will not run.
I'm confident that we can get the patch to work, but I would like to propose to leave the detection logic outside of the cilium agent, and just pass a proper argument with --cgroup-root. We can even pass two options, one for the root and one for where to attach the bpf programs, if we care about this distinction. We already had a special case for Kind, now we are going to add another one, and I think we should keep this complexity out of the agent.
[1] This is the logs:
level=info msg="Mounted cgroupv2 filesystem at /var/run/cilium/cgroupv2" subsys=cgroups
level=warning msg="Failed to determine cgroup v2 hierarchy for Kind node. Socket-based LB (--enable-host-reachable-services) will not work." error="cannot find \"kubelet\" in cgroup v2 path: \"/var/run/cilium/cgroupv2\"" subsys=cgroups
Unfortunately we need to include host itself, as otherwise an application running on a host won't be able to access a ClusterIP service via bpf_sock. So, finding a common root for all pods won't work. |
Yeah, it was at the back of my mind, but I wasn't really sure how it fits into the 2nd scenario above (Virtualized cgroup root). 🤔 I also would like to keep the detection logic to only the specific cases that I've called out above, and leave out the default as-is. @borkmann Do you have any explanation for the #15137 like environments where (cgroup root as seen by Cilium) != (host cgroup root)because of cgroup namespaces? We weren't quite able to figure that out in the slack thread. It'll also be helpful to understand why cgroup fs isn't mounted by the init container similar to bpf fs, and if there are any implications of doing that? |
Ah! I wonder if this unlocks the mystery of the 2nd scenario - https://docs.docker.com/config/containers/runmetrics/#running-docker-on-cgroup-v2. Discussed the overall PR with @borkmann offline, he also pointed out
|
@aditighag Aha, good find! I'm using Docker with the systemd cgroup driver (=cgroup v2), and I have completely disabled cgroupv1 on my host.
I think it might be possible to run on Kind when cgroupns=private, as otherwise the detection of common subhierarchy won't work. Anyway, let's discuss it during this week's sig-datapath. |
Discussed the overall issue in the sig-datapath meeting (05/27), the suggestion was to also mount the host cgroupv2 fs (prior to cilium agent running) from the init script. This is similar to how we mount the BPF fs. |
Some relevant examples:
As you can see, the |
@brb I don't follow your comment. The cgroup hierarchies are relative to the top level cgroup root. As long as we attach the BPF programs at every kind node's cgroup root, it should work, no? I mounted cgroup fs as part of an init container, disabled kind detection logic, and socket lb worked fine. See details about cgroup paths - #16078 (comment). Can you elaborate more? |
With cgroup namespaces on, the node-init container will mount With cgroup namespaces off, the node-init container will mount the host's cgroup root. So we still would need to find an appropriate hierarchy when running on Kind for each cilium-agent. |
I think you can ignore what I wrote above, as it is no longer relevant. The latest findings are that even with cgroup NS on, on Kind we gonna run the cilium-agent pod in the same cgroup NS as the Kind node container. Meaning that if we mount cgroupfs from the node-init and then propagate the mount via DaemonSet into the cilium-agent pod, then cilium-agent will attach the bpf_sock to the right cgroup root 🎉 Next step is to check in what cgroup NS the node-init is running. |
I've tried running the following Pod (on regular k8s node, i.e. not on Kind= which should have the same privileges and configuration as the node-init:
Unfortunately, it's running in a different cgroup NS than the host 😞
|
That's expected since you are only running in the network and pid namespaces of the host. In the node-init container, we use |
But that's only for net and mount ns? I think the nodeinit will still run in a container's cgroup ns which we want to avoid. |
We need to specify
We can confirm that node-init can run in the same cgroup ns as the host (with
|
* Add init container to auto-mount /sys/fs/cgroup cgroup2 at /run/cilium/cgroupv2 for the Cilium agent * Enable CNI exclusive mode, to disable other configs found in /etc/cni/net.d/ * cilium/cilium#16259
* Add init container to auto-mount /sys/fs/cgroup cgroup2 at /run/cilium/cgroupv2 for the Cilium agent * Enable CNI exclusive mode, to disable other configs found in /etc/cni/net.d/ * cilium/cilium#16259
* On Fedora CoreOS, Cilium cross-node service IP load balancing stopped working for a time (first observable as CoreDNS pods located on worker nodes not being able to reach the kubernetes API service 10.3.0.1). This turned out to have two parts: * Fedora CoreOS switched to cgroups v2 by default. In our early testing with cgroups v2, Calico (default) was used. With the cgroups v2 change, SELinux policy denied some eBPF operations. Since fixed in all Fedora CoreOS channels * Cilium requires new mounts to support cgroups v2, which are added here * coreos/fedora-coreos-tracker#292 * coreos/fedora-coreos-tracker#881 * cilium/cilium#16259
* On Fedora CoreOS, Cilium cross-node service IP load balancing stopped working for a time (first observable as CoreDNS pods located on worker nodes not being able to reach the kubernetes API service 10.3.0.1). This turned out to have two parts: * Fedora CoreOS switched to cgroups v2 by default. In our early testing with cgroups v2, Calico (default) was used. With the cgroups v2 change, SELinux policy denied some eBPF operations. Since fixed in all Fedora CoreOS channels * Cilium requires new mounts to support cgroups v2, which are added here * coreos/fedora-coreos-tracker#292 * coreos/fedora-coreos-tracker#881 * cilium/cilium#16259
* On Fedora CoreOS, Cilium cross-node service IP load balancing stopped working for a time (first observable as CoreDNS pods located on worker nodes not being able to reach the kubernetes API service 10.3.0.1). This turned out to have two parts: * Fedora CoreOS switched to cgroups v2 by default. In our early testing with cgroups v2, Calico (default) was used. With the cgroups v2 change, SELinux policy denied some eBPF operations. Since fixed in all Fedora CoreOS channels * Cilium requires new mounts to support cgroups v2, which are added here * coreos/fedora-coreos-tracker#292 * coreos/fedora-coreos-tracker#881 * cilium/cilium#16259
We need to mount cgroup2 filesystem on the underlying host in order to enable socket-based load-balancing in environments with container runtime cgroupv2 configurations. See issues for more details - cilium/cilium#16259 and cilium/cilium#16815.
We need to mount cgroup2 filesystem on the underlying host in order to enable socket-based load-balancing in environments with container runtime cgroupv2 configurations. See issues for more details - cilium/cilium#16259 and cilium/cilium#16815. Signed-off-by: Aditi Ghag <aditi@cilium.io>
* On Fedora CoreOS, Cilium cross-node service IP load balancing stopped working for a time (first observable as CoreDNS pods located on worker nodes not being able to reach the kubernetes API service 10.3.0.1). This turned out to have two parts: * Fedora CoreOS switched to cgroups v2 by default. In our early testing with cgroups v2, Calico (default) was used. With the cgroups v2 change, SELinux policy denied some eBPF operations. Since fixed in all Fedora CoreOS channels * Cilium requires new mounts to support cgroups v2, which are added here * coreos/fedora-coreos-tracker#292 * coreos/fedora-coreos-tracker#881 * cilium/cilium#16259
For kube-proxy replacement (specifically, socket-based load-balancing) to work correctly in KIND clusters, the BPF cgroup programs need to be attached at the correct cgroup hierarchy. For this to happen, the KIND nodes need to have their own separate cgroup namespace. More details in PR - cilium#16259. While cgroup namespaces are supported across both cgroup v1 and v2 modes, container runtimes like Docker enable private cgroup namespace mode by default only with cgroup v2 [1]. With cgroup v1, the default is host cgroup namespace, whereby KIND node containers (and also cilium agent pods) are created in the same cgroup namespace as the underlying host. [1] https://docs.docker.com/config/containers/runmetrics/#running-docker-on-cgroup-v2 Signed-off-by: Aditi Ghag <aditi@cilium.io>
For kube-proxy replacement (specifically, socket-based load-balancing) to work correctly in KIND clusters, the BPF cgroup programs need to be attached at the correct cgroup hierarchy. For this to happen, the KIND nodes need to have their own separate cgroup namespace. More details in PR - #16259. While cgroup namespaces are supported across both cgroup v1 and v2 modes, container runtimes like Docker enable private cgroup namespace mode by default only with cgroup v2 [1]. With cgroup v1, the default is host cgroup namespace, whereby KIND node containers (and also cilium agent pods) are created in the same cgroup namespace as the underlying host. [1] https://docs.docker.com/config/containers/runmetrics/#running-docker-on-cgroup-v2 Signed-off-by: Aditi Ghag <aditi@cilium.io>
[ upstream commit 635ba6c ] For kube-proxy replacement (specifically, socket-based load-balancing) to work correctly in KIND clusters, the BPF cgroup programs need to be attached at the correct cgroup hierarchy. For this to happen, the KIND nodes need to have their own separate cgroup namespace. More details in PR - cilium#16259. While cgroup namespaces are supported across both cgroup v1 and v2 modes, container runtimes like Docker enable private cgroup namespace mode by default only with cgroup v2 [1]. With cgroup v1, the default is host cgroup namespace, whereby KIND node containers (and also cilium agent pods) are created in the same cgroup namespace as the underlying host. [1] https://docs.docker.com/config/containers/runmetrics/#running-docker-on-cgroup-v2 Signed-off-by: Aditi Ghag <aditi@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 635ba6c ] For kube-proxy replacement (specifically, socket-based load-balancing) to work correctly in KIND clusters, the BPF cgroup programs need to be attached at the correct cgroup hierarchy. For this to happen, the KIND nodes need to have their own separate cgroup namespace. More details in PR - #16259. While cgroup namespaces are supported across both cgroup v1 and v2 modes, container runtimes like Docker enable private cgroup namespace mode by default only with cgroup v2 [1]. With cgroup v1, the default is host cgroup namespace, whereby KIND node containers (and also cilium agent pods) are created in the same cgroup namespace as the underlying host. [1] https://docs.docker.com/config/containers/runmetrics/#running-docker-on-cgroup-v2 Signed-off-by: Aditi Ghag <aditi@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>
* On Fedora CoreOS, Cilium cross-node service IP load balancing stopped working for a time (first observable as CoreDNS pods located on worker nodes not being able to reach the kubernetes API service 10.3.0.1). This turned out to have two parts: * Fedora CoreOS switched to cgroups v2 by default. In our early testing with cgroups v2, Calico (default) was used. With the cgroups v2 change, SELinux policy denied some eBPF operations. Since fixed in all Fedora CoreOS channels * Cilium requires new mounts to support cgroups v2, which are added here * coreos/fedora-coreos-tracker#292 * coreos/fedora-coreos-tracker#881 * cilium/cilium#16259
This PR aims to revisit some of the assumptions made around cgroup hierarchies in Cilium in order to enable socket-lb in different environments.
Context
Cilium attaches
BPF_CGROUP_*
type programs to provide socket based load-balancing. The default cgroup root in the agent is set to a custom location (/var/run/cilium/cgroupv2
), where the agent tries to mount cgroup filesystem. The cgroup root is then passed toinit.sh
in order to attach theBPF_CGROUP_*
programs at relevant hook points. While we have some extended logic in place to accommodate environments likekind
, the overall logic breaks in certain scenarios. While the following list is not an exhaustive list, it helps in identifying general patterns.Scenarios where current logic breaks
Virtualized cgroup root in the cgroup namespace mode (Cilium attaching to the wrong cgroup #15137)
If container runtimes are run with cgroup v2, Cilium agent pod would be deployed in a separate cgroup namespace. For example, Docker container runtime with cgroupv2 support switched to private cgroup namespace mode as the default. Due to cgroup namespaces, the cgroup fs mounted by the Cilium pod points to a virtualized cgroup hierarchy instead of the host cgroup root. As a result, BPF programs are attached to the nested cgroup root, and socket-lb isn't effective for other pods.
Resolution:
Mount cgroup fs on the host from init containers. We need to specify cgroup as the enterable namespace in the nsenter command. Cilium agent will auto mount cgroup2 fs on the underlying host if not already mounted. This requires mounting host's
/proc
inside an init container temporarily. As an alternative, users can disable auto mount, and specify a mount point on the host where cgroup2 fs is already mounted. Cgroup2 fs mount point is platform dependent. Hence, we introduce a new helm option for the host cgroup2 fs mount point. See this note in cgroup man page -Note that on many modern systems, systemd(1) automatically mounts
the cgroup2 filesystem at /sys/fs/cgroup/unified during the boot
process.
Commit 866969e hard-coded
kubelet
string which may not work in some platforms. Depending on the value of kubelet config--cgroup-root
, this string may or may not be present. Moreover, the logic to get cgroup root is specific tokind
environments only so that it won't take effect for minikube clusters.Resolution:
The kind cgroup root detection logic can be removed as kind nodes and cilium agent pod are deployed in the same cgroup ns (if they don't, the above init container fix should be sufficient).
Deploying a kind cluster alongside "other" BPF cgroup programs (Kind with socket-lb doesn't work in dev VM + update kind docs #16078)
I ran into this issue while deploying a kind cluster on the dev VM. Cilium pod inside the kind cluster fails to come up with this warning message
failed to attach program
, and is stuck in a "CrashLoopback" state. We can havecilium/ebpf
loader print better error messages/hints in such cases (I'll file an issue).I traced the error return code
255
(aka operation not permitted) from to this check [1] in the kernel source that disallows attach in the presence of programs (no override/multi) in the parent cgroup. After I removed the BPF programs attached in the dev VM cgroup root, I was able to create a kind cluster with socket-lb successfully.[1] https://elixir.bootlin.com/linux/latest/source/kernel/bpf/cgroup.c#L457
Resolution:
We document this case along with potential steps to resolve the issue.
Testing
Tested the changes on kind, GKE and bare metal k8s cluster by verifying that BPF programs are correctly attached, and socket-lb works as expected.
Deferred to follow-ups -
Fixes: #16078
Fixes: #15769
Fixes: 866969e
Fixes: #15137
(Reported-by : @kkourt)
Release note