Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use runtime.slice and system.slice cgroup settings in k8s variants #1684

Merged
merged 6 commits into from Aug 9, 2021

Conversation

cyrus-mc
Copy link

@cyrus-mc cyrus-mc commented Aug 2, 2021

Issue number:

#1681

Description of changes:

Follow documented best practices (https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/node-allocatable.md#recommended-cgroups-setup) by running runtime components (containerd, kubelet) within their own cgroup. Separate from system components.

This allows setting of kubeReserved and systemReserved to be enforced.

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@cyrus-mc
Copy link
Author

cyrus-mc commented Aug 2, 2021

The one outstanding "problem" is that the slice/group directory is not created under every controller. For example under cpuset and hugetlb the runtime.slice directory is not created as it is under other controllers such as cpu.

When using enforceNodeAllocatable kubelet configuration setting (which doesn't appear to be supported as of yet in Bottlerocket) kubelet will fail to start with error

Failed to start ContainerManager Failed to enforce Kube Reserved Cgroup Limits on "/runtime": ["runtime"] cgroup does not exist

This is due to the fact that it doesn't find the cgroup under cpuset or hugetlb. I believe this to be a systemd configuration setting/problem but as of yet haven't found out how to correct.

Copy link
Author

@cyrus-mc cyrus-mc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.freedesktop.org/software/systemd/man/systemd.special.html

slices.target
A special target unit that sets up all slice units (see systemd.slice(5) for details) that shall always be active after boot. By default the generic system.slice slice unit as well as the root slice unit -.slice are pulled in and ordered before this unit (see below).

Adding slice units to slices.target is generally not necessary. Instead, when some unit that uses Slice= is started, the specified slice will be started automatically. Adding WantedBy=slices.target lines to the [Install] section should only be done for units that need to be always active. In that case care needs to be taken to avoid creating a loop through the automatic dependencies on "parent" slices.

So I removed the slice from After and Wants.

packages/containerd/containerd.service Outdated Show resolved Hide resolved
@cyrus-mc
Copy link
Author

cyrus-mc commented Aug 4, 2021

The one outstanding "problem" is that the slice/group directory is not created under every controller. For example under cpuset and hugetlb the runtime.slice directory is not created as it is under other controllers such as cpu.

When using enforceNodeAllocatable kubelet configuration setting (which doesn't appear to be supported as of yet in Bottlerocket) kubelet will fail to start with error

Failed to start ContainerManager Failed to enforce Kube Reserved Cgroup Limits on "/runtime": ["runtime"] cgroup does not exist

This is due to the fact that it doesn't find the cgroup under cpuset or hugetlb. I believe this to be a systemd configuration setting/problem but as of yet haven't found out how to correct.

@bcressey This is the only outstanding issue I have. Currently I don't believe Bottlerocket allows setting (nor sets) the enforceNodeAllocatable setting. But once that is exposed and if kube-reserved is set the kubelet will fail for the reason stated above.

In my AMI for FedoraCoreOS I created a runtime.service that simply created the directories and that runtime.slice required. But I have to assume there is a more correct way to do that.

@jhaynes jhaynes linked an issue Aug 5, 2021 that may be closed by this pull request
@bcressey
Copy link
Contributor

bcressey commented Aug 5, 2021

@bcressey This is the only outstanding issue I have. Currently I don't believe Bottlerocket allows setting (nor sets) the enforceNodeAllocatable setting. But once that is exposed and if kube-reserved is set the kubelet will fail for the reason stated above.

In my AMI for FedoraCoreOS I created a runtime.service that simply created the directories and that runtime.slice required. But I have to assume there is a more correct way to do that.

systemd doesn't manage cpuset or hugetlb under cgroup v1, per their docs on controller support.

mkdir works to create the hierarchy but nothing would actually add processes from systemd's runtime.slice to cgroup.procs for those controllers. However, that might not matter since it doesn't appear that kube-reserved or system-reserved try to manage the corresponding resources.

@ghost
Copy link

ghost commented Aug 6, 2021

@bcressey This is the only outstanding issue I have. Currently I don't believe Bottlerocket allows setting (nor sets) the enforceNodeAllocatable setting. But once that is exposed and if kube-reserved is set the kubelet will fail for the reason stated above.
In my AMI for FedoraCoreOS I created a runtime.service that simply created the directories and that runtime.slice required. But I have to assume there is a more correct way to do that.

systemd doesn't manage cpuset or hugetlb under cgroup v1, per their docs on controller support.

mkdir works to create the hierarchy but nothing would actually add processes from systemd's runtime.slice to cgroup.procs for those controllers. However, that might not matter since it doesn't appear that kube-reserved or system-reserved try to manage the corresponding resources.

https://github.com/kubernetes/kubernetes/blob/9ff3b7e744b34c099c1405d9add192adbef0b6b1/pkg/kubelet/cm/cgroup_manager_linux.go#L282

Checks for the existence of the controllers whitelisted. I will add to release package creation of directories to satisfy cgroup hierarchy.

@cyrus-mc cyrus-mc requested a review from bcressey August 6, 2021 16:48
@cyrus-mc
Copy link
Author

cyrus-mc commented Aug 9, 2021

@samuelkarp Would like to get this reviewed so can be included in next release.

Copy link
Contributor

@samuelkarp samuelkarp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a small concern with the effects this change will have on the aws-ecs-1 variant, since Amazon ECS does not expect containerd to be in a separate cgroup. However, because none of the cgroup settings (other than the hierarchy) are changed, the only effect should be in passive hierarchy-related behavioral differences (for example, CPU share weights under contention). I think it's reasonable to take this change, and if we end up deciding that we should treat ECS differently later we can always do so.

LGTM

@samuelkarp samuelkarp merged commit 13d5ff1 into bottlerocket-os:develop Aug 9, 2021
@cyrus-mc cyrus-mc deleted the cgroup_hierarchy branch August 10, 2021 04:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use runtime.slice and system.slice cgroup settings in k8s variants
3 participants