cephadm: use split cgroup strategy for podman #40025

huww98 · 2021-03-11T06:27:45Z

Since systemd will create a cgroup for each service, we can instruct podman to just split the current cgroup into sub-cgroups. This enables system admins to use resource control features from systemd.

Signed-off-by: 胡玮文 huww98@outlook.com

Checklist

References tracker ticket
Updates documentation if necessary
Includes tests for new functionality or reproducer for bug

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

sebastian-philipp · 2021-03-11T10:29:35Z

Never heard of cgroups=split before, thanks!

This PR relates to #39550 . What is your use case? Which resource do you want to limit?

I'd expect this to be super cumbersome to do manually for a big cluster.

huww98 · 2021-03-11T11:02:14Z

@sebastian-philipp We have a pretty small cluster (8 hosts, 19 OSDs), and we reuse our high-performance machine learning servers as ceph nodes. But sometimes machine learning jobs take too much memory or CPU or even cause OOM, rendering degraded performance or data availability of ceph cluster. So I want to use cgroups to reserve some memory and CPU resources for ceph and other important system daemons.

I would probably config systemd manually on each host for this purpose. I would imagine admins can use tools like ansible for larger clusters. But I think big clusters usually use nodes dedicated to ceph, they probably don't need this feature?

sebastian-philipp · 2021-03-11T11:37:43Z

@sebastian-philipp We have a pretty small cluster (8 hosts, 19 OSDs), and we reuse our high-performance machine learning servers as ceph nodes.

Looks like cephadm is already in use for hyperconverged deployments! 😄

But sometimes machine learning jobs take too much memory or CPU or even cause OOM, rendering degraded performance or data availability of ceph cluster. So I want to use cgroups to reserve some memory and CPU resources for ceph and other important system daemons.

Could you please give #39550 a review? Would be great to know if Sage's PR would fit your needs as well.

I would probably config systemd manually on each host for this purpose. I would imagine admins can use tools like ansible for larger clusters.

cephadm should not rely on external tools like ansible to manage ceph containers. Only as a last resort.

Have you tried to manually set osd_memory_target? Did it work for you?

But I think big clusters usually use nodes dedicated to ceph, they probably don't need this feature?

Might be. But hyperconverged is definitely in-scope for cephadm.

huww98 · 2021-03-11T14:14:15Z

Could you please give #39550 a review? Would be great to know if Sage's PR would fit your needs as well.

I took a look at #39550. From my understanding, only the memory limit will take effect. But I want to reserve memory (e.g. MemoryMin from systemd.resource-control (5)). Besides, It set per-container memory limits. I don't like such fine-grained resource control. I want to set up a cgroup that contains all ceph daemons and probably together with some non-ceph daemons. Then just set up controls on this cgroup, and let the kernel distribute resources among them. And I also want to reserve CPU resource (e.g. CPUWeight from systemd.resource-control (5))

Have you tried to manually set osd_memory_target? Did it work for you?

Yes, it works as expected. But it does not prevent the OSD daemons from being swapped out due to unexpected high memory usage from ML jobs, Then the daemons will become extremely slow, or even be OOM killed.

But hyperconverged is definitely in-scope for cephadm.

I think cephadm should give admins the ability to manage system resources from a higher level. This may involve non-ceph daemons. And systemd could do this.

Anyway, this PR is simple, and should not introduce any negative impact.

huww98 · 2021-03-11T15:16:57Z

From the podman compatibility table, we still support podman 2.0. Then we may need to detect podman version before enabling this. --cgroups=split is introduced in 2.1.0

What do you think? Or we could just drop pre-2.1.0 version support

sebastian-philipp · 2021-03-11T15:18:10Z

From the podman compatibility table, we still support podman 2.0. Then we may need to detect podman version before enabling this. --cgroups=split is introduced in 2.1.0

What do you think? Or we could just drop pre-2.1.0 version support

problematic. we need to enable users to upgrade to a podman version supported by octopus to enable the upgrade to pacific

sebastian-philipp · 2021-03-12T13:26:52Z

I like your refactorization. Thanks!

adk3798

I like the refactorization here and have no issue with using this flag to increase admin ability to control resources. However, I did get an error when trying to redeploy some of my osd daemons with this change added in.

Checking journalctl I saw this issue was caused by

Error: writing file `/sys/fs/cgroup/system.slice/system-ceph\x2dab095f16\x2d8342\x2d11eb\x2db029\x2d5254004549fd.slice/ceph-ab095f16-8342-11eb-b029-5254004549fd@osd.5.service/cgroup.subtree_control`: Operation not supported: OCI runtime error

This was on Fedora 31 using podman 2.1.1. From what I've seen looking into this, it might just be system specific. Perhaps there's some system setting that needs to be checked before adding this flag?

@huww98 have you seen anything similar when redeploying osds? Other than this issue I'm fine with this change.

huww98 · 2021-03-13T02:03:55Z

@adk3798 Sorry, I don’t know why. And I don’t have a system with cgroups v2 for testing now. May I ask how do you deploy a test cluster with cephadm? Since this change only involve python code, I would like to avoid compiling the whole project myself. I did the proof of concept by overriding the systemd file with Ubuntu 18.04 and podman 3.0.1

sebastian-philipp · 2021-03-17T17:05:00Z

https://pulpito.ceph.com/swagner-2021-03-16_08:50:44-rados:cephadm-wip-swagner-testing-2021-03-12-1502-distro-basic-smithi/

failures:

https://tracker.ceph.com/issues/49633
e2e test
cephadm: validate fsid during cephadm shell command #40015
test_cephfs_mirror

sebastian-philipp · 2021-03-17T17:09:42Z

so - it did worked in Sepia.

adk3798 · 2021-03-17T17:46:38Z

so - it did worked in Sepia.

I've been trying this on some other distros since my original comment (CentosOS and Ubuntu) and seen no issues on either. It seems like it might be specific to (certain version of? only with my settings?) Fedora. I was seeing other issues related to cgroups stuff outside of this PR on Fedora as well.

This was on Fedora 33 right after a bootstrap.

[root@vm-00 ~]# podman stats --format {{.ID}},{{.MemUsage}} --no-stream
WARN[0000] Failed to retrieve cgroup stats: open /sys/fs/cgroup/system.slice/system-ceph\x2dcb4eeaf2\x2d8743\x2d11eb\x2d8f8c\x2d5254005b6c4f.slice/ceph-cb4eeaf2-8743-11eb-8f8c-5254005b6c4f@mon.vm-00.service/container/memory.current: no such file or directory 
WARN[0000] Failed to retrieve cgroup stats: open /sys/fs/cgroup/system.slice/system-ceph\x2dcb4eeaf2\x2d8743\x2d11eb\x2d8f8c\x2d5254005b6c4f.slice/ceph-cb4eeaf2-8743-11eb-8f8c-5254005b6c4f@mon.vm-00.service/container/pids.current: no such file or directory 
WARN[0000] Failed to retrieve cgroup stats: open /sys/fs/cgroup/system.slice/system-ceph\x2dcb4eeaf2\x2d8743\x2d11eb\x2d8f8c\x2d5254005b6c4f.slice/ceph-cb4eeaf2-8743-11eb-8f8c-5254005b6c4f@mgr.vm-00.ksymvv.service/container/memory.current: no such file or directory 
WARN[0000] Failed to retrieve cgroup stats: open /sys/fs/cgroup/system.slice/system-ceph\x2dcb4eeaf2\x2d8743\x2d11eb\x2d8f8c\x2d5254005b6c4f.slice/ceph-cb4eeaf2-8743-11eb-8f8c-5254005b6c4f@mgr.vm-00.ksymvv.service/container/pids.current: no such file or directory 

27f841182d50,20.32MB / 4.119GB
31936061766d,-- / --
44ef03138384,25.74MB / 4.119GB
5db7aeb47594,-- / --
73a4fbfa3ec0,7.926MB / 4.119GB
b260e4a669d7,28.2MB / 4.119GB
b52bb267032d,6.795MB / 4.119GB
d05ee16a635d,10.7MB / 4.119GB
efbcc0c8dd04,25.95MB / 4.119GB

That podman command is something we call to get container mem usage. If I'm seeing other cgroup issues outside of this PR on my setups then I don't think the issues I've seen should stop this from being merged. It's probably something on my end.

github-actions · 2021-03-19T13:04:19Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

github-actions · 2021-03-20T13:27:03Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

This allow us to store additional information about engine apart from it's path. Signed-off-by: 胡玮文 <huww98@outlook.com>

Since systemd will create a cgroup for each service, we can instruct podman to just split the current cgroup into sub-cgroups. This enables system admins to use resource control features from systemd. Signed-off-by: 胡玮文 <huww98@outlook.com>

huww98 · 2021-03-21T01:49:52Z

jenkins test make check

tchaikov · 2021-04-05T09:19:46Z

@sebastian-philipp @adk3798 hi folks, is this change good to merge?

adk3798 · 2021-04-05T18:04:46Z

@sebastian-philipp @adk3798 hi folks, is this change good to merge?

I'm personally fine with these changes as long as teuthology passes.

huww98 requested a review from a team as a code owner March 11, 2021 06:27

github-actions bot added the cephadm label Mar 11, 2021

huww98 force-pushed the cg-split branch 5 times, most recently from 34aa789 to abb80a7 Compare March 12, 2021 02:24

sebastian-philipp added the wip-swagner-testing My Teuthology tests label Mar 12, 2021

huww98 force-pushed the cg-split branch from abb80a7 to fd24a3c Compare March 12, 2021 14:59

adk3798 reviewed Mar 12, 2021

View reviewed changes

sebastian-philipp removed the wip-swagner-testing My Teuthology tests label Mar 17, 2021

adk3798 approved these changes Mar 17, 2021

View reviewed changes

github-actions bot added the needs-rebase label Mar 19, 2021

huww98 force-pushed the cg-split branch from fd24a3c to 0933da7 Compare March 19, 2021 13:27

github-actions bot removed the needs-rebase label Mar 19, 2021

github-actions bot added the needs-rebase label Mar 20, 2021

cephadm: use class to represent container engine

ca6a8fc

This allow us to store additional information about engine apart from it's path. Signed-off-by: 胡玮文 <huww98@outlook.com>

huww98 force-pushed the cg-split branch from 0933da7 to 1a76f47 Compare March 20, 2021 16:33

github-actions bot removed the needs-rebase label Mar 20, 2021

huww98 mentioned this pull request Apr 5, 2021

common: enable log to journald #39738

Merged

3 tasks

liewegas added the wip-sage-testing label Apr 5, 2021

liewegas merged commit ee2e6e7 into ceph:master Apr 5, 2021

huww98 deleted the cg-split branch April 6, 2021 15:21

huww98 mentioned this pull request Apr 7, 2021

cephadm: enable log to journald by default #40640

Merged

3 tasks

liewegas mentioned this pull request Apr 19, 2021

pacific: cephadm april batch 3 #40922

Merged

sebastian-philipp mentioned this pull request Jun 1, 2021

mgr/cephadm: Revert changes introduced in order to use "split cgroup strategy for podman" #41610

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cephadm: use split cgroup strategy for podman #40025

cephadm: use split cgroup strategy for podman #40025

huww98 commented Mar 11, 2021

sebastian-philipp commented Mar 11, 2021

huww98 commented Mar 11, 2021

sebastian-philipp commented Mar 11, 2021 •

edited

huww98 commented Mar 11, 2021

huww98 commented Mar 11, 2021

sebastian-philipp commented Mar 11, 2021

sebastian-philipp commented Mar 12, 2021

adk3798 left a comment

huww98 commented Mar 13, 2021

sebastian-philipp commented Mar 17, 2021

sebastian-philipp commented Mar 17, 2021

adk3798 commented Mar 17, 2021

github-actions bot commented Mar 19, 2021

github-actions bot commented Mar 20, 2021

huww98 commented Mar 21, 2021

tchaikov commented Apr 5, 2021

adk3798 commented Apr 5, 2021

cephadm: use split cgroup strategy for podman #40025

cephadm: use split cgroup strategy for podman #40025

Conversation

huww98 commented Mar 11, 2021

Checklist

sebastian-philipp commented Mar 11, 2021

huww98 commented Mar 11, 2021

sebastian-philipp commented Mar 11, 2021 • edited

huww98 commented Mar 11, 2021

huww98 commented Mar 11, 2021

sebastian-philipp commented Mar 11, 2021

sebastian-philipp commented Mar 12, 2021

adk3798 left a comment

Choose a reason for hiding this comment

huww98 commented Mar 13, 2021

sebastian-philipp commented Mar 17, 2021

sebastian-philipp commented Mar 17, 2021

adk3798 commented Mar 17, 2021

github-actions bot commented Mar 19, 2021

github-actions bot commented Mar 20, 2021

huww98 commented Mar 21, 2021

tchaikov commented Apr 5, 2021

adk3798 commented Apr 5, 2021

sebastian-philipp commented Mar 11, 2021 •

edited