Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cephadm: use split cgroup strategy for podman #40025

Merged
merged 2 commits into from Apr 5, 2021

Conversation

huww98
Copy link
Contributor

@huww98 huww98 commented Mar 11, 2021

Since systemd will create a cgroup for each service, we can instruct podman to just split the current cgroup into sub-cgroups. This enables system admins to use resource control features from systemd.

Signed-off-by: 胡玮文 huww98@outlook.com

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@huww98 huww98 requested a review from a team as a code owner March 11, 2021 06:27
@sebastian-philipp
Copy link
Contributor

Never heard of cgroups=split before, thanks!

This PR relates to #39550 . What is your use case? Which resource do you want to limit?

I'd expect this to be super cumbersome to do manually for a big cluster.

@huww98
Copy link
Contributor Author

huww98 commented Mar 11, 2021

@sebastian-philipp We have a pretty small cluster (8 hosts, 19 OSDs), and we reuse our high-performance machine learning servers as ceph nodes. But sometimes machine learning jobs take too much memory or CPU or even cause OOM, rendering degraded performance or data availability of ceph cluster. So I want to use cgroups to reserve some memory and CPU resources for ceph and other important system daemons.

I would probably config systemd manually on each host for this purpose. I would imagine admins can use tools like ansible for larger clusters. But I think big clusters usually use nodes dedicated to ceph, they probably don't need this feature?

@sebastian-philipp
Copy link
Contributor

sebastian-philipp commented Mar 11, 2021

@sebastian-philipp We have a pretty small cluster (8 hosts, 19 OSDs), and we reuse our high-performance machine learning servers as ceph nodes.

Looks like cephadm is already in use for hyperconverged deployments! 😄

But sometimes machine learning jobs take too much memory or CPU or even cause OOM, rendering degraded performance or data availability of ceph cluster. So I want to use cgroups to reserve some memory and CPU resources for ceph and other important system daemons.

Could you please give #39550 a review? Would be great to know if Sage's PR would fit your needs as well.

I would probably config systemd manually on each host for this purpose. I would imagine admins can use tools like ansible for larger clusters.

cephadm should not rely on external tools like ansible to manage ceph containers. Only as a last resort.

Have you tried to manually set osd_memory_target? Did it work for you?

But I think big clusters usually use nodes dedicated to ceph, they probably don't need this feature?

Might be. But hyperconverged is definitely in-scope for cephadm.

@huww98
Copy link
Contributor Author

huww98 commented Mar 11, 2021

Could you please give #39550 a review? Would be great to know if Sage's PR would fit your needs as well.

I took a look at #39550. From my understanding, only the memory limit will take effect. But I want to reserve memory (e.g. MemoryMin from systemd.resource-control (5)). Besides, It set per-container memory limits. I don't like such fine-grained resource control. I want to set up a cgroup that contains all ceph daemons and probably together with some non-ceph daemons. Then just set up controls on this cgroup, and let the kernel distribute resources among them. And I also want to reserve CPU resource (e.g. CPUWeight from systemd.resource-control (5))

Have you tried to manually set osd_memory_target? Did it work for you?

Yes, it works as expected. But it does not prevent the OSD daemons from being swapped out due to unexpected high memory usage from ML jobs, Then the daemons will become extremely slow, or even be OOM killed.

But hyperconverged is definitely in-scope for cephadm.

I think cephadm should give admins the ability to manage system resources from a higher level. This may involve non-ceph daemons. And systemd could do this.

Anyway, this PR is simple, and should not introduce any negative impact.

@huww98
Copy link
Contributor Author

huww98 commented Mar 11, 2021

From the podman compatibility table, we still support podman 2.0. Then we may need to detect podman version before enabling this. --cgroups=split is introduced in 2.1.0

What do you think? Or we could just drop pre-2.1.0 version support

@sebastian-philipp
Copy link
Contributor

From the podman compatibility table, we still support podman 2.0. Then we may need to detect podman version before enabling this. --cgroups=split is introduced in 2.1.0

What do you think? Or we could just drop pre-2.1.0 version support

problematic. we need to enable users to upgrade to a podman version supported by octopus to enable the upgrade to pacific

@huww98 huww98 force-pushed the cg-split branch 5 times, most recently from 34aa789 to abb80a7 Compare March 12, 2021 02:24
@sebastian-philipp
Copy link
Contributor

I like your refactorization. Thanks!

Copy link
Contributor

@adk3798 adk3798 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the refactorization here and have no issue with using this flag to increase admin ability to control resources. However, I did get an error when trying to redeploy some of my osd daemons with this change added in.

image

Checking journalctl I saw this issue was caused by

Error: writing file `/sys/fs/cgroup/system.slice/system-ceph\x2dab095f16\x2d8342\x2d11eb\x2db029\x2d5254004549fd.slice/ceph-ab095f16-8342-11eb-b029-5254004549fd@osd.5.service/cgroup.subtree_control`: Operation not supported: OCI runtime error

This was on Fedora 31 using podman 2.1.1. From what I've seen looking into this, it might just be system specific. Perhaps there's some system setting that needs to be checked before adding this flag?

@huww98 have you seen anything similar when redeploying osds? Other than this issue I'm fine with this change.

@huww98
Copy link
Contributor Author

huww98 commented Mar 13, 2021

@adk3798 Sorry, I don’t know why. And I don’t have a system with cgroups v2 for testing now. May I ask how do you deploy a test cluster with cephadm? Since this change only involve python code, I would like to avoid compiling the whole project myself. I did the proof of concept by overriding the systemd file with Ubuntu 18.04 and podman 3.0.1

@sebastian-philipp
Copy link
Contributor

so - it did worked in Sepia.

@sebastian-philipp sebastian-philipp removed the wip-swagner-testing My Teuthology tests label Mar 17, 2021
@adk3798
Copy link
Contributor

adk3798 commented Mar 17, 2021

so - it did worked in Sepia.

I've been trying this on some other distros since my original comment (CentosOS and Ubuntu) and seen no issues on either. It seems like it might be specific to (certain version of? only with my settings?) Fedora. I was seeing other issues related to cgroups stuff outside of this PR on Fedora as well.

This was on Fedora 33 right after a bootstrap.

[root@vm-00 ~]# podman stats --format {{.ID}},{{.MemUsage}} --no-stream
WARN[0000] Failed to retrieve cgroup stats: open /sys/fs/cgroup/system.slice/system-ceph\x2dcb4eeaf2\x2d8743\x2d11eb\x2d8f8c\x2d5254005b6c4f.slice/ceph-cb4eeaf2-8743-11eb-8f8c-5254005b6c4f@mon.vm-00.service/container/memory.current: no such file or directory 
WARN[0000] Failed to retrieve cgroup stats: open /sys/fs/cgroup/system.slice/system-ceph\x2dcb4eeaf2\x2d8743\x2d11eb\x2d8f8c\x2d5254005b6c4f.slice/ceph-cb4eeaf2-8743-11eb-8f8c-5254005b6c4f@mon.vm-00.service/container/pids.current: no such file or directory 
WARN[0000] Failed to retrieve cgroup stats: open /sys/fs/cgroup/system.slice/system-ceph\x2dcb4eeaf2\x2d8743\x2d11eb\x2d8f8c\x2d5254005b6c4f.slice/ceph-cb4eeaf2-8743-11eb-8f8c-5254005b6c4f@mgr.vm-00.ksymvv.service/container/memory.current: no such file or directory 
WARN[0000] Failed to retrieve cgroup stats: open /sys/fs/cgroup/system.slice/system-ceph\x2dcb4eeaf2\x2d8743\x2d11eb\x2d8f8c\x2d5254005b6c4f.slice/ceph-cb4eeaf2-8743-11eb-8f8c-5254005b6c4f@mgr.vm-00.ksymvv.service/container/pids.current: no such file or directory 

27f841182d50,20.32MB / 4.119GB
31936061766d,-- / --
44ef03138384,25.74MB / 4.119GB
5db7aeb47594,-- / --
73a4fbfa3ec0,7.926MB / 4.119GB
b260e4a669d7,28.2MB / 4.119GB
b52bb267032d,6.795MB / 4.119GB
d05ee16a635d,10.7MB / 4.119GB
efbcc0c8dd04,25.95MB / 4.119GB

That podman command is something we call to get container mem usage. If I'm seeing other cgroup issues outside of this PR on my setups then I don't think the issues I've seen should stop this from being merged. It's probably something on my end.

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

This allow us to store additional information about engine apart from it's
path.

Signed-off-by: 胡玮文 <huww98@outlook.com>
Since systemd will create a cgroup for each service, we can instruct podman to
just split the current cgroup into sub-cgroups. This enables system admins to
use resource control features from systemd.

Signed-off-by: 胡玮文 <huww98@outlook.com>
@huww98
Copy link
Contributor Author

huww98 commented Mar 21, 2021

jenkins test make check

@huww98 huww98 mentioned this pull request Apr 5, 2021
3 tasks
@tchaikov
Copy link
Contributor

tchaikov commented Apr 5, 2021

@sebastian-philipp @adk3798 hi folks, is this change good to merge?

@adk3798
Copy link
Contributor

adk3798 commented Apr 5, 2021

@sebastian-philipp @adk3798 hi folks, is this change good to merge?

I'm personally fine with these changes as long as teuthology passes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants