New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cephadm: use split cgroup strategy for podman #40025
Conversation
Never heard of cgroups=split before, thanks! This PR relates to #39550 . What is your use case? Which resource do you want to limit? I'd expect this to be super cumbersome to do manually for a big cluster. |
@sebastian-philipp We have a pretty small cluster (8 hosts, 19 OSDs), and we reuse our high-performance machine learning servers as ceph nodes. But sometimes machine learning jobs take too much memory or CPU or even cause OOM, rendering degraded performance or data availability of ceph cluster. So I want to use cgroups to reserve some memory and CPU resources for ceph and other important system daemons. I would probably config systemd manually on each host for this purpose. I would imagine admins can use tools like ansible for larger clusters. But I think big clusters usually use nodes dedicated to ceph, they probably don't need this feature? |
Looks like cephadm is already in use for hyperconverged deployments! 😄
Could you please give #39550 a review? Would be great to know if Sage's PR would fit your needs as well.
cephadm should not rely on external tools like ansible to manage ceph containers. Only as a last resort. Have you tried to manually set
Might be. But hyperconverged is definitely in-scope for cephadm. |
I took a look at #39550. From my understanding, only the memory limit will take effect. But I want to reserve memory (e.g.
Yes, it works as expected. But it does not prevent the OSD daemons from being swapped out due to unexpected high memory usage from ML jobs, Then the daemons will become extremely slow, or even be OOM killed.
I think cephadm should give admins the ability to manage system resources from a higher level. This may involve non-ceph daemons. And systemd could do this. Anyway, this PR is simple, and should not introduce any negative impact. |
From the podman compatibility table, we still support podman 2.0. Then we may need to detect podman version before enabling this. What do you think? Or we could just drop pre-2.1.0 version support |
problematic. we need to enable users to upgrade to a podman version supported by octopus to enable the upgrade to pacific |
34aa789
to
abb80a7
Compare
I like your refactorization. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the refactorization here and have no issue with using this flag to increase admin ability to control resources. However, I did get an error when trying to redeploy some of my osd daemons with this change added in.
Checking journalctl I saw this issue was caused by
Error: writing file `/sys/fs/cgroup/system.slice/system-ceph\x2dab095f16\x2d8342\x2d11eb\x2db029\x2d5254004549fd.slice/ceph-ab095f16-8342-11eb-b029-5254004549fd@osd.5.service/cgroup.subtree_control`: Operation not supported: OCI runtime error
This was on Fedora 31 using podman 2.1.1. From what I've seen looking into this, it might just be system specific. Perhaps there's some system setting that needs to be checked before adding this flag?
@huww98 have you seen anything similar when redeploying osds? Other than this issue I'm fine with this change.
@adk3798 Sorry, I don’t know why. And I don’t have a system with cgroups v2 for testing now. May I ask how do you deploy a test cluster with cephadm? Since this change only involve python code, I would like to avoid compiling the whole project myself. I did the proof of concept by overriding the systemd file with Ubuntu 18.04 and podman 3.0.1 |
so - it did worked in Sepia. |
I've been trying this on some other distros since my original comment (CentosOS and Ubuntu) and seen no issues on either. It seems like it might be specific to (certain version of? only with my settings?) Fedora. I was seeing other issues related to cgroups stuff outside of this PR on Fedora as well. This was on Fedora 33 right after a bootstrap.
That podman command is something we call to get container mem usage. If I'm seeing other cgroup issues outside of this PR on my setups then I don't think the issues I've seen should stop this from being merged. It's probably something on my end. |
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
This allow us to store additional information about engine apart from it's path. Signed-off-by: 胡玮文 <huww98@outlook.com>
Since systemd will create a cgroup for each service, we can instruct podman to just split the current cgroup into sub-cgroups. This enables system admins to use resource control features from systemd. Signed-off-by: 胡玮文 <huww98@outlook.com>
jenkins test make check |
@sebastian-philipp @adk3798 hi folks, is this change good to merge? |
I'm personally fine with these changes as long as teuthology passes. |
Since systemd will create a cgroup for each service, we can instruct podman to just split the current cgroup into sub-cgroups. This enables system admins to use resource control features from systemd.
Signed-off-by: 胡玮文 huww98@outlook.com
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox