Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr/cephadm: optimize the daemon ls process #36958

Closed
wants to merge 11 commits into from
Closed

mgr/cephadm: optimize the daemon ls process #36958

wants to merge 11 commits into from

Conversation

pcuzner
Copy link
Contributor

@pcuzner pcuzner commented Sep 3, 2020

The initial ls implementation relied on repeated inspect and exec's against
containers, which makes an the process overly long. This PR splits the process
by using the 'batch' form of ps -a and inspect, to reduce the command
invocations, and where possible avoids running commands inside
containers to determine version. For ceph daemons, the cephadm utility leaves
the version blank - but the mgr/cephadm code them uses it's own metadata to
insert the current version information - again reducing runtime.

Fixes: https://tracker.ceph.com/issues/44055
Signed-off-by: pcuzner@redhat.com

src/cephadm/cephadm Show resolved Hide resolved
src/cephadm/cephadm Show resolved Hide resolved
src/cephadm/cephadm Outdated Show resolved Hide resolved
src/cephadm/cephadm Outdated Show resolved Hide resolved
@sebastian-philipp sebastian-philipp added wip-swagner-testing My Teuthology tests and removed wip-swagner-testing My Teuthology tests labels Sep 7, 2020
src/cephadm/cephadm Show resolved Hide resolved
src/cephadm/cephadm Outdated Show resolved Hide resolved
src/cephadm/cephadm Outdated Show resolved Hide resolved
src/cephadm/cephadm Outdated Show resolved Hide resolved
src/cephadm/cephadm Outdated Show resolved Hide resolved
@jschmid1
Copy link
Contributor

jschmid1 commented Sep 8, 2020

This adds quite a bit of fragile code, I'd love to see some more unittests

@sebastian-philipp sebastian-philipp added the wip-swagner-testing My Teuthology tests label Sep 8, 2020
@pcuzner
Copy link
Contributor Author

pcuzner commented Sep 9, 2020

This adds quite a bit of fragile code, I'd love to see some more unittests

@jschmid1 In addition to the potential of more unittests - it would be helpful to understand where the fragility is so I can address those areas. Could you be more specific please?

@sebastian-philipp
Copy link
Contributor

needs rebase

list_daemons (ls) cephadm command queries the host for daemon
states, and returns metadata. In this patch the data is gathered in
batch mode to cut down on repeated command execution, and any
ceph versions are picked up in mgr/cephadm instead of running a
version command inside every ceph container.

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Updated tests to mock the _get_ceph_metadata call which
allows tests with list_daemons to work as intended

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
The get_ceph_metadata function provides a lookup map of
service to service metadata. In the context of list_daemons we're
only interested in the version info - but this function makes the
metadata available for other purposes too.

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
API calls are made to daemons like prometheus to determine the
version, but if the process hasn't fully initialized the version
info returned is null. This patch catches this condition and also
issues warnings if the version info returned doesn't match the
attributes we're expecting.

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Systemd states for disabled and/or stopped services was showing
Unknown...this patch uses the term 'dead' to align with the state
in systemctl.

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Testing requires the get_ceph_metadata to be mocked. normally,
this function would provide a dict to describe the service to
service metadata lookup. This patch mocks that functionality.

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
@pcuzner pcuzner requested a review from jmolmo September 15, 2020 00:40
Initial toleration patch to support daemons that cephadm sees that
are not container based

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
The deployment pattern should use a unit.run file, so if this is
present we can provide a deployed timestamp

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
@pcuzner
Copy link
Contributor Author

pcuzner commented Sep 18, 2020

@jschmid1 @sebastian-philipp Can you let me know what else needs to be done for this to be merged - or if it's not acceptable, just let me know so I can close it.

@pcuzner pcuzner added the DNM label Sep 22, 2020
Copy link
Member

@jmolmo jmolmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works like a charm!!!
(remove the cephadmc file.. it must be a intruder file included in the pr)

@tserong
Copy link
Contributor

tserong commented Sep 25, 2020

Sorry, I just came across this and noticed that the last two commits are starting to add support for daemons that are not containers. Why would we want cephadm to support daemons that aren't containers?

@tchaikov
Copy link
Contributor

@sebastian-philipp ping?

@sebastian-philipp
Copy link
Contributor

sebastian-philipp commented Sep 25, 2020

@sebastian-philipp ping?

being on parental leave right now.

Sorry, I just came across this and noticed that the last two commits are starting to add support for daemons that are not containers. Why would we want cephadm to support daemons that aren't containers?

keepalived and the cephadm daemon (aka "cephadm exporter" right now).

But we should keep non-contaierized daemons as an exception

@wjwithagen
Copy link
Contributor

Sorry, I just came across this and noticed that the last two commits are starting to add support for daemons that are not containers. Why would we want cephadm to support daemons that aren't containers?

For me is the fact that cephadm seems to depend on containers one of the reasons not to look into this further.
On FreeBSD there is little that would get containerisation (I assume that would look like a jail)
that is compatible with the linux variants.

So hence this caught my eye as a possible hint of usability of cephadm on FreeBSD.

Hopefully I'm not completely of track with this?

@sebastian-philipp
Copy link
Contributor

I don't see any Teuthology tests OR unit tests or anything in this PR. This would certainly speed up things, if we can validate that things are actually working

@pcuzner
Copy link
Contributor Author

pcuzner commented Sep 28, 2020

Sorry, I just came across this and noticed that the last two commits are starting to add support for daemons that are not containers. Why would we want cephadm to support daemons that aren't containers?

@tserong purely for the cephadm exporter mode, which runs as a native systemd unit so it can interact with the host os (systemctl commands and podman/docker commands)

@pcuzner
Copy link
Contributor Author

pcuzner commented Sep 28, 2020

I don't see any Teuthology tests OR unit tests or anything in this PR. This would certainly speed up things, if we can validate that things are actually working

@sebastian-philipp interesting. This is an existing command - so are you saying that we don't currently test list_daemons? "cephadm ls" itself s already part of the test_cephadm script, but I see that ceph orch ps is not. Would including ceph orch ps within the test_cephadm script suffice? If not, some guidance would be great!

@jschmid1 ^^

Copy link
Contributor

@tserong tserong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit 774a523 is described as "mgr/cephadm: updates to determining version of crash", but seems to include rather more changes than just those related to the crash daemon. Is it possible to split this up into separate commits, or change the description to more accurately reflect what's changed?

Also, src/cephadm/cephadmc needs to be removed (this might be an artifact of running 'tox', but in any case it doesn't belong in the source tree)

ls = []
def get_enabled_ceph_units(fsid: str) -> List[str]:
try:
units = os.listdir(f'/etc/systemd/system/ceph-{fsid}.target.wants')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me slightly nervous, partly because it means we're not treating systemd as a black box, but are relying on us knowing how it works with symlinks inside that directory, and partly because it means that this is another place we're going to have to remember to update if we change how we're setting up unit files and targets in future.

What specific performance increase do we get by doing this, vs. running systemctl is-enabled for each unit?

(Also, if we do keep this change, it should use args.unit_dir rather than hard-coding /etc/systemd/system)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the current list_daemons takes 10s on a physical server - this version makes it <1s...we have to improve on dealing with scale.

The whole idea is to reduce the data gathering, so we can run it more often and have more current data for orch and UI flows. We already have to understand what systemd is doing since we're creating unit files directly, the the right place and format - so I don't see this as a big deal personally.

systemd isn't exactly new.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the idea, and totally support speeding things up, and if this PR gives an order of magnitude (or more) improvement, then that's excellent!

My question above was meant to be specifically only around the benefit gained by switching from systemctl to os.listdir, vs. the other changes in this PR.

@sebastian-philipp
Copy link
Contributor

I don't see any Teuthology tests OR unit tests or anything in this PR. This would certainly speed up things, if we can validate that things are actually working

@sebastian-philipp interesting. This is an existing command - so are you saying that we don't currently test list_daemons? "cephadm ls" itself s already part of the test_cephadm script, but I see that ceph orch ps is not. Would including ceph orch ps within the test_cephadm script suffice? If not, some guidance would be great!

@jschmid1 ^^

There are plenty of places, wherecephadm ls is executed, but we don't verify that the output is correct and complete. Like e.g. https://tracker.ceph.com/issues/45628

Would be great to have a test that matches the output with what we expect

@tchaikov tchaikov removed the needs-qa label Oct 9, 2020
@tchaikov
Copy link
Contributor

tchaikov commented Oct 9, 2020

removing needs-qa, as it needs rebase.

@sebastian-philipp sebastian-philipp added the wip-swagner-testing My Teuthology tests label Dec 7, 2020
@sebastian-philipp
Copy link
Contributor

needs rebase

@sebastian-philipp sebastian-philipp removed the wip-swagner-testing My Teuthology tests label Dec 7, 2020
@sebastian-philipp
Copy link
Contributor

close? the cephadm daemon is IMO better suited for improving the performance.

@pcuzner
Copy link
Contributor Author

pcuzner commented Feb 2, 2021

Agree - will close. Thanks

@pcuzner pcuzner closed this Feb 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants