Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DaemonClient: Fix and homogenize use of timeout in client calls #5960

Merged
merged 1 commit into from
Apr 14, 2023

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Apr 12, 2023

Fixes #4016

The DaemonClient provides a number of methods that will attempt to communicate with the daemon process, e.g., stop_daemon and get_status. This is done through the CircusClient, as the main daemon process is the circus daemonizer, which takes a timeout as argument indicating the number of seconds after which the call should raise if the daemon did not respond in time.

The default timeout is set in the constructor of the DaemonClient based on the daemon.timeout config option. It was incorrectly getting the global value instead of the profile specific one, which is corrected by using config.get_option('daemon.timeout', scope=profile.name) to fetch the conifg option.

A timeout argument is added to all DaemonClient methods that call through to the CircusClient that can be used to override the default that is based on the daemon.timeout config option.

A default is added for wait in the restart_daemon method, to homogenize its interface with start_daemon and stop_daemon. The manual timeout cycle in stop_daemon by calling _await_condition has been removed as this functionality is already performed by the circus client itself and so is superfluous. The only place it is still used is in start_daemon because there the circus client is not used, since the circus process is not actually running yet, and a "manual" health check needs to be performed after the daemon process is launched.

The default for the daemon.timeout configuration option is decreased to 2 seconds, as this should be sufficient for most conditions for the circus daemon process to respond. Note that this daemon process is only charged with monitoring the daemon workers and so won't be under heavy load that will prevent it from responding in time. The timeout is set to 5 seconds for the CI workflow, however, since the runners are typically less performant machines where 2 seconds may be insufficient and we don't want the tests to fail just because the call to stop the daemon at the end of the suite timed out.

@sphuber sphuber force-pushed the fix/4016/daemon-client-timeout branch 5 times, most recently from 80fdbc5 to 6e20c49 Compare April 12, 2023 14:14
@sphuber
Copy link
Contributor Author

sphuber commented Apr 12, 2023

Blocked by #5961 . See #5822 (comment) for an explanation

@sphuber sphuber added the pr/blocked PR is blocked by another PR that should be merged first label Apr 12, 2023
@unkcpz
Copy link
Member

unkcpz commented Apr 12, 2023

@sphuber thanks a lot! You figure out the logic of daemon timeout! I guess this will allow us to come back to the PR #5928

The manual timeout cycle in stop_daemon by calling _await_condition has been removed as this functionality is already performed by the circus client itself and so is superfluous

This is one thing still not very clear to me. It means we use the same timeout for _await_condition and other call circus clients. Are they all on the same time scale? When running verdi daemon stop, I think 2 seconds is too short because we need to wait for ongoing processes to be finished (This is what I guess, I don't find where this is handled, maybe in circus which will wait until the process is done?).

P.S @superstar54 shows interest in the daemon part and wants to have a look, so I'll leave the doc part reviewing for him to have a close look.

@sphuber
Copy link
Contributor Author

sphuber commented Apr 13, 2023

This is one thing still not very clear to me. It means we use the same timeout for _await_condition and other call circus clients. Are they all on the same time scale?

The _await_condition is now only used for the start_daemon command (in normal usage) because there we cannot rely on the built-in timeout check of the circus client, because the client is not actually running yet when starting it. So we have a manual one. In this sense, I think it is fine that that uses the same timeout.

When running verdi daemon stop, I think 2 seconds is too short because we need to wait for ongoing processes to be finished (This is what I guess, I don't find where this is handled, maybe in circus which will wait until the process is done?).

This depends, I think, on whether wait = True. If that is False, I think the circus daemon just needs to acknowledge that it received the signal to stop. It will then send the signals to the workers to shut them down. If, however, the user wants to wait, then indeed I think it would wait until the circus daemon confirms that everything is shut down and it is shutting down itself. If the workers are under heavy load, this will probably timeout. But not just with 2 seconds, but probably also 5, 6 or even 10 seconds. In the end, it is not really a problem though as the daemon will still eventually shut down, we just don't get immediate confirmation. This has always been the case and is not something that is changing here, or something that we can easily prevent I think.

Ultimately, if users really don't like this, adding a --timeout flag on the verdi daemon stop command would allow them to specify a larger timeout on the fly. That is something that should now be easy to add.

Copy link
Member

@unkcpz unkcpz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sphuber thanks for the explanation, now everything is clear. The changes look all good.

The `DaemonClient` provides a number of methods that will attempt to
communicate with the daemon process, e.g., `stop_daemon` and `get_status`.
This is done through the `CircusClient`, as the main daemon process is
the circus daemonizer, which takes a timeout as argument indicating the
number of seconds after which the call should raise if the daemon did
not respond in time.

The default timeout is set in the constructor of the `DaemonClient`
based on the `daemon.timeout` config option. It was incorrectly getting
the global value instead of the profile specific one, which is corrected
by using `config.get_option('daemon.timeout', scope=profile.name)` to
fetch the conifg option.

A `timeout` argument is added to all `DaemonClient` methods that call
through to the `CircusClient` that can be used to override the default
that is based on the `daemon.timeout` config option.

A default is added for `wait` in the `restart_daemon` method, to
homogenize its interface with `start_daemon` and `stop_daemon`. The
manual timeout cycle in `stop_daemon` by calling `_await_condition` has
been removed as this functionality is already performed by the circus
client itself and so is superfluous. The only place it is still used is
in `start_daemon` because there the circus client is not used, since the
circus process is not actually running yet, and a "manual" health check
needs to be performed after the daemon process is launched. The manual
check is added to the `stop_daemon` fixture since the check in certain
unit test scenarios could give a false positive without an additional
manual grace period for `is_daemon_running` to start returning `False`.

The default for the `daemon.timeout` configuration option is decreased
to 2 seconds, as this should be sufficient for most conditions for the
circus daemon process to respond. Note that this daemon process is only
charged with monitoring the daemon workers and so won't be under heavy
load that will prevent it from responding in time.
@sphuber sphuber force-pushed the fix/4016/daemon-client-timeout branch from cd68572 to 907ef6a Compare April 13, 2023 21:28
@sphuber sphuber merged commit 30c7f44 into aiidateam:main Apr 14, 2023
@sphuber sphuber deleted the fix/4016/daemon-client-timeout branch April 14, 2023 07:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr/blocked PR is blocked by another PR that should be merged first
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Docs: Topic daemon
2 participants