`DaemonClient`: Fix and homogenize use of `timeout` in client calls #5960

sphuber · 2023-04-12T10:37:30Z

The DaemonClient provides a number of methods that will attempt to communicate with the daemon process, e.g., stop_daemon and get_status. This is done through the CircusClient, as the main daemon process is the circus daemonizer, which takes a timeout as argument indicating the number of seconds after which the call should raise if the daemon did not respond in time.

The default timeout is set in the constructor of the DaemonClient based on the daemon.timeout config option. It was incorrectly getting the global value instead of the profile specific one, which is corrected by using config.get_option('daemon.timeout', scope=profile.name) to fetch the conifg option.

A timeout argument is added to all DaemonClient methods that call through to the CircusClient that can be used to override the default that is based on the daemon.timeout config option.

A default is added for wait in the restart_daemon method, to homogenize its interface with start_daemon and stop_daemon. The manual timeout cycle in stop_daemon by calling _await_condition has been removed as this functionality is already performed by the circus client itself and so is superfluous. The only place it is still used is in start_daemon because there the circus client is not used, since the circus process is not actually running yet, and a "manual" health check needs to be performed after the daemon process is launched.

The default for the daemon.timeout configuration option is decreased to 2 seconds, as this should be sufficient for most conditions for the circus daemon process to respond. Note that this daemon process is only charged with monitoring the daemon workers and so won't be under heavy load that will prevent it from responding in time. The timeout is set to 5 seconds for the CI workflow, however, since the runners are typically less performant machines where 2 seconds may be insufficient and we don't want the tests to fail just because the call to stop the daemon at the end of the suite timed out.

sphuber · 2023-04-12T14:32:56Z

Blocked by #5961 . See #5822 (comment) for an explanation

unkcpz · 2023-04-12T22:18:59Z

@sphuber thanks a lot! You figure out the logic of daemon timeout! I guess this will allow us to come back to the PR #5928

The manual timeout cycle in stop_daemon by calling _await_condition has been removed as this functionality is already performed by the circus client itself and so is superfluous

This is one thing still not very clear to me. It means we use the same timeout for _await_condition and other call circus clients. Are they all on the same time scale? When running verdi daemon stop, I think 2 seconds is too short because we need to wait for ongoing processes to be finished (This is what I guess, I don't find where this is handled, maybe in circus which will wait until the process is done?).

P.S @superstar54 shows interest in the daemon part and wants to have a look, so I'll leave the doc part reviewing for him to have a close look.

sphuber · 2023-04-13T09:28:16Z

This is one thing still not very clear to me. It means we use the same timeout for _await_condition and other call circus clients. Are they all on the same time scale?

The _await_condition is now only used for the start_daemon command (in normal usage) because there we cannot rely on the built-in timeout check of the circus client, because the client is not actually running yet when starting it. So we have a manual one. In this sense, I think it is fine that that uses the same timeout.

When running verdi daemon stop, I think 2 seconds is too short because we need to wait for ongoing processes to be finished (This is what I guess, I don't find where this is handled, maybe in circus which will wait until the process is done?).

This depends, I think, on whether wait = True. If that is False, I think the circus daemon just needs to acknowledge that it received the signal to stop. It will then send the signals to the workers to shut them down. If, however, the user wants to wait, then indeed I think it would wait until the circus daemon confirms that everything is shut down and it is shutting down itself. If the workers are under heavy load, this will probably timeout. But not just with 2 seconds, but probably also 5, 6 or even 10 seconds. In the end, it is not really a problem though as the daemon will still eventually shut down, we just don't get immediate confirmation. This has always been the case and is not something that is changing here, or something that we can easily prevent I think.

Ultimately, if users really don't like this, adding a --timeout flag on the verdi daemon stop command would allow them to specify a larger timeout on the fly. That is something that should now be easy to add.

unkcpz

@sphuber thanks for the explanation, now everything is clear. The changes look all good.

The `DaemonClient` provides a number of methods that will attempt to communicate with the daemon process, e.g., `stop_daemon` and `get_status`. This is done through the `CircusClient`, as the main daemon process is the circus daemonizer, which takes a timeout as argument indicating the number of seconds after which the call should raise if the daemon did not respond in time. The default timeout is set in the constructor of the `DaemonClient` based on the `daemon.timeout` config option. It was incorrectly getting the global value instead of the profile specific one, which is corrected by using `config.get_option('daemon.timeout', scope=profile.name)` to fetch the conifg option. A `timeout` argument is added to all `DaemonClient` methods that call through to the `CircusClient` that can be used to override the default that is based on the `daemon.timeout` config option. A default is added for `wait` in the `restart_daemon` method, to homogenize its interface with `start_daemon` and `stop_daemon`. The manual timeout cycle in `stop_daemon` by calling `_await_condition` has been removed as this functionality is already performed by the circus client itself and so is superfluous. The only place it is still used is in `start_daemon` because there the circus client is not used, since the circus process is not actually running yet, and a "manual" health check needs to be performed after the daemon process is launched. The manual check is added to the `stop_daemon` fixture since the check in certain unit test scenarios could give a false positive without an additional manual grace period for `is_daemon_running` to start returning `False`. The default for the `daemon.timeout` configuration option is decreased to 2 seconds, as this should be sufficient for most conditions for the circus daemon process to respond. Note that this daemon process is only charged with monitoring the daemon workers and so won't be under heavy load that will prevent it from responding in time.

sphuber force-pushed the fix/4016/daemon-client-timeout branch 5 times, most recently from 80fdbc5 to 6e20c49 Compare April 12, 2023 14:14

sphuber added the pr/blocked PR is blocked by another PR that should be merged first label Apr 12, 2023

sphuber requested review from unkcpz and superstar54 April 12, 2023 14:33

unkcpz approved these changes Apr 13, 2023

View reviewed changes

sphuber force-pushed the fix/4016/daemon-client-timeout branch from cd68572 to 907ef6a Compare April 13, 2023 21:28

sphuber merged commit 30c7f44 into aiidateam:main Apr 14, 2023

sphuber deleted the fix/4016/daemon-client-timeout branch April 14, 2023 07:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`DaemonClient`: Fix and homogenize use of `timeout` in client calls #5960

`DaemonClient`: Fix and homogenize use of `timeout` in client calls #5960

sphuber commented Apr 12, 2023

sphuber commented Apr 12, 2023

unkcpz commented Apr 12, 2023 •

edited

Loading

sphuber commented Apr 13, 2023

unkcpz left a comment

DaemonClient: Fix and homogenize use of timeout in client calls #5960

DaemonClient: Fix and homogenize use of timeout in client calls #5960

Conversation

sphuber commented Apr 12, 2023

sphuber commented Apr 12, 2023

unkcpz commented Apr 12, 2023 • edited Loading

sphuber commented Apr 13, 2023

unkcpz left a comment

Choose a reason for hiding this comment

`DaemonClient`: Fix and homogenize use of `timeout` in client calls #5960

`DaemonClient`: Fix and homogenize use of `timeout` in client calls #5960

unkcpz commented Apr 12, 2023 •

edited

Loading