Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Add method to start the daemon to DaemonClient #5625

Merged
merged 1 commit into from
Sep 5, 2022

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Sep 5, 2022

So far, the daemon could only be started through verdi and there was
no way to do it from the Python API. Here we add the start_daemon
method to the aiida.engine.daemon.client.DaemonClient class which will
start the daemon when called.

To start the daemon, the function will actually still invoke the verdi
command through a subprocess. The reason is that starting the daemon
will put the calling process in the background, which is not the desired
behavior when calling it from the API. The actual code that launches the
circus daemon is moved from the verdi command to the _start_daemon
method of the DaemonClient. In this way, the daemon functionality is
more self-contained. By making it a protected method, we signal that
users of the Python API should probably not use it, but use the public
start_daemon instead.

This implementation may seem to have some circularity, as start_daemon
will call a verdi command, which in turn will call the _start_daemon
method of the DaemonClient. The reason for this is that the verdi
command ensures that the correct profile is loaded before starting the
daemon. We could make a separate CLI end point independent of verdi
that just serves to load a profile and start the daemon, but that seems
unnecessarily complicated at this point.

Besides the added function, which was the main goal, the code is also
refactored considerably. The implementation of the command line command
verdi daemon start-circus is now moved to the _start_circus method
of the DaemonClient class. The verdi devel run_daemon command is
moved to verdi daemon worker, which makes more sense as a location.
This command launches a single daemon worker, which is nothing more than
an AiiDA process that runs a Runner instance in blocking mode. The
verdi daemon start will run a circus daemon that manages instances of
these daemon workers. To better match the nomenclature, the module
aiida.engine.daemon.runner was renamed to worker.

@sphuber sphuber requested a review from ltalirz September 5, 2022 07:39
@ltalirz
Copy link
Member

ltalirz commented Sep 5, 2022

Great, thanks a lot @sphuber !

Even just the refactoring alone makes the code easier to understand.

Before I go through it line by line, just one question:

This implementation may seem to have some circularity, as start_daemon
will call a verdi command, which in turn will call the _start_daemon
method of the DaemonClient. The reason for this is that the verdi
command ensures that the correct profile is loaded before starting the
daemon. We could make a separate CLI end point independent of verdi
that just serves to load a profile and start the daemon, but that seems
unnecessarily complicated at this point.

Controlling the daemon of the "current" profile via the Python API is already a great step forward, but it seems to me that a fully-fledged daemon Python API should be able to control a daemon for a specific profile specified by the caller - see e.g.

https://github.com/microsoft/aiida-dynamic-workflows/blob/4d452ed3be4192dc5b2c8f40690f82c3afcaa7a8/aiida_dynamic_workflows/control.py#L79-L97

Would it be possible to make the profile an optional parameter in the daemon API that is pre-filled with the loaded profile when a profile is already loaded (e.g. when used via the verdi cli)?

@sphuber
Copy link
Contributor Author

sphuber commented Sep 5, 2022

Would it be possible to make the profile an optional parameter in the daemon API that is pre-filled with the loaded profile when a profile is already loaded (e.g. when used via the verdi cli)?

Well, in a sense, it already is. The DaemonClient is constructed with a particular profile and then operates on that. So if you need to operate on an arbitrary profile, simply construct a new client for that profile. There is the convenience method get_daemon_client that takes a profile, but will use the currently loaded one by default. So the typical usage in a notebook or shell let's say, would be:

from aiida.engine.daemon.client import get_daemon_client
client = get_daemon_client()
client.start_daemon()

# or
from aiida.engine.daemon.client import DaemonClient
from aiida.manage import load_profile
profile_name = 'profile-a'
profile = load_profile(profile_name)
client = DaemonClient(profile)
client.start_daemon()

Does that work you think?

By the way: I am also working on the process control API, but that required the daemon refactoring anyway for the tests, so I thought to add that in a separate PR.

@ltalirz
Copy link
Member

ltalirz commented Sep 5, 2022

Sorry, I should have looked more closely. That makes sense!

By the way: I am also working on the process control API, but that required the daemon refactoring anyway for the tests, so I thought to add that in a separate PR.

Great!

Then let me have a quick look through the code

Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sphuber !

Just some minor comments.

Finally, if you want to go through with the renaming of "runner" to "worker", you may want to do a full-text search on the repository - e.g. the term appears in the docs, in the verdi config, etc.

I also prefer the term "worker" but I'm also fine with keeping "runner".

arbiter = None
if pidfile is not None:
pidfile.unlink()
@verdi_daemon.command('worker')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that you feel this makes more sense as a location - at the same time, have you ever encountered a use case for this that was not related to AiiDA development?

If so, perhaps this use case should be documented in the aiida docs. If not, I guess we could also leave this command in verdi devel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not so much for developing, but can be useful for debugging if you want to run a single daemon worker in the foreground. I would keep it under verdi daemon as it centralizes all daemon code, especially since circus also calls verdi daemon worker to spawn a new worker. It is a bit weird to have the actual daemon call verdi devel run_daemon as a daemon worker.

"""
The protocol to use to for the controller of the Circus daemon
"""
"""The protocol to use to for the controller of the Circus daemon."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""The protocol to use to for the controller of the Circus daemon."""
"""The protocol to use for the controller of the Circus daemon."""

The endpoint is defined by the controller endpoint, which used the port that was written to the port file upon
starting of the daemon.

.. note:: This is quite slow the first time it is run due to the import of ``zmq.ssh`` in ``circus/utils.py`` in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we make use of zmq.ssh or is this import effectively superfluous for us?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import is indirect, through us importing circus.client.CircusClient. Note that I did not add this comment, it has been there for a while (added in b39a9be). That commit simply moved the import from top-level to within the function, probably to not have it slow down the tab-completion of verdi. I just tested importing from zmq import ssh in a shell and it doesn't seem that slow. I think we can remove the comment and even consider moving the import to top level. We have the verdi load time test to warn if we exceed the load time

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Yes, no need to fix it in this PR.
If you believe it may be a valid candidate to things up (e.g. because we don't use the ssh thing), perhaps worth mentioning it here, that's all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the import top-level and it actually caused the verdi load time test to fail. Now this could have been coincidence, but best to keep the import in the method. No real reason not to.

aiida/engine/daemon/client.py Outdated Show resolved Hide resolved
@sphuber
Copy link
Contributor Author

sphuber commented Sep 5, 2022

Finally, if you want to go through with the renaming of "runner" to "worker", you may want to do a full-text search on the repository - e.g. the term appears in the docs, in the verdi config, etc.

I also prefer the term "worker" but I'm also fine with keeping "runner".

Note that we should not rename everything from runner to worker. Here I am merely renaming the code that launches an actual daemon worker, i.e., a Python process calling verdi daemon worker in the background. This process will then launch an instance of a Runner that will perform the functionality. I don't think that this should necessarily be renamed to worker though. The "worker" just consists of a Runner instance, but it could potentially consist of more.

The outside facing language for AiiDA users should reference daemon workers but that doesn't mean that the Runner concept should also be renamed. If there are references that still reference "runners" when it should be worker, that we could change but would not do that if it is backwards incompatible. The changes I made here I can do because it doesn't touch user facing code really.

So far, the daemon could only be started through `verdi` and there was
no way to do it from the Python API. Here we add the `start_daemon`
method to the `aiida.engine.daemon.client.DaemonClient` class which will
start the daemon when called.

To start the daemon, the function will actually still invoke the `verdi`
command through a subprocess. The reason is that starting the daemon
will put the calling process in the background, which is not the desired
behavior when calling it from the API. The actual code that launches the
circus daemon is moved from the `verdi` command to the `_start_daemon`
method of the `DaemonClient`. In this way, the daemon functionality is
more self-contained. By making it a protected method, we signal that
users of the Python API should probably not use it, but use the public
`start_daemon` instead.

This implementation may seem to have some circularity, as `start_daemon`
will call a `verdi` command, which in turn will call the `_start_daemon`
method of the `DaemonClient`. The reason for this is that the `verdi`
command ensures that the correct profile is loaded before starting the
daemon. We could make a separate CLI end point independent of `verdi`
that just serves to load a profile and start the daemon, but that seems
unnecessarily complicated at this point.

Besides the added function, which was the main goal, the code is also
refactored considerably. The implementation of the command line command
`verdi daemon start-circus` is now moved to the `_start_circus` method
of the `DaemonClient` class. The `verdi devel run_daemon` command is
moved to `verdi daemon worker`, which makes more sense as a location.
This command launches a single daemon worker, which is nothing more than
an AiiDA process that runs a `Runner` instance in blocking mode. The
`verdi daemon start` will run a circus daemon that manages instances of
these daemon workers. To better match the nomenclature, the module
`aiida.engine.daemon.runner` was renamed to `worker`.
@ltalirz
Copy link
Member

ltalirz commented Sep 5, 2022

Ok. I'm also in favor of not changing user-facing API if we don't need to.

Since the difference between a "runner" and a "worker" is not going to be intuitively obvious to most users, you may want to consider mentioning the difference at least somewhere in the docs.

Here are places in the docs that mention "runner" (maybe all fine; just wanted to make sure you agree):

$ git grep -i runner | grep \.rst
docs/source/howto/installation.rst:    runner.poll.interval                   profile   50
docs/source/internals/rest_api.rst:        # Invoke the runner
docs/source/internals/rest_api.rst:   If you want to add more options or modify the existing ones, create you custom runner taking inspiration from ``run_api``.
docs/source/intro/troubleshooting.rst:You might also be interested in reviewing the circus log messages (the ``circus`` library is the daemonizer that manages the daemon runners),
docs/source/topics/processes/concepts.rst:Typically the one responsible for running the processes is an instance of a :py:class:`~aiida.engine.runners.Runner`.
docs/source/topics/processes/concepts.rst:This can be a local runner or one of the daemon runners in case of the daemon running the process.
docs/source/topics/processes/concepts.rst:As soon as it is picked up by a runner and it is active, it will be in the ``Running`` state.
docs/source/topics/processes/concepts.rst:When a process is launched, an instance of the ``Process`` class is created in memory which will be propagated to completion by the responsible runner.
docs/source/topics/processes/concepts.rst:This 'process' instance only exists in the memory of the python interpreter that it is running in, for example that of a daemon runner, and so we cannot directly inspect its state.
docs/source/topics/processes/concepts.rst:All the daemon runners, when they are launched, subscribe to the process queue and RabbitMQ will distribute the continuation tasks to them as they come in, making sure that each task is only sent to one runner at a time.
docs/source/topics/processes/concepts.rst:The receiving daemon runner can restore the process instance in memory from the checkpoint that was stored in the database and continue the execution.
docs/source/topics/processes/concepts.rst:As soon as the process reaches a terminal state, the daemon runner will acknowledge to RabbitMQ that the task has been completed.
docs/source/topics/processes/concepts.rst:Until the runner has confirmed that a task is completed, RabbitMQ will consider the task as incomplete.
docs/source/topics/processes/concepts.rst:If a daemon runner is shut down or dies before it got the chance to finish running a process, the task will automatically be requeued by RabbitMQ and sent to another daemon runner.
docs/source/topics/processes/concepts.rst:Each daemon runner has a maximum number of tasks that it can run concurrently, which means that if there are more active tasks than available slots, some of the tasks will remain queued.
docs/source/topics/processes/concepts.rst:Processes, whose task is in the queue and not with any runner, though technically 'active' as they are not terminated, are not actually being run at the moment.
docs/source/topics/processes/concepts.rst:While a process is not actually being run, i.e. it is not in memory with a runner, one cannot interact with it.
docs/source/topics/processes/usage.rst:As the section on :ref:`the distinction between the process and the node<topics:processes:concepts:node_distinction>` explained, manipulating a process means interacting with the live process instance that lives in the memory of the runner that is running it.
docs/source/topics/processes/usage.rst:By definition, these runners will always run in a different system process than the one from which you want to interact, because otherwise, you would *be* the runner, given that there can only be a single runner in an interpreter and if it is running, the interpreter would be blocked from performing any other operations.
docs/source/topics/processes/usage.rst:When a runner starts to run a process, it will also add listeners for incoming messages that are being sent for that specific process over RabbitMQ.
docs/source/topics/processes/usage.rst:    This does not just apply to daemon runners, but also normal runners.
docs/source/topics/processes/usage.rst:    That is to say that if you were to launch a process in a local runner, that interpreter will be blocked, but it will still setup the listeners for that process on RabbitMQ.
docs/source/topics/processes/usage.rst:    This means that you can manipulate the process from another terminal, just as if you would do with a process that is being run by a daemon runner.
docs/source/topics/processes/usage.rst:The RPC will include the process identifier for which the action is intended and RabbitMQ will send it to whoever registered itself to be listening for that specific process, in this case the runner that is running the process.
docs/source/topics/processes/usage.rst:But even if the task *were* to be with a runner, it might be too busy to respond to the RPC and the process appears to be unreachable.
docs/source/topics/processes/usage.rst:Depending on the cause of the process being unreachable, the problem may resolve itself automatically over time and one can try again at a later time, as for example in the case of the runner being too busy to respond.
docs/source/topics/processes/usage.rst:However, to prevent this from happening, the runner has been designed to have the communication happen over a separate thread and to schedule callbacks for any necessary actions on the main thread, which performs all the heavy lifting.
docs/source/topics/processes/usage.rst:This should make occurrences of the runner being too busy to respond very rare.
docs/source/topics/processes/usage.rst:The problem will manifest itself identically if the runner just could not respond in time or if the task has accidentally been lost forever due to a bug, even though these are two completely separate situations.
docs/source/topics/processes/usage.rst:The previous paragraph already mentioned it in passing, but when a remote procedure call is sent, it first needs to be answered by the responsible runner, if applicable, but it will not *directly execute* the call.
docs/source/topics/processes/usage.rst:If the runner has successfully received the request and scheduled the callback, the command will therefore show something like the following:
docs/source/topics/processes/usage.rst:By default, the ``pause``, ``play`` and ``kill`` commands will only ask for the confirmation of the runner that the request has been scheduled and not actually wait for the command to have been executed.
docs/source/topics/processes/usage.rst:This is because, as explained, the actual action being performed might not be instantaneous as the runner may be busy working with other processes, which would mean that the command would block for a long time.
docs/source/topics/processes/usage.rst:If you know that your daemon runners may be experiencing a heavy load, you can also increase the time that the command waits before timing out, with the ``-t/--timeout`` flag.```

@ltalirz ltalirz self-requested a review September 5, 2022 13:17
Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in any case, from my side the changes are good, no need to re-review from my side.

@sphuber
Copy link
Contributor Author

sphuber commented Sep 5, 2022

Here are places in the docs that mention "runner" (maybe all fine; just wanted to make sure you agree):

You are right, some of these should actually be changed to daemon worker. Wondering if I should still do this in this PR, or just revert the name change in this PR for now and leave that for another time.

@ltalirz
Copy link
Member

ltalirz commented Sep 5, 2022

Up to you, feel free to go ahead - no need to slow down for this

@sphuber sphuber merged commit 2d30698 into aiidateam:main Sep 5, 2022
@sphuber sphuber deleted the feature/5258/daemon-api branch September 5, 2022 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants