Respect the minimum polling interval for scheduler updates #3096

sphuber · 2019-06-28T12:36:10Z

The aiida.engine.processes.calcjobs.manager.JobsList container was
introduced for two purposes, related to managing running calculation
jobs in high-throughput mode:

It bundles the scheduler update calls for all active calculation jobs
for given authentication info (the comination of a computer and user)
It ensures that consecutive scheduler update calls are separated at
least by a time that is given by the get_minimum_job_poll_interval
as defined by the computer of the authentication info.

However, the final requirement was not being respected. The problem was
twofold. The internal attribute _last_updated that records the
timestamp of the last update was not being updated after the scheduler
was queried for a status update.

The use of the RefObjectStore to create and delete these JobsLists
instances, however, was an even bigger problem. The reference store
would delete the JobsLists instance as soon as no-one held a reference
to it any more, since they are created through the
request_job_info_request context manager of the JobManager that each
runner has. As soon as no requests were active, the object was deleted
and with it the record of the last time the scheduler was queried. The
next time a request comes in a new JobsList would be created that
straight away call the scheduler as it has no recollection the last time
it was called for the given authinfo, if at all. This would result in
the minimum poll interval essentially never being respected.

Fundamentally, the problem lies in the fact that the data that needs to
be persistent, the "last updated" timestamp, was being stored in the
container JobsList that by implementation was made non-persistent.
The solution is to simply make the JobsList instances live for as long
as the python interpreter is alive. Each daemon runner creates a single
JobManager instance on start up and this will now only create a new
JobsList once for each authinfo and will keep returning the same
instance for the rest of its lifetime.

coveralls · 2019-06-28T13:33:39Z

Coverage increased (+0.05%) to 74.017% when pulling 3020dd8 on sphuber:fix_2535_job_manager_minimum_poll_interval into 1578227 on aiidateam:develop.

ltalirz

@sphuber Thanks a lot for fixing this!

My suggestions and questions mainly concern the docstrings.
One thing I noticed after the review is that I'm not sure whether the single backticks work in sphinx as you intend them to.
If they do, it's fine.

aiida/engine/processes/calcjobs/manager.py

ltalirz · 2019-06-28T13:19:47Z

aiida/engine/processes/calcjobs/manager.py

-    authorisation information.
+    """A list of submitted jobs on a machine connected to by transport based on the authorisation information.
+
+    This container of active calculation jobs is used to update their status periodically in batches. This ensure that


Suggested change

This container of active calculation jobs is used to update their status periodically in batches. This ensure that

This container of active calculation jobs is used to update their status periodically in batches,

making sure not to poll the job scheduler on the `Computer` more often than configured in its minimum polling interval (even when many jobs are running).

Continue from Note that..

ltalirz · 2019-06-28T14:09:28Z

aiida/engine/processes/calcjobs/manager.py

+    class will guarantee that the time between update calls to the scheduler is larger or equal to that minimum
+    interval.
+
+    Note that since each instance operates on a specific `AuthInfo`, the guarantees of batching scheduler update calls


Note that there will be one JobsList instance per daemon worker and per AuthInfo.
I.e. when, hypothetically, configuring the same computer for multiple users, and running jobs for different users simultaneously, the minimum polling interval would be respected only per user.

I explicitly did not reference the "daemon runner" here, because this applies to any runner. Also when you "run" a job, it creates a runner, which has its own job manager and own job lists for each authentication info. Since all these extra conditions have nothing to do with the JobsList I did not include them here. This behavior should rather be documented in a more general place. This is already done in the documentation of the get_minimum_job_poll_interval.

In this case, it still makes sense to mention that it applies to any runner individually.
This is not obvious.

But I don't think that is a limitation of this class and therefore should not be in this docstring. This class is simply used by the JobManager which then makes sure there is one per AuthInfo. However, then still we do not have a guarantee that this is unique per authinfo. A runner could just this instantiate two JobManager instances and then the condition that is written in the dosctring of the JobsList would be broken. I think it is just the wrong place for this information as it is not enforced here at all.

I get your point - I'm just saying that, from a user perspective, the use case of running jobs using multiple authinfos on the same computer is quite esoteric compared to the highly relevant case of having multiple runners.
So, as a user reading this docstring, that is what I would like it to tell me.

Perhaps one way to resolve this is to mention that the JobList is per JobManager and put a link to the JobManager's documentation :py:class:`...JobManager` (which then should somehow mention its relation to the runners and what this means for any per-computer limits).

I have put a reference to the JobManager in the JobsList docstring and explained the bigger picture with the runners in the JobManager docstring

aiida/engine/processes/calcjobs/manager.py

ltalirz · 2019-06-29T20:18:26Z

Regarding backticks in sphinx, I just checked and it simply prints the contents in italic
https://aiida.readthedocs.io/projects/aiida-core/en/latest/apidoc/aiida.tools.data.html#module-aiida.tools.data.cif

I.e. where that is what you want, that's fine. Often times, it's probably better to use something like :py:class`~aiida.orm.Computer` (but may not be necessary everywhere)

ltalirz

let's go!

The `aiida.engine.processes.calcjobs.manager.JobsList` container was introduced for two purposes, related to managing running calculation jobs in high-throughput mode: * It bundles the scheduler update calls for all active calculation jobs for given authentication info (the comination of a computer and user) * It ensures that consecutive scheduler update calls are separated at least by a time that is given by the `get_minimum_job_poll_interval` as defined by the computer of the authentication info. However, the final requirement was not being respected. The problem was twofold. The internal attribute `_last_updated` that records the timestamp of the last update was not being updated after the scheduler was queried for a status update. The use of the `RefObjectStore` to create and delete these `JobsLists` instances, however, was an even bigger problem. The reference store would delete the `JobsLists` instance as soon as no-one held a reference to it any more, since they are created through the `request_job_info_request` context manager of the `JobManager` that each runner has. As soon as no requests were active, the object was deleted and with it the record of the last time the scheduler was queried. The next time a request comes in a new `JobsList` would be created that straight away call the scheduler as it has no recollection the last time it was called for the given authinfo, if at all. This would result in the minimum poll interval essentially never being respected. Fundamentally, the problem lies in the fact that the data that needs to be persistent, the "last updated" timestamp, was being stored in the container `JobsList` that by implementation was made non-persistent. The solution is to simply make the `JobsList` instances live for as long as the python interpreter is alive. Each daemon runner creates a single `JobManager` instance on start up and this will now only create a new `JobsList` once for each authinfo and will keep returning the same instance for the rest of its lifetime. Co-Authored-By: Leopold Talirz <leopold.talirz@gmail.com>

sphuber requested review from ltalirz, muhrin and giovannipizzi June 28, 2019 12:36

sphuber force-pushed the fix_2535_job_manager_minimum_poll_interval branch from 520421b to 9c05c45 Compare June 28, 2019 12:51

sphuber force-pushed the fix_2535_job_manager_minimum_poll_interval branch from 9c05c45 to 6e76d43 Compare June 28, 2019 13:56

ltalirz suggested changes Jun 28, 2019

View reviewed changes

sphuber force-pushed the fix_2535_job_manager_minimum_poll_interval branch from 1a28ebd to fe4852a Compare July 1, 2019 07:20

ltalirz approved these changes Jul 1, 2019

View reviewed changes

sphuber force-pushed the fix_2535_job_manager_minimum_poll_interval branch from fe4852a to 3020dd8 Compare July 1, 2019 07:51

sphuber merged commit a5b6477 into aiidateam:develop Jul 1, 2019

sphuber deleted the fix_2535_job_manager_minimum_poll_interval branch July 1, 2019 08:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Respect the minimum polling interval for scheduler updates #3096

Respect the minimum polling interval for scheduler updates #3096

sphuber commented Jun 28, 2019

coveralls commented Jun 28, 2019 •

edited

Loading

ltalirz left a comment

ltalirz Jun 28, 2019

ltalirz Jun 28, 2019

sphuber Jun 30, 2019

ltalirz Jun 30, 2019

sphuber Jun 30, 2019

ltalirz Jun 30, 2019 •

edited

Loading

sphuber Jul 1, 2019

ltalirz commented Jun 29, 2019

ltalirz left a comment

	This container of active calculation jobs is used to update their status periodically in batches. This ensure that
	This container of active calculation jobs is used to update their status periodically in batches,
	making sure not to poll the job scheduler on the `Computer` more often than configured in its minimum polling interval (even when many jobs are running).

Respect the minimum polling interval for scheduler updates #3096

Respect the minimum polling interval for scheduler updates #3096

Conversation

sphuber commented Jun 28, 2019

coveralls commented Jun 28, 2019 • edited Loading

ltalirz left a comment

Choose a reason for hiding this comment

ltalirz Jun 28, 2019

Choose a reason for hiding this comment

ltalirz Jun 28, 2019

Choose a reason for hiding this comment

sphuber Jun 30, 2019

Choose a reason for hiding this comment

ltalirz Jun 30, 2019

Choose a reason for hiding this comment

sphuber Jun 30, 2019

Choose a reason for hiding this comment

ltalirz Jun 30, 2019 • edited Loading

Choose a reason for hiding this comment

sphuber Jul 1, 2019

Choose a reason for hiding this comment

ltalirz commented Jun 29, 2019

ltalirz left a comment

Choose a reason for hiding this comment

coveralls commented Jun 28, 2019 •

edited

Loading

ltalirz Jun 30, 2019 •

edited

Loading