Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Respect the minimum polling interval for scheduler updates #3096

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Jun 28, 2019

Fixes #2535

The aiida.engine.processes.calcjobs.manager.JobsList container was
introduced for two purposes, related to managing running calculation
jobs in high-throughput mode:

  • It bundles the scheduler update calls for all active calculation jobs
    for given authentication info (the comination of a computer and user)
  • It ensures that consecutive scheduler update calls are separated at
    least by a time that is given by the get_minimum_job_poll_interval
    as defined by the computer of the authentication info.

However, the final requirement was not being respected. The problem was
twofold. The internal attribute _last_updated that records the
timestamp of the last update was not being updated after the scheduler
was queried for a status update.

The use of the RefObjectStore to create and delete these JobsLists
instances, however, was an even bigger problem. The reference store
would delete the JobsLists instance as soon as no-one held a reference
to it any more, since they are created through the
request_job_info_request context manager of the JobManager that each
runner has. As soon as no requests were active, the object was deleted
and with it the record of the last time the scheduler was queried. The
next time a request comes in a new JobsList would be created that
straight away call the scheduler as it has no recollection the last time
it was called for the given authinfo, if at all. This would result in
the minimum poll interval essentially never being respected.

Fundamentally, the problem lies in the fact that the data that needs to
be persistent, the "last updated" timestamp, was being stored in the
container JobsList that by implementation was made non-persistent.
The solution is to simply make the JobsList instances live for as long
as the python interpreter is alive. Each daemon runner creates a single
JobManager instance on start up and this will now only create a new
JobsList once for each authinfo and will keep returning the same
instance for the rest of its lifetime.

@sphuber sphuber force-pushed the fix_2535_job_manager_minimum_poll_interval branch from 520421b to 9c05c45 Compare June 28, 2019 12:51
@coveralls
Copy link

coveralls commented Jun 28, 2019

Coverage Status

Coverage increased (+0.05%) to 74.017% when pulling 3020dd8 on sphuber:fix_2535_job_manager_minimum_poll_interval into 1578227 on aiidateam:develop.

@sphuber sphuber force-pushed the fix_2535_job_manager_minimum_poll_interval branch from 9c05c45 to 6e76d43 Compare June 28, 2019 13:56
Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sphuber Thanks a lot for fixing this!

My suggestions and questions mainly concern the docstrings.
One thing I noticed after the review is that I'm not sure whether the single backticks work in sphinx as you intend them to.
If they do, it's fine.

aiida/engine/processes/calcjobs/manager.py Outdated Show resolved Hide resolved
authorisation information.
"""A list of submitted jobs on a machine connected to by transport based on the authorisation information.

This container of active calculation jobs is used to update their status periodically in batches. This ensure that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This container of active calculation jobs is used to update their status periodically in batches. This ensure that
This container of active calculation jobs is used to update their status periodically in batches,
making sure not to poll the job scheduler on the `Computer` more often than configured in its minimum polling interval (even when many jobs are running).

Continue from Note that..

class will guarantee that the time between update calls to the scheduler is larger or equal to that minimum
interval.

Note that since each instance operates on a specific `AuthInfo`, the guarantees of batching scheduler update calls
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that there will be one JobsList instance per daemon worker and per AuthInfo.
I.e. when, hypothetically, configuring the same computer for multiple users, and running jobs for different users simultaneously, the minimum polling interval would be respected only per user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explicitly did not reference the "daemon runner" here, because this applies to any runner. Also when you "run" a job, it creates a runner, which has its own job manager and own job lists for each authentication info. Since all these extra conditions have nothing to do with the JobsList I did not include them here. This behavior should rather be documented in a more general place. This is already done in the documentation of the get_minimum_job_poll_interval.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, it still makes sense to mention that it applies to any runner individually.
This is not obvious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I don't think that is a limitation of this class and therefore should not be in this docstring. This class is simply used by the JobManager which then makes sure there is one per AuthInfo. However, then still we do not have a guarantee that this is unique per authinfo. A runner could just this instantiate two JobManager instances and then the condition that is written in the dosctring of the JobsList would be broken. I think it is just the wrong place for this information as it is not enforced here at all.

Copy link
Member

@ltalirz ltalirz Jun 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get your point - I'm just saying that, from a user perspective, the use case of running jobs using multiple authinfos on the same computer is quite esoteric compared to the highly relevant case of having multiple runners.
So, as a user reading this docstring, that is what I would like it to tell me.

Perhaps one way to resolve this is to mention that the JobList is per JobManager and put a link to the JobManager's documentation :py:class:`...JobManager` (which then should somehow mention its relation to the runners and what this means for any per-computer limits).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have put a reference to the JobManager in the JobsList docstring and explained the bigger picture with the runners in the JobManager docstring

aiida/engine/processes/calcjobs/manager.py Outdated Show resolved Hide resolved
aiida/engine/processes/calcjobs/manager.py Show resolved Hide resolved
aiida/engine/processes/calcjobs/manager.py Outdated Show resolved Hide resolved
aiida/engine/processes/calcjobs/manager.py Outdated Show resolved Hide resolved
aiida/engine/processes/calcjobs/manager.py Outdated Show resolved Hide resolved
aiida/engine/processes/calcjobs/manager.py Outdated Show resolved Hide resolved
aiida/engine/processes/calcjobs/manager.py Outdated Show resolved Hide resolved
@ltalirz
Copy link
Member

ltalirz commented Jun 29, 2019

Regarding backticks in sphinx, I just checked and it simply prints the contents in italic
https://aiida.readthedocs.io/projects/aiida-core/en/latest/apidoc/aiida.tools.data.html#module-aiida.tools.data.cif

I.e. where that is what you want, that's fine. Often times, it's probably better to use something like :py:class`~aiida.orm.Computer` (but may not be necessary everywhere)

@sphuber sphuber force-pushed the fix_2535_job_manager_minimum_poll_interval branch from 1a28ebd to fe4852a Compare July 1, 2019 07:20
Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's go!

The `aiida.engine.processes.calcjobs.manager.JobsList` container was
introduced for two purposes, related to managing running calculation
jobs in high-throughput mode:

 * It bundles the scheduler update calls for all active calculation jobs
   for given authentication info (the comination of a computer and user)
 * It ensures that consecutive scheduler update calls are separated at
   least by a time that is given by the `get_minimum_job_poll_interval`
   as defined by the computer of the authentication info.

However, the final requirement was not being respected. The problem was
twofold. The internal attribute `_last_updated` that records the
timestamp of the last update was not being updated after the scheduler
was queried for a status update.

The use of the `RefObjectStore` to create and delete these `JobsLists`
instances, however, was an even bigger problem. The reference store
would delete the `JobsLists` instance as soon as no-one held a reference
to it any more, since they are created through the
`request_job_info_request` context manager of the `JobManager` that each
runner has. As soon as no requests were active, the object was deleted
and with it the record of the last time the scheduler was queried. The
next time a request comes in a new `JobsList` would be created that
straight away call the scheduler as it has no recollection the last time
it was called for the given authinfo, if at all. This would result in
the minimum poll interval essentially never being respected.

Fundamentally, the problem lies in the fact that the data that needs to
be persistent, the "last updated" timestamp, was being stored in the
container `JobsList` that by implementation was made non-persistent.
The solution is to simply make the `JobsList` instances live for as long
as the python interpreter is alive. Each daemon runner creates a single
`JobManager` instance on start up and this will now only create a new
`JobsList` once for each authinfo and will keep returning the same
instance for the rest of its lifetime.

Co-Authored-By: Leopold Talirz <leopold.talirz@gmail.com>
@sphuber sphuber force-pushed the fix_2535_job_manager_minimum_poll_interval branch from fe4852a to 3020dd8 Compare July 1, 2019 07:51
@sphuber sphuber merged commit a5b6477 into aiidateam:develop Jul 1, 2019
@sphuber sphuber deleted the fix_2535_job_manager_minimum_poll_interval branch July 1, 2019 08:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Minimum scheduler polling interval not respected
3 participants