-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Respect the minimum polling interval for scheduler updates #3096
Respect the minimum polling interval for scheduler updates #3096
Conversation
520421b
to
9c05c45
Compare
9c05c45
to
6e76d43
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sphuber Thanks a lot for fixing this!
My suggestions and questions mainly concern the docstrings.
One thing I noticed after the review is that I'm not sure whether the single backticks work in sphinx as you intend them to.
If they do, it's fine.
authorisation information. | ||
"""A list of submitted jobs on a machine connected to by transport based on the authorisation information. | ||
|
||
This container of active calculation jobs is used to update their status periodically in batches. This ensure that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This container of active calculation jobs is used to update their status periodically in batches. This ensure that | |
This container of active calculation jobs is used to update their status periodically in batches, | |
making sure not to poll the job scheduler on the `Computer` more often than configured in its minimum polling interval (even when many jobs are running). |
Continue from Note that..
class will guarantee that the time between update calls to the scheduler is larger or equal to that minimum | ||
interval. | ||
|
||
Note that since each instance operates on a specific `AuthInfo`, the guarantees of batching scheduler update calls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that there will be one JobsList
instance per daemon worker and per AuthInfo
.
I.e. when, hypothetically, configuring the same computer for multiple users, and running jobs for different users simultaneously, the minimum polling interval would be respected only per user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I explicitly did not reference the "daemon runner" here, because this applies to any runner. Also when you "run" a job, it creates a runner, which has its own job manager and own job lists for each authentication info. Since all these extra conditions have nothing to do with the JobsList
I did not include them here. This behavior should rather be documented in a more general place. This is already done in the documentation of the get_minimum_job_poll_interval
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, it still makes sense to mention that it applies to any runner individually.
This is not obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I don't think that is a limitation of this class and therefore should not be in this docstring. This class is simply used by the JobManager
which then makes sure there is one per AuthInfo
. However, then still we do not have a guarantee that this is unique per authinfo. A runner could just this instantiate two JobManager
instances and then the condition that is written in the dosctring of the JobsList
would be broken. I think it is just the wrong place for this information as it is not enforced here at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get your point - I'm just saying that, from a user perspective, the use case of running jobs using multiple authinfos on the same computer is quite esoteric compared to the highly relevant case of having multiple runners.
So, as a user reading this docstring, that is what I would like it to tell me.
Perhaps one way to resolve this is to mention that the JobList is per JobManager and put a link to the JobManager's documentation :py:class:`...JobManager`
(which then should somehow mention its relation to the runners and what this means for any per-computer limits).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have put a reference to the JobManager
in the JobsList
docstring and explained the bigger picture with the runners in the JobManager
docstring
Regarding backticks in sphinx, I just checked and it simply prints the contents in italic I.e. where that is what you want, that's fine. Often times, it's probably better to use something like |
1a28ebd
to
fe4852a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's go!
The `aiida.engine.processes.calcjobs.manager.JobsList` container was introduced for two purposes, related to managing running calculation jobs in high-throughput mode: * It bundles the scheduler update calls for all active calculation jobs for given authentication info (the comination of a computer and user) * It ensures that consecutive scheduler update calls are separated at least by a time that is given by the `get_minimum_job_poll_interval` as defined by the computer of the authentication info. However, the final requirement was not being respected. The problem was twofold. The internal attribute `_last_updated` that records the timestamp of the last update was not being updated after the scheduler was queried for a status update. The use of the `RefObjectStore` to create and delete these `JobsLists` instances, however, was an even bigger problem. The reference store would delete the `JobsLists` instance as soon as no-one held a reference to it any more, since they are created through the `request_job_info_request` context manager of the `JobManager` that each runner has. As soon as no requests were active, the object was deleted and with it the record of the last time the scheduler was queried. The next time a request comes in a new `JobsList` would be created that straight away call the scheduler as it has no recollection the last time it was called for the given authinfo, if at all. This would result in the minimum poll interval essentially never being respected. Fundamentally, the problem lies in the fact that the data that needs to be persistent, the "last updated" timestamp, was being stored in the container `JobsList` that by implementation was made non-persistent. The solution is to simply make the `JobsList` instances live for as long as the python interpreter is alive. Each daemon runner creates a single `JobManager` instance on start up and this will now only create a new `JobsList` once for each authinfo and will keep returning the same instance for the rest of its lifetime. Co-Authored-By: Leopold Talirz <leopold.talirz@gmail.com>
fe4852a
to
3020dd8
Compare
Fixes #2535
The
aiida.engine.processes.calcjobs.manager.JobsList
container wasintroduced for two purposes, related to managing running calculation
jobs in high-throughput mode:
for given authentication info (the comination of a computer and user)
least by a time that is given by the
get_minimum_job_poll_interval
as defined by the computer of the authentication info.
However, the final requirement was not being respected. The problem was
twofold. The internal attribute
_last_updated
that records thetimestamp of the last update was not being updated after the scheduler
was queried for a status update.
The use of the
RefObjectStore
to create and delete theseJobsLists
instances, however, was an even bigger problem. The reference store
would delete the
JobsLists
instance as soon as no-one held a referenceto it any more, since they are created through the
request_job_info_request
context manager of theJobManager
that eachrunner has. As soon as no requests were active, the object was deleted
and with it the record of the last time the scheduler was queried. The
next time a request comes in a new
JobsList
would be created thatstraight away call the scheduler as it has no recollection the last time
it was called for the given authinfo, if at all. This would result in
the minimum poll interval essentially never being respected.
Fundamentally, the problem lies in the fact that the data that needs to
be persistent, the "last updated" timestamp, was being stored in the
container
JobsList
that by implementation was made non-persistent.The solution is to simply make the
JobsList
instances live for as longas the python interpreter is alive. Each daemon runner creates a single
JobManager
instance on start up and this will now only create a newJobsList
once for each authinfo and will keep returning the sameinstance for the rest of its lifetime.