Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #1913 Added minimum scheduler poll interval #1929

Merged
merged 2 commits into from
Oct 24, 2018

Conversation

muhrin
Copy link
Contributor

@muhrin muhrin commented Aug 31, 2018

Fix #1913

Now the computer has a property that can be set to regulate the interval
with which the scheduler command to get jobs can be called. This acts
as a minimum and may in practice be longer if the transport minimum open
interval is longer.

With these changes all jobs in a single runner on the same computer also
have their requests to get the jobs list batched and therefore there
should be a significant decrease in the number of calls asking the
scheduler to list the currently running jobs.

@muhrin muhrin requested a review from sphuber August 31, 2018 16:43
@@ -0,0 +1,235 @@
import contextlib
from future.utils import iteritems, itervalues
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: In the Python 3 PR I dropped future completely in favor of six, which has a iteritems as well. But I dropped it in most cases in favor of just using d.items().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Tiziano! I'll convert over to six.

@muhrin muhrin changed the title Fix #1913 Added minimum scheduler poll interval [WIP] Fix #1913 Added minimum scheduler poll interval Sep 2, 2018
@muhrin muhrin force-pushed the fix_1913_job_update_batching branch 6 times, most recently from 4563fda to 183c0a7 Compare September 3, 2018 10:26
@muhrin muhrin changed the title [WIP] Fix #1913 Added minimum scheduler poll interval Fix #1913 Added minimum scheduler poll interval Sep 3, 2018
@codecov-io
Copy link

codecov-io commented Sep 3, 2018

Codecov Report

Merging #1929 into develop will decrease coverage by 0.47%.
The diff coverage is 39.23%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1929      +/-   ##
===========================================
- Coverage    67.61%   67.14%   -0.48%     
===========================================
  Files          324      321       -3     
  Lines        33305    33291      -14     
===========================================
- Hits         22520    22354     -166     
- Misses       10785    10937     +152
Impacted Files Coverage Δ
aiida/transport/transport.py 63.15% <ø> (ø) ⬆️
aiida/daemon/execmanager.py 9.42% <0%> (ø) ⬆️
aiida/orm/implementation/sqlalchemy/authinfo.py 88.57% <100%> (ø) ⬆️
aiida/scheduler/__init__.py 70.27% <100%> (ø) ⬆️
aiida/work/transports.py 96.36% <100%> (-0.19%) ⬇️
aiida/orm/implementation/django/authinfo.py 81.33% <100%> (ø) ⬆️
aiida/work/job_calcs.py 25.77% <25.77%> (ø)
aiida/orm/implementation/general/computer.py 66.29% <66.66%> (ø) ⬆️
aiida/work/runners.py 91.51% <75%> (-0.42%) ⬇️
aiida/orm/authinfo.py 66.66% <75%> (ø) ⬆️
... and 35 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a94294f...f142eba. Read the comment docs.

@dev-zero
Copy link
Contributor

dev-zero commented Sep 3, 2018

@muhrin GitHub does not allow me to add inline comments for some reason. Why do you exclude aiida/work/utils.py from the python-modernize run?

yield sleep(interval)

import traceback
traceback.print_exc()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we always want the print here? It will be dumped once it hits the exception handling of the process no?

@@ -513,6 +516,27 @@ def set_default_mpiprocs_per_machine(self, def_cpus_per_machine):
raise TypeError("def_cpus_per_machine must be an integer (or None)")
self._set_property("default_mpiprocs_per_machine", def_cpus_per_machine)

def get_minimum_job_poll_interval(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I saw you used metadata as the container for this info, I was worried that this was useless, because as that is a column of the DbComputer table it should be immutable once stored, which would mean that it could only be set once during setup. I then checked the code and saw that the is stored check was commented out. So apparently, this is now possible, however, I wonder if it is the right thing. I don't think there is another viable option but I am just pointing it out here, so we can discuss and make sure this is what we want

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I suspect this metadata is basically the extras of Computer in the sense that they can be updated (intentionally) even after storing.

scheduler.set_transport(transport)

kwargs = {'as_dict': True}
if scheduler.get_feature('can_query_by_user'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the scheduler cannot query by user? It used to default to just query for a specific job id. I don't see anymore how this is possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Querying by job id is no longer necessary from this point of the code because it queries all jobs (and caches the results). In a way that's the point in this class. It's still useful to query by user in case it's possible just to reduce the number of results.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That I understood, but I assumed that given there exists this scheduler.get_feature('can_query_by_user') check, certain schedulers cannot query by user. If that is in fact the case, this user wide polling will break. So either certain scheduler cannot query on a per user basis and this will break, or all of them can and then this check is superfluous and should be removed

Copy link
Contributor Author

@muhrin muhrin Sep 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I made an assumption here that if you don't specify a user then it will query jobs from all users - if that's not true then you're right that this could cause a problem. Let me check...

@muhrin
Copy link
Contributor Author

muhrin commented Sep 18, 2018

@muhrin GitHub does not allow me to add inline comments for some reason. Why do you exclude aiida/work/utils.py from the python-modernize run?

Thanks @dev-zero , I've added a comment to the pre-commit file.

@coveralls
Copy link

coveralls commented Sep 18, 2018

Pull Request Test Coverage Report for Build 3837

  • 84 of 203 (41.38%) changed or added relevant lines in 11 files are covered.
  • 2836 unchanged lines in 64 files lost coverage.
  • Overall coverage decreased (-1.2%) to 66.736%

Changes Missing Coverage Covered Lines Changed/Added Lines %
aiida/daemon/execmanager.py 0 1 0.0%
aiida/work/runners.py 3 4 75.0%
aiida/orm/implementation/general/computer.py 4 6 66.67%
aiida/orm/implementation/sqlalchemy/authinfo.py 0 2 0.0%
aiida/orm/authinfo.py 9 12 75.0%
aiida/work/utils.py 35 44 79.55%
aiida/work/job_processes.py 0 29 0.0%
aiida/work/job_calcs.py 25 97 25.77%
Files with Coverage Reduction New Missed Lines %
aiida/common/datastructures.py 1 95.74%
aiida/daemon/execmanager.py 1 9.57%
aiida/control/computer.py 1 83.72%
aiida/common/ipython/ipython_magics.py 1 0.0%
aiida/orm/implementation/general/group.py 1 61.62%
aiida/common/extendeddicts.py 1 91.84%
aiida/restapi/common/utils.py 1 66.76%
aiida/orm/implementation/django/group.py 1 87.57%
aiida/common/additions/backup_script/backup_setup.py 2 81.32%
aiida/orm/data/cif.py 2 80.57%
Totals Coverage Status
Change from base Build 3832: -1.2%
Covered Lines: 23828
Relevant Lines: 35705

💛 - Coveralls

@muhrin muhrin changed the title Fix #1913 Added minimum scheduler poll interval [WIP] Fix #1913 Added minimum scheduler poll interval Oct 3, 2018
@muhrin muhrin mentioned this pull request Oct 3, 2018
@muhrin
Copy link
Contributor Author

muhrin commented Oct 16, 2018

Ok, we are now looking by job id when we can't query by user. This is something we need to test in the field on each of the schedulers.
Current status:

  • direct
  • slurm
  • lsf
  • pbspro
  • torque

The others (i.e. direct and sge) can query by user.

Does anyone have the ability to test any of these?

@coveralls
Copy link

coveralls commented Oct 17, 2018

Coverage Status

Coverage increased (+0.01%) to 68.314% when pulling 597430f on muhrin:fix_1913_job_update_batching into 089f8d4 on aiidateam:develop.

@muhrin muhrin force-pushed the fix_1913_job_update_batching branch from 1bc56a8 to ee51ec1 Compare October 18, 2018 13:01
@muhrin muhrin changed the title [WIP] Fix #1913 Added minimum scheduler poll interval Fix #1913 Added minimum scheduler poll interval Oct 23, 2018
@muhrin
Copy link
Contributor Author

muhrin commented Oct 23, 2018

According to the chat with Giovanni as the last meeting the schedulers are indeed capable (and designed to) take a list of job ids so this puppy should be set to go.

muhrin and others added 2 commits October 24, 2018 14:52
Now the computer has a property that can be set to regulate the interval
with which the scheduler command to get jobs can be called.  This acts
as a minimum and may in practice be longer if the transport minimum open
interval is longer.

With these changes all jobs in a single runner on the same computer also
have their requests to get the jobs list batched and therefore there
should be a significant decrease in the number of calls asking the
scheduler to list the currently running jobs.
By default the `JobsList` will get the status of running jobs by requesting
the status for the user, associated with the auth info, however, some
schedulers do not support this functionality. Instead, one has to query for
a list of job ids. In this case the `JobsList` will simply get the job id list
from the internal mapping it keeps.
@sphuber sphuber force-pushed the fix_1913_job_update_batching branch from ee51ec1 to 597430f Compare October 24, 2018 12:54
Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A-mazing

@sphuber sphuber merged commit 7854680 into aiidateam:develop Oct 24, 2018
@sphuber sphuber deleted the fix_1913_job_update_batching branch October 24, 2018 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants