Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CalcJob: move job resource validation to the Scheduler class #4192

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Jun 21, 2020

Fixes #3887

This is necessary because the resource validation is scheduler
dependent. Up till now, the CalcJob defined a validator on the entire
input namespace that validates the metadata.options.resources input.
Specifically, it demanded that the num_machines keyword was defined,
which is indeed a requirement for the NodeNumberJobResource, used for
schedulers like SLURM, however, this field doesn't even make sense for
schedulers like LSF and SGE that use the ParEnvJobResource type of job
resources.

The solution is to delegate the validation of the resources to the
scheduler which in turn delegates it to the JobResource class that it
uses. The validation used to happend on construction of the job resource
instance, but is factored out to the validate_resources classmethod.
This allows it to be called without having to construct an instance
which is more efficient.

Finally, the signature of the CalcJob input validators is changed and
no longer raise an InputValidationError but instead return an error
message as the interface of Port.validator requires. The port itself
will detect if an error message is returned and raise a ValueError
with all the relevant error messages.

@sphuber sphuber force-pushed the fix/3887/calcjob-scheduler-resources-validation branch from 45edcfe to 8699878 Compare June 21, 2020 18:59
@codecov
Copy link

codecov bot commented Jun 21, 2020

Codecov Report

Merging #4192 into develop will increase coverage by 0.08%.
The diff coverage is 90.91%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #4192      +/-   ##
===========================================
+ Coverage    78.96%   79.03%   +0.08%     
===========================================
  Files          467      467              
  Lines        34511    34492      -19     
===========================================
+ Hits         27248    27259      +11     
+ Misses        7263     7233      -30     
Flag Coverage Δ
#django 70.96% <90.91%> (+0.08%) ⬆️
#sqlalchemy 71.83% <90.91%> (+0.08%) ⬆️
Impacted Files Coverage Δ
aiida/engine/processes/calcjobs/calcjob.py 82.02% <76.93%> (-0.51%) ⬇️
aiida/schedulers/plugins/slurm.py 52.50% <83.34%> (ø)
aiida/schedulers/plugins/pbsbaseclasses.py 59.16% <90.00%> (ø)
aiida/schedulers/scheduler.py 76.29% <90.25%> (+5.70%) ⬆️
aiida/schedulers/datastructures.py 93.80% <98.47%> (+13.12%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f8be9de...80fef5e. Read the comment docs.

@sphuber
Copy link
Contributor Author

sphuber commented Jun 21, 2020

N.B.: this PR assumes that there are no schedulers out there in the wild that subclass their own JobResource. If that is the case, the code will break, because the validate_resources is an abstractclassmethod and therefore the custom subclasses won't be instantiable until they implement it. If we think this is a problem, we could make it concrete and just pass. We do need to be able to notify subclasses that they need to start implementing this method at some point in that case.

Second important point: the ArithmeticAddCalculation implementation specifies {'num_machines': 1, 'num_mpiprocs_per_machine': 1} as default for the metadata.options.resources. This was done because this class is used often for demonstrations and tutorials, and not having to specify the resources simplifies the examples. However, we now know that this default is incorrect as it only caters to schedulers with job resources of the type NodeNumberJobResource, which is not always the case. So really we have to remove this default, but this will impact the simplicity of the tutorial and documentation. What do we do here?

@sphuber
Copy link
Contributor Author

sphuber commented Jun 22, 2020

@giovannipizzi it would probably be good to ship this with 1.3.0 given that we have had multiple users face this problem. I have added a second commit on top of the original one, which contains just some cleaning of the scheduler base class and job resource implementations. Especially the validate_resources of the node number sub class I have significantly refactored in an attempt to make it more readable. I found the original nested if-blocks very difficult to grok.

@sphuber sphuber force-pushed the fix/3887/calcjob-scheduler-resources-validation branch from 229ea91 to db62c84 Compare June 22, 2020 10:45
Copy link
Member

@giovannipizzi giovannipizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor change in the docs, and a question

tot_num_mpiprocs = resources.get('tot_num_mpiprocs', None)
if num_mpiprocs_per_machine is None and tot_num_mpiprocs is None and default_mpiprocs_per_machine is not None:
# Only set the default value if tot_num_mpiprocs is not provided. Otherwise, it means that the user provided
# both `num_machines` and `tot_num_mpiprocs`, and we have to ignore the default value of `tot_num_mpiprocs`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# both `num_machines` and `tot_num_mpiprocs`, and we have to ignore the default value of `tot_num_mpiprocs`.
# both `num_machines` and `tot_num_mpiprocs`, and we have to ignore the default value of `num_mpiprocs_per_machine`.

def_cpus_machine = computer.get_default_mpiprocs_per_machine()
if def_cpus_machine is not None:
resources['default_mpiprocs_per_machine'] = def_cpus_machine
default_mpiprocs_per_machine = computer.get_default_mpiprocs_per_machine()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general doubt I have here is if this is a general logic, or still specific to some scheduler types. Do we have tests for a parenv job resource, or can we ask someone (@zhubonan, @atztogo, ...) to check if this new logic now works also for them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right and this will fail. Even if the user does not explicitly specify num_mpiprocs_per_machine and tot_num_mpiprocs, if the computer has default_mpiprocs_per_machine defined (which I think currently is almost always the case, or at least, verdi computer setup suggests to set it), then num_mpiprocs_per_machine will be added to the resources and since I changed ParEnvJobResource.validate_resources to raise when it gets unsupported keywords, this will fail.

I am not sure where and how to fix it. Easy hack would be to simply ignore this keyword in the ParEnvJobResource but is not the correct solution. If specifying a number of machines for a ParEnvJobResource based scheduler never makes sense, maybe we should not even have computers that are configured with such a scheduler, define a num_mpiprocs_per_machine.

This is necessary because the resource validation is scheduler
dependent. Up till now, the `CalcJob` defined a validator on the entire
input namespace that validates the `metadata.options.resources` input.
Specifically, it demanded that the `num_machines` keyword was defined,
which is indeed a requirement for the `NodeNumberJobResource`, used for
schedulers like SLURM, however, this field doesn't even make sense for
schedulers like LSF and SGE that use the `ParEnvJobResource` type of job
resources.

The solution is to delegate the validation of the resources to the
scheduler which in turn delegates it to the `JobResource` class that it
uses. The validation used to happend on construction of the job resource
instance, but is factored out to the `validate_resources` classmethod.
This allows it to be called without having to construct an instance
which is more efficient.

Finally, the signature of the `CalcJob` input validators is changed and
no longer raise an `InputValidationError` but instead return an error
message as the interface of `Port.validator` requires. The port itself
will detect if an error message is returned and raise a `ValueError`
with all the relevant error messages.
The `Scheduler.get_valid_schedulers` method is deprecated as the
`aiida.plugins` module is reserved to inquire about installed plugins.
Furthermore, it is properly marked as an abstract class which allows to
remove the explicit `raise NotImplementedError` from methods that are
abstract and have to be implemented by subclasses. Finally, there is
some minor cleaning up of the code and docstrings.

The implementation of the `validate_resources` method of the subclasses
of `JobResource` has been improved and significantly simplified in the
case of `NodeNumberJobResource` making the logic a lot more readable.
Tests are added for the `ParEnvJobResource` job resource class which was
completely untested.
@sphuber sphuber force-pushed the fix/3887/calcjob-scheduler-resources-validation branch from db62c84 to fb0d103 Compare June 23, 2020 11:06
@sphuber
Copy link
Contributor Author

sphuber commented Jun 23, 2020

@giovannipizzi I have a added a new commit that I think should address the problem you mentioned. I added a test that launches a calculation job using a computer with SGE scheduler, and I confirmed that it failed as I explained in my response to your question. The fix in the last commit addresses the problem. I am not sure if there are still other failure points hiding out there. @atztogo and @zhubonan if you have the time to give this branch a spin with a real calculation on a LSF or SGE cluster, that would be great!

The `CalcJob.presubmit` method was always adding the resource keyword
`num_mpiprocs_per_machine` if it was not already (implicitly) specified
and the computer defined a default. However, this keyword is not valid
for all schedulers, and for example, the SGE and LSF scheduler will
raise if it is passed. Here we add the `Scheduler.preprocess_resources`
method, which can be called and will properly take the type of job
resources into account when deciding whether `num_mpiprocs_per_machine`
should be defined.
@sphuber sphuber force-pushed the fix/3887/calcjob-scheduler-resources-validation branch from fb0d103 to 80fef5e Compare June 23, 2020 11:21
Copy link
Member

@giovannipizzi giovannipizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'm approving it, but it would be great if @atztogo or @zhubonan or anybody else with an SGE scheduler could quickly try out if this branch works for them!

@atztogo
Copy link
Contributor

atztogo commented Jun 23, 2020

Thank you very much @sphuber . I ran some calculations using 80fef5e , and went OK without num_machines with sge.

@sphuber sphuber merged commit a97f725 into aiidateam:develop Jun 23, 2020
@sphuber sphuber deleted the fix/3887/calcjob-scheduler-resources-validation branch June 23, 2020 13:26
astamminger added a commit to astamminger/aiida-cusp that referenced this pull request Jul 9, 2020
Minimal aiida-core version has to be fixed due to the recently
introduced change in the aiida-core package moving the resources
validation to the scheduler plugin.

see aiidateam/aiida-core#4192
astamminger added a commit to aiida-cusp/aiida-cusp that referenced this pull request Jul 9, 2020
* Fix minimal aiida-core version to 1.3.0

Minimal aiida-core version has to be fixed due to the recently
introduced change in the aiida-core package moving the resources
validation to the scheduler plugin.

see aiidateam/aiida-core#4192

* Fix python version to 3.7.7 in tests

PyYAML causes issues only for tests running on python 3.7.
A suggested change to fix this issue is fixing the python 3.7
version to the 3.7.7 minor release since test start to fail
only for 3.7.8

see: actions/runner-images#1202
astamminger added a commit to astamminger/aiida-cusp that referenced this pull request Mar 9, 2022
* Fix minimal aiida-core version to 1.3.0

Minimal aiida-core version has to be fixed due to the recently
introduced change in the aiida-core package moving the resources
validation to the scheduler plugin.

see aiidateam/aiida-core#4192

* Fix python version to 3.7.7 in tests

PyYAML causes issues only for tests running on python 3.7.
A suggested change to fix this issue is fixing the python 3.7
version to the 3.7.7 minor release since test start to fail
only for 3.7.8

see: actions/runner-images#1202
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lsf scheduler wrong with num_machines
3 participants