`CalcJob`: move job resource validation to the `Scheduler` class #4192

sphuber · 2020-06-21T18:49:45Z

This is necessary because the resource validation is scheduler
dependent. Up till now, the CalcJob defined a validator on the entire
input namespace that validates the metadata.options.resources input.
Specifically, it demanded that the num_machines keyword was defined,
which is indeed a requirement for the NodeNumberJobResource, used for
schedulers like SLURM, however, this field doesn't even make sense for
schedulers like LSF and SGE that use the ParEnvJobResource type of job
resources.

The solution is to delegate the validation of the resources to the
scheduler which in turn delegates it to the JobResource class that it
uses. The validation used to happend on construction of the job resource
instance, but is factored out to the validate_resources classmethod.
This allows it to be called without having to construct an instance
which is more efficient.

Finally, the signature of the CalcJob input validators is changed and
no longer raise an InputValidationError but instead return an error
message as the interface of Port.validator requires. The port itself
will detect if an error message is returned and raise a ValueError
with all the relevant error messages.

codecov · 2020-06-21T19:03:34Z

Codecov Report

Merging #4192 into develop will increase coverage by 0.08%.
The diff coverage is 90.91%.

@@             Coverage Diff             @@
##           develop    #4192      +/-   ##
===========================================
+ Coverage    78.96%   79.03%   +0.08%     
===========================================
  Files          467      467              
  Lines        34511    34492      -19     
===========================================
+ Hits         27248    27259      +11     
+ Misses        7263     7233      -30

Flag	Coverage Δ
#django	`70.96% <90.91%> (+0.08%)`	⬆️
#sqlalchemy	`71.83% <90.91%> (+0.08%)`	⬆️

Impacted Files	Coverage Δ
aiida/engine/processes/calcjobs/calcjob.py	`82.02% <76.93%> (-0.51%)`	⬇️
aiida/schedulers/plugins/slurm.py	`52.50% <83.34%> (ø)`
aiida/schedulers/plugins/pbsbaseclasses.py	`59.16% <90.00%> (ø)`
aiida/schedulers/scheduler.py	`76.29% <90.25%> (+5.70%)`	⬆️
aiida/schedulers/datastructures.py	`93.80% <98.47%> (+13.12%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f8be9de...80fef5e. Read the comment docs.

sphuber · 2020-06-21T19:09:58Z

N.B.: this PR assumes that there are no schedulers out there in the wild that subclass their own JobResource. If that is the case, the code will break, because the validate_resources is an abstractclassmethod and therefore the custom subclasses won't be instantiable until they implement it. If we think this is a problem, we could make it concrete and just pass. We do need to be able to notify subclasses that they need to start implementing this method at some point in that case.

Second important point: the ArithmeticAddCalculation implementation specifies {'num_machines': 1, 'num_mpiprocs_per_machine': 1} as default for the metadata.options.resources. This was done because this class is used often for demonstrations and tutorials, and not having to specify the resources simplifies the examples. However, we now know that this default is incorrect as it only caters to schedulers with job resources of the type NodeNumberJobResource, which is not always the case. So really we have to remove this default, but this will impact the simplicity of the tutorial and documentation. What do we do here?

sphuber · 2020-06-22T09:49:09Z

@giovannipizzi it would probably be good to ship this with 1.3.0 given that we have had multiple users face this problem. I have added a second commit on top of the original one, which contains just some cleaning of the scheduler base class and job resource implementations. Especially the validate_resources of the node number sub class I have significantly refactored in an attempt to make it more readable. I found the original nested if-blocks very difficult to grok.

giovannipizzi

Minor change in the docs, and a question

giovannipizzi · 2020-06-23T09:32:04Z

aiida/engine/processes/calcjobs/calcjob.py

+        tot_num_mpiprocs = resources.get('tot_num_mpiprocs', None)
+        if num_mpiprocs_per_machine is None and tot_num_mpiprocs is None and default_mpiprocs_per_machine is not None:
+            # Only set the default value if tot_num_mpiprocs is not provided. Otherwise, it means that the user provided
+            # both `num_machines` and `tot_num_mpiprocs`, and we have to ignore the default value of `tot_num_mpiprocs`.


Suggested change

# both `num_machines` and `tot_num_mpiprocs`, and we have to ignore the default value of `tot_num_mpiprocs`.

# both `num_machines` and `tot_num_mpiprocs`, and we have to ignore the default value of `num_mpiprocs_per_machine`.

giovannipizzi · 2020-06-23T09:35:24Z

aiida/engine/processes/calcjobs/calcjob.py

-        def_cpus_machine = computer.get_default_mpiprocs_per_machine()
-        if def_cpus_machine is not None:
-            resources['default_mpiprocs_per_machine'] = def_cpus_machine
+        default_mpiprocs_per_machine = computer.get_default_mpiprocs_per_machine()


The general doubt I have here is if this is a general logic, or still specific to some scheduler types. Do we have tests for a parenv job resource, or can we ask someone (@zhubonan, @atztogo, ...) to check if this new logic now works also for them?

I think you are right and this will fail. Even if the user does not explicitly specify num_mpiprocs_per_machine and tot_num_mpiprocs, if the computer has default_mpiprocs_per_machine defined (which I think currently is almost always the case, or at least, verdi computer setup suggests to set it), then num_mpiprocs_per_machine will be added to the resources and since I changed ParEnvJobResource.validate_resources to raise when it gets unsupported keywords, this will fail.

I am not sure where and how to fix it. Easy hack would be to simply ignore this keyword in the ParEnvJobResource but is not the correct solution. If specifying a number of machines for a ParEnvJobResource based scheduler never makes sense, maybe we should not even have computers that are configured with such a scheduler, define a num_mpiprocs_per_machine.

This is necessary because the resource validation is scheduler dependent. Up till now, the `CalcJob` defined a validator on the entire input namespace that validates the `metadata.options.resources` input. Specifically, it demanded that the `num_machines` keyword was defined, which is indeed a requirement for the `NodeNumberJobResource`, used for schedulers like SLURM, however, this field doesn't even make sense for schedulers like LSF and SGE that use the `ParEnvJobResource` type of job resources. The solution is to delegate the validation of the resources to the scheduler which in turn delegates it to the `JobResource` class that it uses. The validation used to happend on construction of the job resource instance, but is factored out to the `validate_resources` classmethod. This allows it to be called without having to construct an instance which is more efficient. Finally, the signature of the `CalcJob` input validators is changed and no longer raise an `InputValidationError` but instead return an error message as the interface of `Port.validator` requires. The port itself will detect if an error message is returned and raise a `ValueError` with all the relevant error messages.

The `Scheduler.get_valid_schedulers` method is deprecated as the `aiida.plugins` module is reserved to inquire about installed plugins. Furthermore, it is properly marked as an abstract class which allows to remove the explicit `raise NotImplementedError` from methods that are abstract and have to be implemented by subclasses. Finally, there is some minor cleaning up of the code and docstrings. The implementation of the `validate_resources` method of the subclasses of `JobResource` has been improved and significantly simplified in the case of `NodeNumberJobResource` making the logic a lot more readable. Tests are added for the `ParEnvJobResource` job resource class which was completely untested.

sphuber · 2020-06-23T11:08:26Z

@giovannipizzi I have a added a new commit that I think should address the problem you mentioned. I added a test that launches a calculation job using a computer with SGE scheduler, and I confirmed that it failed as I explained in my response to your question. The fix in the last commit addresses the problem. I am not sure if there are still other failure points hiding out there. @atztogo and @zhubonan if you have the time to give this branch a spin with a real calculation on a LSF or SGE cluster, that would be great!

The `CalcJob.presubmit` method was always adding the resource keyword `num_mpiprocs_per_machine` if it was not already (implicitly) specified and the computer defined a default. However, this keyword is not valid for all schedulers, and for example, the SGE and LSF scheduler will raise if it is passed. Here we add the `Scheduler.preprocess_resources` method, which can be called and will properly take the type of job resources into account when deciding whether `num_mpiprocs_per_machine` should be defined.

giovannipizzi

Thanks! I'm approving it, but it would be great if @atztogo or @zhubonan or anybody else with an SGE scheduler could quickly try out if this branch works for them!

atztogo · 2020-06-23T13:15:37Z

Thank you very much @sphuber . I ran some calculations using 80fef5e , and went OK without num_machines with sge.

Minimal aiida-core version has to be fixed due to the recently introduced change in the aiida-core package moving the resources validation to the scheduler plugin. see aiidateam/aiida-core#4192

* Fix minimal aiida-core version to 1.3.0 Minimal aiida-core version has to be fixed due to the recently introduced change in the aiida-core package moving the resources validation to the scheduler plugin. see aiidateam/aiida-core#4192 * Fix python version to 3.7.7 in tests PyYAML causes issues only for tests running on python 3.7. A suggested change to fix this issue is fixing the python 3.7 version to the 3.7.7 minor release since test start to fail only for 3.7.8 see: actions/runner-images#1202

sphuber requested a review from giovannipizzi June 21, 2020 18:49

sphuber force-pushed the fix/3887/calcjob-scheduler-resources-validation branch from 45edcfe to 8699878 Compare June 21, 2020 18:59

sphuber force-pushed the fix/3887/calcjob-scheduler-resources-validation branch from 229ea91 to db62c84 Compare June 22, 2020 10:45

giovannipizzi reviewed Jun 23, 2020

View reviewed changes

sphuber added 2 commits June 23, 2020 12:19

sphuber requested a review from giovannipizzi June 23, 2020 11:06

sphuber force-pushed the fix/3887/calcjob-scheduler-resources-validation branch from db62c84 to fb0d103 Compare June 23, 2020 11:06

sphuber force-pushed the fix/3887/calcjob-scheduler-resources-validation branch from fb0d103 to 80fef5e Compare June 23, 2020 11:21

giovannipizzi approved these changes Jun 23, 2020

View reviewed changes

sphuber merged commit a97f725 into aiidateam:develop Jun 23, 2020

sphuber deleted the fix/3887/calcjob-scheduler-resources-validation branch June 23, 2020 13:26

astamminger mentioned this pull request Jul 9, 2020

Fix minimal aiida-core version to 1.3.0 aiida-cusp/aiida-cusp#5

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`CalcJob`: move job resource validation to the `Scheduler` class #4192

`CalcJob`: move job resource validation to the `Scheduler` class #4192

sphuber commented Jun 21, 2020

codecov bot commented Jun 21, 2020 •

edited

sphuber commented Jun 21, 2020

sphuber commented Jun 22, 2020

giovannipizzi left a comment

giovannipizzi Jun 23, 2020

giovannipizzi Jun 23, 2020

sphuber Jun 23, 2020

sphuber commented Jun 23, 2020

giovannipizzi left a comment

atztogo commented Jun 23, 2020

	# both `num_machines` and `tot_num_mpiprocs`, and we have to ignore the default value of `tot_num_mpiprocs`.
	# both `num_machines` and `tot_num_mpiprocs`, and we have to ignore the default value of `num_mpiprocs_per_machine`.

CalcJob: move job resource validation to the Scheduler class #4192

CalcJob: move job resource validation to the Scheduler class #4192

Conversation

sphuber commented Jun 21, 2020

codecov bot commented Jun 21, 2020 • edited

Codecov Report

sphuber commented Jun 21, 2020

sphuber commented Jun 22, 2020

giovannipizzi left a comment

Choose a reason for hiding this comment

giovannipizzi Jun 23, 2020

Choose a reason for hiding this comment

giovannipizzi Jun 23, 2020

Choose a reason for hiding this comment

sphuber Jun 23, 2020

Choose a reason for hiding this comment

sphuber commented Jun 23, 2020

giovannipizzi left a comment

Choose a reason for hiding this comment

atztogo commented Jun 23, 2020

`CalcJob`: move job resource validation to the `Scheduler` class #4192

`CalcJob`: move job resource validation to the `Scheduler` class #4192

codecov bot commented Jun 21, 2020 •

edited