Implement `Scheduler.parse_output` for the `SlurmScheduler` plugin #3931

sphuber · 2020-04-14T14:10:44Z

This implements the Scheduler.parse_output method that allows parsing
the detailed job info that is retrieved from the scheduler when a job is
finished. For the time being, only an out-of-memory error is detected
and the corresponding exit code is returned.

codecov · 2020-04-14T14:23:37Z

Codecov Report

Merging #3931 into develop will increase coverage by 0.02%.
The diff coverage is 95.84%.

@@             Coverage Diff             @@
##           develop    #3931      +/-   ##
===========================================
+ Coverage    79.32%   79.33%   +0.02%     
===========================================
  Files          468      468              
  Lines        34713    34737      +24     
===========================================
+ Hits         27532    27555      +23     
- Misses        7181     7182       +1

Flag	Coverage Δ
#django	`72.95% <95.84%> (+0.02%)`	⬆️
#sqlalchemy	`72.16% <95.84%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
aiida/schedulers/plugins/slurm.py	`62.03% <95.84%> (+3.09%)`	⬆️
aiida/engine/daemon/runner.py	`79.32% <0.00%> (-3.44%)`	⬇️
aiida/transports/plugins/local.py	`81.54% <0.00%> (+0.26%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e974d3f...a1b5729. Read the comment docs.

mbercx · 2020-06-17T13:56:06Z

@sphuber I've run some tests using the PwBaseWorkChain and was able to produce the expected behaviour for calculations that ran out of walltime. For the out of memory issues, I haven't been able to produce a run that finishes with State equal to OUT_OF_MEMORY, often the job will only print an indication of memory issues to the stderr (e.g. cudaErrorMemoryAllocation error). However, these can of course easily be added to the SlurmScheduler.parse_output method.

One question I had was whether we should not also allow the parse_output method to return parsed data. For example, to return the maximum memory used on the GPU (maxmem), which I'm now doing after the runs have completed by extracting the _scheduler-stdout.txt from the FolderData.

Regarding the discussion on whether to exit when the scheduler returns a non-zero exit code in #3906: I'd also vote to maintain backwards compatibility and only add a warning to the log for now.

sphuber · 2020-06-23T16:07:13Z

@sphuber I've run some tests using the PwBaseWorkChain and was able to produce the expected behaviour for calculations that ran out of walltime. For the out of memory issues, I haven't been able to produce a run that finishes with State equal to OUT_OF_MEMORY, often the job will only print an indication of memory issues to the stderr (e.g. cudaErrorMemoryAllocation error). However, these can of course easily be added to the SlurmScheduler.parse_output method.

The problem with this is that this error does not come from the SLURM scheduler, but from the GPU, specifically the Nvidia card that is used on the Piz Daint XC50 nodes. To me that means we should not be parsing for this in the SlurmScheduler plugin. It should only ever parse data that is specific to the SLURM scheduler and not dependent on the code or platform on which the job was run. Because if we don't, where do we stop? Do we start adding parsing of all sorts of different architectures and codes in all scheduler plugins?

One question I had was whether we should not also allow the parse_output method to return parsed data. For example, to return the maximum memory used on the GPU (maxmem), which I'm now doing after the runs have completed by extracting the _scheduler-stdout.txt from the FolderData.

Same point as before: this information has nothing to do with the SLURM scheduler and so I don't think the logic to parse this data should be baked in there. Somebody else running a job on a different machine that also uses a SLURM scheduler might not get this information.

mbercx · 2020-09-09T08:03:46Z

Sorry for taking so long to pick this up again!

Same point as before: this information has nothing to do with the SLURM scheduler and so I don't think the logic to parse this data should be baked in there. Somebody else running a job on a different machine that also uses a SLURM scheduler might not get this information.

Fair enough! That all makes sense to me (now 😅). I suppose at some point we can introduce "architecture-related" parsing, but I agree that this should not be done here.

I've gone through the code again and have no further comments. I also have several calcjobs with 110 and 120 exit codes in my database right now, so it's been field-tested quite extensively. :) The ERROR_SCHEDULER_OUT_OF_MEMORY exit code is picked up pretty consistently when I run on the CPU's of Piz Daint's hybrid nodes.

This implements the `Scheduler.parse_output` method that allows parsing the detailed job info that is retrieved from the scheduler when a job is finished. For the time being, only an out-of-memory error is detected and the corresponding exit code is returned.

This was referenced Apr 14, 2020

WIP: Restructured the detailed job info from SLURM #3261

Closed

Initial prototype for scheduler parsing #3647

Closed

sphuber force-pushed the feature/2431/slurm-parse-output branch 3 times, most recently from 959da78 to 05dd0ae Compare April 15, 2020 21:24

espenfl mentioned this pull request May 5, 2020

Parse warnings from stdout aiida-vasp/aiida-vasp#345

Closed

3 tasks

sphuber added the pr/blocked PR is blocked by another PR that should be merged first label Jun 4, 2020

sphuber force-pushed the feature/2431/slurm-parse-output branch 2 times, most recently from b5c030e to 9a4d224 Compare June 7, 2020 17:26

sphuber force-pushed the feature/2431/slurm-parse-output branch from 9a4d224 to fe31e42 Compare June 23, 2020 16:02

sphuber mentioned this pull request Aug 27, 2020

Double check calculation job really has finished from detailed scheduler info when normal scheduler update says job is terminated #2431

Open

sphuber force-pushed the feature/2431/slurm-parse-output branch from fe31e42 to 57f0918 Compare August 27, 2020 14:53

sphuber added pr/ready-for-review PR is ready to be reviewed and removed pr/blocked PR is blocked by another PR that should be merged first labels Aug 27, 2020

mbercx self-requested a review September 9, 2020 07:49

mbercx approved these changes Sep 9, 2020

View reviewed changes

sphuber force-pushed the feature/2431/slurm-parse-output branch from 57f0918 to a1b5729 Compare September 10, 2020 16:26

sphuber merged commit ec32fbb into aiidateam:develop Sep 10, 2020

sphuber deleted the feature/2431/slurm-parse-output branch September 10, 2020 17:18

PhilippRue mentioned this pull request Dec 4, 2020

Implement base_restart for KKRcalculation to fix known / common issues JuDFTteam/aiida-kkr#34

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `Scheduler.parse_output` for the `SlurmScheduler` plugin #3931

Implement `Scheduler.parse_output` for the `SlurmScheduler` plugin #3931

sphuber commented Apr 14, 2020 •

edited

Loading

codecov bot commented Apr 14, 2020 •

edited

Loading

mbercx commented Jun 17, 2020

sphuber commented Jun 23, 2020

mbercx commented Sep 9, 2020

Implement Scheduler.parse_output for the SlurmScheduler plugin #3931

Implement Scheduler.parse_output for the SlurmScheduler plugin #3931

Conversation

sphuber commented Apr 14, 2020 • edited Loading

codecov bot commented Apr 14, 2020 • edited Loading

Codecov Report

mbercx commented Jun 17, 2020

sphuber commented Jun 23, 2020

mbercx commented Sep 9, 2020

Implement `Scheduler.parse_output` for the `SlurmScheduler` plugin #3931

Implement `Scheduler.parse_output` for the `SlurmScheduler` plugin #3931

sphuber commented Apr 14, 2020 •

edited

Loading

codecov bot commented Apr 14, 2020 •

edited

Loading