Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SlurmScheduler: Parse the NODE_FAIL state #5866

Merged
merged 1 commit into from Jan 25, 2023

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Jan 23, 2023

Fixes #5865

If a job fails due to a node failure, SLURM will set the job's state to NODE_FAIL. The SlurmScheduler.parse_output method is updated to check for this state, in which case the ERROR_SCHEDULER_NODE_FAILURE exit code is returned. This is a new exit code defined on the CalcJob base class.

@sphuber sphuber requested a review from ltalirz January 23, 2023 16:46
If a job fails due to a node failure, SLURM will set the job's state to
`NODE_FAIL`. The `SlurmScheduler.parse_output` method is updated to
check for this state, in which case the `ERROR_SCHEDULER_NODE_FAILURE`
exit code is returned. This is a new exit code defined on the `CalcJob`
base class.
Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, thanks a lot!

@@ -431,6 +431,20 @@ def test_parse_out_of_memory():
assert exit_code == CalcJob.exit_codes.ERROR_SCHEDULER_OUT_OF_MEMORY # pylint: disable=no-member


def test_parse_node_failure():
"""Test that `ERROR_SCHEDULER_NODE_FAILURE` code is returne if `STATE`."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Test that `ERROR_SCHEDULER_NODE_FAILURE` code is returne if `STATE`."""
"""Test that `ERROR_SCHEDULER_NODE_FAILURE` code is returned if `STATE == 'NODE_FAIL'`."""

@sphuber sphuber force-pushed the feature/5865/node-failure-parsing branch from e268e54 to 65ff640 Compare January 24, 2023 08:50
@sphuber sphuber merged commit 65c1b32 into aiidateam:main Jan 25, 2023
@sphuber sphuber deleted the feature/5865/node-failure-parsing branch January 25, 2023 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add NODE_FAILURE exit code for CalcJob and add parsing to SlurmScheduler
2 participants