Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SlurmScheduler: Detect broken submission scripts for invalid account #5850

Merged
merged 1 commit into from
Dec 21, 2022

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Dec 20, 2022

Fixes #2955

If an invalid combination of the account and partition options are
provided the submission will fail. Currently the scheduler plugin will
raise a generic exception causing the expontential backoff mechanism to
kick in. This is pointless, however, as the problem is not transient and
the submission will always fail, unless the scheduler script is updated,
which is not possible, since it breaks provenance.

The solution is to make use of the recently introduced feature for the
_parse_submit_output method to return an instance of ExitCode which
will trigger the engine to terminate the process. If an invalid account
or combination of account and partition are defined, the error will be:

Invalid account or account/partition combination specified

This error is printed to the stderr. When detected, the new exit code
ERROR_SCHEDULER_INVALID_ACCOUNT is returned.

The ERROR_SCHEDULER_INVALID_ACCOUNT exit code uses status 131. The
idea is that the range 130 - 139 is reserved for errors that occur when
the job script submission fails. The status 130 is kept open for a more
general exit code that may be defined in the future.

@sphuber
Copy link
Contributor Author

sphuber commented Dec 20, 2022

This PR is a follow-up to #5849 which should be reviewed and merged first. This PR then adds a commit on top which implements the new feature for the SlurmScheduler.

If an invalid combination of the `account` and `partition` options are
provided the submission will fail. Currently the scheduler plugin will
raise a generic exception causing the expontential backoff mechanism to
kick in. This is pointless, however, as the problem is not transient and
the submission will always fail, unless the scheduler script is updated,
which is not possible, since it breaks provenance.

The solution is to make use of the recently introduced feature for the
`_parse_submit_output` method to return an instance of `ExitCode` which
will trigger the engine to terminate the process. If an invalid account
or combination of account and partition are defined, the error will be:

    Invalid account or account/partition combination specified

This error is printed to the `stderr`. When detected, the new exit code
`ERROR_SCHEDULER_INVALID_ACCOUNT` is returned.

The `ERROR_SCHEDULER_INVALID_ACCOUNT` exit code uses status `131`. The
idea is that the range 130 - 139 is reserved for errors that occur when
the job script submission fails. The status 130 is kept open for a more
general exit code that may be defined in the future.
@sphuber sphuber force-pushed the feature/2955/slurm-invalid-account branch from dfc4da7 to d0903ac Compare December 20, 2022 16:26
Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, looks good to me as well


if 'Invalid account' in stderr:
return CalcJob.exit_codes.ERROR_SCHEDULER_INVALID_ACCOUNT

raise SchedulerError(f'Error during submission, retval={retval}\nstdout={stdout}\nstderr={stderr}')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so an exception-based mechanism is actually in place as well...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but this gets trapped low-down in the expontential backoff mechanism. This was exactly the direct source of the linked issue. This exception was caught and the engine kept retrying until pausing the job. The point of the exit code is that we need to get out of that, and bubble all the way up.

@@ -422,8 +422,14 @@ def _parse_submit_output(self, retval, stdout, stderr):

Return a string with the JobID.
"""
from aiida.engine import CalcJob
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could move to top of file

Copy link
Contributor

@espenfl espenfl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @sphuber. Same comment as already given. Thanks a lot for adding this.

@sphuber sphuber merged commit d5ebf0e into aiidateam:main Dec 21, 2022
@sphuber sphuber deleted the feature/2955/slurm-invalid-account branch December 21, 2022 07:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scheduler plugins need to be able to distinguish recoverable from irrecoverable errors
3 participants