Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the execmanager.submit_calculation idempotent'ish #3188

Merged

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Jul 18, 2019

Fixes #3183

The submit_calculation is responsible for submitting a job, that was
already uploaded to the remote machine, to the scheduler. There was a
risk of the daemon submitting the same job twice. If the first time
around it successfully submitted the job and set the job id as an
attribute on the node, but then was interrupted (for example due to a
shutdown) before it could transition the CalcJob to the UPDATE
transport task, once reloaded the command would be executed again. This
would not complain and simply submit the job again to the scheduler.
This is both a waste of resources and can cause various complications.

It is impossible to make the function fully idempotent, but by checking
if the job id attribute is already set for the node at the beginning and
returning it should minimize the risk of double submission.

The `submit_calculation` is responsible for submitting a job, that was
already uploaded to the remote machine, to the scheduler. There was a
risk of the daemon submitting the same job twice. If the first time
around it successfully submitted the job and set the job id as an
attribute on the node, but then was interrupted (for example due to a
shutdown) before it could transition the `CalcJob` to the `UPDATE`
transport task, once reloaded the command would be executed again. This
would not complain and simply submit the job again to the scheduler.
This is both a waste of resources and can cause various complications.

It is impossible to make the function fully idempotent, but by checking
if the job id attribute is already set for the node at the beginning and
returning it should minimize the risk of double submission.
@sphuber sphuber force-pushed the fix_3183_submit_calculation_idempotency branch from b44b0da to caceba0 Compare July 19, 2019 06:38
Copy link
Member

@giovannipizzi giovannipizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good!
Another thing to check when we address #2496 is that the set_job_id call is atomic and stores asap.

as you say this is not enough if the daemon gets shut down before that call but after submission.
The only option to make this really idempotent is to delegate the check to the scheduler, but this is beyond the scope of this PR here.

Eg have a wrapper for squeue that submits only once per folder (e. g. storing the jobid on a file in the folder, say _aiida_jobid.txt, and before submission it checks if it exists - if it's there it returns it, otherwise it proxies to squeue.

Not for now, but would it be possible/easy to get from slurm all the job id's you have submitted, say, in the past 1 week, and check if any of them is unknown to AiiDA? (e. g. meaning that even with this fix we would have run the simulation twice)?
aybe this can be tested also after this PR is merged.

@giovannipizzi
Copy link
Member

@sphuber sphuber merged commit b7a161d into aiidateam:develop Jul 19, 2019
@sphuber sphuber deleted the fix_3183_submit_calculation_idempotency branch July 19, 2019 07:57
@sphuber
Copy link
Contributor Author

sphuber commented Jul 19, 2019

Should be easy enough. To get a unique list of all job ids for given user since given data:

sacct -S 2019-06-18 -u sphuber --format=JobId -n | cut -d . -f1 | sort | uniq -u

this can then be compared with the job ids stored in the attributes.

d-tomerini pushed a commit to d-tomerini/aiida_core that referenced this pull request Sep 30, 2019
)

The `submit_calculation` is responsible for submitting a job, that was
already uploaded to the remote machine, to the scheduler. There was a
risk of the daemon submitting the same job twice. If the first time
around it successfully submitted the job and set the job id as an
attribute on the node, but then was interrupted (for example due to a
shutdown) before it could transition the `CalcJob` to the `UPDATE`
transport task, once reloaded the command would be executed again. This
would not complain and simply submit the job again to the scheduler.
This is both a waste of resources and can cause various complications.

It is impossible to make the function fully idempotent, but by checking
if the job id attribute is already set for the node at the beginning and
returning it should minimize the risk of double submission.
d-tomerini pushed a commit to d-tomerini/aiida_core that referenced this pull request Oct 16, 2019
)

The `submit_calculation` is responsible for submitting a job, that was
already uploaded to the remote machine, to the scheduler. There was a
risk of the daemon submitting the same job twice. If the first time
around it successfully submitted the job and set the job id as an
attribute on the node, but then was interrupted (for example due to a
shutdown) before it could transition the `CalcJob` to the `UPDATE`
transport task, once reloaded the command would be executed again. This
would not complain and simply submit the job again to the scheduler.
This is both a waste of resources and can cause various complications.

It is impossible to make the function fully idempotent, but by checking
if the job id attribute is already set for the node at the beginning and
returning it should minimize the risk of double submission.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make the execmanager.submit_calculation idempotent as best as possible
2 participants