Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use sbatch instead of srun to block execution #65

Merged
merged 1 commit into from
Apr 21, 2021
Merged

use sbatch instead of srun to block execution #65

merged 1 commit into from
Apr 21, 2021

Conversation

ChristopherBarrington
Copy link
Contributor

sbatch wraps the hostname command to run on the default partition, srun version may fail on clusters where the default partition does not accept interactive jobs

#64

sbatch wraps the hostname command to run on the default partition, srun version may fail on clusters where the default partition does not accept interactive jobs
@qdread
Copy link
Contributor

qdread commented Mar 1, 2021

Thanks for the PR! Currently I am getting errors when testing the PR on our cluster. I will try to debug.

@qdread
Copy link
Contributor

qdread commented Mar 1, 2021

I have been unable to debug this. The following command hangs indefinitely on our system:

sbatch --wait --wrap="hostname" --output=foo.txt

I will attempt to find a workaround but if you have any ideas that would be helpful.

@qdread
Copy link
Contributor

qdread commented Mar 1, 2021

@pmarchand1 if you have time to look at this, I was curious why wait_for_job submits a dummy job with a dependency to block the R process. Wouldn't it be more straightforward to do something like a while loop that would only break if the job is complete?

@ChristopherBarrington
Copy link
Contributor Author

ChristopherBarrington commented Mar 2, 2021

I have been unable to debug this. The following command hangs indefinitely on our system:

sbatch --wait --wrap="hostname" --output=foo.txt

I will attempt to find a workaround but if you have any ideas that would be helpful.

That is very strange. I'm not sure what different configurations our clusters have - the command posted above works as expected for me. Sorry I can't be of much help there.

Edit to add that we use Slurm 18:

> sinfo -V
slurm 18.08.8

@qdread
Copy link
Contributor

qdread commented Mar 2, 2021

@ChristopherBarrington thanks that's helpful. We are still on version 15 from 2016 so maybe updating it would help, I'll see if I can get that done.

edit: Yes, that seems to explain it because in Slurm 15.08, sbatch does not have a --wait option. So it was interpreting --wait as --wait-all-nodes. I will probably accept this PR once our sysadmins update Slurm and I can successfully test on our system.

@pmarchand1
Copy link
Collaborator

@qdread This functionality was added by @itcarroll after I left SESYNC.

@qdread
Copy link
Contributor

qdread commented Mar 17, 2021

Thanks @pmarchand1 ! We are getting pretty close to getting the new cluster online so it will be running the new version of Slurm. I am holding off on testing anything else out until then.

@itcarroll
Copy link
Contributor

I hope the --wait option to sbatch was not around when I added the wait argument, b/c it sure would have been the way to go. I won't look at this in detail, as it seems like you've got a solution, but the whole srun with dependency thing now appears outdated. (thanks @pmarchand1 for the ping!)

@qdread qdread merged commit 224f726 into earthlab:master Apr 21, 2021
@qdread
Copy link
Contributor

qdread commented Apr 21, 2021

I tested this on slurm 20.11.2, R 4.0.4 and it worked so the PR is merged. Thanks for the PR! @ChristopherBarrington please let me know if you would like to be listed as a package contributor.

@ChristopherBarrington
Copy link
Contributor Author

@qdread If its no effort for you, that would be fun! I am glad to have (hopefully) helped out in any case.

@qdread
Copy link
Contributor

qdread commented Apr 22, 2021

Done. I am in the process of releasing a new version to CRAN. Once that's accepted I will rebuild the pkgdown site and your name will appear! 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants