New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Scontrol
Error when checkpointing / preemption on slurm
#1601
Comments
Update: it seems that my last hypothesis was wrong. After logging So my question is whether it is possible to run |
Hi @YannDubs |
Thanks Jeremey, I found a way around it. In case someone has this issue in the future I was able to solve it by adding slurm to my path (in my case |
Hi @YannDubs , can you help me explain more what did you mean by adding |
I meant adding |
Hi,
For me, submitit works great when there is no need of checkpointing / preemption but I have the following error when I need to checkpoint:
FileNotFoundError: [Errno 2] No such file or directory: 'scontrol'
Specifically, I can reproduce this error by running
docs/mnist.py
, I ran the following three version of the mnist example to understand the issue:docs/mnist.py
on slurm as is, I get the previous error. Full logs: stderr , stdoutdocs/mnist.py
on the local executer (cluster="local"
) everything works as it should: so submitit + checkpointing works fine.docs/mnist.py
but without preemption ( removingtimeout_min
andjob._interrupt()
) everything works fine: so slurm + submitit work fine.Also
scontrol
seems to work fine on my login node, so I don't understand why thecheck_call(["scontrol", "requeue", jid])
does not work. That being said,Scontrol
does not work on the nodes I get allocated to (it only works from the login nodes) but from my understandingcheck_call(["scontrol", "requeue", jid])
is called from where I call submitit and thus not havingscontrol
on the allocated nodes shouldn't be an issue, am I correct?Thank you !
The text was updated successfully, but these errors were encountered: