[BUG] `Scontrol` Error when checkpointing / preemption on slurm #1601

YannDubs · 2021-01-24T13:29:17Z

Hi,

For me, submitit works great when there is no need of checkpointing / preemption but I have the following error when I need to checkpoint:
FileNotFoundError: [Errno 2] No such file or directory: 'scontrol'

Specifically, I can reproduce this error by running docs/mnist.py, I ran the following three version of the mnist example to understand the issue:

Running docs/mnist.py on slurm as is, I get the previous error. Full logs: stderr , stdout
If I ssh into some slurm node that I get allocated to and run docs/mnist.py on the local executer (cluster="local") everything works as it should: so submitit + checkpointing works fine.
Running docs/mnist.py but without preemption ( removing timeout_min and job._interrupt()) everything works fine: so slurm + submitit work fine.

Also scontrol seems to work fine on my login node, so I don't understand why the check_call(["scontrol", "requeue", jid]) does not work. That being said, Scontrol does not work on the nodes I get allocated to (it only works from the login nodes) but from my understanding check_call(["scontrol", "requeue", jid]) is called from where I call submitit and thus not having scontrol on the allocated nodes shouldn't be an issue, am I correct?

Thank you !

The text was updated successfully, but these errors were encountered:

YannDubs · 2021-01-25T00:08:04Z

Update: it seems that my last hypothesis was wrong. After logging os.environ["HOSTNAME"] in _requeue I realized that the code is actually running from the allocated node. The stderr of the subprocess.check_call is /bin/sh: 1: scontrol: not found.

So my question is whether it is possible to run _requeue from the node where the code was initially run ? Is that very uncommon to not have scontrol on the allocated nodes ?

jrapin · 2021-01-25T13:38:25Z

Hi @YannDubs
Yes the requeueing is supposed to happen within the job itself. As far as I understand though, scontrol is supposed to be available cluster-wide (it's always been the case so far), so I would expect a configuration issue on the cluster :(

YannDubs · 2021-01-25T14:52:34Z

Thanks Jeremey, I found a way around it.

In case someone has this issue in the future I was able to solve it by adding slurm to my path (in my case /opt/slurm/bin).

huvunvidia · 2021-07-07T05:15:24Z

Hi @YannDubs , can you help me explain more what did you mean by adding /opt/slurm/bin to your path?
You meant add it to be like this check_call(["/opt/slurm/bin/scontrol", "/opt/slurm/bin/requeue", jid])?
Thank you very much.

YannDubs · 2021-07-07T08:50:44Z

I meant adding export PATH="$PATH:/opt/slurm/bin" . E.g. in your ~/.bashrc

YannDubs closed this as completed Jan 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] `Scontrol` Error when checkpointing / preemption on slurm #1601

[BUG] `Scontrol` Error when checkpointing / preemption on slurm #1601

YannDubs commented Jan 24, 2021 •

edited

YannDubs commented Jan 25, 2021 •

edited

jrapin commented Jan 25, 2021

YannDubs commented Jan 25, 2021

huvunvidia commented Jul 7, 2021

YannDubs commented Jul 7, 2021

[BUG] Scontrol Error when checkpointing / preemption on slurm #1601

[BUG] Scontrol Error when checkpointing / preemption on slurm #1601

Comments

YannDubs commented Jan 24, 2021 • edited

YannDubs commented Jan 25, 2021 • edited

jrapin commented Jan 25, 2021

YannDubs commented Jan 25, 2021

huvunvidia commented Jul 7, 2021

YannDubs commented Jul 7, 2021

[BUG] `Scontrol` Error when checkpointing / preemption on slurm #1601

[BUG] `Scontrol` Error when checkpointing / preemption on slurm #1601

YannDubs commented Jan 24, 2021 •

edited

YannDubs commented Jan 25, 2021 •

edited