Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Scontrol Error when checkpointing / preemption on slurm #1601

Closed
YannDubs opened this issue Jan 24, 2021 · 5 comments
Closed

[BUG] Scontrol Error when checkpointing / preemption on slurm #1601

YannDubs opened this issue Jan 24, 2021 · 5 comments

Comments

@YannDubs
Copy link

YannDubs commented Jan 24, 2021

Hi,

For me, submitit works great when there is no need of checkpointing / preemption but I have the following error when I need to checkpoint:
FileNotFoundError: [Errno 2] No such file or directory: 'scontrol'

Specifically, I can reproduce this error by running docs/mnist.py, I ran the following three version of the mnist example to understand the issue:

  • Running docs/mnist.py on slurm as is, I get the previous error. Full logs: stderr , stdout
  • If I ssh into some slurm node that I get allocated to and run docs/mnist.py on the local executer (cluster="local") everything works as it should: so submitit + checkpointing works fine.
  • Running docs/mnist.py but without preemption ( removing timeout_min and job._interrupt()) everything works fine: so slurm + submitit work fine.

Also scontrol seems to work fine on my login node, so I don't understand why the check_call(["scontrol", "requeue", jid]) does not work. That being said, Scontrol does not work on the nodes I get allocated to (it only works from the login nodes) but from my understanding check_call(["scontrol", "requeue", jid]) is called from where I call submitit and thus not having scontrol on the allocated nodes shouldn't be an issue, am I correct?

Thank you !

@YannDubs
Copy link
Author

YannDubs commented Jan 25, 2021

Update: it seems that my last hypothesis was wrong. After logging os.environ["HOSTNAME"] in _requeue I realized that the code is actually running from the allocated node. The stderr of the subprocess.check_call is /bin/sh: 1: scontrol: not found.

So my question is whether it is possible to run _requeue from the node where the code was initially run ? Is that very uncommon to not have scontrol on the allocated nodes ?

@jrapin
Copy link
Contributor

jrapin commented Jan 25, 2021

Hi @YannDubs
Yes the requeueing is supposed to happen within the job itself. As far as I understand though, scontrol is supposed to be available cluster-wide (it's always been the case so far), so I would expect a configuration issue on the cluster :(

@YannDubs
Copy link
Author

Thanks Jeremey, I found a way around it.

In case someone has this issue in the future I was able to solve it by adding slurm to my path (in my case /opt/slurm/bin).

@huvunvidia
Copy link

Hi @YannDubs , can you help me explain more what did you mean by adding /opt/slurm/bin to your path?
You meant add it to be like this check_call(["/opt/slurm/bin/scontrol", "/opt/slurm/bin/requeue", jid])?
Thank you very much.

@YannDubs
Copy link
Author

YannDubs commented Jul 7, 2021

I meant adding export PATH="$PATH:/opt/slurm/bin" . E.g. in your ~/.bashrc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants