You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some software tests fail when run inside a SLURM job, e.g. OpenMPI which does mpirun that picks up the SLURM job it is running in and fails as the resulting configuration doesn't match what the job is expecting.
I have 2 workarounds in my EB wrapper script:
if [[ "${SLURM_NODELIST:-}" != "" ]]; then
ssh $SLURM_NODELIST bash -l "$0"
exit $?
fi
This basically restarts the current script via ssh if run from inside a SLURM job assuming only 1 node.
for i in $(env | grep ^SLURM_ | cut -f1 -d=); do
unset $i
done
This removes all SLURM_* variables from the current environment.
As the issue is a common pitfall with EB and given how easy the 2nd variant is to implement in EB via os.environ I'd suggest to do this by default, possibly with a --no-cleanup-slurm-env option to opt-out.
The text was updated successfully, but these errors were encountered:
That was just an example. Both methods seem to work, I had the 2nd in use while our nodes where updated and didn't allow SSH yet. And only the 2nd can be reasonably done in EasyBuild.
Makes sense to me, and it probably makes sense to implement this in EasyBuild 5.0 (although it shouldn't actually break anything, only fix builds that break/hang because they're running in a Slurm environment which may cause trouble with MPI).
boegel
changed the title
Cleanup SLURM/Batchsystem environment before doing builds
Clean up SLURM/Batchsystem environment before doing builds
Jan 17, 2024
We briefly discussed this during the EasyBuild conf call today, and the general consensus seemed to be that this should be made opt-in rather than opt-out (which makes sense to me)
Some software tests fail when run inside a SLURM job, e.g. OpenMPI which does
mpirun
that picks up the SLURM job it is running in and fails as the resulting configuration doesn't match what the job is expecting.I have 2 workarounds in my EB wrapper script:
This basically restarts the current script via
ssh
if run from inside a SLURM job assuming only 1 node.This removes all
SLURM_*
variables from the current environment.As the issue is a common pitfall with EB and given how easy the 2nd variant is to implement in EB via
os.environ
I'd suggest to do this by default, possibly with a--no-cleanup-slurm-env
option to opt-out.The text was updated successfully, but these errors were encountered: