generated from amazon-archives/__template_MIT-0
-
Notifications
You must be signed in to change notification settings - Fork 7
Closed
Description
Describe the bug
When deploying a new cluster I sometimes get the following errors from create_slurm_accounts.py.
If I rerun the script then it passes.
2024-05-13 19:43:53,708 p=4087 u=root n=ansible | TASK [ParallelClusterHeadNode : Run /opt/slurm/config/bin/create_slurm_accounts.py to make sure it works] ***
2024-05-13 19:44:05,640 p=4087 u=root n=ansible | fatal: [local]: FAILED! => changed=true
cmd: |-
set -ex
export SLURM_ROOT=/opt/slurm
/opt/slurm/config/bin/create_slurm_accounts.py --accounts /opt/slurm/config/accounts.yml --users /opt/slurm/config/users_groups.json --default-account unassigned -d
DEBUG:2024-05-13 14:43:54,470: Checking account infrastructure existence and fairshare
INFO:2024-05-13 14:43:54,470: Creating account infrastructure with fairshare=10, parent=None
INFO:2024-05-13 14:43:59,127: Updating infrastructure account parent from None to root
ERROR:2024-05-13 14:43:59,334: Couldn't set ParentName for account infrastructure to root.
command: ['/opt/slurm/bin/sacctmgr', 'modify', '-i', 'account', 'infrastructure', 'set', 'Parent=root']
output:
Nothing modified
Traceback (most recent call last):
File "/opt/slurm/config/bin/create_slurm_accounts.py", line 152, in update_slurm
subprocess.check_output([self.sacctmgr, 'modify', '-i', 'account', account, 'set', f'Parent={exp_parent}'], encoding='UTF-8', stderr=self.devnull) # nosec
File "/usr/lib64/python3.9/subprocess.py", line 424, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/usr/lib64/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/opt/slurm/bin/sacctmgr', 'modify', '-i', 'account', 'infrastructure', 'set', 'Parent=root']' returned non-zero exit status 1.
ERROR:root:Unhandled exception in /opt/slurm/config/bin/create_slurm_accounts.py
Traceback (most recent call last):
File "/opt/slurm/config/bin/create_slurm_accounts.py", line 354, in <module>
app = SlurmAccountManager(args.accounts, args.users, args.default_account)
File "/opt/slurm/config/bin/create_slurm_accounts.py", line 90, in __init__
number_of_changes = self.update_slurm()
File "/opt/slurm/config/bin/create_slurm_accounts.py", line 272, in update_slurm
raise RuntimeError("Some slurm updates failed")
RuntimeError: Some slurm updates failed
Traceback (most recent call last):
File "/opt/slurm/config/bin/create_slurm_accounts.py", line 354, in <module>
app = SlurmAccountManager(args.accounts, args.users, args.default_account)
File "/opt/slurm/config/bin/create_slurm_accounts.py", line 90, in __init__
number_of_changes = self.update_slurm()
File "/opt/slurm/config/bin/create_slurm_accounts.py", line 272, in update_slurm
raise RuntimeError("Some slurm updates failed")
RuntimeError: Some slurm updates failed
Metadata
Metadata
Assignees
Labels
No labels