Skip to content

[BUG] Cluster fails to deploy because create_slurm_accounts.py fails #231

@cartalla

Description

@cartalla

Describe the bug

When deploying a new cluster I sometimes get the following errors from create_slurm_accounts.py.
If I rerun the script then it passes.

2024-05-13 19:43:53,708 p=4087 u=root n=ansible | TASK [ParallelClusterHeadNode : Run /opt/slurm/config/bin/create_slurm_accounts.py to make sure it works] ***
2024-05-13 19:44:05,640 p=4087 u=root n=ansible | fatal: [local]: FAILED! => changed=true

  cmd: |-
    set -ex

    export SLURM_ROOT=/opt/slurm
    /opt/slurm/config/bin/create_slurm_accounts.py --accounts /opt/slurm/config/accounts.yml --users /opt/slurm/config/users_groups.json --default-account unassigned -d

DEBUG:2024-05-13 14:43:54,470: Checking account infrastructure existence and fairshare
    INFO:2024-05-13 14:43:54,470:     Creating account infrastructure with fairshare=10, parent=None

INFO:2024-05-13 14:43:59,127: Updating infrastructure account parent from None to root
    ERROR:2024-05-13 14:43:59,334: Couldn't set ParentName for account infrastructure to root.
    command: ['/opt/slurm/bin/sacctmgr', 'modify', '-i', 'account', 'infrastructure', 'set', 'Parent=root']
    output:
     Nothing modified
    Traceback (most recent call last):
      File "/opt/slurm/config/bin/create_slurm_accounts.py", line 152, in update_slurm
        subprocess.check_output([self.sacctmgr, 'modify', '-i', 'account', account, 'set', f'Parent={exp_parent}'], encoding='UTF-8', stderr=self.devnull) # nosec
      File "/usr/lib64/python3.9/subprocess.py", line 424, in check_output
        return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
      File "/usr/lib64/python3.9/subprocess.py", line 528, in run
        raise CalledProcessError(retcode, process.args,
    subprocess.CalledProcessError: Command '['/opt/slurm/bin/sacctmgr', 'modify', '-i', 'account', 'infrastructure', 'set', 'Parent=root']' returned non-zero exit status 1.

ERROR:root:Unhandled exception in /opt/slurm/config/bin/create_slurm_accounts.py
    Traceback (most recent call last):
      File "/opt/slurm/config/bin/create_slurm_accounts.py", line 354, in <module>
        app = SlurmAccountManager(args.accounts, args.users, args.default_account)
      File "/opt/slurm/config/bin/create_slurm_accounts.py", line 90, in __init__
        number_of_changes = self.update_slurm()
      File "/opt/slurm/config/bin/create_slurm_accounts.py", line 272, in update_slurm
        raise RuntimeError("Some slurm updates failed")
    RuntimeError: Some slurm updates failed
    Traceback (most recent call last):
      File "/opt/slurm/config/bin/create_slurm_accounts.py", line 354, in <module>
        app = SlurmAccountManager(args.accounts, args.users, args.default_account)
      File "/opt/slurm/config/bin/create_slurm_accounts.py", line 90, in __init__
        number_of_changes = self.update_slurm()
      File "/opt/slurm/config/bin/create_slurm_accounts.py", line 272, in update_slurm
        raise RuntimeError("Some slurm updates failed")
    RuntimeError: Some slurm updates failed

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions