Skip to content

Feature request for restarting slurmctld upon failure #6538

@gwolski

Description

@gwolski

I had an issue with PC 3.11.1 slurmctld dying - it is looking like an issue at my end with DNS and reverse lookup (#6529), but while reviewing and thinking about how to mitigate my issue I modified the slurmctld systemctl file to restart slurmctld on failure so I don't lose my cluster.

Here is my new 3.11.1 slurmctld.service file:

# /etc/systemd/system/slurmctld.service
[Unit]
Description=Slurm controller daemon
After=network-online.target munge.service remote-fs.target
Wants=network-online.target
ConditionPathExists=/opt/slurm/etc/slurm.conf
StartLimitIntervalSec=30
StartLimitBurst=2

[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/slurmctld
ExecStart=/opt/slurm/sbin/slurmctld -D -s $SLURMCTLD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
LimitNOFILE=562930
LimitMEMLOCK=infinity
LimitSTACK=infinity
Restart=on-failure
RestartSec=10s

[Install]
WantedBy=multi-user.target

Four new lines added:
In the [Unit] section:
StartLimitIntervalSec=30
StartLimitBurst=2

and in the [Service] section:
Restart=on-failure
RestartSec=10s

Maybe something you might want to consider adding to standard distribution?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions