Feature request for restarting slurmctld upon failure

I had an issue with PC 3.11.1 slurmctld dying - it is looking like an issue at my end with DNS and reverse lookup (https://github.com/aws/aws-parallelcluster/issues/6529), but while reviewing and thinking about how to mitigate my issue I modified the slurmctld systemctl file to restart slurmctld on failure so I don't lose my cluster. 

Here is my new 3.11.1 slurmctld.service file:
```
# /etc/systemd/system/slurmctld.service
[Unit]
Description=Slurm controller daemon
After=network-online.target munge.service remote-fs.target
Wants=network-online.target
ConditionPathExists=/opt/slurm/etc/slurm.conf
StartLimitIntervalSec=30
StartLimitBurst=2

[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/slurmctld
ExecStart=/opt/slurm/sbin/slurmctld -D -s $SLURMCTLD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
LimitNOFILE=562930
LimitMEMLOCK=infinity
LimitSTACK=infinity
Restart=on-failure
RestartSec=10s

[Install]
WantedBy=multi-user.target
```

Four new lines added:
In the [Unit] section:
StartLimitIntervalSec=30
StartLimitBurst=2

and in the [Service] section:
Restart=on-failure
RestartSec=10s

Maybe something you might want to consider adding to standard distribution?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature request for restarting slurmctld upon failure #6538

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature request for restarting slurmctld upon failure #6538

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions