(3.2.0 3.5.1) GPU nodes not coming back online after scontrol reboot

The issue

After rebooting a GPU compute node via the scontrol reboot command available in the Slurm scheduler, the compute node may not come back online in the Slurm cluster because the slurmd daemon fails to start, causing the compute node to be unavailable to jobs.

This will lead to waste of instance time: the instance backing the compute node remains running until hitting the timeout defined as ResumeTimeout (by default 30 minutes), but it is not usable by Slurm jobs. After this timeout the compute node will be placed in DOWN state, causing ParallelCluster to reprovision it.

Please also be aware that versions between 3.2.0 and 3.4.1 were already affected by a bug in the slurm reboot functionality, therefore doing Slurm-aware rebooting of the compute nodes is not advisable in those versions.

You can check if the error occurs by running the sinfo or scontrol show node command to check the state of the compute node. If the compute node does not come back after rebooting, after a while you will find the node in DOWN+CLOUD+REBOOT_ISSUED+NOT_RESPONDING state. If the node does not come back within the ResumeTimeout, Slurm will put the node in DOWN+CLOUD+NOT_RESPONDING state, which will cause ParallelCluster to kill the backing EC2 instance and start a new one.

The cause of this bug is that after an instance reboot, the NVIDIA GPUs device files under /dev/ (e.g. /dev/nvidia0) are not immediately present. When the slurmd service starts, it cannot find these device files and therefore assumes that the expected GPU devices are missing. This causes the daemon to fail the startup and exit.

In order to reproduce the bug it’s enough to use compute nodes with NVIDIA GPUs (e.g. g4dn.12xlarge) and trying to reboot a running GPU compute node via the command:

sudo -i scontrol reboot <nodename>

Affected versions (OSes, schedulers)

ParallelCluster versions between 3.2.0 and 3.5.1 on all OSs.
Only the Slurm scheduler is affected.
Only compute nodes with NVIDIA GPUs are affected.

Mitigation

A fix for the issue is to make sure the NVIDIA GPU device files are present under /dev before the slurmd service starts.

One way to do it is to configure the NVIDIA Persistence Daemon on the GPU compute nodes and make sure that it is started before slurmd does (via appropriate systemd dependencies between the two daemons).

You can configure the nvidia-persistenced service on the GPU compute nodes using the following script as an OnNodeConfigured custom action for the compute nodes. Please make sure to refer to the Custom bootstrap actions documentation for more information on how to configure custom bootstrap scripts.

#!/bin/bash

# Return if no NVIDIA PCIe devices are found
lspci | grep -i -q NVIDIA 
rc=$?
[[ $rc -eq 0 ]] || exit 0

# Define default user based on OS ID
id=$(grep "^ID=" /etc/os-release | sed "s/ID=//")

case $id in

  *ubuntu*)
  default_user="ubuntu"
  ;;

  *centos*)
  default_user="centos"
  ;;

  *amzn*)
  default_user="ec2-user"
  ;;

esac


# Create nvidia-persistenced.service file
cat << EOF > /etc/systemd/system/nvidia-persistenced.service
# NVIDIA Persistence Daemon Init Script
#
# Copyright (c) 2013 NVIDIA Corporation
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.
#
# This is a sample systemd service file, designed to show how the NVIDIA
# Persistence Daemon can be started.
#

[Unit]
Description=NVIDIA Persistence Daemon
Wants=syslog.target

[Service]
Type=forking
ExecStart=/usr/bin/nvidia-persistenced --user ${default_user}
ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced

[Install]
WantedBy=multi-user.target
EOF

# Define systemd dependency between slurmd.service and nvidia-persistenced.service
mkdir -p /etc/systemd/system/slurmd.service.d 
cat << EOF > /etc/systemd/system/slurmd.service.d/nvidia-persistenced.conf
[Unit]
After=nvidia-persistenced.service
Requires=nvidia-persistenced.service
EOF

# Reload systemd configuration and enable nvidia-persistenced
systemctl daemon-reload 
systemctl enable nvidia-persistenced.service 
systemctl start nvidia-persistenced.service

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(3.2.0 3.5.1) GPU nodes not coming back online after scontrol reboot

The issue

Affected versions (OSes, schedulers)

Mitigation

Clone this wiki locally