(2.8.1 and earlier) Slurm node list misconfiguration breaks termination of idle compute nodes

The issue

When using the Slurm job scheduler with ParallelCluster <2.9.0, if the Slurm node list configuration becomes corrupted, the compute fleet management daemons may become unable to terminate idle nodes. This can result in compute nodes continuing to run and incurring costs when this is not intended.

Corruption of the Slurm node list configuration can be caused by the user modifying the scheduler configuration or could be the result of system errors that break the ParallelCluster management logic. For example this could happen when the file system is full and the sqswatcher daemon is unable to properly update the scheduler configuration.

Affected ParallelCluster versions

This issue affects all versions of ParallelCluster from v2.3.1 to v2.8.1. In ParallelCluster 2.9.0, released on Sep 2020, we enhanced the approach for Slurm scaling and addressed such corner case.

Error details

To identify if a node is idle and can be self-terminated, the nodewatcher (the daemon running on Slurm compute nodes for ParallelCluster <2.9.0) reads the output of the squeue command.

An incorrect Slurm configuration is translated into an error message returned in the output of the squeue command that is not correctly parsed by the nodewatcher. Due to this, the node is marked as active, even if there are no jobs running on it or pending jobs.

For example a no left space on device system error might interfere with the execution of the commands performed to scale down the cluster, causing an incomplete cleanup of the hostnames associated with terminated instances. When this happens the output of the squeue command presents the following message: squeue: error: Duplicated NodeHostName ip-10-31-18-157 in the config file.

It is possible to identify the faulty behavior by looking at the /var/log/nodewatcher file on the compute nodes. When the problem occurs the log file will present a log message saying Instance has active jobs even if, according to the previous log line, there are no running jobs but instead the error message from the squeue command.

2021-12-01 09:39:45,389 INFO [slurm:has_jobs] Found the following running jobs:
squeue: error: Duplicated NodeHostName ip-10-31-18-157 in the config file
2021-12-01 09:39:45,389 INFO [nodewatcher:_poll_instance_status] Instance has active jobs.

The solution

It is highly recommended to delete the cluster and create a new one with an updated ParallelCluster version as soon as possible.

We likewise recommend reviewing the AWS ParallelCluster Support Policy for details regarding our timelines and policies for support of ParallelCluster versions. For the ParallelCluster versions affected by this issue (v2.3.1 to 2.8.1), the end of supported life date is 12/31/2021; after this date, these versions will be eligible for troubleshooting support on a best effort basis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(2.8.1 and earlier) Slurm node list misconfiguration breaks termination of idle compute nodes

The issue

Affected ParallelCluster versions

Error details

The solution

Clone this wiki locally