Skip to content

PCluster 3.10.1 and 3.11.0 Slurm compute daemon node configuration differs from hardware #6449

@stefan-vaisala

Description

@stefan-vaisala

Hello,

We have been testing to upgrade from PCluster 3.8.0 to 3.11.0 and noticed some differences that impact performance after extensive testing of our applications. We run hybrid MPI-openMP applications using HPC6a.48xlarge instances and noticed that after testing PCluster 3.10.1 or 3.11.0 all of our applications are running ~40% slower than 3.8.0 using the out-of-the-box PCluster AMIs associated with either version. We narrowed down the issue by downgrading/changing versions of performance impacting software (such as EFA installer, downgrading to v1.32.0 or v1.33.0), switching how the job is submitted/run in Slurm (Hydra bootstrap and mpiexec vs PMIv2 and srun), and some other changes that did not improve the degraded performance.

Upon investigation, we noticed that the slurmd compute daemon on the HPC6a.48xlarge instances incorrectly identifies the hardware configuration, resulting in improper job placement and degraded performance. Snapshots of the slurmd from varying versions of PCluster as follows:

HPC6a.48xlarge on PCluster 3.8.0 with Slurm 23.02.7 (correct when considering NUMA node as socket):

[2024-10-03T09:14:54.114] Considering each NUMA node as a socket
[2024-10-03T09:14:54.114] Node reconfigured socket/core boundaries SocketsPerBoard=96:4(hw) CoresPerSocket=1:24(hw)
[2024-10-03T09:14:54.116] Considering each NUMA node as a socket
[2024-10-03T09:14:54.124] CPU frequency setting not configured for this node
[2024-10-03T09:14:54.130] slurmd version 23.02.7 started
[2024-10-03T09:14:54.168] slurmd started on Thu, 03 Oct 2024 09:14:54 +0000
[2024-10-03T09:14:54.169] CPUs=96 Boards=1 Sockets=4 Cores=24 Threads=1 Memory=378805 TmpDisk=40947 Uptime=240 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

HPC6a.48xlarge on PCluster 3.10.1 with Slurm 23.11.7:

[2024-10-01T13:38:57.884] Considering each NUMA node as a socket
[2024-10-01T13:38:57.960] Considering each NUMA node as a socket
[2024-10-01T13:38:57.965] CPU frequency setting not configured for this node
[2024-10-01T13:38:58.142] slurmd version 23.11.7 started
[2024-10-01T13:38:58.221] slurmd started on Tue, 01 Oct 2024 13:38:58 +0000
[2024-10-01T13:38:58.221] CPUs=96 Boards=1 Sockets=96 Cores=1 Threads=1 Memory=378805 TmpDisk=40947 Uptime=123 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

HPC6a.48xlarge on PCluster 3.11.0 with Slurm 23.11.10:

[2024-10-03T13:56:38.733] Considering each NUMA node as a socket
[2024-10-03T13:56:38.735] Considering each NUMA node as a socket
[2024-10-03T13:56:38.740] CPU frequency setting not configured for this node
[2024-10-03T13:56:39.387] pyxis: version v0.20.0
[2024-10-03T13:56:39.388] slurmd version 23.11.10 started
[2024-10-03T13:56:39.830] slurmd started on Thu, 03 Oct 2024 13:56:39 +0000
[2024-10-03T13:56:39.831] CPUs=96 Boards=1 Sockets=96 Cores=1 Threads=1 Memory=378805 TmpDisk=40947 Uptime=377 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

lscpu from a HPC6a.48xlarge instance:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              96
On-line CPU(s) list: 0-95
Thread(s) per core:  1
Core(s) per socket:  48
Socket(s):           2
NUMA node(s):        4
Vendor ID:           AuthenticAMD
CPU family:          25
Model:               1
Model name:          AMD EPYC 7R13 Processor
Stepping:            1
CPU MHz:             2420.130
BogoMIPS:            5299.99
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-23
NUMA node1 CPU(s):   24-47
NUMA node2 CPU(s):   48-71
NUMA node3 CPU(s):   72-95

Is there some fix (or workaround) to properly reconfigure the node configuration in PCluster 3.11.0? It looks like some process/script that was run in 3.8.0 (e.g. line: [2024-10-03T09:14:54.114] Node reconfigured socket/core boundaries ...) is either not being run or not running properly. We'd prefer not to hard code the proper node configuration in the PCluster compute resource YAML as we dynamically spin up/down clusters and could use difference instance types in a given compute resource depending on resource availability.

Thanks for any help you can provide!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions