From d334112298f1dd09752173d860285b821958f9f4 Mon Sep 17 00:00:00 2001 From: Xuanqi He Date: Tue, 2 Sep 2025 11:30:02 -0400 Subject: [PATCH 1/4] [3.14.0][Changelog] Address formatting/wording issue in 3.14.0 cookbook changelog --- CHANGELOG.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index c08876779..c1128f82f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,11 +7,14 @@ This file is used to list changes made in each version of the AWS ParallelCluste ------ **ENHANCEMENTS** -- Remove UnkillableStepTimeout from slurm.conf and let slurm set this value. +- Add support for p6e-gb200 instances via capacity blocks. +- Remove `UnkillableStepTimeout` from slurm.conf and let slurm set this value. - Add `build-image` support for kernel 6.12 of Amazon Linux 2023. The official ParallelCluster Amazon Linux 2023 AMIs use kernel 6.12. **CHANGES** +- Install nvidia-imex for all OSs except AL2. - Ubuntu 20.04 is no longer supported. +- Remove `berkshelf`. All cookbooks are local and do not need `berkshelf` dependency management. - Upgrade Slurm to version 24.11.6 (from 24.05.8). - Upgrade EFA installer to 1.43.2 (from 1.41.0). - Efa-driver: efa-2.17.2-1 @@ -20,21 +23,18 @@ This file is used to list changes made in each version of the AWS ParallelCluste - Libfabric-aws: libfabric-aws-2.1.0-5 - Rdma-core: rdma-core-58.0-1 - Open MPI: openmpi40-aws-4.1.7-2 and openmpi50-aws-5.0.6-11 -- Upgrade Cinc Client to version to 18.4.12 from 18.2.7. +- Upgrade Cinc Client to version 18.4.12 (from 18.2.7). - Upgrade NVIDIA driver to version 570.172.08 (from 570.86.15) for all OSs except AL2. - Upgrade CUDA Toolkit to version 12.8.1 (from 12.8.0) for all OSs except AL2. - Upgrade DCGM to version 4.2.3 (from 3.3.6) for all OSs except AL2. - Upgrade Python to 3.12.11 (from 3.12.8) for all OSs except AL2. - Upgrade Python to 3.9.23 (from 3.9.20) for AL2. - Upgrade Intel MPI Library to 2021.16.0 (from 2021.13.1). -- Addressed cluster id mismatch known issue by deleting the file `/var/spool/slurm.state/clustername` before configuring Slurm accounting. - Upgrade DCV to version 2024.0-19030. -- Remove `berkshelf`. All cookbooks are local and do not need `berkshelf` dependency management. -- Add support for GB200 instance types. -- Install nvidia-imex for all OSs except AL2. **BUG FIXES** - Fix a race condition in CloudWatch Agent startup that could cause nodes bootstrap failures. +- Fix cluster id mismatch known issue by deleting the file `/var/spool/slurm.state/clustername` before configuring Slurm accounting. 3.13.2 ------ From 165a7329547df079f0e07b585d28bb509a6d67cc Mon Sep 17 00:00:00 2001 From: Xuanqi He Date: Tue, 2 Sep 2025 11:43:53 -0400 Subject: [PATCH 2/4] Some more details revises. --- CHANGELOG.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index c1128f82f..0513dcb11 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,13 +8,13 @@ This file is used to list changes made in each version of the AWS ParallelCluste **ENHANCEMENTS** - Add support for p6e-gb200 instances via capacity blocks. -- Remove `UnkillableStepTimeout` from slurm.conf and let slurm set this value. - Add `build-image` support for kernel 6.12 of Amazon Linux 2023. The official ParallelCluster Amazon Linux 2023 AMIs use kernel 6.12. **CHANGES** - Install nvidia-imex for all OSs except AL2. - Ubuntu 20.04 is no longer supported. - Remove `berkshelf`. All cookbooks are local and do not need `berkshelf` dependency management. +- Remove `UnkillableStepTimeout` from slurm.conf and let slurm set this value. - Upgrade Slurm to version 24.11.6 (from 24.05.8). - Upgrade EFA installer to 1.43.2 (from 1.41.0). - Efa-driver: efa-2.17.2-1 @@ -34,7 +34,7 @@ This file is used to list changes made in each version of the AWS ParallelCluste **BUG FIXES** - Fix a race condition in CloudWatch Agent startup that could cause nodes bootstrap failures. -- Fix cluster id mismatch known issue by deleting the file `/var/spool/slurm.state/clustername` before configuring Slurm accounting. +- Fix cluster id mismatch issue by deleting the file `/var/spool/slurm.state/clustername` before configuring Slurm accounting. 3.13.2 ------ From bddd469414cec7d5ff89d1c18a98f67b67bbbae8 Mon Sep 17 00:00:00 2001 From: Xuanqi He Date: Fri, 5 Sep 2025 15:10:43 -0400 Subject: [PATCH 3/4] Address wording/format/version issues --- CHANGELOG.md | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 0513dcb11..3e2d50b8b 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,21 +7,24 @@ This file is used to list changes made in each version of the AWS ParallelCluste ------ **ENHANCEMENTS** -- Add support for p6e-gb200 instances via capacity blocks. -- Add `build-image` support for kernel 6.12 of Amazon Linux 2023. The official ParallelCluster Amazon Linux 2023 AMIs use kernel 6.12. +- Add support for P6e-GB200 instances. ParallelCluster sets up Slurm topology plugin to handle P6e-GB200 UltraServers. See limitations section for important additional setup requirements. +- Add `build-image` support for Amazon Linux 2023 AMIs based on kernel 6.12 (in addition to 6.1). + +**LIMITATIONS** +- P6e-GB200 instances are only tested on Amazon Linux 2023, Ubuntu 22.04 and Ubuntu 24.04. +- Using IMEX on P6e-GB200 requires additional setup. Please refer to . **CHANGES** - Install nvidia-imex for all OSs except AL2. -- Ubuntu 20.04 is no longer supported. - Remove `berkshelf`. All cookbooks are local and do not need `berkshelf` dependency management. - Remove `UnkillableStepTimeout` from slurm.conf and let slurm set this value. - Upgrade Slurm to version 24.11.6 (from 24.05.8). -- Upgrade EFA installer to 1.43.2 (from 1.41.0). - - Efa-driver: efa-2.17.2-1 +- Upgrade EFA installer to 1.42.0 (from 1.41.0). + - Efa-driver: efa-2.15.3-1 - Efa-config: efa-config-1.18-1 - Efa-profile: efa-profile-1.7-1 - - Libfabric-aws: libfabric-aws-2.1.0-5 - - Rdma-core: rdma-core-58.0-1 + - Libfabric-aws: libfabric-aws-2.1.0-3 + - Rdma-core: rdma-core-57.0-1 - Open MPI: openmpi40-aws-4.1.7-2 and openmpi50-aws-5.0.6-11 - Upgrade Cinc Client to version 18.4.12 (from 18.2.7). - Upgrade NVIDIA driver to version 570.172.08 (from 570.86.15) for all OSs except AL2. @@ -31,11 +34,15 @@ This file is used to list changes made in each version of the AWS ParallelCluste - Upgrade Python to 3.9.23 (from 3.9.20) for AL2. - Upgrade Intel MPI Library to 2021.16.0 (from 2021.13.1). - Upgrade DCV to version 2024.0-19030. +- Upgrade the official ParallelCluster Amazon Linux 2023 AMIs to kernel 6.12 (from 6.1). **BUG FIXES** - Fix a race condition in CloudWatch Agent startup that could cause nodes bootstrap failures. - Fix cluster id mismatch issue by deleting the file `/var/spool/slurm.state/clustername` before configuring Slurm accounting. +**DEPRECATIONS** +- Ubuntu 20.04 is no longer supported. + 3.13.2 ------ From 7fa27fdaa1eb8487b7e392ab0e3ab141cf91b35b Mon Sep 17 00:00:00 2001 From: Xuanqi He Date: Fri, 5 Sep 2025 16:21:31 -0400 Subject: [PATCH 4/4] Use the latest efa version --- CHANGELOG.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 3e2d50b8b..e96577a34 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -19,12 +19,12 @@ This file is used to list changes made in each version of the AWS ParallelCluste - Remove `berkshelf`. All cookbooks are local and do not need `berkshelf` dependency management. - Remove `UnkillableStepTimeout` from slurm.conf and let slurm set this value. - Upgrade Slurm to version 24.11.6 (from 24.05.8). -- Upgrade EFA installer to 1.42.0 (from 1.41.0). - - Efa-driver: efa-2.15.3-1 +- Upgrade EFA installer to 1.43.2 (from 1.41.0). + - Efa-driver: efa-2.17.2-1 - Efa-config: efa-config-1.18-1 - Efa-profile: efa-profile-1.7-1 - - Libfabric-aws: libfabric-aws-2.1.0-3 - - Rdma-core: rdma-core-57.0-1 + - Libfabric-aws: libfabric-aws-2.1.0-5 + - Rdma-core: rdma-core-58.0-1 - Open MPI: openmpi40-aws-4.1.7-2 and openmpi50-aws-5.0.6-11 - Upgrade Cinc Client to version 18.4.12 (from 18.2.7). - Upgrade NVIDIA driver to version 570.172.08 (from 570.86.15) for all OSs except AL2.