From f6f7728507e3e62392ce271a7a93faa65833aedd Mon Sep 17 00:00:00 2001
From: mauri-melato <1615209+mauri-melato@users.noreply.github.com>
Date: Mon, 18 Jul 2022 21:28:31 +0200
Subject: [PATCH 1/4] Update CHANGELOG.md

---
 CHANGELOG.md | 41 +++++++++++++++++++++--------------------
 1 file changed, 21 insertions(+), 20 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 165e70378f..d1d389a835 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -5,42 +5,43 @@ CHANGELOG
 ------
 
 **ENHANCEMENTS**
-- Add new configuration parameter `Scheduling/SlurmSettings/QueueUpdateStrategy` to allow cluster update when
-  `SlurmQueues` configuration changes don't impact Slurm scheduler configuration.
-- Add support for multiple Elastic File Systems.
-- Add support for multiple FSx File Systems.
-- Add support for attaching existing FSx for Ontap and FSx for OpenZFS File Systems.
-- Add support for FSx Lustre Persistent_2 deployment type.
-- Add support for memory-based scheduling in Slurm.
-  - Configure `RealMemory` on compute nodes by default as 95% of the EC2 memory.
-  - Add new configuration parameter `Scheduling/SlurmSettings/EnableMemoryBasedScheduling` to configure memory-based scheduling in Slurm.
+- Add support for memory-based job scheduling in Slurm
+  - Configure compute nodes real memory in the Slurm cluster configuration.
+  - Add new configuration parameter `Scheduling/SlurmSettings/EnableMemoryBasedScheduling` to enable memory-based scheduling in Slurm.
   - Add new configuration parameter `Scheduling/SlurmQueues/ComputeResources/SchedulableMemory` to override default value of the memory seen by the scheduler on compute nodes.
+- Improve flexibility on cluster configuration updates to avoid the stop and start of the entire cluster whenever possible.
+  - Add new configuration parameter `Scheduling/SlurmSettings/QueueUpdateStrategy` to set the preferred strategy to adopt for compute nodes needing a configuration update and replacement.
+- Add support to mount existing FSx for ONTAP and FSx for OpenZFS file systems.
+- Add support to mount multiple instances of existing EFS, FSx for Lustre / for ONTAP/ for OpenZFS file systems.
+- Add support for FSx Lustre Persistent_2 deployment type.
 - Prompt user to enable EFA for supported instance types when using `pcluster configure` wizard.
-- Change default EBS volume types from gp2 to gp3 in both the root and additional volumes.
 - Add support for rebooting compute nodes via Slurm.
 
 **CHANGES**
-- Remove support for Python 3.6.
 - Upgrade Slurm to version 21.08.8-2.
-- Do not require `PlacementGroup/Enabled` to be set to `true` when passing an existing `PlacementGroup/Id`.
+- Upgrade EFA installer to version 1.17.2 
+  - ---TBC---
+- Change default EBS volume types from gp2 to gp3 for both the root and additional volumes.
 - Changes to FSx for Lustre file systems created by ParallelCluster:
   - Change the default deployment type to `Scratch_2`.
   - Change the Lustre server version to `2.12`.
-- Add `lambda:ListTags` and `lambda:UntagResource` to `ParallelClusterUserRole` used by ParallelCluster API stack for cluster update.
+- Do not require `PlacementGroup/Enabled` to be set to `true` when passing an existing `PlacementGroup/Id`.
 - Add `parallelcluster:cluster-name` tag to all resources created by ParallelCluster.
 - Do not allow setting `PlacementGroup/Id` when `PlacementGroup/Enabled` is explicitly set to `false`.
-- Restrict IPv6 access to IMDS to root and cluster admin users only, when configuration parameter `HeadNode/Imds/Secured` is enabled.
-- Change the default root volume size from 35 GiB to the size of AMIs. The default can be overwritten in cluster configuration file.
+- Add `lambda:ListTags` and `lambda:UntagResource` to `ParallelClusterUserRole` used by ParallelCluster API stack for cluster update.
+- Restrict IPv6 access to IMDS to root and cluster admin users only, when configuration parameter `HeadNode/Imds/Secured` is true as by default.
+- With a custom AMI, use the AMI root volume size instead of the ParallelCluster default of 35 GiB. The value can be changed in cluster configuration file.
 - Automatic disabling of the compute fleet when the configuration parameter `Scheduling/SlurmQueues/ComputeResources/SpotPrice`
   is lower than the minimum required Spot request fulfillment price.
-- Show `requested_value` and `current_value` values in the change set when adding or removing a section.
+- Show `requested_value` and `current_value` values in the change set when adding or removing a section during an update.
 - Do not replace dynamic node in POWER_DOWN as jobs may be still running.
+- Remove support for Python 3.6.
 
 **BUG FIXES**
-- Fix default for disable validate and test components when building custom AMI. The default was to disable those components, but it wasn't effective.
-- Handle corner case in the scaling logic when instance is just launched and the describe instances API doesn't report yet all the EC2 info.
-- Dropped validation that would prevent ARM instance type to be used when `DisableSimultaneousMultithreading` was set to true.
-- Add missing policies for EcrImageDeletionLambda and ImageBuilderInstance roles that were causing failure when upgrading ParallelCluster API from one version to another.
+- Fix the default behaviour to disable the validation and test components when building a custom AMI.
+- Handle corner case in the scaling logic when an instance is just launched and the describe instances API call doesn't report all the EC2 info yet.
+- Fixed support for `DisableSimultaneousMultithreading` parameter on instance types with ARM processors.
+- Add missing policies for `EcrImageDeletionLambda` and `ImageBuilderInstance` roles that were causing failure when upgrading ParallelCluster API from one version to another.
 
 3.1.4
 ------

From 626ea099fb071b7ee1660c02e880a8899c057f06 Mon Sep 17 00:00:00 2001
From: mauri-melato <1615209+mauri-melato@users.noreply.github.com>
Date: Tue, 19 Jul 2022 16:16:01 +0200
Subject: [PATCH 2/4] Update CHANGELOG.md

Added missing important software updates and features.
---
 CHANGELOG.md | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index d1d389a835..8f1aca45bd 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -11,22 +11,32 @@ CHANGELOG
   - Add new configuration parameter `Scheduling/SlurmQueues/ComputeResources/SchedulableMemory` to override default value of the memory seen by the scheduler on compute nodes.
 - Improve flexibility on cluster configuration updates to avoid the stop and start of the entire cluster whenever possible.
   - Add new configuration parameter `Scheduling/SlurmSettings/QueueUpdateStrategy` to set the preferred strategy to adopt for compute nodes needing a configuration update and replacement.
+- Improve failover mechanism over available compute resources when hitting insufficient capacity issues with EC2 instances. Disable compute nodes by a configurable amount of time (default 10 min) when a node launch fails due to insufficient capacity.
 - Add support to mount existing FSx for ONTAP and FSx for OpenZFS file systems.
 - Add support to mount multiple instances of existing EFS, FSx for Lustre / for ONTAP/ for OpenZFS file systems.
-- Add support for FSx Lustre Persistent_2 deployment type.
+- Add support for FSx for Lustre Persistent_2 deployment type when creating a new file system.
 - Prompt user to enable EFA for supported instance types when using `pcluster configure` wizard.
 - Add support for rebooting compute nodes via Slurm.
+- Improved handling of Slurm power states to also account for manual powering down of nodes.
+- Add NVIDIA GDRCopy 2.3 into the product AMIs to enable low-latency GPU memory copy.
 
 **CHANGES**
-- Upgrade Slurm to version 21.08.8-2.
-- Upgrade EFA installer to version 1.17.2 
-  - ---TBC---
+- Upgrade EFA installer to version 1.17.2
+  - EFA driver: ``efa-1.16.0-1``
+  - EFA configuration: ``efa-config-1.10-1``
+  - EFA profile: ``efa-profile-1.5-1``
+  - Libfabric: ``libfabric-aws-1.16.0~amzn2.0-1``
+  - RDMA core: ``rdma-core-41.0-2``
+  - Open MPI: ``openmpi40-aws-4.1.4-2``
+- Upgrade NICE DCV to version 2022.0-12760.
+- Upgrade NVIDIA driver to version 470.129.06.
+- Upgrade NVIDIA Fabric Manager to version 470.129.06.
 - Change default EBS volume types from gp2 to gp3 for both the root and additional volumes.
 - Changes to FSx for Lustre file systems created by ParallelCluster:
   - Change the default deployment type to `Scratch_2`.
   - Change the Lustre server version to `2.12`.
 - Do not require `PlacementGroup/Enabled` to be set to `true` when passing an existing `PlacementGroup/Id`.
-- Add `parallelcluster:cluster-name` tag to all resources created by ParallelCluster.
+- Add `parallelcluster:cluster-name` tag to all the resources created by ParallelCluster.
 - Do not allow setting `PlacementGroup/Id` when `PlacementGroup/Enabled` is explicitly set to `false`.
 - Add `lambda:ListTags` and `lambda:UntagResource` to `ParallelClusterUserRole` used by ParallelCluster API stack for cluster update.
 - Restrict IPv6 access to IMDS to root and cluster admin users only, when configuration parameter `HeadNode/Imds/Secured` is true as by default.
@@ -34,13 +44,13 @@ CHANGELOG
 - Automatic disabling of the compute fleet when the configuration parameter `Scheduling/SlurmQueues/ComputeResources/SpotPrice`
   is lower than the minimum required Spot request fulfillment price.
 - Show `requested_value` and `current_value` values in the change set when adding or removing a section during an update.
-- Do not replace dynamic node in POWER_DOWN as jobs may be still running.
+- Disable `aws-ubuntu-eni-helper` service in DLAMI to avoid conflicts with `configure_nw_interface.sh` when configuring instances with multiple network cards.
 - Remove support for Python 3.6.
 
 **BUG FIXES**
-- Fix the default behaviour to disable the validation and test components when building a custom AMI.
+- Fix the default behavior to skip the validation and test steps when building a custom AMI.
 - Handle corner case in the scaling logic when an instance is just launched and the describe instances API call doesn't report all the EC2 info yet.
-- Fixed support for `DisableSimultaneousMultithreading` parameter on instance types with ARM processors.
+- Fixed support for `DisableSimultaneousMultithreading` parameter on instance types with Arm processors.
 - Add missing policies for `EcrImageDeletionLambda` and `ImageBuilderInstance` roles that were causing failure when upgrading ParallelCluster API from one version to another.
 
 3.1.4

From c2a9905ad4e9bafaaaf6c0fbf71f82f1bc894123 Mon Sep 17 00:00:00 2001
From: mauri-melato <1615209+mauri-melato@users.noreply.github.com>
Date: Tue, 19 Jul 2022 16:21:24 +0200
Subject: [PATCH 3/4] Update CHANGELOG.md

---
 CHANGELOG.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 8f1aca45bd..e977b80622 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -49,6 +49,7 @@ CHANGELOG
 
 **BUG FIXES**
 - Fix the default behavior to skip the validation and test steps when building a custom AMI.
+- Fix file handle leak in `computemgtd`.
 - Handle corner case in the scaling logic when an instance is just launched and the describe instances API call doesn't report all the EC2 info yet.
 - Fixed support for `DisableSimultaneousMultithreading` parameter on instance types with Arm processors.
 - Add missing policies for `EcrImageDeletionLambda` and `ImageBuilderInstance` roles that were causing failure when upgrading ParallelCluster API from one version to another.

From a5ecd426763dec269d3707be7b8b6f8dff081161 Mon Sep 17 00:00:00 2001
From: mauri-melato <1615209+mauri-melato@users.noreply.github.com>
Date: Tue, 19 Jul 2022 16:22:22 +0200
Subject: [PATCH 4/4] Update CHANGELOG.md

---
 CHANGELOG.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index e977b80622..fc377dfe7e 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -50,7 +50,7 @@ CHANGELOG
 **BUG FIXES**
 - Fix the default behavior to skip the validation and test steps when building a custom AMI.
 - Fix file handle leak in `computemgtd`.
-- Handle corner case in the scaling logic when an instance is just launched and the describe instances API call doesn't report all the EC2 info yet.
+- Fix race condition that was sporadically causing launched instances to be immediately terminated because not available yet in EC2 DescribeInstances response
 - Fixed support for `DisableSimultaneousMultithreading` parameter on instance types with Arm processors.
 - Add missing policies for `EcrImageDeletionLambda` and `ImageBuilderInstance` roles that were causing failure when upgrading ParallelCluster API from one version to another.