Add chef attribute `cluster/in_place_update_on_fleet_enabled` to disable cfn-hup on compute and login nodes and improve performance at scale #3040

gmarciani · 2025-10-28T19:08:36Z

Description of changes

This PR mitigates the performance degradation reported in aws/aws-parallelcluster#6449

Add new chef attribute cluster/in_place_update_on_fleet_enabled to disable cfn-hup on compute and login nodes in order to improve performance at scale.
When cfn-hup is disabled on compute and login nodes, the cluster readiness checks executed by the head node are disabled.
Such attribute takes effect at config time (cluster creation/update, not at build image).

Q&A

Why disabling cluster readiness check?
Cluster readiness checks verify that all running compute/login nodes deployed the expected config version. If cfn-hup is disabled, they cannot apply the config version carried by the cluster update, so the check would always fail.
Why not calling the attribute in_place_update_on_fleet_disabled?
We decided to name it in_place_update_on_fleet_enabled rather than in_place_update_on_fleet_disabled because of consistency (we use positive attributes in the rest of the cookbook) and maintainability (positive attributes are less error prone, e.g. double negations)
Why disabling cfnhup on login nodes if the source of perf degradation are only compute nodes?
To provide a consistent user experience and implementation. If we keep cfnhup in login nodes we would end up having login nodes supporting in-place updates and compute nodes not supporting it, ultimately leading to potential confusion and complexity.
Why not testing the update of the new attribute?
Updates to ExtraChefAttributes have never been supported as per update policy here

User Experience

By default the attribute is true, so cfn-hup is enabled on all cluster nodes.
When set to false, cfn-hup is disabled on both compute and login nodes. When this the case, the cluster readiness checks are disabled because w.o cfnhup compute/login nodes are not able to start an in-place update, so such checks would always fail.

[UseCase 1] in-place updates enabled

This is the default behavior, where cfn-hup is enabled on head node, compute nodes and login nodes. Being cfn-hup enabled, compute/login nodes are able to execute in-place updates, so the head node executes the usual cluster readiness check at the end of the update.

[UseCase 2] in-place updates disabled

cfn-hup is enabled on head node, but disabled on both compute nodes and login nodes. Being cfn-hup disabled, compute/login nodes are not able to execute in-place updates, so the head node does not execute the cluster readiness check at the end of the update.

Tests

Unit tests (Existing and new ones)
Manually validated all the use cases reported in User Experience.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…abled` to disable in-place updates on compute and login nodes by disabling cfn-hup on those nodes. As a consequence, it also disables the cluster readiness checks executed by the head node on cluster update. Disabling cfn-hup mitigates a relevant performance degradation that may occur with tightly coupled workload st scale.

...ooks/aws-parallelcluster-platform/templates/supervisord/parallelcluster_supervisord.conf.erb

cookbooks/aws-parallelcluster-shared/attributes/cluster.rb

himani2411 · 2025-10-29T19:02:14Z

I would suggest that when you describe the user experience you refrain from mentioning [UseCase 1] cfn-hup enabled or [UseCase 1] cfn-hup disabled and explain what the use of cfn-hup is, just like what cluster_readiness_check is explained

gmarciani · 2025-10-29T20:13:20Z

I would suggest that when you describe the user experience you refrain from mentioning [UseCase 1] cfn-hup enabled or [UseCase 1] cfn-hup disabled and explain what the use of cfn-hup is, just like what cluster_readiness_check is explained

Done, both here and in the PR for the CLI aws/aws-parallelcluster#7071

himani2411

LGTM!

gmarciani added the 3.x label Oct 28, 2025

gmarciani mentioned this pull request Oct 28, 2025

[Validators] Add validator ExtraChefAttributesValidator to validate format of ExtraChefAttributes and notify the user about downsides of in_place_update_on_fleet_enabled=False aws/aws-parallelcluster#7071

Merged

gmarciani marked this pull request as ready for review October 28, 2025 22:19

gmarciani requested review from a team as code owners October 28, 2025 22:19

gmarciani force-pushed the wip/mgiacomo/3150/performance-disable-cfnhup-on-compute-nodes-1024-1 branch from b4acceb to 7e49541 Compare October 29, 2025 17:07

gmarciani changed the title ~~Add chef attribute cluster/cfnhup_on_fleet_enabled to disable cfn-hup on compute and login nodes.~~ Add chef attribute cluster/in_place_update_on_fleet_enabled to disable cfn-hup on compute and login nodes and improve performance at scale Oct 29, 2025

gmarciani force-pushed the wip/mgiacomo/3150/performance-disable-cfnhup-on-compute-nodes-1024-1 branch 3 times, most recently from aed529e to dedb84e Compare October 29, 2025 18:28

gmarciani force-pushed the wip/mgiacomo/3150/performance-disable-cfnhup-on-compute-nodes-1024-1 branch from dedb84e to 61fab33 Compare October 29, 2025 18:30

himani2411 reviewed Oct 29, 2025

View reviewed changes

...ooks/aws-parallelcluster-platform/templates/supervisord/parallelcluster_supervisord.conf.erb Show resolved Hide resolved

himani2411 reviewed Oct 29, 2025

View reviewed changes

cookbooks/aws-parallelcluster-shared/attributes/cluster.rb Show resolved Hide resolved

gmarciani enabled auto-merge (rebase) October 29, 2025 21:35

himani2411 approved these changes Oct 30, 2025

View reviewed changes

gmarciani merged commit 6eda378 into aws:develop Oct 30, 2025
26 of 30 checks passed

gmarciani deleted the wip/mgiacomo/3150/performance-disable-cfnhup-on-compute-nodes-1024-1 branch October 30, 2025 16:20

gmarciani mentioned this pull request Nov 3, 2025

Add chef attribute cluster/in_place_update_on_fleet_enabled to disable cfn-hup on compute and login nodes and improve performance at scale #3045

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add chef attribute `cluster/in_place_update_on_fleet_enabled` to disable cfn-hup on compute and login nodes and improve performance at scale #3040

Add chef attribute `cluster/in_place_update_on_fleet_enabled` to disable cfn-hup on compute and login nodes and improve performance at scale #3040

Uh oh!

gmarciani commented Oct 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

himani2411 commented Oct 29, 2025

Uh oh!

gmarciani commented Oct 29, 2025

Uh oh!

himani2411 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add chef attribute cluster/in_place_update_on_fleet_enabled to disable cfn-hup on compute and login nodes and improve performance at scale #3040

Add chef attribute cluster/in_place_update_on_fleet_enabled to disable cfn-hup on compute and login nodes and improve performance at scale #3040

Uh oh!

Conversation

gmarciani commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Q&A

User Experience

[UseCase 1] in-place updates enabled

[UseCase 2] in-place updates disabled

Tests

Uh oh!

Uh oh!

Uh oh!

himani2411 commented Oct 29, 2025

Uh oh!

gmarciani commented Oct 29, 2025

Uh oh!

himani2411 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add chef attribute `cluster/in_place_update_on_fleet_enabled` to disable cfn-hup on compute and login nodes and improve performance at scale #3040

Add chef attribute `cluster/in_place_update_on_fleet_enabled` to disable cfn-hup on compute and login nodes and improve performance at scale #3040

gmarciani commented Oct 28, 2025 •

edited

Loading