Add chef attribute cluster/in_place_update_on_fleet_enabled to disable cfn-hup on compute and login nodes and improve performance at scale
#3040
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of changes
This PR mitigates the performance degradation reported in aws/aws-parallelcluster#6449
Add new chef attribute
cluster/in_place_update_on_fleet_enabledto disable cfn-hup on compute and login nodes in order to improve performance at scale.When cfn-hup is disabled on compute and login nodes, the cluster readiness checks executed by the head node are disabled.
Such attribute takes effect at config time (cluster creation/update, not at build image).
Q&A
Why disabling cluster readiness check?
Cluster readiness checks verify that all running compute/login nodes deployed the expected config version. If cfn-hup is disabled, they cannot apply the config version carried by the cluster update, so the check would always fail.
Why not calling the attribute in_place_update_on_fleet_disabled?
We decided to name it
in_place_update_on_fleet_enabledrather thanin_place_update_on_fleet_disabledbecause of consistency (we use positive attributes in the rest of the cookbook) and maintainability (positive attributes are less error prone, e.g. double negations)Why disabling cfnhup on login nodes if the source of perf degradation are only compute nodes?
To provide a consistent user experience and implementation. If we keep cfnhup in login nodes we would end up having login nodes supporting in-place updates and compute nodes not supporting it, ultimately leading to potential confusion and complexity.
Why not testing the update of the new attribute?
Updates to
ExtraChefAttributeshave never been supported as per update policy hereUser Experience
By default the attribute is true, so cfn-hup is enabled on all cluster nodes.
When set to false, cfn-hup is disabled on both compute and login nodes. When this the case, the cluster readiness checks are disabled because w.o cfnhup compute/login nodes are not able to start an in-place update, so such checks would always fail.
[UseCase 1] in-place updates enabled
This is the default behavior, where cfn-hup is enabled on head node, compute nodes and login nodes. Being cfn-hup enabled, compute/login nodes are able to execute in-place updates, so the head node executes the usual cluster readiness check at the end of the update.
[UseCase 2] in-place updates disabled
cfn-hup is enabled on head node, but disabled on both compute nodes and login nodes. Being cfn-hup disabled, compute/login nodes are not able to execute in-place updates, so the head node does not execute the cluster readiness check at the end of the update.
Tests
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.