Add validators to check head node instance type and shared storage type w.r.t cluster size #6623

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

hanwen-cluster merged 2 commits into aws:develop from hanwen-cluster:release-3.12

Dec 19, 2024

Contributor

hanwen-cluster commented Dec 13, 2024

The requirements set in these validators are minimum. Users should leave more safety margin considering their workloads.

Tests

Manual testing with cluster configurations that trigger/not trigger the validators is successful

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

hanwen-cluster requested review from a team as code owners

December 13, 2024 17:10

hanwen-cluster added the skip-changelog-update label

hehe7318 reviewed

View reviewed changes

cli/src/pcluster/validators/cluster_validators.py

    
                              f"Head node instance type {head_node_instance_type} has {head_node_memory} GB of memory. "

                              f"Please choose a head node instance type with at least {required_memory} GB of memory"

                              f" to manage {total_max_compute_nodes} compute nodes.",

                              FailureLevel.ERROR,

Contributor

hehe7318 Dec 13, 2024

Why decide to use FailureLevel.Error here but not a Warning? I think customers have the right to decide what instance type they pay for, but we have a responsibility to warn them.

Contributor Author

hanwen-cluster Dec 13, 2024

Because the requirement is very relaxed.
For example, the validator allows t3.micro to manage 10 nodes, t3.medium to manage 85 nodes, t3.xlarge to manage infinite nodes. If the requirement is violated, it is very likely the scaling up would fail.
Also, the customer could suppress the validation anyway

Contributor

hehe7318 Dec 13, 2024

Convinced me, I agree.

hanwen-cluster changed the base branch from release-3.12 to develop

December 13, 2024 17:23

hanwen-cluster force-pushed the release-3.12 branch from e8d4b13 to ec5c7db Compare

December 13, 2024 17:25

dreambeyondorange reviewed

View reviewed changes

cli/src/pcluster/validators/cluster_validators.py Outdated Show resolved Hide resolved

dreambeyondorange reviewed

View reviewed changes

cli/src/pcluster/validators/cluster_validators.py Outdated Show resolved Hide resolved

dreambeyondorange reviewed

View reviewed changes

cli/src/pcluster/validators/cluster_validators.py Outdated

    
                      if total_max_compute_nodes > 100:

                          self._add_failure(

                              "EBS shared storage is mounted on the head node and shared to the compute nodes. "

                              "This is a performance bottle neck if the compute nodes rely on this shared storage. "

Contributor

dreambeyondorange Dec 13, 2024

It's not clear why this is a performance bottleneck. Performance of what components?

Contributor Author

hanwen-cluster Dec 18, 2024

Done

hanwen-cluster force-pushed the release-3.12 branch from ec5c7db to e42cd15 Compare

December 17, 2024 16:56

hanwen-cluster marked this pull request as draft

December 17, 2024 16:57

hanwen-cluster force-pushed the release-3.12 branch 2 times, most recently from ec5c7db to 3378898 Compare

December 18, 2024 19:29

hanwen-cluster marked this pull request as ready for review

December 18, 2024 19:30

dreambeyondorange reviewed

View reviewed changes

cli/src/pcluster/validators/cluster_validators.py Outdated

    
              class HeadNodeMemorySizeValidator(Validator):

                  """

                  Head Node Instance Type Validator.

Contributor

dreambeyondorange Dec 19, 2024

Probably need to change this to match

Contributor Author

hanwen-cluster Dec 19, 2024

Done

cli/src/pcluster/validators/cluster_validators.py Outdated

    
                      if total_max_compute_nodes > 100:

                          self._add_failure(

                              "EBS shared storage is mounted on the head node and shared to the compute nodes. "

                              "Therefore, the head node network bandwidth is a performance bottle neck "

Contributor

dreambeyondorange Dec 19, 2024

Maybe call it network performance? It’s still ambiguous like this

Contributor Author

hanwen-cluster Dec 19, 2024

Done


          Add validators to check head node instance type and shared storage ty…

d5fe1b8

…pe w.r.t cluster size

The requirements set in these validators are minimum. Users should leave more safety margin considering their workloads.

Signed-off-by: Hanwen <hanwenli@amazon.com>

hanwen-cluster force-pushed the release-3.12 branch from 3378898 to d5fe1b8 Compare

December 19, 2024 16:04

dreambeyondorange approved these changes

View reviewed changes

Contributor

dreambeyondorange left a comment

Nice work!


          Merge branch 'develop' into release-3.12

acc0890

hanwen-cluster merged commit 69df4eb into aws:develop

24 checks passed

hanwen-cluster added a commit to hanwen-cluster/aws-parallelcluster that referenced this pull request


          [Test] Use larger head nodes to manage large cluster size

940cfbb

These tests started to fail after aws#6623. The tests didn't encounter scaling issue because the tests were not launching many compute nodes to use the full capacity of the cluster.

Signed-off-by: Hanwen <hanwenli@amazon.com>

hanwen-cluster mentioned this pull request

[Test] Use larger head nodes to manage large cluster size #6629

Merged

hanwen-cluster added a commit that referenced this pull request


          [Test] Use larger head nodes to manage large cluster size

a7e23b2

These tests started to fail after #6623. The tests didn't encounter scaling issue because the tests were not launching many compute nodes to use the full capacity of the cluster.

Signed-off-by: Hanwen <hanwenli@amazon.com>

hgreebe pushed a commit to hgreebe/aws-parallelcluster that referenced this pull request


          [Test] Use larger head nodes to manage large cluster size

8fce2f1

These tests started to fail after aws#6623. The tests didn't encounter scaling issue because the tests were not launching many compute nodes to use the full capacity of the cluster.

Signed-off-by: Hanwen <hanwenli@amazon.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-changelog-update