Skip to content

Conversation

@hanwen-cluster
Copy link
Contributor

The requirements set in these validators are minimum. Users should leave more safety margin considering their workloads.

Tests

  • Manual testing with cluster configurations that trigger/not trigger the validators is successful

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@hanwen-cluster hanwen-cluster requested review from a team as code owners December 13, 2024 17:10
@hanwen-cluster hanwen-cluster added the skip-changelog-update Disables the check that enforces changelog updates in PRs label Dec 13, 2024
f"Head node instance type {head_node_instance_type} has {head_node_memory} GB of memory. "
f"Please choose a head node instance type with at least {required_memory} GB of memory"
f" to manage {total_max_compute_nodes} compute nodes.",
FailureLevel.ERROR,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why decide to use FailureLevel.Error here but not a Warning? I think customers have the right to decide what instance type they pay for, but we have a responsibility to warn them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the requirement is very relaxed.
For example, the validator allows t3.micro to manage 10 nodes, t3.medium to manage 85 nodes, t3.xlarge to manage infinite nodes. If the requirement is violated, it is very likely the scaling up would fail.
Also, the customer could suppress the validation anyway

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convinced me, I agree.

@hanwen-cluster hanwen-cluster changed the base branch from release-3.12 to develop December 13, 2024 17:23
if total_max_compute_nodes > 100:
self._add_failure(
"EBS shared storage is mounted on the head node and shared to the compute nodes. "
"This is a performance bottle neck if the compute nodes rely on this shared storage. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear why this is a performance bottleneck. Performance of what components?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@hanwen-cluster hanwen-cluster marked this pull request as draft December 17, 2024 16:57
@hanwen-cluster hanwen-cluster force-pushed the release-3.12 branch 2 times, most recently from ec5c7db to 3378898 Compare December 18, 2024 19:29
@hanwen-cluster hanwen-cluster marked this pull request as ready for review December 18, 2024 19:30

class HeadNodeMemorySizeValidator(Validator):
"""
Head Node Instance Type Validator.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably need to change this to match

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if total_max_compute_nodes > 100:
self._add_failure(
"EBS shared storage is mounted on the head node and shared to the compute nodes. "
"Therefore, the head node network bandwidth is a performance bottle neck "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe call it network performance? It’s still ambiguous like this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

…pe w.r.t cluster size

The requirements set in these validators are minimum. Users should leave more safety margin considering their workloads.

Signed-off-by: Hanwen <hanwenli@amazon.com>
Copy link
Contributor

@dreambeyondorange dreambeyondorange left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@hanwen-cluster hanwen-cluster merged commit 69df4eb into aws:develop Dec 19, 2024
24 checks passed
hanwen-cluster added a commit to hanwen-cluster/aws-parallelcluster that referenced this pull request Dec 20, 2024
These tests started to fail after aws#6623. The tests didn't encounter scaling issue because the tests were not launching many compute nodes to use the full capacity of the cluster.

Signed-off-by: Hanwen <hanwenli@amazon.com>
hanwen-cluster added a commit that referenced this pull request Dec 20, 2024
These tests started to fail after #6623. The tests didn't encounter scaling issue because the tests were not launching many compute nodes to use the full capacity of the cluster.

Signed-off-by: Hanwen <hanwenli@amazon.com>
hgreebe pushed a commit to hgreebe/aws-parallelcluster that referenced this pull request Feb 26, 2025
These tests started to fail after aws#6623. The tests didn't encounter scaling issue because the tests were not launching many compute nodes to use the full capacity of the cluster.

Signed-off-by: Hanwen <hanwenli@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-changelog-update Disables the check that enforces changelog updates in PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants