Skip to content

Conversation

@rexcsn
Copy link
Contributor

@rexcsn rexcsn commented Feb 23, 2021

  • Add gpu_type to internal config to specify instance GPU model as GRES GPU Type in gres.conf
  • Updated integ test to submit test submitting job with specific model of GPU
  • Add test for checking slurm health check behavior for CLOUD nodes. Previously slurm will not perform node health check before ResumeTimeout expires. This is fixed in 20.11.4
  • Modify status check test to avoid slurm health check and use static capacity
  • Add additional test to make sure nodeaddr and nodehostname are reset on power_down. This should verify the behavior of relying on cloud_reg_addrs option

Signed-off-by: Rex shuningc@amazon.com

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@codecov
Copy link

codecov bot commented Feb 23, 2021

Codecov Report

Merging #2459 (e69cda7) into develop (0fdcbc4) will increase coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2459      +/-   ##
===========================================
+ Coverage    62.07%   62.09%   +0.02%     
===========================================
  Files           40       40              
  Lines         6220     6224       +4     
===========================================
+ Hits          3861     3865       +4     
  Misses        2359     2359              
Impacted Files Coverage Δ
cli/src/pcluster/config/mappings.py 100.00% <ø> (ø)
cli/src/pcluster/config/json_param_types.py 99.36% <100.00%> (+<0.01%) ⬆️
cli/src/pcluster/utils.py 67.79% <100.00%> (+0.15%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0fdcbc4...e69cda7. Read the comment docs.

@rexcsn rexcsn force-pushed the slurm/gpu branch 2 times, most recently from b31ae8d to 354162a Compare February 23, 2021 20:10
@rexcsn rexcsn changed the title Specify instance GPU model as GRES GPU Type in gres.conf Upgrade to Slurm 20.11.4 Feb 23, 2021
@rexcsn rexcsn force-pushed the slurm/gpu branch 2 times, most recently from b377731 to 55143f1 Compare February 24, 2021 00:29
@rexcsn rexcsn marked this pull request as ready for review February 24, 2021 01:56
CHANGELOG.md Outdated
- Make `key_name` parameter optional to support cluster configurations without a key pair.
- Remove support for Python 3.4
- Root volume size increased from 25GB to 35GB on all AMIs. Minimum root volume size is now 35GB.
- Upgrade Slurm to version 20.11.4.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you move this up in the list of changes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I would only report item #2 and #4 from the list below in this changelog. the other changes are not really meaningful for the user

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

# Set gpus according to instance features
gpus = instance_type_info.gpu_count()
compute_resource_section.get_param("gpus").value = gpus
compute_resource_section.get_param("gpu_type").value = instance_type_info.gpu_type()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we just skip adding the gpu_type entry if there is no gpu rather than using the no_gpu_type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter will still be there in the Json because it is defined in mappings.py? Are you saying we should use an empty string "" instead of "no_gpu_type" as the default?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either empty "" or None so that it corresponds to a false value in Python. But feel free to leave it as is if you prefer.

_wait_for_node_reset(scheduler_commands, static_nodes=static_nodes, dynamic_nodes=[])
assert_num_instances_in_cluster(cluster_name, region, len(static_nodes))
# Reset SlurmdTimeout to 180s
_set_slurmd_timeout(remote_command_executor, timeout=180)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we change this value in the cookbook the test will be misaligned. Can we restore the previous value without hardcoding the value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that hardcoding the value would be more clear. Since we are changing the slurm configuration across test cases, the time could fail if value is not reset correctly, or set to some value we are not aware of. The hardcoded value will make updating value in cookbook a bit more complicated, but IMO it might help us to spot issues with our tests more easily

# Can take up to 15 mins for status check to show


def _test_status_check_replacement(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: test_ec2_status_check_replacement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

_wait_for_node_reset(scheduler_commands, static_nodes=[], dynamic_nodes=dynamic_nodes)
# Assert static nodes are reset
_wait_for_node_reset(scheduler_commands, static_nodes=static_nodes, dynamic_nodes=[])
assert_num_instances_in_cluster(cluster_name, region, len(static_nodes))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to check logs to make sure the replacement happened due to failing health checks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we check that Setting nodes failing health check type ec2_health_check to DRAIN is in clustermgtd log above

@rexcsn rexcsn force-pushed the slurm/gpu branch 4 times, most recently from 2862f2f to 366300b Compare February 24, 2021 19:54
* Add gpu_type to internal config to specify instance GPU model as GRES GPU Type in gres.conf
* Updated integ test to submit test submitting job with specific model of GPU
* Add test for checking slurm health check behavior for CLOUD nodes. Previously slurm will not perform node health check before ResumeTimeout expires. This is fixed in 20.11.4
* Modify status check test to avoid slurm health check and use static capacity
* Add additional test to make sure nodeaddr and nodehostname are reset on power_down. This should verify the behavior of relying on cloud_reg_addrs option

Signed-off-by: Rex <shuningc@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants