-
Notifications
You must be signed in to change notification settings - Fork 314
Upgrade to Slurm 20.11.4 #2459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade to Slurm 20.11.4 #2459
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #2459 +/- ##
===========================================
+ Coverage 62.07% 62.09% +0.02%
===========================================
Files 40 40
Lines 6220 6224 +4
===========================================
+ Hits 3861 3865 +4
Misses 2359 2359
Continue to review full report at Codecov.
|
b31ae8d to
354162a
Compare
b377731 to
55143f1
Compare
CHANGELOG.md
Outdated
| - Make `key_name` parameter optional to support cluster configurations without a key pair. | ||
| - Remove support for Python 3.4 | ||
| - Root volume size increased from 25GB to 35GB on all AMIs. Minimum root volume size is now 35GB. | ||
| - Upgrade Slurm to version 20.11.4. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you move this up in the list of changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| # Set gpus according to instance features | ||
| gpus = instance_type_info.gpu_count() | ||
| compute_resource_section.get_param("gpus").value = gpus | ||
| compute_resource_section.get_param("gpu_type").value = instance_type_info.gpu_type() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we just skip adding the gpu_type entry if there is no gpu rather than using the no_gpu_type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter will still be there in the Json because it is defined in mappings.py? Are you saying we should use an empty string "" instead of "no_gpu_type" as the default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
either empty "" or None so that it corresponds to a false value in Python. But feel free to leave it as is if you prefer.
| _wait_for_node_reset(scheduler_commands, static_nodes=static_nodes, dynamic_nodes=[]) | ||
| assert_num_instances_in_cluster(cluster_name, region, len(static_nodes)) | ||
| # Reset SlurmdTimeout to 180s | ||
| _set_slurmd_timeout(remote_command_executor, timeout=180) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we change this value in the cookbook the test will be misaligned. Can we restore the previous value without hardcoding the value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that hardcoding the value would be more clear. Since we are changing the slurm configuration across test cases, the time could fail if value is not reset correctly, or set to some value we are not aware of. The hardcoded value will make updating value in cookbook a bit more complicated, but IMO it might help us to spot issues with our tests more easily
| # Can take up to 15 mins for status check to show | ||
|
|
||
|
|
||
| def _test_status_check_replacement( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: test_ec2_status_check_replacement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| _wait_for_node_reset(scheduler_commands, static_nodes=[], dynamic_nodes=dynamic_nodes) | ||
| # Assert static nodes are reset | ||
| _wait_for_node_reset(scheduler_commands, static_nodes=static_nodes, dynamic_nodes=[]) | ||
| assert_num_instances_in_cluster(cluster_name, region, len(static_nodes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to check logs to make sure the replacement happened due to failing health checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we check that Setting nodes failing health check type ec2_health_check to DRAIN is in clustermgtd log above
2862f2f to
366300b
Compare
* Add gpu_type to internal config to specify instance GPU model as GRES GPU Type in gres.conf * Updated integ test to submit test submitting job with specific model of GPU * Add test for checking slurm health check behavior for CLOUD nodes. Previously slurm will not perform node health check before ResumeTimeout expires. This is fixed in 20.11.4 * Modify status check test to avoid slurm health check and use static capacity * Add additional test to make sure nodeaddr and nodehostname are reset on power_down. This should verify the behavior of relying on cloud_reg_addrs option Signed-off-by: Rex <shuningc@amazon.com>
Signed-off-by: Rex shuningc@amazon.com
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.