Skip to content

Conversation

@hehe7318
Copy link
Contributor

Description of changes

  • Use c5.xlarge instead of hpc5a instance type. The slurm job is just sleep.
  • We noticed bootstrap time in Rocky and Rhel is explicitly longer than other OSs. c5.xlarge bootstrap time in Rocky and Rhel is around 6 minutes. Set stop_max_delay_secs based on the OS. Increase wait_for_num_nodes_in_scheduler timeout to 7 mins.
  • Ensure job running time bigger than _wait_for_node_reset timeout plus _assert_nodes_not_terminated waiting_time.
  • Modify parameter name to improve readability
  • Add a 45 seconds delay to accommodate node replacement process (~45s between node down status and replacement)

Tests

  • Test re-run passed

Reference

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…d setting stop_max_delay_secs based on the OS (aws#6737)

* Use c5.xlarge instead of hpc5a instance type. The slurm job is just sleep.
* We noticed bootstrap time in Rocky and Rhel is explicitly longer than other OSs. `c5.xlarge` bootstrap time in Rocky and Rhel is around 6 minutes. Set stop_max_delay_secs based on the OS. Increase `wait_for_num_nodes_in_scheduler` timeout to 7 mins.
* Ensure job running time bigger than `_wait_for_node_reset` timeout plus `_assert_nodes_not_terminated` waiting_time.
* Modify parameter name to improve readability
* Add a 45 seconds delay to accommodate node replacement process (~45s between node down status and replacement)
@hehe7318 hehe7318 added skip-changelog-update Disables the check that enforces changelog updates in PRs 3.x labels Mar 19, 2025
@hehe7318 hehe7318 requested review from a team as code owners March 19, 2025 14:55
@hehe7318 hehe7318 enabled auto-merge (squash) March 19, 2025 15:24
@hehe7318 hehe7318 merged commit a0b68d2 into aws:develop Mar 19, 2025
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3.x skip-changelog-update Disables the check that enforces changelog updates in PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants