[Bug Fix] Fix clustermgtd to detect bootstrap timeout so protected mode can trigger by hehe7318 · Pull Request #702 · aws/aws-parallelcluster-node

hehe7318 · 2026-04-29T23:17:49Z

Description of changes

Compute nodes that hang during bootstrap (e.g. when a required VPC endpoint like DynamoDB is missing) loop indefinitely between ResumeTimeout and instance wake-up, because clustermgtd never counts them as bootstrap failures. Protected mode therefore never triggers and the cluster cannot stop the failing loop.

Root cause: Slurm clears NOT_RESPOND on ResumeTimeout, producing state DOWN+CLOUD+POWERED_DOWN. The previous expected set included NOT_RESPONDING and never matched.

Fix: drop NOT_RESPONDING from SLURM_SCONTROL_RESUME_FAILED_STATE so failing nodes are counted and protected mode triggers at the configured threshold.

Tests

Unit tests passed.
Manually test done, now cluster can enter protected mode as expected.

Checklist

Make sure you are pointing to the right branch.
If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
Check all commits' messages are clear, describing what and why vs how.
Make sure to have added unit tests or integration tests to cover the new/modified code.
Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…de can trigger Compute nodes that hang during bootstrap (e.g. when a required VPC endpoint like DynamoDB is missing) loop indefinitely between ResumeTimeout and instance wake-up, because clustermgtd never counts them as bootstrap failures. Protected mode therefore never triggers and the cluster cannot stop the failing loop. Root cause: Slurm clears NOT_RESPOND on ResumeTimeout, producing state DOWN+CLOUD+POWERED_DOWN. The previous expected set included NOT_RESPONDING and never matched. Fix: drop NOT_RESPONDING from SLURM_SCONTROL_RESUME_FAILED_STATE so failing nodes are counted and protected mode triggers at the configured threshold.

…ap (#3175) When a compute node cannot reach DynamoDB during bootstrap (for example, a private subnet is missing the DynamoDB VPC gateway endpoint), Chef's `ruby_block[retrieve compute node info]` retries silently with no useful diagnostic information. The default configuration (30 retries × ~5 min per attempt) also exceeds Slurm's default ResumeTimeout (2100s), so the instance is terminated before Chef can emit a final error. This change: - Tightens timeouts and retries so Chef fails fast within ResumeTimeout: - `retries`: 30 → 5 - `aws_connection_timeout_seconds`: 30 → 10 - `aws_read_timeout_seconds`: 60 → 30 - `shell_timeout_seconds`: 300 → 60 - Worst-case total time: ~5–6 minutes (under the 35 minutes `ResumeTimeout` default). - Improves the failure message to include the CLI exit code and stderr, plus a hint about the most common cause (missing DynamoDB VPC endpoint on a private subnet). - Fixes a pre-existing typo: `shell_timout_seconds` → `shell_timeout_seconds`. Related bug: aws/aws-parallelcluster-node#702

hehe7318 requested review from a team as code owners April 29, 2026 23:17

hehe7318 added skip-changelog-update 3.x labels Apr 29, 2026

Add changelog

92704af

hehe7318 mentioned this pull request Apr 30, 2026

[Changelog] Add changelog for computenode bootstrap timeout detection fix in protected mode aws/aws-parallelcluster#7363

Merged

gmarciani approved these changes Apr 30, 2026

View reviewed changes

hehe7318 merged commit f836f6f into aws:develop Apr 30, 2026
12 checks passed

This was referenced Apr 30, 2026

Refer SLURM_SCONTROL_RESUME_FAILED_STATE in is_bootstrap_failure comment #703

Merged

Fix DynamoDB retry behavior and error message on compute node bootstrap aws/aws-parallelcluster-cookbook#3175

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug Fix] Fix clustermgtd to detect bootstrap timeout so protected mode can trigger#702

[Bug Fix] Fix clustermgtd to detect bootstrap timeout so protected mode can trigger#702
hehe7318 merged 2 commits into
aws:developfrom
hehe7318:wip/fix-clustermgtd-detect-resume-timeout

hehe7318 commented Apr 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hehe7318 commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Tests

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hehe7318 commented Apr 29, 2026 •

edited

Loading