Skip to content

[Bug Fix] Fix clustermgtd to detect bootstrap timeout so protected mode can trigger#702

Merged
hehe7318 merged 2 commits into
aws:developfrom
hehe7318:wip/fix-clustermgtd-detect-resume-timeout
Apr 30, 2026
Merged

[Bug Fix] Fix clustermgtd to detect bootstrap timeout so protected mode can trigger#702
hehe7318 merged 2 commits into
aws:developfrom
hehe7318:wip/fix-clustermgtd-detect-resume-timeout

Conversation

@hehe7318
Copy link
Copy Markdown
Contributor

@hehe7318 hehe7318 commented Apr 29, 2026

Description of changes

Compute nodes that hang during bootstrap (e.g. when a required VPC endpoint like DynamoDB is missing) loop indefinitely between ResumeTimeout and instance wake-up, because clustermgtd never counts them as bootstrap failures. Protected mode therefore never triggers and the cluster cannot stop the failing loop.

Root cause: Slurm clears NOT_RESPOND on ResumeTimeout, producing state DOWN+CLOUD+POWERED_DOWN. The previous expected set included NOT_RESPONDING and never matched.

Fix: drop NOT_RESPONDING from SLURM_SCONTROL_RESUME_FAILED_STATE so failing nodes are counted and protected mode triggers at the configured threshold.

Tests

  • Unit tests passed.
  • Manually test done, now cluster can enter protected mode as expected.

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…de can trigger

Compute nodes that hang during bootstrap (e.g. when a required VPC
endpoint like DynamoDB is missing) loop indefinitely between
ResumeTimeout and instance wake-up, because clustermgtd
never counts them as bootstrap failures. Protected mode therefore
never triggers and the cluster cannot stop the failing loop.

Root cause: Slurm  clears NOT_RESPOND on ResumeTimeout, producing state DOWN+CLOUD+POWERED_DOWN. The previous expected set included NOT_RESPONDING and never matched.

Fix: drop NOT_RESPONDING from SLURM_SCONTROL_RESUME_FAILED_STATE so
failing nodes are counted and protected mode triggers at the
configured threshold.
@hehe7318 hehe7318 merged commit f836f6f into aws:develop Apr 30, 2026
12 checks passed
hehe7318 added a commit to aws/aws-parallelcluster-cookbook that referenced this pull request May 7, 2026
…ap (#3175)

When a compute node cannot reach DynamoDB during bootstrap (for example, a private subnet is missing the DynamoDB VPC gateway endpoint), Chef's `ruby_block[retrieve compute node info]` retries silently with no useful diagnostic information. The default configuration (30 retries × ~5 min per attempt) also exceeds Slurm's default ResumeTimeout (2100s), so the instance is terminated before Chef can emit a final error.

This change:

- Tightens timeouts and retries so Chef fails fast within ResumeTimeout:
  - `retries`: 30 → 5
  - `aws_connection_timeout_seconds`: 30 → 10
  - `aws_read_timeout_seconds`: 60 → 30
  - `shell_timeout_seconds`: 300 → 60
  - Worst-case total time: ~5–6 minutes (under the 35 minutes `ResumeTimeout` default).
- Improves the failure message to include the CLI exit code and stderr, plus a hint about the most common cause (missing DynamoDB VPC endpoint on a private subnet).
- Fixes a pre-existing typo: `shell_timout_seconds` → `shell_timeout_seconds`.

Related bug: aws/aws-parallelcluster-node#702
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants