Skip to content

feat: avoid clustered restarts during upgrades#49

Merged
jacderida merged 4 commits intomainfrom
feat-avoid_clustered_restarts
Mar 30, 2026
Merged

feat: avoid clustered restarts during upgrades#49
jacderida merged 4 commits intomainfrom
feat-avoid_clustered_restarts

Conversation

@jacderida
Copy link
Copy Markdown
Collaborator

@jacderida jacderida commented Mar 30, 2026

Summary

  • Increase the default staged rollout window from 1 hour to 24 hours, giving nodes more room to spread out their restarts
  • When a pending upgrade is detected, sleep for exactly the remaining rollout delay instead of waiting for the next check interval tick — this eliminates restart clustering caused by quantization to the check interval
  • Skip crates.io publish for pre-release versions (RC, alpha, beta) in the release workflow

Test plan

  • Validated with 100-node testnet using 2-hour rollout window
  • Scheduled restart times are uniformly distributed across the rollout window
  • Actual restart times match scheduled times (no burst clustering)
  • All existing unit tests pass
  • Clippy and cargo fmt pass

Test results

Auto-Upgrade Test Results: DEV-01

All 91 nodes (90 regular + 1 genesis) successfully upgraded from v0.7.0 to v0.7.10-rc.1.

Check Result
All nodes upgrade to v0.7.10-rc.1 PASS (91/91)
Binary downloaded once per host PASS (1 download per VM, rest cached)
Release info fetched once per host PASS (2-3 fetches due to cache TTL over 2hr window, 34-53 cache hits each)
No upgrade errors PASS (zero errors found)
Peer ID retention PASS (identical before/after)
Port retention N/A (nodes use random ports by design with 0.0.0.0:0)
Restart time distribution PASS (spread across full 2-hour window, 2-11 per 10-min bucket, no clustering)
Graceful shutdown logged PASS (91/91)
NRestarts = 1 PASS (91/91)
No /proc/PID/exe (deleted) PASS (91/91 - no stale binaries)
systemd stop/start cycle PASS

Key findings:

  • Binary caching works - only 1 download per host, subsequent nodes detect the binary was already replaced
  • Restart distribution is even - no clustered bursts, scheduled-to-actual accuracy within 5-11 seconds
  • No stale binaries - the (deleted) issue from previous tests is fully resolved
  • All process restarts verified via systemd NRestarts=1 and journalctl stop/start cycles

🤖 Generated with Claude Code

jacderida and others added 3 commits March 29, 2026 16:16
Prevents clustered restarts when a new release is published by spreading
node upgrades evenly across a 24-hour window instead of 1 hour.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When an upgrade is pending, the monitor task now sleeps for precisely
the remaining rollout delay rather than waiting for the next check
interval tick. This eliminates restart clustering caused by quantization
to the check interval.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pre-release versions (alpha, beta, rc) should not be published to
crates.io. Also removes publish-crate from the release job dependency
chain so pre-release GitHub releases aren't blocked.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 30, 2026 17:12
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Reduces upgrade-induced restart clustering by widening the staged rollout window and aligning sleep timing with each node’s scheduled upgrade time; also updates the release workflow to skip crates.io publishing for prereleases.

Changes:

  • Increase default staged rollout window from 1 hour to 24 hours.
  • When an upgrade is pending, sleep until the exact remaining rollout delay rather than the next check interval tick.
  • Skip crates.io publish for prerelease tags in the GitHub Actions release workflow.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
src/node.rs Adjusts upgrade-monitor loop to sleep until scheduled upgrade time to avoid quantization/clustering.
src/config.rs Changes default staged rollout window from 1h to 24h.
.github/workflows/release.yml Skips crates.io publishing for prereleases and adjusts job dependencies for release creation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Add backoff when rollout delay has elapsed but upgrade failed, to
  prevent a tight retry loop on Duration::ZERO
- Wrap upgrade sleep in tokio::select! with shutdown.cancelled() so
  shutdown can interrupt long rollout delay sleeps
- Restore publish-crate dependency on release job with conditional
  logic: release proceeds if publish-crate succeeds or was skipped
  (pre-release), but blocks if it fails (stable)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jacderida jacderida merged commit 8495d16 into main Mar 30, 2026
17 checks passed
@jacderida jacderida deleted the feat-avoid_clustered_restarts branch March 30, 2026 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants