feat: avoid clustered restarts during upgrades by jacderida · Pull Request #49 · WithAutonomi/ant-node

jacderida · 2026-03-30T17:12:04Z

Summary

Increase the default staged rollout window from 1 hour to 24 hours, giving nodes more room to spread out their restarts
When a pending upgrade is detected, sleep for exactly the remaining rollout delay instead of waiting for the next check interval tick — this eliminates restart clustering caused by quantization to the check interval
Skip crates.io publish for pre-release versions (RC, alpha, beta) in the release workflow

Test plan

Validated with 100-node testnet using 2-hour rollout window
Scheduled restart times are uniformly distributed across the rollout window
Actual restart times match scheduled times (no burst clustering)
All existing unit tests pass
Clippy and cargo fmt pass

Test results

Auto-Upgrade Test Results: DEV-01

All 91 nodes (90 regular + 1 genesis) successfully upgraded from v0.7.0 to v0.7.10-rc.1.

Check	Result
All nodes upgrade to v0.7.10-rc.1	PASS (91/91)
Binary downloaded once per host	PASS (1 download per VM, rest cached)
Release info fetched once per host	PASS (2-3 fetches due to cache TTL over 2hr window, 34-53 cache hits each)
No upgrade errors	PASS (zero errors found)
Peer ID retention	PASS (identical before/after)
Port retention	N/A (nodes use random ports by design with 0.0.0.0:0)
Restart time distribution	PASS (spread across full 2-hour window, 2-11 per 10-min bucket, no clustering)
Graceful shutdown logged	PASS (91/91)
NRestarts = 1	PASS (91/91)
No /proc/PID/exe (deleted)	PASS (91/91 - no stale binaries)
systemd stop/start cycle	PASS

Key findings:

Binary caching works - only 1 download per host, subsequent nodes detect the binary was already replaced
Restart distribution is even - no clustered bursts, scheduled-to-actual accuracy within 5-11 seconds
No stale binaries - the (deleted) issue from previous tests is fully resolved
All process restarts verified via systemd NRestarts=1 and journalctl stop/start cycles

🤖 Generated with Claude Code

Prevents clustered restarts when a new release is published by spreading node upgrades evenly across a 24-hour window instead of 1 hour. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When an upgrade is pending, the monitor task now sleeps for precisely the remaining rollout delay rather than waiting for the next check interval tick. This eliminates restart clustering caused by quantization to the check interval. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Pre-release versions (alpha, beta, rc) should not be published to crates.io. Also removes publish-crate from the release job dependency chain so pre-release GitHub releases aren't blocked. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Reduces upgrade-induced restart clustering by widening the staged rollout window and aligning sleep timing with each node’s scheduled upgrade time; also updates the release workflow to skip crates.io publishing for prereleases.

Changes:

Increase default staged rollout window from 1 hour to 24 hours.
When an upgrade is pending, sleep until the exact remaining rollout delay rather than the next check interval tick.
Skip crates.io publish for prerelease tags in the GitHub Actions release workflow.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`src/node.rs`	Adjusts upgrade-monitor loop to sleep until scheduled upgrade time to avoid quantization/clustering.
`src/config.rs`	Changes default staged rollout window from 1h to 24h.
`.github/workflows/release.yml`	Skips crates.io publishing for prereleases and adjusts job dependencies for release creation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/node.rs

.github/workflows/release.yml

- Add backoff when rollout delay has elapsed but upgrade failed, to prevent a tight retry loop on Duration::ZERO - Wrap upgrade sleep in tokio::select! with shutdown.cancelled() so shutdown can interrupt long rollout delay sleeps - Restore publish-crate dependency on release job with conditional logic: release proceeds if publish-crate succeeds or was skipped (pre-release), but blocks if it fails (stable) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jacderida and others added 3 commits March 29, 2026 16:16

feat: increase staged rollout window from 1 hour to 24 hours

37e00f7

Prevents clustered restarts when a new release is published by spreading node upgrades evenly across a 24-hour window instead of 1 hour. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings March 30, 2026 17:12

Copilot started reviewing on behalf of jacderida March 30, 2026 17:12 View session

Copilot AI reviewed Mar 30, 2026

View reviewed changes

src/node.rs Outdated Show resolved Hide resolved

src/node.rs Outdated Show resolved Hide resolved

.github/workflows/release.yml Outdated Show resolved Hide resolved

mickvandijke approved these changes Mar 30, 2026

View reviewed changes

jacderida merged commit 8495d16 into main Mar 30, 2026
17 checks passed

jacderida deleted the feat-avoid_clustered_restarts branch March 30, 2026 17:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: avoid clustered restarts during upgrades#49

feat: avoid clustered restarts during upgrades#49
jacderida merged 4 commits intomainfrom
feat-avoid_clustered_restarts

jacderida commented Mar 30, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jacderida commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Test results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jacderida commented Mar 30, 2026 •

edited

Loading