Skip to content

feat: extend automatic upgrades with caching, cross-platform support, and clean restarts#22

Merged
jacderida merged 12 commits intoWithAutonomi:mainfrom
jacderida:feat-extend_automatic_upgrades
Mar 19, 2026
Merged

feat: extend automatic upgrades with caching, cross-platform support, and clean restarts#22
jacderida merged 12 commits intoWithAutonomi:mainfrom
jacderida:feat-extend_automatic_upgrades

Conversation

@jacderida
Copy link
Collaborator

@jacderida jacderida commented Mar 10, 2026

Summary

Extends the automatic upgrade system with production-ready features for multi-node deployments across all platforms:

  • Always-on upgrades: Removed the --auto-upgrade flag; upgrades are now always enabled, simplifying deployment configuration
  • Release metadata cache: Shared disk cache with configurable TTL so multiple nodes on the same machine don't each poll the GitHub API independently
  • Binary cache: Downloaded and verified binaries are cached on disk (with SHA-256 integrity checks) so only the first node needs to download; subsequent nodes copy from cache
  • Clean process restart: Replaced Unix exec() with a spawn-and-exit model that works cross-platform. Added --stop-on-upgrade flag for service manager environments (systemd, launchd, Windows Service) where the manager handles restarts
  • Windows support: Added zip archive extraction alongside tar.gz, enabling auto-upgrades on Windows nodes
  • Staggered upgrade checks: Nodes randomise their check intervals (±5%) to avoid thundering-herd API calls
  • Robustness fixes: Handle deleted binary paths, improve error context in download/extraction, fix staged rollout check integration in the node loop

Validation

Three separate testnet deployments were used to validate the changes:

  • 100-node Linux testnet: All nodes successfully auto-upgraded to the fake 0.3.12-rc.1 release
  • 100-node Linux testnet + standalone Windows node: All Linux nodes upgraded successfully, and the standalone Windows node also upgraded correctly (zip extraction path)
  • 100-node Linux testnet + standalone macOS node: All Linux nodes upgraded successfully, and the standalone macOS node also upgraded correctly

Test plan

  • 100-node Linux-only testnet upgrade validated
  • 100-node Linux testnet + Windows standalone node upgrade validated
  • 100-node Linux testnet + macOS standalone node upgrade validated

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings March 10, 2026 18:24
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands the node auto-upgrade subsystem for multi-node, cross-platform deployments by adding shared on-disk caching, Windows archive support, and a service-manager-friendly restart mode, while also making upgrades always enabled by default.

Changes:

  • Make auto-upgrades always enabled (remove --auto-upgrade / config enabled) and add --stop-on-upgrade for service-manager restarts.
  • Add shared disk caches for GitHub release metadata and extracted binaries (with locking and TTL support in the cache primitives).
  • Add cross-platform restart behavior and Windows zip extraction; update packaging/deploy/docs to match the new behavior and repo URLs.

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
wix/main.wxs Update ARP metadata links to new GitHub org/repo.
systemd/saorsa-node.service Add --stop-on-upgrade and switch to Restart=always for service-manager restarts.
src/upgrade/release_cache.rs New: disk-backed release metadata cache with TTL and file locking + tests.
src/upgrade/monitor.rs Integrate ReleaseCache; add upgrade timing logs; update repo references in tests.
src/upgrade/mod.rs Export new cache modules and restart exit code; update platform support flag/test.
src/upgrade/cache_dir.rs New: shared upgrade cache directory helper + test.
src/upgrade/binary_cache.rs New: extracted-binary disk cache with SHA-256 integrity metadata and a download lock helper + tests.
src/upgrade/apply.rs Add binary cache usage, zip extraction, deleted-path handling, and spawn-and-exit restart model with --stop-on-upgrade.
src/node.rs Make upgrades always on; wire in caches + stop-on-upgrade; use staged-rollout-aware monitor method; add startup jitter.
src/config.rs Remove upgrade enabled; add stop_on_upgrade; update default GitHub repo.
src/bin/saorsa-node/cli.rs Remove --auto-upgrade; add --stop-on-upgrade; update upgrade config wiring.
scripts/testnet/upgrade-test.sh Adjust testnet script output for always-on upgrades.
scripts/testnet/start-genesis.sh Remove --auto-upgrade from generated systemd unit.
scripts/testnet/spawn-nodes.sh Remove --auto-upgrade from generated systemd unit.
scripts/testnet/deploy-all.sh Update binary URL to new GitHub org/repo.
scripts/testnet/build-and-deploy.sh Update clone URL to new GitHub org/repo.
docs/DESIGN.md Update repo references and remove mention of --auto-upgrade flag.
deploy/terraform/cloud-init/worker.yml Update release download URL to new GitHub org/repo.
deploy/scripts/spawn-nodes.sh Update release download URL to new GitHub org/repo.
README.md Update CLI examples, repo references, and add “Running as a Service” guidance.
Cargo.toml Add zip, fs2, and Windows self-replace; install systemd unit in packages; update repository URL.
.github/workflows/release.yml Adjust Windows MSI version derivation to strip prerelease suffix.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@greptile-apps
Copy link

greptile-apps bot commented Mar 10, 2026

Greptile Summary

This PR extends the auto-upgrade system from a basic Unix-only downloader into a production-grade, cross-platform upgrade pipeline. Key additions are a shared disk cache for release metadata (ReleaseCache) and verified binaries (BinaryCache), zip archive extraction for Windows, and a --stop-on-upgrade flag that lets service managers (systemd, launchd, WinSW) own the restart instead of exec(). The --auto-upgrade flag is removed in favour of always-on behaviour.

Highlights:

  • Release and binary caches are both well-guarded: release writes use an exclusive file lock + atomic rename; binary cache reads verify SHA-256 against stored metadata.
  • on_disk_version() correctly uses tokio::process::Command with a 5 s timeout (addressing the prior review concern).
  • Archive format detection by magic bytes is clean and the new test suite covers tar.gz, zip, unknown format, and missing-binary cases.
  • Repository URLs updated consistently across all scripts, config defaults, CI, and docs.

Issues found:

  • with_download_lock is called from the async fn resolve_upgrade_binary but uses fs2::FileExt::lock_exclusive(), a blocking syscall. When another process holds the lock the Tokio worker thread stalls — should use tokio::task::spawn_blocking.
  • Both extract_from_zip and extract_from_tar_gz read the entire binary into a Vec<u8> before writing; a streaming std::io::copy would avoid potentially large heap allocations.
  • current_binary_path() can return a relative argv[0] path, which could silently break the spawn in trigger_restart if the CWD changes.
  • BinaryCache::store() does not ensure the cache directory exists before writing, leading to a less-informative error if the directory is absent.

Confidence Score: 3/5

  • Functionally sound and validated on real testnets, but the blocking file lock in an async task is a correctness concern under contention that should be addressed before merging to a high-node-count fleet.
  • The core upgrade logic is well-tested and the security gate (ML-DSA signature verification before any extraction) is preserved. The testnet validation gives confidence in the happy path. The blocking with_download_lock call inside an async task is a real concern: while the lock is held only briefly (a single cache-file read), under contention it blocks a Tokio worker, and the issue compounds at scale with many nodes starting simultaneously. The relative-path and large-allocation issues are lower severity but worth fixing. No critical data-loss or security regressions were found.
  • Primary: src/upgrade/apply.rs (blocking lock in async, full-binary Vec allocation, relative argv[0] path). Secondary: src/upgrade/binary_cache.rs (missing cache dir creation in store()).

Important Files Changed

Filename Overview
src/upgrade/apply.rs Core upgrade logic refactored: adds binary cache integration, cross-platform restart (spawn-and-exit vs exec), zip extraction, and blocking file-lock in async context
src/upgrade/binary_cache.rs New file: SHA-256-verified binary cache with exclusive file locking; store() writes without holding the download lock (benign due to identical content) but could cause transient SHA-256 mismatches
src/upgrade/release_cache.rs New file: disk cache for GitHub release metadata with TTL, atomic writes, and exclusive file locking; well-implemented with thorough test coverage
src/upgrade/monitor.rs Adds release cache integration; falls back transparently to network on cache miss; clean implementation
src/node.rs Upgrade monitor always-on, adds jitter, uses check_for_ready_upgrade with richer branch handling; upgrade monitor no longer Arc-wrapped
systemd/saorsa-node.service Adds --stop-on-upgrade to ExecStart and upgrades Restart=on-failure to Restart=always for clean upgrade restarts

Sequence Diagram

sequenceDiagram
    participant N as saorsa-node (any instance)
    participant RC as ReleaseCache (disk)
    participant GH as GitHub API
    participant BC as BinaryCache (disk)
    participant FS as Filesystem (binary)

    N->>RC: read_if_valid(repo)
    alt cache hit (fresh)
        RC-->>N: Vec<GitHubRelease>
    else cache miss or expired
        N->>GH: GET /repos/{owner}/repo/releases
        GH-->>N: Vec<GitHubRelease>
        N->>RC: write(repo, releases) [exclusive lock]
        RC-->>N: Ok
    end

    N->>N: select_upgrade_from_releases()
    alt upgrade available
        N->>BC: get_verified(version)
        alt binary cache hit
            BC-->>N: PathBuf (cached binary)
        else cache miss
            N->>BC: with_download_lock { double-check }
            BC-->>N: still None
            N->>GH: download archive + signature
            GH-->>N: archive, .sig
            N->>N: verify ML-DSA signature
            N->>N: extract binary
            N->>BC: store(version, binary)
        end
        N->>FS: check on_disk_version()
        alt disk already upgraded
            N->>N: trigger_restart (skip replace)
        else needs replacement
            N->>FS: backup current binary
            N->>FS: replace_binary (atomic rename / self_replace)
            alt stop_on_upgrade=true
                N->>N: process::exit(0) [Unix] or exit(100) [Windows]
            else stop_on_upgrade=false
                N->>N: spawn new binary, then process::exit(0)
            end
        end
    end
Loading

Comments Outside Diff (2)

  1. src/upgrade/apply.rs, line 957-961 (link)

    Blocking file lock called from async context

    cache.with_download_lock(...) calls fs2::FileExt::lock_exclusive(), which is a blocking syscall. When another process holds the lock, the current Tokio worker thread stalls until the lock is released. Since resolve_upgrade_binary is async and runs inside a tokio::spawn task, blocking here starves other futures sharing that worker thread.

    While the closure body (a single get_verified file read) is fast, the lock acquisition itself can block for as long as the competing process holds it. This should be moved to a tokio::task::spawn_blocking block:

    let cached_under_lock = tokio::task::spawn_blocking({
        let cache = cache.clone(); // if BinaryCache is Clone, or Arc-wrap it
        let version_str = version_str.to_string();
        move || {
            cache.with_download_lock(|| {
                cache.get_verified(&version_str)
                    .map_or(Ok(None), |p| Ok(Some(p)))
            })
        }
    })
    .await
    .map_err(|e| Error::Upgrade(format!("Download lock task panicked: {e}")))??;

    Alternatively, BinaryCache could be made Clone (or wrapped in Arc) and with_download_lock could be a free async fn that uses tokio::task::spawn_blocking internally.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/upgrade/apply.rs
    Line: 957-961
    
    Comment:
    **Blocking file lock called from async context**
    
    `cache.with_download_lock(...)` calls `fs2::FileExt::lock_exclusive()`, which is a blocking syscall. When another process holds the lock, the current Tokio worker thread stalls until the lock is released. Since `resolve_upgrade_binary` is `async` and runs inside a `tokio::spawn` task, blocking here starves other futures sharing that worker thread.
    
    While the closure body (a single `get_verified` file read) is fast, the lock *acquisition* itself can block for as long as the competing process holds it. This should be moved to a `tokio::task::spawn_blocking` block:
    
    ```rust
    let cached_under_lock = tokio::task::spawn_blocking({
        let cache = cache.clone(); // if BinaryCache is Clone, or Arc-wrap it
        let version_str = version_str.to_string();
        move || {
            cache.with_download_lock(|| {
                cache.get_verified(&version_str)
                    .map_or(Ok(None), |p| Ok(Some(p)))
            })
        }
    })
    .await
    .map_err(|e| Error::Upgrade(format!("Download lock task panicked: {e}")))??;
    ```
    
    Alternatively, `BinaryCache` could be made `Clone` (or wrapped in `Arc`) and `with_download_lock` could be a free `async fn` that uses `tokio::task::spawn_blocking` internally.
    
    How can I resolve this? If you propose a fix, please make it concise.
  2. src/upgrade/binary_cache.rs, line 1545-1549 (link)

    store() doesn't ensure the cache directory exists

    BinaryCache::new accepts any PathBuf without creating the directory. If upgrade_cache_dir() fails (or the directory is removed between creation and first use), the fs::copy call here will fail with an opaque I/O error. Callers currently suppress this with warn!, so it won't crash — but an explicit fs::create_dir_all(&self.cache_dir) at the top of store() would make the failure much easier to diagnose:

    pub fn store(&self, version: &str, source_path: &std::path::Path) -> Result<()> {
        fs::create_dir_all(&self.cache_dir)
            .map_err(|e| Error::Upgrade(format!("Failed to create binary cache dir: {e}")))?;
        let hash = sha256_file(source_path)?;
        // ...
    }
    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/upgrade/binary_cache.rs
    Line: 1545-1549
    
    Comment:
    **`store()` doesn't ensure the cache directory exists**
    
    `BinaryCache::new` accepts any `PathBuf` without creating the directory. If `upgrade_cache_dir()` fails (or the directory is removed between creation and first use), the `fs::copy` call here will fail with an opaque I/O error. Callers currently suppress this with `warn!`, so it won't crash — but an explicit `fs::create_dir_all(&self.cache_dir)` at the top of `store()` would make the failure much easier to diagnose:
    
    ```rust
    pub fn store(&self, version: &str, source_path: &std::path::Path) -> Result<()> {
        fs::create_dir_all(&self.cache_dir)
            .map_err(|e| Error::Upgrade(format!("Failed to create binary cache dir: {e}")))?;
        let hash = sha256_file(source_path)?;
        // ...
    }
    ```
    
    How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 426bde1

@jacderida jacderida force-pushed the feat-extend_automatic_upgrades branch from e86bca9 to 09500a1 Compare March 10, 2026 18:35
Copilot AI review requested due to automatic review settings March 10, 2026 19:16
@jacderida
Copy link
Collaborator Author

@greptile review

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jacderida jacderida force-pushed the feat-extend_automatic_upgrades branch from 426bde1 to 99f6bdc Compare March 10, 2026 22:25
Copilot AI review requested due to automatic review settings March 10, 2026 22:31
@jacderida jacderida force-pushed the feat-extend_automatic_upgrades branch from 99f6bdc to 5f4545a Compare March 10, 2026 22:31
@jacderida jacderida force-pushed the feat-extend_automatic_upgrades branch from 5f4545a to ab3ecdb Compare March 10, 2026 22:35
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (2)

.github/workflows/release.yml:168

  • Stripping the pre-release suffix from the git tag when generating the MSI ProductVersion can cause version collisions (e.g. v1.2.3-rc.1 and v1.2.3 both become 1.2.3.0). MSI upgrades generally require monotonically increasing versions; this may prevent upgrading from an RC to the final release. Consider encoding pre-release information into the 4-part MSI version instead of dropping it (or ensure your WiX upgrade strategy accounts for this).
          # Remove 'v' prefix and strip any pre-release suffix (e.g., -rc.1, -beta.2)
          $version = $tag -replace '^v', '' -replace '-.*$', ''
          # Ensure 4-part version for MSI (add .0 if needed)
          $parts = $version.Split('.')
          while ($parts.Length -lt 4) {
            $version = "$version.0"
            $parts = $version.Split('.')
          }

README.md:933

  • The CLI reference doesn’t mention the new --stop-on-upgrade flag, even though it’s documented earlier and implemented in cli.rs. Please add it here so operators can discover it via the README’s option list.
    --upgrade-channel <CHANNEL>
        Release channel: stable, beta
        [default: stable]


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jacderida jacderida force-pushed the feat-extend_automatic_upgrades branch from ab3ecdb to 0e0d136 Compare March 10, 2026 22:55
Copilot AI review requested due to automatic review settings March 11, 2026 00:18
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 23 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

.github/workflows/release.yml:168

  • The MSI version derivation now strips any pre-release suffix (e.g. v1.2.3-rc.11.2.3). Since this workflow runs on all v* tags and publishes pre-releases too, this can create MSI version collisions between pre-release and stable tags (and between multiple pre-releases), which can break Windows upgrade/uninstall behavior. Consider either skipping MSI generation for pre-release tags or mapping pre-release identifiers into a monotonically increasing 4-part MSI version (so each tag produces a unique, ordered ProductVersion).
          $tag = "${{ github.ref_name }}"
          # Remove 'v' prefix and strip any pre-release suffix (e.g., -rc.1, -beta.2)
          $version = $tag -replace '^v', '' -replace '-.*$', ''
          # Ensure 4-part version for MSI (add .0 if needed)
          $parts = $version.Split('.')
          while ($parts.Length -lt 4) {
            $version = "$version.0"
            $parts = $version.Split('.')
          }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jacderida jacderida force-pushed the feat-extend_automatic_upgrades branch from 4180c36 to 0e0d136 Compare March 11, 2026 21:50
Copy link
Collaborator

@grumbach grumbach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #22 Review: Extend Automatic Upgrades

Overall Assessment: The architecture is sound — two-layer caching, ML-DSA signature verification before extraction, zip-slip protection via enclosed_name(), and the spawn-and-exit restart model are all well-designed. However, there are correctness issues that should be addressed before merge.

Must Fix

  1. std::process::exit() in async context — bypasses graceful shutdown, tokio runtime, open handles, and the CancellationToken flow (see inline comments on apply.rs)
  2. current_binary_path() (deleted) logic bug — the (deleted) stripping inside exists() guard is unreachable (see inline comment)
  3. use rand::Rng inline in function bodies — violates project coding standards (see inline comments on node.rs)

Should Fix

  1. Replace std::thread::sleep with tokio::time::sleep in the Windows retry loop (apply.rs:560)
  2. Log a warning when chrono::Duration::from_std() fails silently (node.rs)
  3. Extract jitter calculation into a helper — it's duplicated in node.rs
  4. Add migration note in release notes: users with enabled = false in config will now silently get auto-upgrades

Consider for Follow-up

  • rollout.rs:69,136 uses direct array indexing (self.node_id_hash[0] through [7]). Blake3 is always 32 bytes so it's safe, but violates the project's no-direct-indexing rule. Consider hash.as_bytes().first_chunk::<8>().
  • No cache eviction in cache_dir.rs — old versions accumulate forever
  • apply.rs:272 reads entire response into memory before writing. For 200 MiB max archive, consider streaming to disk.
  • Missing tests for resolve_upgrade_binary(), current_binary_path() with (deleted), concurrent cache access, and jitter calculation
  • check_interval_hours = 0 in config would cause a tight polling loop — consider validating

let invoked_path = env::args().next().map(PathBuf::from);

if let Some(ref invoked) = invoked_path {
if invoked.exists() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug (M1): Logic error — this (deleted) check (lines 103-110) is inside the if invoked.exists() guard. If the binary was deleted (which is why the (deleted) suffix appears), then invoked.exists() returns false, and this code is unreachable.

The (deleted) stripping only works in the current_exe() fallback path (lines 124-132). Move the (deleted) check before the exists() check, or handle deleted paths as a special case:

// Check for (deleted) suffix first, before exists()
let cleaned = if invoked_str.ends_with(" (deleted)") {
    PathBuf::from(invoked_str.trim_end_matches(" (deleted)"))
} else {
    invoked.clone()
};
if cleaned.exists() {
    return Ok(cleaned);
}

info!("Exiting with code 0 for service manager restart");
std::thread::sleep(std::time::Duration::from_millis(100));
std::process::exit(0);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical (C1): std::process::exit() in async context terminates the process immediately without running destructors or flushing async resources. This is called from within a tokio::spawn'd task. The 100ms thread::sleep before exit is best-effort but:

  • Tokio runtime is not gracefully shut down
  • Open file handles, network connections, and LMDB transactions may not be flushed
  • The CancellationToken shutdown flow in RunningNode::run() is completely bypassed

Recommendation: Instead of calling exit() directly, cancel the shutdown token and let the main loop handle graceful shutdown, then exit from the main function. Or at minimum, spawn a dedicated OS thread for the exit sequence.

);
last_err = Some(e);
std::thread::sleep(std::time::Duration::from_millis(*delay_ms));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major (M2): std::thread::sleep blocks the tokio worker thread. This Windows retry loop sleeps for 500ms, 1000ms, and 2000ms — up to 3.5 seconds of tokio worker blocking in the worst case.

Use tokio::time::sleep instead, or wrap in spawn_blocking.

src/node.rs Outdated
let variance = interval_secs / 20; // 5%
let jitter_secs = if variance > 0 {
use rand::Rng;
rand::thread_rng().gen_range(0..=variance * 2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major (M3): use rand::Rng; is inline in a function body. Project rules (CLAUDE.md) explicitly state: "always use top level use for dependencies instead of inlining them in the functions." Please move this to the top-level imports.

src/node.rs Outdated
let variance = interval_secs / 20; // 5%
let jittered_secs = if variance > 0 {
use rand::Rng;
let jitter = rand::thread_rng().gen_range(0..=variance * 2);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major (M3, duplicate): Second occurrence of inline use rand::Rng;. Same as above — move to top-level imports.

Also, this entire jitter calculation block (lines 617-631) is duplicated from lines 541-558. Extract into a helper function.

@grumbach grumbach dismissed their stale review March 12, 2026 05:19

Re-submitting as comment review

Copy link
Collaborator

@grumbach grumbach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #22 Review: Extend Automatic Upgrades

Overall Assessment: The architecture is sound — two-layer caching, ML-DSA signature verification before extraction, zip-slip protection via enclosed_name(), and the spawn-and-exit restart model are all well-designed. However, there are correctness issues that should be addressed before merge.

Must Fix

  1. std::process::exit() in async context — bypasses graceful shutdown, tokio runtime, open handles, and the CancellationToken flow (see inline comments on apply.rs)
  2. current_binary_path() (deleted) logic bug — the (deleted) stripping inside exists() guard is unreachable (see inline comment)
  3. use rand::Rng inline in function bodies — violates project coding standards (see inline comments on node.rs)

Should Fix

  1. Replace std::thread::sleep with tokio::time::sleep in the Windows retry loop (apply.rs:560)
  2. Log a warning when chrono::Duration::from_std() fails silently (node.rs)
  3. Extract jitter calculation into a helper — it's duplicated in node.rs
  4. Add migration note in release notes: users with enabled = false in config will now silently get auto-upgrades

Consider for Follow-up

  • rollout.rs:69,136 uses direct array indexing (self.node_id_hash[0] through [7]). Blake3 is always 32 bytes so it's safe, but violates the project's no-direct-indexing rule. Consider hash.as_bytes().first_chunk::<8>().
  • No cache eviction in cache_dir.rs — old versions accumulate forever
  • apply.rs:272 reads entire response into memory before writing. For 200 MiB max archive, consider streaming to disk.
  • Missing tests for resolve_upgrade_binary(), current_binary_path() with (deleted), concurrent cache access, and jitter calculation
  • check_interval_hours = 0 in config would cause a tight polling loop — consider validating

let invoked_path = env::args().next().map(PathBuf::from);

if let Some(ref invoked) = invoked_path {
if invoked.exists() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug (M1): Logic error — this (deleted) check (lines 103-110) is inside the if invoked.exists() guard. If the binary was deleted (which is why the (deleted) suffix appears), then invoked.exists() returns false, and this code is unreachable.

The (deleted) stripping only works in the current_exe() fallback path (lines 124-132). Move the (deleted) check before the exists() check, or handle deleted paths as a special case:

// Check for (deleted) suffix first, before exists()
let cleaned = if invoked_str.ends_with(" (deleted)") {
    PathBuf::from(invoked_str.trim_end_matches(" (deleted)"))
} else {
    invoked.clone()
};
if cleaned.exists() {
    return Ok(cleaned);
}

info!("Exiting with code 0 for service manager restart");
std::thread::sleep(std::time::Duration::from_millis(100));
std::process::exit(0);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical (C1): std::process::exit() in async context terminates the process immediately without running destructors or flushing async resources. This is called from within a tokio::spawn'd task. The 100ms thread::sleep before exit is best-effort but:

  • Tokio runtime is not gracefully shut down
  • Open file handles, network connections, and LMDB transactions may not be flushed
  • The CancellationToken shutdown flow in RunningNode::run() is completely bypassed

Recommendation: Instead of calling exit() directly, cancel the shutdown token and let the main loop handle graceful shutdown, then exit from the main function. Or at minimum, spawn a dedicated OS thread for the exit sequence.

);
last_err = Some(e);
std::thread::sleep(std::time::Duration::from_millis(*delay_ms));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major (M2): std::thread::sleep blocks the tokio worker thread. This Windows retry loop sleeps for 500ms, 1000ms, and 2000ms — up to 3.5 seconds of tokio worker blocking in the worst case.

Use tokio::time::sleep instead, or wrap in spawn_blocking.

src/node.rs Outdated
let variance = interval_secs / 20; // 5%
let jitter_secs = if variance > 0 {
use rand::Rng;
rand::thread_rng().gen_range(0..=variance * 2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major (M3): use rand::Rng; is inline in a function body. Project rules (CLAUDE.md) explicitly state: "always use top level use for dependencies instead of inlining them in the functions." Please move this to the top-level imports.

src/node.rs Outdated
let variance = interval_secs / 20; // 5%
let jittered_secs = if variance > 0 {
use rand::Rng;
let jitter = rand::thread_rng().gen_range(0..=variance * 2);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major (M3, duplicate): Second occurrence of inline use rand::Rng;. Same as above — move to top-level imports.

Also, this entire jitter calculation block (lines 617-631) is duplicated from lines 541-558. Extract into a helper function.

Copilot AI review requested due to automatic review settings March 12, 2026 22:18
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

.github/workflows/release.yml:166

  • The MSI version derivation strips the entire pre-release suffix (e.g. v0.3.12-rc.10.3.12.0). If you publish MSIs for both pre-releases and the eventual stable v0.3.12, they’ll share the same ProductVersion, which can break Windows Installer upgrade semantics (stable may not upgrade/replace the RC MSI). Consider mapping pre-release identifiers into the 4th version component (e.g. 0.3.12.1 for rc.1) so every tag produces a monotonically increasing MSI version.
          # Remove 'v' prefix and strip any pre-release suffix (e.g., -rc.1, -beta.2)
          $version = $tag -replace '^v', '' -replace '-.*$', ''
          # Ensure 4-part version for MSI (add .0 if needed)
          $parts = $version.Split('.')
          while ($parts.Length -lt 4) {
            $version = "$version.0"

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +637 to +651
// Standalone mode: spawn new process, then exit after graceful shutdown
let args: Vec<String> = env::args().skip(1).collect();

// exec() replaces the current process
let err = std::process::Command::new(binary_path).args(&args).exec();
info!("Spawning new process: {} {:?}", binary_path.display(), args);

// If we get here, exec failed
Err(Error::Upgrade(format!("Failed to exec new binary: {err}")))
}
std::process::Command::new(binary_path)
.args(&args)
.stdin(std::process::Stdio::null())
.stdout(std::process::Stdio::inherit())
.stderr(std::process::Stdio::inherit())
.spawn()
.map_err(|e| Error::Upgrade(format!("Failed to spawn new binary: {e}")))?;

#[cfg(not(unix))]
{
// On Windows, we can't replace a running binary
// Just log and let the user restart manually
let _ = binary_path; // Suppress unused warning on Windows
warn!("Auto-restart not supported on this platform. Please restart manually.");
Ok(())
info!("New process spawned, will exit after graceful shutdown");
Ok(0)
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In standalone mode (stop_on_upgrade = false), prepare_restart spawns the upgraded binary immediately, before the node has completed its graceful shutdown/cleanup. This can briefly run two instances concurrently and cause port/DB lock conflicts (the child may fail to start while the parent still holds resources). Consider deferring the spawn until after shutdown has completed (e.g., return a restart action/command to be executed by the main run loop after cleanup), or always use an external supervisor for restarts.

Copilot uses AI. Check for mistakes.
Comment on lines 358 to 363
/// Whether the current platform supports in-place auto-upgrade.
///
/// On Windows, replacing a running executable is typically blocked by file locks.
/// Supported on Unix (via `exec()`) and Windows (via `self-replace` crate).
const fn auto_upgrade_supported() -> bool {
!cfg!(windows)
true
}
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto_upgrade_supported() now always returns true, but the surrounding docs/commentary indicate Windows support via self-replace/exec(). In this module, perform_upgrade still uses fs::rename for replacement and doesn’t use self-replace, so Windows in-place upgrades will likely fail at runtime. Either restore a real platform gate here (fail closed on Windows) or update Upgrader to use the same cross-platform replacement approach as AutoApplyUpgrader.

Copilot uses AI. Check for mistakes.
Comment on lines +39 to +43
/// Note: this creates a real directory under the platform data dir.
/// Modifying env vars to isolate this requires `unsafe` (due to
/// `deny(unsafe_code)`), so we accept the minor side-effect.
#[test]
fn test_upgrade_cache_dir_returns_path() {
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unit test creates a real directory under the user/platform data dir, which can leave behind state and can fail in restricted CI environments. Also, std::env::set_var isn’t gated by deny(unsafe_code), so the comment about needing unsafe to isolate env looks incorrect. Consider making the test hermetic (e.g., set XDG_DATA_HOME/APPDATA to a temp dir in a serialized test), or mark it #[ignore]/remove it if it can’t be made isolated.

Suggested change
/// Note: this creates a real directory under the platform data dir.
/// Modifying env vars to isolate this requires `unsafe` (due to
/// `deny(unsafe_code)`), so we accept the minor side-effect.
#[test]
fn test_upgrade_cache_dir_returns_path() {
/// This test uses a temporary directory as the platform data dir by
/// setting the appropriate environment variable (e.g. `XDG_DATA_HOME` or
/// `APPDATA`) before calling `upgrade_cache_dir()`, to avoid touching the
/// real user data directory.
#[test]
fn test_upgrade_cache_dir_returns_path() {
// Use a dedicated temp directory as the platform data directory root.
let temp_root = std::env::temp_dir().join("saorsa_upgrade_cache_test");
let _ = std::fs::remove_dir_all(&temp_root);
std::fs::create_dir_all(&temp_root).expect("failed to create temp data dir root");
// Point the platform data dir to our temp root so that ProjectDirs
// resolves paths under it instead of the real user data directory.
#[cfg(target_os = "windows")]
{
std::env::set_var("APPDATA", &temp_root);
}
#[cfg(not(target_os = "windows"))]
{
std::env::set_var("XDG_DATA_HOME", &temp_root);
}

Copilot uses AI. Check for mistakes.
Comment on lines +154 to +156
/// caller should fetch from the network. The returned
/// The returned lock guard must be held until after writing the fresh
/// data so that other nodes block rather than all hitting the API.
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc comment has a duplicated/unfinished sentence (“caller should fetch from the network. The returned The returned…”). Please remove the stray fragment so the rustdoc for lock_and_recheck is readable.

Suggested change
/// caller should fetch from the network. The returned
/// The returned lock guard must be held until after writing the fresh
/// data so that other nodes block rather than all hitting the API.
/// caller should fetch from the network. The returned lock guard must be
/// held until after writing the fresh data so that other nodes block
/// rather than all hitting the API.

Copilot uses AI. Check for mistakes.
Comment on lines +27 to +28
/// On Windows, `trigger_restart` exits with this code instead of using
/// `exec()`. The wrapping service (e.g. NSSM or Windows Service) should be
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RESTART_EXIT_CODE docs mention trigger_restart/exec(), but the code now uses prepare_restart and no longer calls exec(). Update this comment to reflect the current restart mechanism (spawn-and-exit vs service-manager exit codes).

Suggested change
/// On Windows, `trigger_restart` exits with this code instead of using
/// `exec()`. The wrapping service (e.g. NSSM or Windows Service) should be
/// When using a service manager–based restart flow (instead of spawning the
/// new binary directly), the process should exit with this code and rely on
/// the wrapping service (e.g. systemd, NSSM, or Windows Service) being

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings March 14, 2026 21:10
@jacderida jacderida force-pushed the feat-extend_automatic_upgrades branch from 96b1411 to 06909cb Compare March 14, 2026 21:10
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

///
/// Returns `Ok(Some(releases))` if a valid cache was found under the
/// lock, or `Ok(None)` if the cache is still stale/missing and the
/// caller should fetch from the network. The returned
Comment on lines +789 to +797
fn jittered_interval(base: std::time::Duration) -> std::time::Duration {
let secs = base.as_secs();
let variance = secs / 20; // 5%
if variance == 0 {
return base;
}
let jitter = rand::thread_rng().gen_range(0..=variance * 2);
std::time::Duration::from_secs(secs.saturating_sub(variance) + jitter)
}
@@ -504,54 +527,102 @@ impl RunningNode {
self.start_protocol_routing();

// Start upgrade monitor if enabled
Copilot AI review requested due to automatic review settings March 17, 2026 16:32
@jacderida jacderida force-pushed the feat-extend_automatic_upgrades branch from 37cc2a1 to 6e0ebae Compare March 17, 2026 16:32
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 24 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +154 to +156
/// caller should fetch from the network. The returned
/// The returned lock guard must be held until after writing the fresh
/// data so that other nodes block rather than all hitting the API.
Comment on lines +27 to +29
/// On Windows, `trigger_restart` exits with this code instead of using
/// `exec()`. The wrapping service (e.g. NSSM or Windows Service) should be
/// configured to restart on this exit code.
};

if cleaned.exists() {
// Canonicalize to an absolute path so that trigger_restart
/// Whether the current platform supports in-place auto-upgrade.
///
/// On Windows, replacing a running executable is typically blocked by file locks.
/// Supported on Unix (via `exec()`) and Windows (via `self-replace` crate).
@@ -504,54 +527,102 @@ impl RunningNode {
self.start_protocol_routing();

// Start upgrade monitor if enabled
jacderida and others added 10 commits March 18, 2026 22:22
Add shared disk caches so multiple saorsa-node instances on the same
machine avoid redundant GitHub API calls and binary downloads:

- Release cache (release_cache.rs): TTL-based JSON cache with fs2 file
  locking. Only one node per host fetches from the API; others reuse
  the cached response. TTL set to 1 hour to match the default check
  interval.
- Binary cache (binary_cache.rs): SHA-256 verified cache of extracted
  binaries with fs2 download locking. Only one node downloads and
  extracts; others copy the cached binary.
- Shared cache directory (cache_dir.rs): ~/.local/share/saorsa/upgrades/

Add Windows auto-upgrade support:
- Binary replacement via self-replace crate with retry/backoff
- RESTART_EXIT_CODE (100) for service manager restart signalling
- .exe suffix handling in extract_binary
- auto_upgrade_supported() now returns true on all platforms

Add log messages matching the antnode upgrade test verification pipeline:
- "Node is running on port: {port}"
- "Next upgrade check scheduled for {time}"
- "Node will stop/restart for upgrade at {time}"
- "Downloading saorsa-node binary..."
- "Cached binary verified for version {version}"
- "Using cached release info" / "No valid cache, fetching from API"
- "Error during upgrade process: {reason}"
- "Already running latest version"

Strip pre-release suffix from MSI version in the release workflow so
that WiX candle receives a valid 4-part numeric version.

Update GitHub repo references from dirvine to saorsa-labs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The upgrade loop was calling `check_for_updates()` which bypasses the
staged rollout delay calculation. Switch to `check_for_ready_upgrade()`
so that nodes respect their deterministic rollout delay before applying
upgrades, preventing mass simultaneous restarts.

The UpgradeMonitor no longer needs Arc wrapping since only the single
spawned upgrade task uses it; moving ownership in gives the required
`&mut self` access for the stateful rollout tracking.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Strip the " (deleted)" suffix that Linux appends to /proc/self/exe when
another node has already replaced the on-disk binary during an upgrade.
This prevents backup creation and binary replacement from targeting a
nonexistent path.

Add randomized jitter (0-10% of check interval) before the first upgrade
check so nodes started simultaneously don't all race to the GitHub API
on the first cycle.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Auto-upgrades are now unconditionally active rather than opt-in.
The `--auto-upgrade` CLI flag and `SAORSA_AUTO_UPGRADE` env var
are removed, along with the `enabled` field from `UpgradeConfig`.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ades

Add --stop-on-upgrade flag for service manager mode (systemd, launchd,
Windows Service) where the node exits cleanly and lets the service
manager restart it. Without the flag, standalone mode spawns the new
binary as a child process before exiting.

This fixes leaked inodes from exec(), ensures clean resource cleanup,
and makes systemd aware of the restart lifecycle.

- Add --stop-on-upgrade CLI flag and UpgradeConfig field
- Replace trigger_restart: exit(0) on Unix / exit(100) on Windows for
  service manager mode; spawn+exit for standalone mode
- Update systemd service: add --stop-on-upgrade, change Restart=always
- Add service file to RPM package assets
- Add "Running as a Service" README section (systemd/launchd/Windows)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Skip binary replacement when sibling service already upgraded (checks
  on-disk version via --version before backup/replace)
- Distinguish "no upgrade available" from "upgrade pending" in log
  messages using time_until_upgrade()
- Add 100ms sleep before process exit to flush tracing log buffer
- Prefer argv[0] over current_exe() for symlink-aware path resolution

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The extract_binary() function unconditionally used GzDecoder (tar.gz),
but Windows release assets are packaged as .zip files. This caused
"invalid gzip header" errors on Windows during auto-upgrade extraction.

Now detects archive format by magic bytes (gzip=1f8b, zip=504b) and
dispatches to the appropriate extractor, supporting both formats on
all platforms.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove sentinel `_failed_` path pattern; propagate errors properly
  from download_verify_extract and map to RolledBack in apply_upgrade
- Make on_disk_version async with 5s timeout to avoid blocking tokio
- Wire up BinaryCache::with_download_lock at the cache-miss call site
  to prevent concurrent downloads from multiple nodes
- Apply jitter to every upgrade check interval, not just the first, to
  prevent fleet re-alignment thundering herd
- Remove stale `enabled = true` from README upgrade config example
- Fix formatting in apply.rs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix unreachable `(deleted)` suffix check in `current_binary_path()` by
  moving it before the `exists()` guard
- Replace `std::process::exit()` in async context with graceful shutdown
  via CancellationToken, ensuring tokio runtime, LMDB, and file handles
  are properly cleaned up before exit
- Move inline `use rand::Rng` to top-level imports and extract duplicated
  jitter calculation into `jittered_interval()` helper
- Offload blocking `replace_binary` to `spawn_blocking` to avoid starving
  the tokio worker thread (especially on Windows with retry sleeps)
- Add warning logs when `chrono::Duration::from_std()` falls back to
  default values

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
macOS App Nap aggressively coalesces timers and reduces CPU priority
for background processes with no visible UI. This caused upgrade check
timers to be deferred by 3-15+ minutes, making the auto-upgrade feature
unreliable on macOS when running as a standalone process.

Call NSProcessInfo.beginActivity(NSActivityUserInitiated) at startup to
tell macOS this process is performing important user-initiated work.
This prevents timer coalescing and CPU throttling for the lifetime of
the process.

Note: IOKit power assertions (PreventUserIdleSystemSleep) do NOT prevent
App Nap — only the NSProcessInfo activity API works for this purpose.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jacderida jacderida force-pushed the feat-extend_automatic_upgrades branch from 6e0ebae to 4ab01f1 Compare March 19, 2026 01:02
saorsa-core v0.17 changed MultiAddr::port() to return Option<u16>.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 19, 2026 01:04
@jacderida jacderida force-pushed the feat-extend_automatic_upgrades branch from 4ab01f1 to a80a71e Compare March 19, 2026 01:04
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 24 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

///
/// Returns `Ok(Some(releases))` if a valid cache was found under the
/// lock, or `Ok(None)` if the cache is still stale/missing and the
/// caller should fetch from the network. The returned
Comment on lines +25 to +29
/// Exit code that signals the service manager to restart the process.
///
/// On Windows, `trigger_restart` exits with this code instead of using
/// `exec()`. The wrapping service (e.g. NSSM or Windows Service) should be
/// configured to restart on this exit code.
};

if cleaned.exists() {
// Canonicalize to an absolute path so that trigger_restart
- Replace redundant closure with method reference for MultiAddr::port
- Replace deprecated msg_send_id! with msg_send! (objc2 deprecation)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 19, 2026 16:51
@jacderida jacderida force-pushed the feat-extend_automatic_upgrades branch from 3127a30 to f6a24c8 Compare March 19, 2026 16:51
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 24 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

///
/// Returns `Ok(Some(releases))` if a valid cache was found under the
/// lock, or `Ok(None)` if the cache is still stale/missing and the
/// caller should fetch from the network. The returned
Comment on lines +129 to +131
let now = SystemTime::now().duration_since(UNIX_EPOCH).ok()?.as_secs();
let age_secs = now.saturating_sub(cached.fetched_at_epoch_secs);
if age_secs >= self.ttl.as_secs() {
Comment on lines +311 to +315
if let Ok(cache_dir) = upgrade_cache_dir() {
monitor = monitor.with_release_cache(ReleaseCache::new(
cache_dir,
std::time::Duration::from_secs(3600),
));
Comment on lines +27 to +28
/// On Windows, `trigger_restart` exits with this code instead of using
/// `exec()`. The wrapping service (e.g. NSSM or Windows Service) should be
Comment on lines +108 to +113
let path_str = invoked.to_string_lossy();
let cleaned = if path_str.ends_with(" (deleted)") {
let stripped = path_str.trim_end_matches(" (deleted)");
debug!("Stripped '(deleted)' suffix from invoked path: {stripped}");
PathBuf::from(stripped)
} else {
Comment on lines +99 to +100
// Prefer the invoked path (argv[0]) if it exists, as it preserves symlinks.
// Fall back to current_exe() which resolves symlinks via /proc/self/exe.
@jacderida jacderida merged commit 1d3f7f3 into WithAutonomi:main Mar 19, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants