Skip to content

fix: subnet bootstrapping#1545

Merged
phutchins merged 10 commits intomainfrom
feature/subnet-bootstrapping
Mar 17, 2026
Merged

fix: subnet bootstrapping#1545
phutchins merged 10 commits intomainfrom
feature/subnet-bootstrapping

Conversation

@phutchins
Copy link
Copy Markdown
Contributor

@phutchins phutchins commented Mar 11, 2026

Note

Medium Risk
Medium risk because it changes CI build OS/container base and substantially rewires remote subnet initialization/bootstrapping, config generation, and node startup behavior, which could break deployments if assumptions about paths/ports/SSH differ.

Overview
Aligns build/runtime environments to Ubuntu 24.04 by moving GitHub Actions runners to ubuntu-24.04 and switching the fendermint runner image base to ubuntu:24.04 to match the newer glibc requirement.

Expands ipc-subnet-manager to support fresh-host bootstrapping and more resilient remote ops. Adds a bootstrap command (installs deps, clones/builds IPC), an init --resume mode, and a diagnose command; updates configs to support internal_ip for peer traffic, safer local-vs-remote path handling for ~/.ipc, and more robust genesis/config propagation across hosts.

Hardens node start/health/peering flows. Node startup now reliably sets resolver/subnet env vars via a generated start-node.sh, health checks gain optional --wait, log tailing avoids remote grep pipes, and multiple SSH/sed operations are reworked to avoid quoting/login-shell hangs (new exec_on_host_simple, keepalive options, and script-based remote edits).

Written by Cursor Bugbot for commit 7431ec8. This will update automatically on new commits. Configure here.

…figuration updates

- Added a new bootstrap command to install dependencies (Rust, Foundry, Node.js) on fresh validator hosts.
- Updated initialization process to support resuming from previous failures.
- Modified subnet configuration with new validator IPs and registry addresses.
- Improved health check and execution commands for better reliability.
- Enhanced documentation to reflect new bootstrap steps and usage instructions.
…rd metrics

- Introduced a new troubleshooting document for diagnosing issues with the IPLD Resolver not listening on port 26654.
- Enhanced the dashboard script to initialize additional metrics for better monitoring, including block production rates and finality tracking.
- Updated health check scripts to ensure proper environment variable handling and improve logging for resolver-related configurations.
- Introduced a new function `ssh_exec_long` to handle long-running commands with streaming output, preventing SSH timeouts during builds.
- Updated the `update_validator_binaries` function to utilize the new long-running command execution, improving build process logging and error handling.
@phutchins phutchins requested a review from a team as a code owner March 11, 2026 14:03
@phutchins phutchins changed the title Feature/subnet bootstrapping fix: subnet bootstrapping Mar 11, 2026
Comment thread scripts/ipc-subnet-manager/lib/dashboard.sh Outdated
Comment thread scripts/ipc-subnet-manager/lib/dashboard.sh Outdated
…d script

- Updated the calculation of `blocks_per_min` to accurately reflect block production rates based on time differences.
- Adjusted timestamp formatting logic to ensure proper handling of time zone indicators.
for arg in "$@"; do
case $arg in
--wait=*) wait_seconds="${arg#*=}" ;;
--wait) shift; wait_seconds="${1:-30}" ;;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--wait value not parsed in for loop

Medium Severity

In cmd_check, the --wait VALUE form (space-separated) calls shift inside a for arg in "$@" loop. shift modifies $@ but has no effect on the loop's already-captured iteration list. As a result, wait_seconds="${1:-30}" reads $1 — the original first argument (e.g., "--wait") — instead of the intended value (e.g., 45). Since "--wait" is not a number, the [ "$wait_seconds" -gt 0 ] check silently fails and no sleep occurs. The suggested usage ./ipc-manager check --wait 45 (advertised in the error output) will never work correctly.

Fix in Cursor Fix in Web

Comment thread scripts/ipc-subnet-manager/ipc-subnet-config.yml
- Modified the `ipc-subnet-config.yml` to reflect new registry and gateway addresses for the parent subnet.
- Enhanced the `config.sh` script to retrieve parent addresses from a unified source, ensuring backward compatibility with existing configurations.
- Updated YAML config synchronization logic to maintain consistency between subnet and ipc_cli.parent sections.
METRICS[peers]=0
METRICS[mempool_size]=0
METRICS[mempool_bytes]=0
METRICS[mempool_max]=5000
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dashboard mempool_max initialization prevents config reading

Low Severity

METRICS[mempool_max] is now initialized to 5000 in initialize_dashboard, which causes the conditional check [ -z "${METRICS[mempool_max]:-}" ] in fetch_metrics to always evaluate to false. The actual CometBFT mempool config value is never read from the node. Previously this key was uninitialized, so the first fetch_metrics call would read the real value. The dashboard now always shows 5000 as capacity regardless of actual config, causing incorrect mempool percentage calculations.

Additional Locations (1)
Fix in Cursor Fix in Web

local peer_ip=$(get_config_value "validators[$peer_idx].ip")
if echo "$static_addrs" | grep -q "/ip4/$peer_ip/tcp/$libp2p_port"; then
local peer_ip=$(get_peer_ip "$peer_idx")
if echo "$static_addrs" | grep -q "/ip4/$peer_ip/tcp/$v_resolver_port"; then
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Info command checks wrong port for peer static_addresses

Low Severity

In cmd_info, the static_addresses peer check uses $v_resolver_port (the current validator's resolver port) to verify peer entries. But static_addresses contains each peer's own resolver port, not the current validator's port. In local mode where each validator has a different port offset, this check always fails, producing misleading diagnostic output. The peer's port via get_resolver_port_for_validator "$peer_idx" is needed instead.

Additional Locations (1)
Fix in Cursor Fix in Web

@phutchins phutchins enabled auto-merge (squash) March 13, 2026 12:12
@karlem
Copy link
Copy Markdown
Contributor

karlem commented Mar 13, 2026

@phutchins thanks for the PR, but can you please clarify why bootstrap provisions a full dev/build environment on each validator host (repo clone + Rust + Foundry + Node/pnpm) instead of using prebuilt/pinned ipc-cli and fendermint artifacts?
What specific runtime or operational requirement makes on-host source builds necessary here, and is a forked repo actually required for this flow?

- Enhanced the `get_chain_id` function to differentiate between local and remote modes for fetching the chain ID.
- Implemented direct curl requests to the validator's external IP in remote mode, ensuring better connectivity and reliability.
- Maintained existing functionality for local mode by using localhost for API calls.
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

There are 6 total unresolved issues (including 3 from previous reviews).

Autofix Details

Bugbot Autofix prepared fixes for all 3 issues found in the latest run.

  • ✅ Fixed: Private keys committed in configuration file
    • Replaced hardcoded private keys with environment variable placeholders (${IPC_VALIDATORS_N_PRIVATE_KEY}) so sensitive credentials are no longer committed to version control.
  • ✅ Fixed: Troubleshooting doc contains hardcoded infrastructure IPs
    • Deleted the TROUBLESHOOT-RESOLVER.md file which contained personal debugging notes with hardcoded IPs, usernames, and infrastructure details.
  • ✅ Fixed: Remote tilde expansion uses local HOME variable
    • Changed tilde expansion to use the remote ipc_user's home directory (/home/$ipc_user) instead of the local $HOME variable.

Create PR

Or push these changes by commenting:

@cursor push 152a58a80a
Preview (152a58a80a)
diff --git a/scripts/ipc-subnet-manager/TROUBLESHOOT-RESOLVER.md b/scripts/ipc-subnet-manager/TROUBLESHOOT-RESOLVER.md
deleted file mode 100644
--- a/scripts/ipc-subnet-manager/TROUBLESHOOT-RESOLVER.md
+++ /dev/null
@@ -1,196 +1,0 @@
-# Systematic Troubleshooting: Port 26654 (IPLD Resolver) Not Listening
-
-## Diagnostic Results (from your run)
-
-| Check | Result |
-|-------|--------|
-| Config `listen_addr` | ✓ `/ip4/0.0.0.0/tcp/26654` |
-| Config `subnet_id` | ✓ `/r314159/t410fjzsmxroshdmvdq5bg4zwqxx5lznwxaga4h7zgqa` |
-| Config `[resolver] enabled` | ✓ `true` |
-| Start script | ✓ Has correct env vars |
-| Manual `env FM_... ipc-cli node start` | ✗ Port 26654 still not listening |
-| Logs: "IPLD Resolver disabled" or "starting..." | ✗ **Neither appears** |
-| Logs: "snapshots disabled" at node.rs | Line **142** (remote) vs **243** (current code) |
-
-**ROOT CAUSE:** The remote binary was built from a different branch (e.g. f3-lifecycle). Line numbers don't match current code; the resolver block may not exist or is structured differently in that binary. The config and env vars are correct—the binary simply doesn't have the resolver code.
-
----
-
-## Fix
-
-Rebuild the binary on validators from the branch that has the resolver code:
-
-```bash
-./ipc-manager update-binaries --branch feature/subnet-bootstrapping
-./ipc-manager restart --yes
-```
-
-Then verify:
-
-```bash
-./ipc-manager check
-ssh philip@34.16.93.183 "ss -tuln | grep 26654"
-```
-
----
-
-## Root Cause Logic (from fendermint)
-
-The resolver starts only when `resolver_enabled()` returns true:
-```rust
-// fendermint/app/settings/src/lib.rs:523-527
-pub fn resolver_enabled(&self) -> bool {
-    !self.resolver.connection.listen_addr.is_empty()
-        && self.ipc.subnet_id != *ipc_api::subnet_id::UNDEF
-}
-```
-
-**Both conditions must be true:**
-1. `resolver.connection.listen_addr` must be non-empty (e.g. `/ip4/0.0.0.0/tcp/26654`)
-2. `ipc.subnet_id` must not be UNDEF (root: 0, children: [])
-
-If disabled, logs show: `"IPLD Resolver disabled."`
-If enabled, logs show: `"starting the IPLD Resolver Service..."`
-
----
-
-## Step 1: Check Config on Remote
-
-SSH to validator-1 and inspect the fendermint config:
-
-```bash
-ssh philip@34.16.93.183 "sudo -u ipc cat /home/ipc/.ipc-node/fendermint/config/default.toml"
-```
-
-**Look for:**
-- `[resolver]` or `[resolver.connection]` section
-- `listen_addr = "/ip4/0.0.0.0/tcp/26654"` (or similar)
-- `[ipc]` section with `subnet_id = "/r314159/t410fjzsmxroshdmvdq5bg4zwqxx5lznwxaga4h7zgqa"`
-
-**Grep for key sections:**
-```bash
-ssh philip@34.16.93.183 "sudo -u ipc grep -A5 '\[resolver\]' /home/ipc/.ipc-node/fendermint/config/default.toml"
-ssh philip@34.16.93.183 "sudo -u ipc grep -A2 '\[ipc\]' /home/ipc/.ipc-node/fendermint/config/default.toml"
-ssh philip@34.16.93.183 "sudo -u ipc grep listen_addr /home/ipc/.ipc-node/fendermint/config/default.toml"
-```
-
----
-
-## Step 2: Check Logs for Resolver Decision (CRITICAL)
-
-```bash
-# Resolver decision
-ssh philip@34.16.93.183 "sudo -u ipc grep -E 'IPLD Resolver|resolver' /home/ipc/.ipc-node/logs/*.log 2>/dev/null | tail -20"
-
-# Also check startup logs
-ssh philip@34.16.93.183 "sudo -u ipc tail -100 /home/ipc/.ipc-node/logs/*.app.log 2>/dev/null | grep -E 'Resolver|resolver|listen|26654'"
-```
-
-**Interpretation:**
-- `"IPLD Resolver disabled."` → resolver_enabled() returned false (listen_addr empty and/or subnet_id UNDEF)
-- `"starting the IPLD Resolver Service..."` → resolver started (port issue may be elsewhere)
-
-**If logs show "disabled":** The binary is loading config but resolver_enabled() is false. Possible causes:
-- `validator.toml` or `local.toml` overrides and clears listen_addr
-- Config parsing bug (e.g. Multiaddr type)
-- Different binary (f3-lifecycle) with different logic
-
-**If logs show "starting...":** Resolver runs but port doesn't bind. Check for "IPLD Resolver Service failed" or bind errors.
-
----
-
-## Step 3: Check Start Script (What Actually Runs)
-
-```bash
-ssh philip@34.16.93.183 "sudo -u ipc cat /home/ipc/.ipc-node/start-node.sh 2>/dev/null || echo 'File not found'"
-```
-
-**Verify:** Does it contain `export FM_RESOLVER__CONNECTION__LISTEN_ADDR` and `export FM_IPC__SUBNET_ID`?
-
----
-
-## Step 4: Check How Node Is Currently Running
-
-```bash
-ssh philip@34.16.93.183 "ps aux | grep 'ipc-cli node start' | grep -v grep"
-```
-
-**Check:** Is the process started by start-node.sh or by a direct nohup command? (env vars only apply if set before the process starts)
-
----
-
-## Step 5: Manual Test – Run With Explicit Env Vars
-
-Stop the node, then run manually with env vars to isolate whether config or env is the issue:
-
-```bash
-# On validator-1 (34.16.93.183)
-ssh philip@34.16.93.183
-
-# Stop existing node
-sudo pkill -f "ipc-cli node start" || true
-sleep 3
-
-# Run as ipc user with explicit env vars (no wrapper script)
-sudo -u ipc env \
-  FM_RESOLVER__CONNECTION__LISTEN_ADDR=/ip4/0.0.0.0/tcp/26654 \
-  FM_IPC__SUBNET_ID=/r314159/t410fjzsmxroshdmvdq5bg4zwqxx5lznwxaga4h7zgqa \
-  /home/ipc/ipc/target/release/ipc-cli node start --home /home/ipc/.ipc-node
-
-# Let it run 15-20 seconds, then Ctrl+C to stop
-# In another terminal, check port:
-#   ssh philip@34.16.93.183 "ss -tuln | grep 26654"
-```
-
-**If port 26654 appears:** Env vars work; the wrapper script or how it's invoked is the problem.
-**If port 26654 does NOT appear:** Config or binary (e.g. f3-lifecycle branch) may disable the resolver.
-
----
-
-## Step 6: Check for Override Configs
-
-Config load order: default.toml → validator.toml → local.toml → env. Later overrides can clear earlier values.
-
-```bash
-ssh philip@34.16.93.183 "sudo -u ipc ls -la /home/ipc/.ipc-node/fendermint/config/"
-ssh philip@34.16.93.183 "sudo -u ipc cat /home/ipc/.ipc-node/fendermint/config/validator.toml 2>/dev/null || echo 'No validator.toml'"
-ssh philip@34.16.93.183 "sudo -u ipc cat /home/ipc/.ipc-node/fendermint/config/local.toml 2>/dev/null || echo 'No local.toml'"
-```
-
-## Step 7: Check Binary / Branch
-
-```bash
-# Fix safe.directory first, then check branch
-ssh philip@34.16.93.183 "sudo -u ipc git -C /home/ipc/ipc config --global --add safe.directory /home/ipc/ipc 2>/dev/null; sudo -u ipc bash -c 'cd /home/ipc/ipc && git branch -v && git log -1 --oneline'"
-```
-
-**Note:** If validators run `f3-lifecycle` (or another branch), resolver logic may differ from `feature/subnet-bootstrapping`.
-
----
-
-## Step 8: Check Default Config Template
-
-If the node was initialized with a different node-init, the default.toml may have been generated without resolver settings:
-
-```bash
-ssh philip@34.16.93.183 "sudo -u ipc head -100 /home/ipc/.ipc-node/fendermint/config/default.toml"
-```
-
----
-
-## Summary: Decision Tree
-
-| Config has listen_addr? | Config has subnet_id? | Log says "disabled"? | Likely cause |
-|-------------------------|----------------------|----------------------|--------------|
-| No / empty               | -                    | Yes                  | Config missing resolver.connection.listen_addr |
-| Yes                      | No / UNDEF           | Yes                  | Config missing ipc.subnet_id |
-| Yes                      | Yes                  | Yes                  | Env override not applied (script/quoting) or binary differs |
-| Yes                      | Yes                  | No ("starting...")   | Resolver starts but port bind fails (e.g. permission, conflict) |
-
----
-
-## After Finding Root Cause
-
-1. **If config is wrong:** Fix default.toml (or re-run node init with correct node-init.yml)
-2. **If env vars not applied:** Fix start script invocation (wrapper script, quoting, or use systemd with Environment=)
-3. **If binary/branch differs:** Build from feature/subnet-bootstrapping or adapt to that branch's config
\ No newline at end of file

diff --git a/scripts/ipc-subnet-manager/ipc-subnet-config.yml b/scripts/ipc-subnet-manager/ipc-subnet-config.yml
--- a/scripts/ipc-subnet-manager/ipc-subnet-config.yml
+++ b/scripts/ipc-subnet-manager/ipc-subnet-config.yml
@@ -24,21 +24,24 @@
     ssh_user: "philip"
     ipc_user: "ipc"
     role: "primary" # First node initialized
-    private_key: "0x867c766fa9ea9fab8929a6ec6a4fe32ccf33969035d3d7f2262f6eb8021b56d8"
+    # Private key loaded from environment: IPC_VALIDATORS_0_PRIVATE_KEY
+    private_key: "${IPC_VALIDATORS_0_PRIVATE_KEY}"
   - name: "validator-2"
     ip: "136.115.12.207"
     internal_ip: "10.128.0.3"
     ssh_user: "philip"
     ipc_user: "ipc"
     role: "secondary"
-    private_key: "0x40aa709b5d6765411f2afbdb0b4ae00e45a06425b37a386334c80482b203d04d"
+    # Private key loaded from environment: IPC_VALIDATORS_1_PRIVATE_KEY
+    private_key: "${IPC_VALIDATORS_1_PRIVATE_KEY}"
   - name: "validator-3"
     ip: "35.202.56.101"
     internal_ip: "10.128.0.4"
     ssh_user: "philip"
     ipc_user: "ipc"
     role: "secondary"
-    private_key: "0xc1099a062e296366a2ac3b26ac80a409833e6a74edbf677a0bd14580d2c68ea2"
+    # Private key loaded from environment: IPC_VALIDATORS_2_PRIVATE_KEY
+    private_key: "${IPC_VALIDATORS_2_PRIVATE_KEY}"
 # Network Configuration
 network:
   cometbft_p2p_port: 26656

diff --git a/scripts/ipc-subnet-manager/lib/config.sh b/scripts/ipc-subnet-manager/lib/config.sh
--- a/scripts/ipc-subnet-manager/lib/config.sh
+++ b/scripts/ipc-subnet-manager/lib/config.sh
@@ -484,7 +484,12 @@
     else
         local remote_ipc_dir
         remote_ipc_dir=$(get_config_value "paths.ipc_config_dir")
-        remote_ipc_dir="${remote_ipc_dir/#\~/$HOME}"
+        # Expand tilde using remote ipc_user's home, not local $HOME
+        local ipc_user
+        ipc_user=$(get_config_value "validators[$validator_idx].ipc_user")
+        if [ -n "$ipc_user" ] && [ "$ipc_user" != "null" ]; then
+            remote_ipc_dir="${remote_ipc_dir/#\~//home/$ipc_user}"
+        fi
         genesis_json_yml="$remote_ipc_dir/genesis_${subnet_id_no_slash//\//_}.json"
         genesis_sealed_yml="$remote_ipc_dir/genesis_sealed_${subnet_id_no_slash//\//_}.json"
     fi

This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.

ssh_user: "philip"
ipc_user: "ipc"
role: "primary" # First node initialized
role: "primary" # First node initialized
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Private keys committed in configuration file

High Severity

The config file contains three validator private_key values alongside real external IPs (GCP hosts on Calibration testnet). While these may be testnet keys, committing private keys to a public repository is a serious security concern — anyone can impersonate these validators, drain funds, or disrupt the subnet. These keys and the associated IP addresses/infrastructure details are now permanently in git history.

Additional Locations (2)
Fix in Cursor Fix in Web


1. **If config is wrong:** Fix default.toml (or re-run node init with correct node-init.yml)
2. **If env vars not applied:** Fix start script invocation (wrapper script, quoting, or use systemd with Environment=)
3. **If binary/branch differs:** Build from feature/subnet-bootstrapping or adapt to that branch's config
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Troubleshooting doc contains hardcoded infrastructure IPs

Medium Severity

This entire file appears to be a personal debugging/troubleshooting session log with hardcoded infrastructure IPs (e.g., 34.16.93.183), specific user accounts (philip), specific subnet IDs, and step-by-step SSH commands referencing a specific deployment. It reads like personal notes that were accidentally committed rather than reusable documentation.

Fix in Cursor Fix in Web

else
local remote_ipc_dir
remote_ipc_dir=$(get_config_value "paths.ipc_config_dir")
remote_ipc_dir="${remote_ipc_dir/#\~/$HOME}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remote tilde expansion uses local HOME variable

Medium Severity

When constructing remote genesis file paths for node-init.yml, remote_ipc_dir is expanded using ${remote_ipc_dir/#\~/$HOME} which substitutes the local machine's $HOME value. If the local user's home differs from the remote ipc user's home (e.g., local is /home/philip but remote is /home/ipc), the paths written into the YAML will be wrong, causing ipc-cli node init on the remote to fail to find genesis files.

Fix in Cursor Fix in Web

@phutchins
Copy link
Copy Markdown
Contributor Author

@phutchins thanks for the PR, but can you please clarify why bootstrap provisions a full dev/build environment on each validator host (repo clone + Rust + Foundry + Node/pnpm) instead of using prebuilt/pinned ipc-cli and fendermint artifacts? What specific runtime or operational requirement makes on-host source builds necessary here, and is a forked repo actually required for this flow?

We should cross compile and push the binaries but it can wait for another iteration.

- Changed the build runner from Ubuntu 22.04 to Ubuntu 24.04 in the GitHub Actions workflow configuration.
- Updated the Dockerfile to use Ubuntu 24.04 to align with the glibc 2.39 requirement for binaries built on the Ubuntu 24.04 CI runner.
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 7 total unresolved issues (including 6 from previous reviews).

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

parent_gateway = "$parent_gateway"

EOF
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Omitting topdown overrides doesn't actually disable topdown syncing

High Severity

When init.topdown.enabled is false, the [ipc.topdown] section is simply omitted from fendermint-overrides. This doesn't disable topdown — it leaves fendermint's default topdown config intact (generated by ipc-cli node init using the parent.rpc_url from node-init.yml). Fendermint will still attempt to query the parent chain for the subnet, which fails for bootstrap subnets not yet activated on-chain, potentially causing node startup failures or continuous errors.

Additional Locations (1)
Fix in Cursor Fix in Web

@phutchins phutchins merged commit 699d1d6 into main Mar 17, 2026
18 checks passed
@phutchins phutchins deleted the feature/subnet-bootstrapping branch March 17, 2026 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants