Conversation
…figuration updates - Added a new bootstrap command to install dependencies (Rust, Foundry, Node.js) on fresh validator hosts. - Updated initialization process to support resuming from previous failures. - Modified subnet configuration with new validator IPs and registry addresses. - Improved health check and execution commands for better reliability. - Enhanced documentation to reflect new bootstrap steps and usage instructions.
…rd metrics - Introduced a new troubleshooting document for diagnosing issues with the IPLD Resolver not listening on port 26654. - Enhanced the dashboard script to initialize additional metrics for better monitoring, including block production rates and finality tracking. - Updated health check scripts to ensure proper environment variable handling and improve logging for resolver-related configurations.
- Introduced a new function `ssh_exec_long` to handle long-running commands with streaming output, preventing SSH timeouts during builds. - Updated the `update_validator_binaries` function to utilize the new long-running command execution, improving build process logging and error handling.
…d script - Updated the calculation of `blocks_per_min` to accurately reflect block production rates based on time differences. - Adjusted timestamp formatting logic to ensure proper handling of time zone indicators.
| for arg in "$@"; do | ||
| case $arg in | ||
| --wait=*) wait_seconds="${arg#*=}" ;; | ||
| --wait) shift; wait_seconds="${1:-30}" ;; |
There was a problem hiding this comment.
--wait value not parsed in for loop
Medium Severity
In cmd_check, the --wait VALUE form (space-separated) calls shift inside a for arg in "$@" loop. shift modifies $@ but has no effect on the loop's already-captured iteration list. As a result, wait_seconds="${1:-30}" reads $1 — the original first argument (e.g., "--wait") — instead of the intended value (e.g., 45). Since "--wait" is not a number, the [ "$wait_seconds" -gt 0 ] check silently fails and no sleep occurs. The suggested usage ./ipc-manager check --wait 45 (advertised in the error output) will never work correctly.
- Modified the `ipc-subnet-config.yml` to reflect new registry and gateway addresses for the parent subnet. - Enhanced the `config.sh` script to retrieve parent addresses from a unified source, ensuring backward compatibility with existing configurations. - Updated YAML config synchronization logic to maintain consistency between subnet and ipc_cli.parent sections.
| METRICS[peers]=0 | ||
| METRICS[mempool_size]=0 | ||
| METRICS[mempool_bytes]=0 | ||
| METRICS[mempool_max]=5000 |
There was a problem hiding this comment.
Dashboard mempool_max initialization prevents config reading
Low Severity
METRICS[mempool_max] is now initialized to 5000 in initialize_dashboard, which causes the conditional check [ -z "${METRICS[mempool_max]:-}" ] in fetch_metrics to always evaluate to false. The actual CometBFT mempool config value is never read from the node. Previously this key was uninitialized, so the first fetch_metrics call would read the real value. The dashboard now always shows 5000 as capacity regardless of actual config, causing incorrect mempool percentage calculations.
Additional Locations (1)
| local peer_ip=$(get_config_value "validators[$peer_idx].ip") | ||
| if echo "$static_addrs" | grep -q "/ip4/$peer_ip/tcp/$libp2p_port"; then | ||
| local peer_ip=$(get_peer_ip "$peer_idx") | ||
| if echo "$static_addrs" | grep -q "/ip4/$peer_ip/tcp/$v_resolver_port"; then |
There was a problem hiding this comment.
Info command checks wrong port for peer static_addresses
Low Severity
In cmd_info, the static_addresses peer check uses $v_resolver_port (the current validator's resolver port) to verify peer entries. But static_addresses contains each peer's own resolver port, not the current validator's port. In local mode where each validator has a different port offset, this check always fails, producing misleading diagnostic output. The peer's port via get_resolver_port_for_validator "$peer_idx" is needed instead.
Additional Locations (1)
|
@phutchins thanks for the PR, but can you please clarify why bootstrap provisions a full dev/build environment on each validator host (repo clone + Rust + Foundry + Node/pnpm) instead of using prebuilt/pinned ipc-cli and fendermint artifacts? |
- Enhanced the `get_chain_id` function to differentiate between local and remote modes for fetching the chain ID. - Implemented direct curl requests to the validator's external IP in remote mode, ensuring better connectivity and reliability. - Maintained existing functionality for local mode by using localhost for API calls.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
There are 6 total unresolved issues (including 3 from previous reviews).
Autofix Details
Bugbot Autofix prepared fixes for all 3 issues found in the latest run.
- ✅ Fixed: Private keys committed in configuration file
- Replaced hardcoded private keys with environment variable placeholders (${IPC_VALIDATORS_N_PRIVATE_KEY}) so sensitive credentials are no longer committed to version control.
- ✅ Fixed: Troubleshooting doc contains hardcoded infrastructure IPs
- Deleted the TROUBLESHOOT-RESOLVER.md file which contained personal debugging notes with hardcoded IPs, usernames, and infrastructure details.
- ✅ Fixed: Remote tilde expansion uses local HOME variable
- Changed tilde expansion to use the remote ipc_user's home directory (/home/$ipc_user) instead of the local $HOME variable.
Or push these changes by commenting:
@cursor push 152a58a80a
Preview (152a58a80a)
diff --git a/scripts/ipc-subnet-manager/TROUBLESHOOT-RESOLVER.md b/scripts/ipc-subnet-manager/TROUBLESHOOT-RESOLVER.md
deleted file mode 100644
--- a/scripts/ipc-subnet-manager/TROUBLESHOOT-RESOLVER.md
+++ /dev/null
@@ -1,196 +1,0 @@
-# Systematic Troubleshooting: Port 26654 (IPLD Resolver) Not Listening
-
-## Diagnostic Results (from your run)
-
-| Check | Result |
-|-------|--------|
-| Config `listen_addr` | ✓ `/ip4/0.0.0.0/tcp/26654` |
-| Config `subnet_id` | ✓ `/r314159/t410fjzsmxroshdmvdq5bg4zwqxx5lznwxaga4h7zgqa` |
-| Config `[resolver] enabled` | ✓ `true` |
-| Start script | ✓ Has correct env vars |
-| Manual `env FM_... ipc-cli node start` | ✗ Port 26654 still not listening |
-| Logs: "IPLD Resolver disabled" or "starting..." | ✗ **Neither appears** |
-| Logs: "snapshots disabled" at node.rs | Line **142** (remote) vs **243** (current code) |
-
-**ROOT CAUSE:** The remote binary was built from a different branch (e.g. f3-lifecycle). Line numbers don't match current code; the resolver block may not exist or is structured differently in that binary. The config and env vars are correct—the binary simply doesn't have the resolver code.
-
----
-
-## Fix
-
-Rebuild the binary on validators from the branch that has the resolver code:
-
-```bash
-./ipc-manager update-binaries --branch feature/subnet-bootstrapping
-./ipc-manager restart --yes
-```
-
-Then verify:
-
-```bash
-./ipc-manager check
-ssh philip@34.16.93.183 "ss -tuln | grep 26654"
-```
-
----
-
-## Root Cause Logic (from fendermint)
-
-The resolver starts only when `resolver_enabled()` returns true:
-```rust
-// fendermint/app/settings/src/lib.rs:523-527
-pub fn resolver_enabled(&self) -> bool {
- !self.resolver.connection.listen_addr.is_empty()
- && self.ipc.subnet_id != *ipc_api::subnet_id::UNDEF
-}
-```
-
-**Both conditions must be true:**
-1. `resolver.connection.listen_addr` must be non-empty (e.g. `/ip4/0.0.0.0/tcp/26654`)
-2. `ipc.subnet_id` must not be UNDEF (root: 0, children: [])
-
-If disabled, logs show: `"IPLD Resolver disabled."`
-If enabled, logs show: `"starting the IPLD Resolver Service..."`
-
----
-
-## Step 1: Check Config on Remote
-
-SSH to validator-1 and inspect the fendermint config:
-
-```bash
-ssh philip@34.16.93.183 "sudo -u ipc cat /home/ipc/.ipc-node/fendermint/config/default.toml"
-```
-
-**Look for:**
-- `[resolver]` or `[resolver.connection]` section
-- `listen_addr = "/ip4/0.0.0.0/tcp/26654"` (or similar)
-- `[ipc]` section with `subnet_id = "/r314159/t410fjzsmxroshdmvdq5bg4zwqxx5lznwxaga4h7zgqa"`
-
-**Grep for key sections:**
-```bash
-ssh philip@34.16.93.183 "sudo -u ipc grep -A5 '\[resolver\]' /home/ipc/.ipc-node/fendermint/config/default.toml"
-ssh philip@34.16.93.183 "sudo -u ipc grep -A2 '\[ipc\]' /home/ipc/.ipc-node/fendermint/config/default.toml"
-ssh philip@34.16.93.183 "sudo -u ipc grep listen_addr /home/ipc/.ipc-node/fendermint/config/default.toml"
-```
-
----
-
-## Step 2: Check Logs for Resolver Decision (CRITICAL)
-
-```bash
-# Resolver decision
-ssh philip@34.16.93.183 "sudo -u ipc grep -E 'IPLD Resolver|resolver' /home/ipc/.ipc-node/logs/*.log 2>/dev/null | tail -20"
-
-# Also check startup logs
-ssh philip@34.16.93.183 "sudo -u ipc tail -100 /home/ipc/.ipc-node/logs/*.app.log 2>/dev/null | grep -E 'Resolver|resolver|listen|26654'"
-```
-
-**Interpretation:**
-- `"IPLD Resolver disabled."` → resolver_enabled() returned false (listen_addr empty and/or subnet_id UNDEF)
-- `"starting the IPLD Resolver Service..."` → resolver started (port issue may be elsewhere)
-
-**If logs show "disabled":** The binary is loading config but resolver_enabled() is false. Possible causes:
-- `validator.toml` or `local.toml` overrides and clears listen_addr
-- Config parsing bug (e.g. Multiaddr type)
-- Different binary (f3-lifecycle) with different logic
-
-**If logs show "starting...":** Resolver runs but port doesn't bind. Check for "IPLD Resolver Service failed" or bind errors.
-
----
-
-## Step 3: Check Start Script (What Actually Runs)
-
-```bash
-ssh philip@34.16.93.183 "sudo -u ipc cat /home/ipc/.ipc-node/start-node.sh 2>/dev/null || echo 'File not found'"
-```
-
-**Verify:** Does it contain `export FM_RESOLVER__CONNECTION__LISTEN_ADDR` and `export FM_IPC__SUBNET_ID`?
-
----
-
-## Step 4: Check How Node Is Currently Running
-
-```bash
-ssh philip@34.16.93.183 "ps aux | grep 'ipc-cli node start' | grep -v grep"
-```
-
-**Check:** Is the process started by start-node.sh or by a direct nohup command? (env vars only apply if set before the process starts)
-
----
-
-## Step 5: Manual Test – Run With Explicit Env Vars
-
-Stop the node, then run manually with env vars to isolate whether config or env is the issue:
-
-```bash
-# On validator-1 (34.16.93.183)
-ssh philip@34.16.93.183
-
-# Stop existing node
-sudo pkill -f "ipc-cli node start" || true
-sleep 3
-
-# Run as ipc user with explicit env vars (no wrapper script)
-sudo -u ipc env \
- FM_RESOLVER__CONNECTION__LISTEN_ADDR=/ip4/0.0.0.0/tcp/26654 \
- FM_IPC__SUBNET_ID=/r314159/t410fjzsmxroshdmvdq5bg4zwqxx5lznwxaga4h7zgqa \
- /home/ipc/ipc/target/release/ipc-cli node start --home /home/ipc/.ipc-node
-
-# Let it run 15-20 seconds, then Ctrl+C to stop
-# In another terminal, check port:
-# ssh philip@34.16.93.183 "ss -tuln | grep 26654"
-```
-
-**If port 26654 appears:** Env vars work; the wrapper script or how it's invoked is the problem.
-**If port 26654 does NOT appear:** Config or binary (e.g. f3-lifecycle branch) may disable the resolver.
-
----
-
-## Step 6: Check for Override Configs
-
-Config load order: default.toml → validator.toml → local.toml → env. Later overrides can clear earlier values.
-
-```bash
-ssh philip@34.16.93.183 "sudo -u ipc ls -la /home/ipc/.ipc-node/fendermint/config/"
-ssh philip@34.16.93.183 "sudo -u ipc cat /home/ipc/.ipc-node/fendermint/config/validator.toml 2>/dev/null || echo 'No validator.toml'"
-ssh philip@34.16.93.183 "sudo -u ipc cat /home/ipc/.ipc-node/fendermint/config/local.toml 2>/dev/null || echo 'No local.toml'"
-```
-
-## Step 7: Check Binary / Branch
-
-```bash
-# Fix safe.directory first, then check branch
-ssh philip@34.16.93.183 "sudo -u ipc git -C /home/ipc/ipc config --global --add safe.directory /home/ipc/ipc 2>/dev/null; sudo -u ipc bash -c 'cd /home/ipc/ipc && git branch -v && git log -1 --oneline'"
-```
-
-**Note:** If validators run `f3-lifecycle` (or another branch), resolver logic may differ from `feature/subnet-bootstrapping`.
-
----
-
-## Step 8: Check Default Config Template
-
-If the node was initialized with a different node-init, the default.toml may have been generated without resolver settings:
-
-```bash
-ssh philip@34.16.93.183 "sudo -u ipc head -100 /home/ipc/.ipc-node/fendermint/config/default.toml"
-```
-
----
-
-## Summary: Decision Tree
-
-| Config has listen_addr? | Config has subnet_id? | Log says "disabled"? | Likely cause |
-|-------------------------|----------------------|----------------------|--------------|
-| No / empty | - | Yes | Config missing resolver.connection.listen_addr |
-| Yes | No / UNDEF | Yes | Config missing ipc.subnet_id |
-| Yes | Yes | Yes | Env override not applied (script/quoting) or binary differs |
-| Yes | Yes | No ("starting...") | Resolver starts but port bind fails (e.g. permission, conflict) |
-
----
-
-## After Finding Root Cause
-
-1. **If config is wrong:** Fix default.toml (or re-run node init with correct node-init.yml)
-2. **If env vars not applied:** Fix start script invocation (wrapper script, quoting, or use systemd with Environment=)
-3. **If binary/branch differs:** Build from feature/subnet-bootstrapping or adapt to that branch's config
\ No newline at end of file
diff --git a/scripts/ipc-subnet-manager/ipc-subnet-config.yml b/scripts/ipc-subnet-manager/ipc-subnet-config.yml
--- a/scripts/ipc-subnet-manager/ipc-subnet-config.yml
+++ b/scripts/ipc-subnet-manager/ipc-subnet-config.yml
@@ -24,21 +24,24 @@
ssh_user: "philip"
ipc_user: "ipc"
role: "primary" # First node initialized
- private_key: "0x867c766fa9ea9fab8929a6ec6a4fe32ccf33969035d3d7f2262f6eb8021b56d8"
+ # Private key loaded from environment: IPC_VALIDATORS_0_PRIVATE_KEY
+ private_key: "${IPC_VALIDATORS_0_PRIVATE_KEY}"
- name: "validator-2"
ip: "136.115.12.207"
internal_ip: "10.128.0.3"
ssh_user: "philip"
ipc_user: "ipc"
role: "secondary"
- private_key: "0x40aa709b5d6765411f2afbdb0b4ae00e45a06425b37a386334c80482b203d04d"
+ # Private key loaded from environment: IPC_VALIDATORS_1_PRIVATE_KEY
+ private_key: "${IPC_VALIDATORS_1_PRIVATE_KEY}"
- name: "validator-3"
ip: "35.202.56.101"
internal_ip: "10.128.0.4"
ssh_user: "philip"
ipc_user: "ipc"
role: "secondary"
- private_key: "0xc1099a062e296366a2ac3b26ac80a409833e6a74edbf677a0bd14580d2c68ea2"
+ # Private key loaded from environment: IPC_VALIDATORS_2_PRIVATE_KEY
+ private_key: "${IPC_VALIDATORS_2_PRIVATE_KEY}"
# Network Configuration
network:
cometbft_p2p_port: 26656
diff --git a/scripts/ipc-subnet-manager/lib/config.sh b/scripts/ipc-subnet-manager/lib/config.sh
--- a/scripts/ipc-subnet-manager/lib/config.sh
+++ b/scripts/ipc-subnet-manager/lib/config.sh
@@ -484,7 +484,12 @@
else
local remote_ipc_dir
remote_ipc_dir=$(get_config_value "paths.ipc_config_dir")
- remote_ipc_dir="${remote_ipc_dir/#\~/$HOME}"
+ # Expand tilde using remote ipc_user's home, not local $HOME
+ local ipc_user
+ ipc_user=$(get_config_value "validators[$validator_idx].ipc_user")
+ if [ -n "$ipc_user" ] && [ "$ipc_user" != "null" ]; then
+ remote_ipc_dir="${remote_ipc_dir/#\~//home/$ipc_user}"
+ fi
genesis_json_yml="$remote_ipc_dir/genesis_${subnet_id_no_slash//\//_}.json"
genesis_sealed_yml="$remote_ipc_dir/genesis_sealed_${subnet_id_no_slash//\//_}.json"
fiThis Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.
| ssh_user: "philip" | ||
| ipc_user: "ipc" | ||
| role: "primary" # First node initialized | ||
| role: "primary" # First node initialized |
There was a problem hiding this comment.
Private keys committed in configuration file
High Severity
The config file contains three validator private_key values alongside real external IPs (GCP hosts on Calibration testnet). While these may be testnet keys, committing private keys to a public repository is a serious security concern — anyone can impersonate these validators, drain funds, or disrupt the subnet. These keys and the associated IP addresses/infrastructure details are now permanently in git history.
Additional Locations (2)
|
|
||
| 1. **If config is wrong:** Fix default.toml (or re-run node init with correct node-init.yml) | ||
| 2. **If env vars not applied:** Fix start script invocation (wrapper script, quoting, or use systemd with Environment=) | ||
| 3. **If binary/branch differs:** Build from feature/subnet-bootstrapping or adapt to that branch's config |
There was a problem hiding this comment.
Troubleshooting doc contains hardcoded infrastructure IPs
Medium Severity
This entire file appears to be a personal debugging/troubleshooting session log with hardcoded infrastructure IPs (e.g., 34.16.93.183), specific user accounts (philip), specific subnet IDs, and step-by-step SSH commands referencing a specific deployment. It reads like personal notes that were accidentally committed rather than reusable documentation.
| else | ||
| local remote_ipc_dir | ||
| remote_ipc_dir=$(get_config_value "paths.ipc_config_dir") | ||
| remote_ipc_dir="${remote_ipc_dir/#\~/$HOME}" |
There was a problem hiding this comment.
Remote tilde expansion uses local HOME variable
Medium Severity
When constructing remote genesis file paths for node-init.yml, remote_ipc_dir is expanded using ${remote_ipc_dir/#\~/$HOME} which substitutes the local machine's $HOME value. If the local user's home differs from the remote ipc user's home (e.g., local is /home/philip but remote is /home/ipc), the paths written into the YAML will be wrong, causing ipc-cli node init on the remote to fail to find genesis files.
We should cross compile and push the binaries but it can wait for another iteration. |
- Changed the build runner from Ubuntu 22.04 to Ubuntu 24.04 in the GitHub Actions workflow configuration.
- Updated the Dockerfile to use Ubuntu 24.04 to align with the glibc 2.39 requirement for binaries built on the Ubuntu 24.04 CI runner.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 7 total unresolved issues (including 6 from previous reviews).
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| parent_gateway = "$parent_gateway" | ||
|
|
||
| EOF | ||
| fi |
There was a problem hiding this comment.
Omitting topdown overrides doesn't actually disable topdown syncing
High Severity
When init.topdown.enabled is false, the [ipc.topdown] section is simply omitted from fendermint-overrides. This doesn't disable topdown — it leaves fendermint's default topdown config intact (generated by ipc-cli node init using the parent.rpc_url from node-init.yml). Fendermint will still attempt to query the parent chain for the subnet, which fails for bootstrap subnets not yet activated on-chain, potentially causing node startup failures or continuous errors.



Note
Medium Risk
Medium risk because it changes CI build OS/container base and substantially rewires remote subnet initialization/bootstrapping, config generation, and node startup behavior, which could break deployments if assumptions about paths/ports/SSH differ.
Overview
Aligns build/runtime environments to Ubuntu 24.04 by moving GitHub Actions runners to
ubuntu-24.04and switching thefendermintrunner image base toubuntu:24.04to match the newer glibc requirement.Expands
ipc-subnet-managerto support fresh-host bootstrapping and more resilient remote ops. Adds abootstrapcommand (installs deps, clones/builds IPC), aninit --resumemode, and adiagnosecommand; updates configs to supportinternal_ipfor peer traffic, safer local-vs-remote path handling for~/.ipc, and more robust genesis/config propagation across hosts.Hardens node start/health/peering flows. Node startup now reliably sets resolver/subnet env vars via a generated
start-node.sh, health checks gain optional--wait, log tailing avoids remotegreppipes, and multiple SSH/sed operations are reworked to avoid quoting/login-shell hangs (newexec_on_host_simple, keepalive options, and script-based remote edits).Written by Cursor Bugbot for commit 7431ec8. This will update automatically on new commits. Configure here.