Skip to content

fix(ant-dev): clean up orphan anvil/antnode and stale node identities on stop#81

Merged
Nic-dorman merged 1 commit into
mainfrom
fix/ant-dev-stop-orphan-anvil
May 14, 2026
Merged

fix(ant-dev): clean up orphan anvil/antnode and stale node identities on stop#81
Nic-dorman merged 1 commit into
mainfrom
fix/ant-dev-stop-orphan-anvil

Conversation

@Nic-dorman
Copy link
Copy Markdown
Collaborator

Helps mitigate #73 (Option B path).

ant-devnet keeps anvil alive past Testnet::new's scope with std::mem::forget(testnet) and relies on graceful Drop at process exit to clean it up. SIGTERM/SIGKILL skip destructors, so every ant dev stop leaks one anvil child and one ~/.local/share/ant/nodes/<peer_id>/ directory per spawned node (25 dirs on the default preset). After a handful of start/stop cycles — and especially after kill-mid-startup events — the LXC accumulates orphan anvils plus 100+ stale node dirs, and subsequent ant dev start runs flake or hang.

This is the Option B workaround proposed in #73 (the band-aid at the ant-dev layer). The proper fix is Option A: change ant-devnet/main.rs to use tempfile::TempDir + a tokio signal handler, mirroring how ant-client's MiniTestnet and ant-node's tests/e2e/testnet.rs already do it. That lives in WithAutonomi/ant-node and will go up as a separate PR there.

Changes in ant dev stop

  • pkill -9 -f anvil and pkill -9 -f .../antnode in addition to the existing ant-devnet pkill
  • rm -rf ~/.local/share/ant/{nodes,spill} so the next ant dev start begins from a clean slate
  • Centralised the existing pkill call sites into a _pkill() helper for readability

No behaviour change on Windows — pkill and the data-dir cleanup are POSIX-only branches.

Test plan

  • Before: ant dev start followed by ant dev stop left an orphan anvil and 25 dirs in ~/.local/share/ant/nodes/. Reproducible every run.
  • After: same startstop leaves zero processes and an empty data dir:
    --- after stop: should be empty ---
    (none - clean)
    --- nodes dir gone? ---
    ls: cannot access '/home/nic/.local/share/ant/nodes': No such file or directory
    (no nodes dir - clean)
    
  • Full cross-SDK e2e harness still green after the change (no SDK breakage)

… on stop

ant-devnet keeps anvil alive past Testnet::new scope via std::mem::forget
on the AnvilInstance, then relies on graceful Drop at process exit to
clean it up. SIGTERM/SIGKILL skip destructors, so every ant dev stop
leaks one anvil child and one ~/.local/share/ant/nodes/<peer_id>/ tree
for each of the spawned nodes. After a handful of start/stop or
killed-mid-startup cycles, the LXC accumulates orphan anvils plus 100+
stale node dirs, and subsequent ant dev start runs flake or hang.

This is a workaround at the ant-dev layer (Option B in #73). The proper
fix lives in ant-devnet itself (Option A: tempfile::TempDir + tokio
signal handler, mirroring how ant-clients MiniTestnet and ant-nodes
tests/e2e/testnet.rs already do it) and will be a separate PR against
WithAutonomi/ant-node.

In ant dev stop now:
- pkill anvil and antnode in addition to ant-devnet
- rm -rf ~/.local/share/ant/nodes and ~/.local/share/ant/spill so the
  next start begins from a clean state
- Centralise the pkill calls into a _pkill() helper

No behaviour change on Windows (the pkill / rm paths are POSIX-only).

Closes #16 (local task); helps mitigate #73 (upstream).
@Nic-dorman Nic-dorman merged commit 6dceb66 into main May 14, 2026
@Nic-dorman Nic-dorman deleted the fix/ant-dev-stop-orphan-anvil branch May 14, 2026 11:06
Nic-dorman added a commit that referenced this pull request May 14, 2026
…l) (#88)

`ant dev start` previously hardcoded `ant-devnet --preset default` (25
nodes). On a cold-cache fresh start that reproducibly hits the
manifest-wait timeout (#73) — even after #81's stop-time cleanup —
because spinning up 25 nodes exceeds the 6-min wait window. New
contributors following SETUP.md hit `Timed out waiting for devnet
manifest` on their first run with no obvious cause.

`--preset small` (10 nodes) finishes in seconds and is plenty for SDK
development. Switch the default to `small` and expose `--preset` so
users can opt back into `default` / `large` for stress runs.

## Why default changes (not just a new flag)

Defaults should make the documented happy path actually work. With
`default` as the default, `ant dev start --ant-node-dir …` per SETUP.md
fails on cold cache; with `small`, it works. The proper fix is #73
Option A in `WithAutonomi/ant-node` (`tempfile::TempDir` + tokio signal
handler in `ant-devnet/main.rs`) — once that lands and `default` works
reliably, this flag still gives users a fast option for tight iteration
loops.

## Test plan

- [x] `ant dev start --ant-node-dir ~/Projects/ant-node` (no --preset) →
      uses `small`, devnet ready in ~10s
- [x] `ant dev start --preset default --ant-node-dir …` → reproduces #73
      manifest-timeout symptom (as expected — the underlying ant-devnet
      bug is unchanged)
- [x] `ant dev start --help` shows the new flag with all three choices
- [x] Cross-SDK e2e harness (15/15 SDKs) green with the default preset
      change

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Nic-dorman added a commit that referenced this pull request May 14, 2026
Cuts v0.7.1 atop v0.7.0. Primarily refreshes the upstream `ant-core`
pin to the `ant-cli-v0.2.3` release tag (no API change for antd
consumers). Bundles a substantial round of cross-SDK example/build
fixes, dispatcher improvements, and CI/release workflow hardening.

## antd

- chore(antd): bump ant-core to v0.2.3 (#85)

## SDK example/build fixes

- fix(antd-php): use cost-estimate fields in example 02 (#74)
- fix(antd-elixir): print cost-estimate fields in examples (#75)
- fix(antd-lua): add missing discover module to rockspec (#76)
- fix(antd-kotlin): make put-response cost optional + ship gradle wrapper (#77)
- fix(antd-zig): pass payment_mode to dataPutPublic/dataPutPrivate (#79)
- fix(antd-java): make examples runnable via gradle :examples subproject (#80)
- fix(antd-zig): align stdlib API to declared 0.14.x minimum (#82)
- fix(antd-swift): port to Linux + populate cost-estimate fields (#87)

## ant-dev (developer CLI)

- fix(ant-dev): clean up orphan anvil/antnode and stale node identities on stop (#81)
- fix(ant-dev): tooling cluster — flag alias, sys.executable, anvil preflight, README (#83)
- feat(ant-dev): expand `ant dev example` to dispatch all 15 SDKs (#84)
- fix(ant-dev): dispatcher swift no-skip + lua LUA_PATH wrap (#86)
- feat(ant-dev): expose --preset flag on `ant dev start` (default: small) (#88)

## CI / release

- ci: authenticate arduino/setup-protoc on ci.yml too (#60)
- feat(release): publish antd-linux-arm64 artifact (#89)

## Validation

15/15 SDKs round-tripped end-to-end against a daemon built from this
commit on a Linux dev box (Ubuntu 24.04, 0.7.1 atop ant-core v0.2.3).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant