Skip to content

fix(update): wait for daemon process exit, not RPC port, before restart#74

Merged
Rinse12 merged 5 commits into
masterfrom
fix/update-install-restart-race
Jun 8, 2026
Merged

fix(update): wait for daemon process exit, not RPC port, before restart#74
Rinse12 merged 5 commits into
masterfrom
fix/update-install-restart-race

Conversation

@Rinse12

@Rinse12 Rinse12 commented Jun 8, 2026

Copy link
Copy Markdown
Member

Problem

bitsocial update install --restart-daemons stops the running daemon, then restarts it with the new binary. But it only waited for the old daemon's RPC port (9138) to free before restarting.

The RPC port is released by daemonServer.destroy() — which runs before the daemon finishes killing its kubo child (the exit hook awaits killKuboProcess() last). So the new daemon could be spawned while the old kubo still held the IPFS API port (50019), and it dies on startup with:

Cannot start IPFS daemon because the IPFS API port 127.0.0.1:50019 (configured as http://127.0.0.1:50019/api/v0) is already in use.

…so port 9138 never comes up. This was hit on a production host after an in-place update.

This is a distinct race from the internal keepKuboUp races fixed in #71 — it's in the update install orchestration. See #70 (comment).

Fix

Wait for the old daemon's PID to actually exit before restarting, instead of just its RPC port. The daemon's exit hook (now reliable after #71) kills kubo before the process exits, so a gone PID guarantees the kubo API port is free.

  • update install now polls process.kill(pid, 0) until ESRCH (bounded at 60s) instead of tcpPortUsed.waitUntilFree(rpcPort).

Test

New end-to-end test test/cli/update-install-restart-race.test.ts drives the real bitsocial update install:

  • same-version install (skips npm, runs the full stop + _restartDaemons path);
  • isolated XDG_DATA_HOME so it only sees the test daemon;
  • a bitsocial PATH shim that records whether the kubo port is still bound at the moment of restart, then exec's the real daemon.

That marker is the discriminator (the eventual daemon state self-heals via the watchdog, so asserting end-state wouldn't catch it). Red before the fix (inuse), green after (free).

A PKC_CLI_TEST_KUBO_SHUTDOWN_DELAY_MS hook in daemon.ts (mirroring the existing PKC_CLI_TEST_* hooks) makes the window deterministic.

Verification

  • New test: red → green
  • Full suite: 252 passed / 1 skipped
  • Build clean

Follow-up (not in this PR)

update install's restart spawns a detached daemon that bypasses systemd (bitsocial.service). On systemd-managed hosts the documented update flow should restart via the unit. Tracked separately.

Summary by CodeRabbit

  • Bug Fixes

    • Ensure daemons fully exit before restart during update/install, fixing a restart race that could cause installations to fail.
  • Tests

    • Added a deterministic integration test that reproduces and verifies correct restart timing during updates.
  • Chores

    • Introduced a configurable daemon shutdown timeout (and a test-only delay) to make shutdown and restart behavior more predictable.

@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@Rinse12, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 31 minutes and 4 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0b90bbde-6cff-4fd1-978e-e29fe1c816a9

📥 Commits

Reviewing files that changed from the base of the PR and between f3f403c and 475a943.

📒 Files selected for processing (2)
  • test/cli/update-install-restart-race.test.ts
  • test/kubo/kuboRpcGateway.integration.test.ts
📝 Walkthrough

Walkthrough

Replaces port-based daemon shutdown waits with PID-exit polling using a new DAEMON_SHUTDOWN_TIMEOUT_MS constant, adds a test-only PKC_CLI_TEST_KUBO_SHUTDOWN_DELAY_MS hook to reproduce the race, and adds an integration test that verifies the old kubo API port is freed before restarting daemons.

Changes

Restart race fix: process-exit polling and deterministic reproduction test

Layer / File(s) Summary
Shutdown timeout constant and imports
src/common-utils/daemon-state.ts, src/cli/commands/daemon.ts, src/cli/commands/update/install.ts
Adds exported DAEMON_SHUTDOWN_TIMEOUT_MS = 120000 and updates imports to use it in async exit/wait logic.
Process-exit polling for daemon shutdown
src/cli/commands/update/install.ts
Replaces port-based wait loop with _waitForProcessExit(pid, timeoutMs), polling process.kill(pid, 0) until ESRCH or timeout; treats EPERM/other kill errors as non-terminal during polling.
Test hooks and race reproduction framework
src/cli/commands/daemon.ts, test/cli/update-install-restart-race.test.ts, test/common-utils/daemon-state.test.ts
Adds PKC_CLI_TEST_KUBO_SHUTDOWN_DELAY_MS delay inside killKuboProcess() for deterministic testing, switches asyncExitHook wait to the shutdown constant, and adds an integration test that uses a PATH shim and marker file to assert the old kubo API port is free before restart; adjusts a PID-reuse regression test to skip on Windows.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • bitsocialnet/bitsocial-cli#71: Both PRs modify src/cli/commands/daemon.ts daemon shutdown timing to handle restart-race conditions using environment-controlled delays for deterministic reproduction.
  • bitsocialnet/bitsocial-cli#22: Both PRs modify daemon restart/shutdown logic in src/cli/commands/update/install.ts, with this PR's process-exit polling directly tied to the retrieval PR's daemon restart flow.
  • bitsocialnet/bitsocial-cli#56: Both PRs modify the daemon restart logic in src/cli/commands/update/install.ts—this PR changes how daemons fully stop before restart, while the retrieval PR adds a user hint immediately before _restartDaemons() is called.

Poem

🐰 I poked the code with gentle paws,
Waiting on PIDs instead of ports,
A test-time pause to catch the cause,
Now restarts wait for proper sorts,
Hooray — no races in our forts! 🎉

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main fix: switching from waiting for RPC port to waiting for daemon process exit before restart. This aligns with the core change across multiple files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/update-install-restart-race

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cli/commands/update/install.ts`:
- Around line 75-78: The restart flow uses a hardcoded 60000ms in the call to
_waitForProcessExit, which is shorter than the daemon's own shutdown budget;
extract a shared constant (e.g., DAEMON_SHUTDOWN_TIMEOUT_MS) and use it in both
the daemon shutdown hook and in the update/install restart logic so both paths
use the same timeout; update the call to this._waitForProcessExit(d.pid, ...) to
pass the shared DAEMON_SHUTDOWN_TIMEOUT_MS constant and ensure the daemon
shutdown code references the same constant.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 10537132-dd0f-42bb-b9e1-c96055baee17

📥 Commits

Reviewing files that changed from the base of the PR and between 6c3aac4 and aa3bd70.

📒 Files selected for processing (3)
  • src/cli/commands/daemon.ts
  • src/cli/commands/update/install.ts
  • test/cli/update-install-restart-race.test.ts

Comment thread src/cli/commands/update/install.ts Outdated
…rt (issue #70)

`bitsocial update install --restart-daemons` only waited for the old daemon's
RPC port to free before restarting it. The RPC port is released by
daemonServer.destroy() *before* the daemon finishes killing its kubo child, so
the new daemon could be spawned while the old kubo still held the IPFS API port
and die on startup with "IPFS API port already in use" — port 9138 never comes
up. Seen in prod after an in-place update.

Wait for the old daemon's PID to actually exit instead. The daemon's exit hook
kills kubo before the process exits, so a gone PID guarantees the kubo API port
is free before we restart.

Adds an end-to-end test driving the real `update install` (same-version so it
skips npm, a `bitsocial` PATH shim that records whether the kubo port is still
bound at restart time) plus a deterministic timing hook
(PKC_CLI_TEST_KUBO_SHUTDOWN_DELAY_MS). The test isolates the daemon-state
directory by overriding HOME (env-paths derives it from HOME on every platform;
XDG_DATA_HOME only applies on Linux) so it never enumerates or restarts other
tests' daemons running in parallel.
@Rinse12 Rinse12 force-pushed the fix/update-install-restart-race branch from aa3bd70 to 73458b4 Compare June 8, 2026 08:06
The update-install restart flow waited only 60s for a stopped daemon's PID
to exit, but the daemon's async exit hook is given 120s to shut down kubo +
the RPC server. A slow-but-valid shutdown (60-120s) would abort
`update install --restart-daemons` midway with daemons stopped and nothing
installed.

Hoist a shared DAEMON_SHUTDOWN_TIMEOUT_MS constant and drive both the
daemon exit hook and the update-install wait from it, so the orchestrator's
patience matches the daemon's contract. Folds the timeout into the error
message so the two can't drift.

Addresses CodeRabbit review on #74.
The PID-reuse scenario (issue #66) is a Docker-on-Linux problem and the
identity check that detects it relies on Unix process introspection
(/proc, ps) plus Unix tooling (sleep, bash). On Windows the identity is
undeterminable, so the code intentionally degrades to liveness-only — the
safe fallback. The test asserted Unix-only behavior, so it failed
deterministically on windows-latest CI.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/cli/update-install-restart-race.test.ts`:
- Around line 162-179: The test currently checks marker contents but not the
exit status, which can produce false positives; after calling
runUpdateInstall(...) and obtaining result, add an explicit assertion that
result.exitCode (or result.status depending on test harness) equals 0 to ensure
the command succeeded before inspecting markerFile — locate this right after the
const result = await runUpdateInstall(...) line (in the same test using
sharedEnv, markerFile and observations) and fail the test if the process did not
exit successfully.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1e362d39-2f25-48d8-9a95-e74857aa0364

📥 Commits

Reviewing files that changed from the base of the PR and between aa3bd70 and f3f403c.

📒 Files selected for processing (5)
  • src/cli/commands/daemon.ts
  • src/cli/commands/update/install.ts
  • src/common-utils/daemon-state.ts
  • test/cli/update-install-restart-race.test.ts
  • test/common-utils/daemon-state.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/cli/commands/daemon.ts
  • src/cli/commands/update/install.ts

Comment thread test/cli/update-install-restart-race.test.ts
Rinse12 added 2 commits June 8, 2026 10:11
…arker

Addresses CodeRabbit review on PR #74: the restart-race test inferred
success only indirectly via the marker observations. Add an explicit
exitCode === 0 assertion so a non-zero exit from the command itself
fails the test even when the marker happens to look right.
The kubo RPC + gateway integration beforeAll picked ephemeral ports with
getAvailablePort(), which closes its probe socket before returning the
port number. kubo binds the port itself afterwards, leaving a TOCTOU
window where another process can claim it. On CI this surfaced as the
gateway failing to bind (serveHTTPGateway: ... address already in use)
and the suite failing flakily.

Wrap startup in a bounded retry (4 attempts) that picks fresh ports and a
fresh repo on any 'already in use' rejection, and re-throws every other
error so real regressions still fail fast.
@Rinse12 Rinse12 merged commit 6d06d58 into master Jun 8, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant