fix(update): wait for daemon process exit, not RPC port, before restart#74
Conversation
|
Warning Review limit reached
More reviews will be available in 31 minutes and 4 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughReplaces port-based daemon shutdown waits with PID-exit polling using a new DAEMON_SHUTDOWN_TIMEOUT_MS constant, adds a test-only PKC_CLI_TEST_KUBO_SHUTDOWN_DELAY_MS hook to reproduce the race, and adds an integration test that verifies the old kubo API port is freed before restarting daemons. ChangesRestart race fix: process-exit polling and deterministic reproduction test
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cli/commands/update/install.ts`:
- Around line 75-78: The restart flow uses a hardcoded 60000ms in the call to
_waitForProcessExit, which is shorter than the daemon's own shutdown budget;
extract a shared constant (e.g., DAEMON_SHUTDOWN_TIMEOUT_MS) and use it in both
the daemon shutdown hook and in the update/install restart logic so both paths
use the same timeout; update the call to this._waitForProcessExit(d.pid, ...) to
pass the shared DAEMON_SHUTDOWN_TIMEOUT_MS constant and ensure the daemon
shutdown code references the same constant.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 10537132-dd0f-42bb-b9e1-c96055baee17
📒 Files selected for processing (3)
src/cli/commands/daemon.tssrc/cli/commands/update/install.tstest/cli/update-install-restart-race.test.ts
…rt (issue #70) `bitsocial update install --restart-daemons` only waited for the old daemon's RPC port to free before restarting it. The RPC port is released by daemonServer.destroy() *before* the daemon finishes killing its kubo child, so the new daemon could be spawned while the old kubo still held the IPFS API port and die on startup with "IPFS API port already in use" — port 9138 never comes up. Seen in prod after an in-place update. Wait for the old daemon's PID to actually exit instead. The daemon's exit hook kills kubo before the process exits, so a gone PID guarantees the kubo API port is free before we restart. Adds an end-to-end test driving the real `update install` (same-version so it skips npm, a `bitsocial` PATH shim that records whether the kubo port is still bound at restart time) plus a deterministic timing hook (PKC_CLI_TEST_KUBO_SHUTDOWN_DELAY_MS). The test isolates the daemon-state directory by overriding HOME (env-paths derives it from HOME on every platform; XDG_DATA_HOME only applies on Linux) so it never enumerates or restarts other tests' daemons running in parallel.
aa3bd70 to
73458b4
Compare
The update-install restart flow waited only 60s for a stopped daemon's PID to exit, but the daemon's async exit hook is given 120s to shut down kubo + the RPC server. A slow-but-valid shutdown (60-120s) would abort `update install --restart-daemons` midway with daemons stopped and nothing installed. Hoist a shared DAEMON_SHUTDOWN_TIMEOUT_MS constant and drive both the daemon exit hook and the update-install wait from it, so the orchestrator's patience matches the daemon's contract. Folds the timeout into the error message so the two can't drift. Addresses CodeRabbit review on #74.
The PID-reuse scenario (issue #66) is a Docker-on-Linux problem and the identity check that detects it relies on Unix process introspection (/proc, ps) plus Unix tooling (sleep, bash). On Windows the identity is undeterminable, so the code intentionally degrades to liveness-only — the safe fallback. The test asserted Unix-only behavior, so it failed deterministically on windows-latest CI.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/cli/update-install-restart-race.test.ts`:
- Around line 162-179: The test currently checks marker contents but not the
exit status, which can produce false positives; after calling
runUpdateInstall(...) and obtaining result, add an explicit assertion that
result.exitCode (or result.status depending on test harness) equals 0 to ensure
the command succeeded before inspecting markerFile — locate this right after the
const result = await runUpdateInstall(...) line (in the same test using
sharedEnv, markerFile and observations) and fail the test if the process did not
exit successfully.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1e362d39-2f25-48d8-9a95-e74857aa0364
📒 Files selected for processing (5)
src/cli/commands/daemon.tssrc/cli/commands/update/install.tssrc/common-utils/daemon-state.tstest/cli/update-install-restart-race.test.tstest/common-utils/daemon-state.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
- src/cli/commands/daemon.ts
- src/cli/commands/update/install.ts
…arker Addresses CodeRabbit review on PR #74: the restart-race test inferred success only indirectly via the marker observations. Add an explicit exitCode === 0 assertion so a non-zero exit from the command itself fails the test even when the marker happens to look right.
The kubo RPC + gateway integration beforeAll picked ephemeral ports with getAvailablePort(), which closes its probe socket before returning the port number. kubo binds the port itself afterwards, leaving a TOCTOU window where another process can claim it. On CI this surfaced as the gateway failing to bind (serveHTTPGateway: ... address already in use) and the suite failing flakily. Wrap startup in a bounded retry (4 attempts) that picks fresh ports and a fresh repo on any 'already in use' rejection, and re-throws every other error so real regressions still fail fast.
Problem
bitsocial update install --restart-daemonsstops the running daemon, then restarts it with the new binary. But it only waited for the old daemon's RPC port (9138) to free before restarting.The RPC port is released by
daemonServer.destroy()— which runs before the daemon finishes killing its kubo child (the exit hook awaitskillKuboProcess()last). So the new daemon could be spawned while the old kubo still held the IPFS API port (50019), and it dies on startup with:…so port 9138 never comes up. This was hit on a production host after an in-place update.
This is a distinct race from the internal
keepKuboUpraces fixed in #71 — it's in theupdate installorchestration. See #70 (comment).Fix
Wait for the old daemon's PID to actually exit before restarting, instead of just its RPC port. The daemon's exit hook (now reliable after #71) kills kubo before the process exits, so a gone PID guarantees the kubo API port is free.
update installnow pollsprocess.kill(pid, 0)untilESRCH(bounded at 60s) instead oftcpPortUsed.waitUntilFree(rpcPort).Test
New end-to-end test
test/cli/update-install-restart-race.test.tsdrives the realbitsocial update install:_restartDaemonspath);XDG_DATA_HOMEso it only sees the test daemon;bitsocialPATH shim that records whether the kubo port is still bound at the moment of restart, then exec's the real daemon.That marker is the discriminator (the eventual daemon state self-heals via the watchdog, so asserting end-state wouldn't catch it). Red before the fix (
inuse), green after (free).A
PKC_CLI_TEST_KUBO_SHUTDOWN_DELAY_MShook indaemon.ts(mirroring the existingPKC_CLI_TEST_*hooks) makes the window deterministic.Verification
Follow-up (not in this PR)
update install's restart spawns a detached daemon that bypasses systemd (bitsocial.service). On systemd-managed hosts the documented update flow should restart via the unit. Tracked separately.Summary by CodeRabbit
Bug Fixes
Tests
Chores