Skip to content

feat: [ENG-2464] enhance cold start-up daemon#546

Merged
bao-byterover merged 4 commits intomainfrom
fix/ENG-2464
Apr 27, 2026
Merged

feat: [ENG-2464] enhance cold start-up daemon#546
bao-byterover merged 4 commits intomainfrom
fix/ENG-2464

Conversation

@bao-byterover
Copy link
Copy Markdown
Collaborator

@bao-byterover bao-byterover commented Apr 25, 2026

Summary

  • Problem: On a cold daemon, daemon-routed CLI commands (brv status, brv locations, brv vc *, brv push, brv pull, brv login, brv hub *, brv providers list, brv worktree list, brv source list, etc.) take noticeably long to return. Warm invocations are fast.
  • Why it matters: First command after a reboot, after brv restart, or after the daemon idle-timeout is perceptibly slow — affects nearly every daemon-routed command. Has been in production since v2.0.0.
  • What changed:
    1. Reordered daemon bootstrap so the Socket.IO transport does not accept client connections until all event handlers are registered. Eliminates the handler-registration race where clients could connect and fire requests before listeners were wired.
    2. Tuned withDaemonRetry defaults in src/oclif/lib/daemon-client.ts from MAX_RETRIES=3 × 2 s (4 s budget) to MAX_RETRIES=10 × 1 s (9 s budget). With shorter delay, the cold-start ECONNREFUSED window (heartbeat fresh but port not open during handler-setup) recovers in ~1 s on the typical case instead of ~2 s, and the wider budget covers slow-OIDC cold starts that would previously fail.
  • What did NOT change (scope boundary): No handler behavior changed. No API contract changed. AuthHandler untouched. setupFeatureHandlers untouched. OIDC flow untouched. Token storage / auth / agent IPC / MCP transport untouched. Heartbeat ordering unchanged (intentionally — see "Risks" §1 below).

Type of change

  • Bug fix
  • New feature
  • Refactor (no behavior change)
  • Documentation
  • Test
  • Chore (build, dependencies, CI)

Scope (select all touched areas)

  • TUI / REPL
  • Agent / Tools
  • LLM Providers
  • Server / Daemon
  • Shared (constants, types, transport events)
  • CLI Commands (oclif)
  • Hub / Connectors
  • Cloud Sync
  • CI/CD / Infra

Linked issues

  • Closes ENG-2464 — [CLI] Investigate slow daemon startup — profile + rank bottlenecks

Root cause (bug fixes only, otherwise write N/A)

  • Root cause: transportServer.start(port) opened the Socket.IO port early in daemon bootstrap (before handlers were registered), but onRequest was guarded with if (!io) throw, forcing all handler registrations to happen AFTER start(). The first CLI command of a cold session connected quickly, fired its business request, and hit a not-yet-registered listener — Socket.IO silently drops unhandled events. The client's requestWithAck timeout was the only thing stopping the request, followed by a withDaemonRetry backoff and a retry. On the retry, handlers were registered, request succeeded immediately.
  • Why this was not caught earlier:
    1. No CI test exercises cold-daemon from an external subprocess. Integration tests either mock the daemon or reuse a warm one — the race only fires once per cold boot.
    2. The race doesn't crash. It succeeds on retry with exit code 0 and correct output. No error log. Just slow.
    3. Dev machines have warm daemons. Developers run commands all day; daemon stays resident. The race only fires on the first cold command after reboot / idle-timeout / explicit restart.
    4. The regression is the intersection of two independent PRs. ENG-1091 introduced feature-handlers.ts with blocking OIDC before handler registration — harmless at the time because no command routed through those handlers. ENG-1186 migrated ~10 local commands to daemon-routed. Either PR in isolation looked correct. Their intersection shipped in v2.0.0.

Test plan

  • Coverage added:
    • Unit test
    • Integration test
    • Manual verification (in addition)
  • Test file(s): test/unit/infra/transport/socket-io-transport-server.test.ts — updated one test that asserted the now-wrong invariant (onRequest throws before start()) to instead assert the new correct behavior (pre-start() registration stores the handler and applies it to the first connecting socket).
  • Key scenario(s) covered:
    • Pre-start() onRequest no longer throws.
    • Handler registered before start() is correctly applied to the first incoming socket after start() (via the existing io.on('connection') loop over requestHandlers map).
    • All previously passing tests for onRequest, broadcast, broadcastTo, sendTo, getPort, isRunning still pass.
    • broadcast / broadcastTo still correctly throw TransportServerNotStartedError when called before start() (they genuinely need io).
    • Manual cold-run verification of representative commands across all handler categories: brv status, brv locations, brv vc status, brv vc branch, brv providers list, brv hub list, brv worktree list, brv source list.

User-visible changes

Faster cold startup for every daemon-routed command. Warm invocations unchanged. No CLI flag, config, defaults, or output format changes. No breaking changes.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording

The benchmark comparison was run against both the published npm install -g byterover-cli@3.8.3 (baseline) and this branch installed from npm pack (fix), on the same machine, alternating back-to-back. For each command, three cold runs were executed (brv restart + sleep 1 before each) across all handler categories listed above. The fix consistently reduces cold-start latency across every tested command. See the ticket attachments / comments for the full run log.

Checklist

  • Tests added or updated and passing (npm test) — full suite green
  • Lint passes (npm run lint) — 0 errors on touched files (pre-existing submodule warning exists on main, unrelated)
  • Type check passes (npm run typecheck) — clean for both root and webui
  • Build succeeds (npm run build) — clean
  • Commits follow Conventional Commits format — feat: [ENG-2464] enhance cold start-up daemon
  • Documentation updated (if applicable) — inline comments in both touched files explain the new invariant and reference ENG-2464
  • No breaking changes (or clearly documented above) — zero API / behavior change; only observable difference is cold-startup latency
  • Branch is up to date with main

Risks and mitigations

  • Risk: A cold client could see ECONNREFUSED between the moment the daemon writes its heartbeat (so pollForDaemon considers it ready) and the moment transport.start(port) actually opens the port. After this PR, that window spans the entire handler-setup phase including setupFeatureHandlers's OIDC discovery (typically <1 s, up to several seconds on slow networks).
    • Mitigation: withDaemonRetry (src/oclif/lib/daemon-client.ts) catches ConnectionFailedError and sleeps DEFAULT_RETRY_DELAY_MS between attempts. This PR tunes those defaults from 3 × 2 s (4 s budget) to 10 × 1 s (9 s budget) — short enough to recover on the typical case in ~1 s, wide enough to cover slow OIDC. Verified in manual cold runs.
    • Considered but rejected: Deferring heartbeatWriter.start() until after transport.start(port) would let pollForDaemon itself wait for port-readiness, removing the ECONNREFUSED window entirely. Rejected because pollForDaemon has a 5 s DAEMON_READY_TIMEOUT_MS, and on a retry after that timeout, daemon-spawner.js's gracefullyStopDaemon step SIGTERMs the daemon for "stale heartbeat" — which would kill an in-progress slow-OIDC bootstrap and trigger a respawn loop. Keeping heartbeat early avoids that failure mode entirely. Properly fixing this in pollForDaemon (port-readiness probe + spawn-aware stale detection) belongs in @campfirein/brv-transport-client, not here.
  • Risk: A future handler registration that depends on post-start() state (e.g., transport.getPort(), which returns undefined before start) silently misbehaves.
    • Mitigation: onRequest no longer requires start(), but broadcast, broadcastTo, and getPort still require it (they genuinely need io). Tests asserting TransportServerNotStartedError for those remain in place. The invariant is documented in a comment in socket-io-transport-server.ts explaining why onRequest is now pre-start-safe while the others are not.
  • Risk: OIDC failure behavior change.
    • Mitigation: Unchanged. setupFeatureHandlers still awaits getAuthConfig, still throws on network failure, still propagates up through the existing try/catch in brv-server.ts:main which cleans up and exits. No new failure modes.
  • Risk: Two daemons race the port between acquire and listen.
    • Mitigation: GlobalInstanceManager.acquire() still runs first and is atomic — only one daemon holds the lock. If a second daemon is spawned during the bootstrap window, it sees the lock and exits with already_running. Unchanged from main.

@bao-byterover bao-byterover requested a review from leehpham as a code owner April 25, 2026 00:21
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 25, 2026

Claude finished @bao-byterover's task in 2m 45s —— View job


PR Review: feat: [ENG-2464] enhance cold start-up daemon

Tasks

  • Gather context and read changed files
  • Analyze brv-server.ts changes
  • Analyze socket-io-transport-server.ts changes
  • Analyze test changes
  • Post inline comments and overall review

Summary

Score: 4 / 5

This is a well-diagnosed, minimal, and targeted fix for a real production regression. The root cause analysis in the PR description is thorough and accurate. The change is correct, the test directly validates the new invariant, and no API contracts are broken.


What's correct

  • Core fix is sound. Removing the if (!io) throw guard from onRequest and letting start()'s io.on('connection') loop iterate the pre-populated requestHandlers map is the right mechanism. The existing sockets.values() loop at the end of onRequest becomes a no-op before start() (since sockets is empty), which is harmless.
  • broadcast / broadcastTo still guard correctly. They genuinely need io, and both still throw TransportServerNotStartedError before start(). The asymmetry between onRequest (pre-start-safe) and broadcast (not pre-start-safe) is a valid design distinction and is now explicitly tested.
  • onConnection / onDisconnection were already pre-start-safe (they push to arrays with no !io guard), so no change needed there.
  • Bootstrap cleanup path is unchanged. If transportServer.start() fails at its new position, the catch block still calls transportServer?.stop(), agentPool.shutdown(), webuiServer?.stop(), etc. — no new cleanup gap introduced.
  • The test is well-constructed. It validates the key invariant end-to-end (register handler → start server → client connects → emits → correct response), and the afterEach teardown handles server stop even if the test throws mid-way.

Issues

1. Increased ECONNREFUSED window (non-blocking, worth tracking)
instanceManager.acquire(port, version) writes daemon.json with the chosen port at step 3, before any bootstrap work. Previously, transportServer.start(port) was called at step 4a (right after construction), so the gap was negligible. Now it's delayed to after all of steps 4b–11: WebUI server start, heartbeat, auth state loading, agent pool initialization, and every onRequest registration. On a slow machine or cold filesystem, this could easily be 200–500 ms or more. The client's exponential-backoff retry (30 attempts from 50 ms) absorbs this, but the expected bootstrap duration is now a load-bearing assumption — worth documenting. See inline comment at brv-server.ts:665.

2. Verbose inline comment blocks (nitpick)
Both the comment block in socket-io-transport-server.ts:119–127 and brv-server.ts:661–665 are multi-line. CLAUDE.md requires "one short line max" for inline comments. See inline suggestions for condensed forms.

3. Test client not registered in outer clientSocket (minor)
The new test creates a local client and disconnects it manually. If the promise rejects before client.disconnect(), cleanup falls to afterEach via server.stop() (which calls io.disconnectSockets(true)). This is safe, but for consistency with the rest of the suite, assigning to the outer clientSocket would let afterEach handle it uniformly. See inline comment on the test.


What didn't change (confirmed)

  • AuthHandler, setupFeatureHandlers, OIDC flow, token storage, agent IPC, MCP transport — all untouched.
  • broadcast, broadcastTo, sendTo, getPort, isRunning behavior unchanged.
  • No CLI-side changes.

Comment thread src/server/infra/transport/socket-io-transport-server.ts
Comment thread src/server/infra/daemon/brv-server.ts
Comment thread test/unit/infra/transport/socket-io-transport-server.test.ts Outdated
Comment thread src/server/infra/daemon/brv-server.ts Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 25, 2026

Claude finished @bao-byterover's task in 3m 39s —— View job


PR Review: feat: [ENG-2464] enhance cold start-up daemon

Tasks

  • Gather context and read changed files
  • Analyze daemon-client.ts changes
  • Analyze brv-server.ts changes
  • Analyze socket-io-transport-server.ts changes
  • Analyze test changes
  • Post inline comments and overall review

Summary

Score: 3 / 5

The core fix remains technically correct and well-reasoned. The second commit (fix: [ENG-2464] fix edge case slow OIDC) adds the MAX_RETRIES/DEFAULT_RETRY_DELAY_MS tuning and the heartbeat ordering explanation, both of which are sound. However, the previous review's style feedback was not only ignored — the new commit added more verbose comment blocks, compounding the existing violations. The same two inline suggestions from the first review pass remain open. Score drops from 4 to 3 for the repeated pattern of comment verbosity across all four changed source files.


What's correct

  • Core fix is sound. Pre-registering handlers in onRequest before start() eliminates the handler-registration race. The start() connection loop iterates requestHandlers and wires them to the first arriving socket — no missed events.
  • Retry budget is well-reasoned. MAX_RETRIES=10 × DEFAULT_RETRY_DELAY_MS=1000 ms gives a 9 s budget. This covers the ECONNREFUSED window (heartbeat written → port open) including slow-OIDC discovery, and is strictly better than the previous 4 s.
  • Heartbeat ordering is correct. Keeping heartbeatWriter.start() before transport.start() avoids the SIGTERM-on-stale-heartbeat failure mode that would otherwise kill a slow-OIDC bootstrap.
  • broadcast / broadcastTo still guard correctly. They genuinely need io and still throw TransportServerNotStartedError before start().
  • Test validates the right invariant. Pre-register → start → connect → emit → correct response is the right sequence to test.
  • No API contract, handler behavior, or auth/OIDC flow changes.

Issues (see inline comments)

1. daemon-client.ts:23–30 — new multi-line JSDoc block (nitpick, style)
Added in the second commit. A 7-line JSDoc for a single constant with cross-file references and PR-ticket links belongs in the commit message, not source. CLAUDE.md: "one short line max".

2. socket-io-transport-server.ts:119–127 — 8-line comment block (nitpick, style, unresolved from previous review)
Same feedback given in the first review pass; suggestion provided. Not addressed in the second commit.

3. brv-server.ts:243–254 — new 11-line heartbeat comment (nitpick, style)
The heartbeat/SIGTERM interaction is genuinely non-obvious — worth a comment — but 11 lines is excessive. One sentence is sufficient; the rest belongs in the commit message.

4. brv-server.ts:673–678 — 6-line transport start comment (nitpick, style, unresolved from previous review)
Suggestion provided in the first review pass. The historical "why it was slow" narrative belongs in the commit message; the inline note needs only one line.

5. brv-server.ts:198–201 — 4-line step-4a comment (nitpick, style)
Consistent with the pattern above; can be trimmed to one line.

6. test:236 — client cleanup not in try/finally (suggestion, minor, unresolved from previous review)
Feedback given last round. If the emit Promise rejects, client.disconnect() is unreachable. A try/finally block makes cleanup unconditional.


Non-blocking observation (unchanged from previous review)

Increased ECONNREFUSED windowinstanceManager.acquire(port, version) writes daemon.json at step 3 (before any bootstrap). The gap between that write and transport.start(port) now spans the entire bootstrap sequence (WebUI start, heartbeat, auth loading, agent pool warm-up, all handler registrations). The 9 s retry budget absorbs it, but genuine connection failures now take up to 9 s to surface vs. 4 s before. Acceptable tradeoff; worth noting in the risk section if not already captured.


What didn't change (confirmed)

  • AuthHandler, setupFeatureHandlers, OIDC flow, token storage, agent IPC, MCP transport — all untouched.
  • broadcast, broadcastTo, sendTo, getPort, isRunning behavior unchanged.
  • No CLI flag, config, defaults, or output format changes.

Comment thread src/oclif/lib/daemon-client.ts Outdated
Comment thread src/server/infra/transport/socket-io-transport-server.ts Outdated
Comment thread src/server/infra/daemon/brv-server.ts Outdated
Comment thread src/server/infra/daemon/brv-server.ts Outdated
Comment thread test/unit/infra/transport/socket-io-transport-server.test.ts Outdated
Comment thread src/server/infra/daemon/brv-server.ts Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 27, 2026

Claude finished @bao-byterover's task in 4m 10s —— View job


PR Review: feat: [ENG-2464] enhance cold start-up daemon

Tasks

  • Gather context and read changed files
  • Analyze previous review feedback and check if addressed
  • Review daemon-client.ts changes
  • Review brv-server.ts changes
  • Review socket-io-transport-server.ts changes
  • Review test changes
  • Post inline comments and overall review

Summary

Score: 2 / 5

The core fix remains technically correct and well-reasoned. The sync trigger is a merge-from-main commit (d9bfa10) — no new logic was introduced. However, this is now the third review pass and every issue flagged in passes 1 and 2 is still unaddressed. Additionally, this pass surfaces a new correctness bug introduced by the second commit: two DaemonClientOptions JSDoc comments still document the old defaults.


New issue (introduced by this PR, not caught in previous passes)

DaemonClientOptions has stale JSDoc defaultsmaxRetries says Default: 3 and retryDelayMs says Default: 2000, but both constants were changed in this PR (MAX_RETRIES = 10, DEFAULT_RETRY_DELAY_MS = 1000). These are the authoritative docs for callers who pass custom values to withDaemonRetry.

src/oclif/lib/daemon-client.ts lines 69 and 75 — not patchable with an inline suggestion (unchanged lines), but the fix is trivial: update Default: 3Default: 10 and Default: 2000Default: 1000.

Fix this →


Persistent issues (unresolved across all three passes — see inline suggestions)

# File Lines Issue
1 daemon-client.ts 23–30 7-line JSDoc for a constant (CLAUDE.md: one line max)
2 socket-io-transport-server.ts 119–126 8-line comment block (CLAUDE.md: one line max)
3 brv-server.ts 197–200 4-line step-4a comment (CLAUDE.md: one line max)
4 brv-server.ts 243–253 11-line heartbeat comment (CLAUDE.md: one line max)
5 brv-server.ts 673–678 6-line transport-start comment (CLAUDE.md: one line max)
6 test/…transport-server.test.ts 236 client.disconnect() not in try/finally — leaks on reject

What's still correct

  • Core fix is sound. Moving transportServer.start(port) to after all onRequest registrations eliminates the handler-registration race. Pre-registered handlers in requestHandlers map are applied to the first connecting socket via the existing io.on('connection') loop. The mechanism is correct.
  • broadcast / broadcastTo still guard correctly. Both still throw TransportServerNotStartedError before start() — they genuinely need io. The asymmetry with onRequest is valid and tested.
  • Retry budget is well-reasoned. MAX_RETRIES=10 × 1 s gives a 9 s window to cover slow-OIDC cold starts. Strictly better than the old 4 s budget.
  • Heartbeat ordering is correct. Early heartbeat prevents pollForDaemon from SIGTERMing the daemon during slow OIDC bootstrap.
  • Cleanup path is unchanged. Catch block covers transportServer?.stop() correctly whether start() succeeded or not.
  • No API contract, handler behavior, or auth/OIDC flow changes.

Comment thread src/server/infra/transport/socket-io-transport-server.ts
Comment thread src/server/infra/daemon/brv-server.ts Outdated
Comment thread src/server/infra/daemon/brv-server.ts Outdated
Comment thread src/server/infra/daemon/brv-server.ts
Comment thread src/oclif/lib/daemon-client.ts
Comment thread test/unit/infra/transport/socket-io-transport-server.test.ts Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 27, 2026

Claude finished @bao-byterover's task in 1m 45s —— View job


PR Review: feat: [ENG-2464] enhance cold start-up daemon

Tasks

  • Gather context and read changed files
  • Analyze latest commit (4349b97 fix: [ENG-2464] correct JSDoc defaults and trim verbose comments)
  • Check if all previous review issues were resolved
  • Review test changes
  • Post inline comments and overall review

Summary

Score: 5 / 5

This pass is clean. Every issue raised across the previous three review passes has been resolved in 4349b97. The core fix remains technically correct and well-tested. This PR is ready to merge.


Previous issues — all resolved ✅

# File Issue Status
1 daemon-client.ts:23 7-line JSDoc block for MAX_RETRIES constant ✅ Single-line JSDoc
2 daemon-client.ts:62,68 Stale Default: 3 / Default: 2000 in DaemonClientOptions ✅ Now Default: 10 / Default: 1000
3 socket-io-transport-server.ts:119 8-line comment block ✅ Single line
4 brv-server.ts:198 4-line step-4a comment ✅ Single line
5 brv-server.ts:240 11-line heartbeat comment ✅ Single line
6 brv-server.ts:659 6-line transport-start comment ✅ Single line
7 test:…transport-server.test.ts:234 client.disconnect() not in try/finally ✅ Wrapped in try/finally

What's correct

  • Core fix is sound. onRequest no longer throws before start(). Pre-registered handlers populate this.requestHandlers, and start()'s io.on('connection') loop applies them to every incoming socket. The for (const socket of this.sockets.values()) loop at the end of onRequest is a no-op before start() (empty map) — correct and harmless.
  • broadcast / broadcastTo still guard correctly. Both still throw TransportServerNotStartedError before start().
  • Retry budget well-reasoned. MAX_RETRIES=10 × 1 s = 9 s covers slow-OIDC cold starts. Replaces the prior 4 s budget.
  • Heartbeat ordering is correct. Keeping heartbeat before transport.start() prevents pollForDaemon from SIGTERMing the daemon during slow OIDC bootstrap.
  • File-level step list updated. The block comment at the top of main() now accurately documents step 4 (construct) vs. step 11 (start) for SocketIOTransportServer.
  • Test replacement is correct. The old 'should throw error if server not started' test is replaced with the new invariant test (pre-register → start → connect → emit → correct response, with try/finally cleanup). Removing the old test is correct since the thrown TransportServerNotStartedError for onRequest was the wrong invariant.
  • No API contract, handler behavior, or auth/OIDC flow changes.

@bao-byterover bao-byterover merged commit 5481a8e into main Apr 27, 2026
6 checks passed
@bao-byterover bao-byterover deleted the fix/ENG-2464 branch April 27, 2026 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant