Summary
The test-bitsocial-cli (windows-latest) CI job intermittently fails on two different tests across runs, while macos-latest and ubuntu-latest pass consistently. Because a different test fails each run (not the same one deterministically), these are environment flakes, not code regressions — but they redden unrelated PRs (e.g. #74, whose diff touches neither test).
Two distinct root causes, both Windows-specific.
Flake 1 — test/cli/logs.test.ts > "continues watching old file if no new file appears"
AssertionError: expected '[2026-05-01T00:00:00.000Z] initial li…' to contain 'APPENDED_LINE'
❯ test/cli/logs.test.ts:462:31
What the test does: spawns node ./bin/run logs --logPath <dir> -f, waits until initial line appears on stdout, then appends APPENDED_LINE to the file and expects the follow-watcher to emit it within a fixed 8 s timeout before it SIGINTs the child.
Root cause: the budget, not the watch mechanism. logs -f already uses a 300 ms userspace poll loop (src/cli/commands/logs.ts:294-306) and reads directly from a tracked byte position (not gated on stat().size) specifically to work around Windows/NTFS quirks — so detection itself is sound. The flake is timing:
node ./bin/run cold-start (Node + oclif command load) is much slower on the Windows runner. The test gates the append on seeing initial line so startup shouldn't matter — but the append→poll(≤300 ms)→fd.read→stdout.write→pipe-drain round trip still has to finish inside what's left of the 8 s window. On a loaded Windows runner, initial line can surface late, leaving too little margin.
shutdown() calls process.exit(0) on SIGINT (logs.ts:313). process.exit does not flush a still-draining stdout pipe. On Windows, pipe writes are async, so APPENDED_LINE can be written but lost on the abrupt exit.
Fix directions:
- Raise the test timeout and/or drive shutdown off observing
APPENDED_LINE rather than a fixed wall-clock deadline.
- In
logs.ts, flush stdout before exiting (process.stdout.write("", cb) / process.exitCode = 0 + let the loop drain) instead of bare process.exit(0).
Flake 2 — test/common-utils/daemon-state.test.ts > "should prune a stale state file whose PID now belongs to a process that is not a bitsocial daemon"
AssertionError: expected { pid: 8220, …(3) } to be undefined
❯ test/common-utils/daemon-state.test.ts:153:60
What the test does: spawns sleep 120, writes a legacy state file (no procStartTime) carrying that PID, and expects getAliveDaemonStates() to prune it because the live process under that PID is sleep, not a bitsocial daemon (regression test for #66).
Root cause: the legacy-path identity check relies on reading the process command line, which has no Windows implementation (src/common-utils/daemon-state.ts:54-67):
```ts
async function getProcessCommandLine(pid) {
try { return (await fs.readFile(`/proc/${pid}/cmdline`, "utf-8"))... } // no /proc on Windows
catch {
try { return (await execFileAsync("ps", ["-p", String(pid), "-o", "args="]))... } // ps unreliable on Windows
catch { return undefined; }
}
}
```
And isDaemonStateAlive (line 141): when the command line is undeterminable it returns true ("fall back to liveness only"), so the stale legacy state is not pruned.
On Windows there is no /proc. The ps fallback only sometimes resolves (Git-for-Windows ships a procps ps, found via PATH only because the CI step runs under Git bash) and the freshly-spawned sleep child isn't always visible in its output yet despite awaiting once(child, "spawn"). When ps returns nothing/errors → getProcessCommandLine is undefined → not pruned → assertion fails. When ps happens to work, it passes. Hence the intermittency.
Fix directions:
- Add a Windows branch to
getProcessCommandLine (e.g. wmic process where processid=<pid> get commandline / Get-CimInstance Win32_Process), or
- Skip/
it.skipIf(process.platform === "win32") the legacy-cmdline-identity test on Windows and rely on the procStartTime path (the modern non-legacy identity check) there, or
- Decide Windows is unsupported for this heuristic and document it.
Suggested triage
These two tests will keep randomly reddening unrelated PRs until addressed. Short term: make the Windows job non-required (or continue-on-error) so it stops blocking merges. Longer term: the two fixes above. Distinct from #73 (a Linux ephemeral-port EADDRINUSE flake).
Evidence: PR #74 runs — logs.test.ts failed on one commit, daemon-state.test.ts failed on the prior commit, macOS+Ubuntu green throughout.
Summary
The
test-bitsocial-cli (windows-latest)CI job intermittently fails on two different tests across runs, whilemacos-latestandubuntu-latestpass consistently. Because a different test fails each run (not the same one deterministically), these are environment flakes, not code regressions — but they redden unrelated PRs (e.g. #74, whose diff touches neither test).Two distinct root causes, both Windows-specific.
Flake 1 —
test/cli/logs.test.ts > "continues watching old file if no new file appears"What the test does: spawns
node ./bin/run logs --logPath <dir> -f, waits untilinitial lineappears on stdout, then appendsAPPENDED_LINEto the file and expects the follow-watcher to emit it within a fixed 8 s timeout before itSIGINTs the child.Root cause: the budget, not the watch mechanism.
logs -falready uses a 300 ms userspace poll loop (src/cli/commands/logs.ts:294-306) and reads directly from a tracked byteposition(not gated onstat().size) specifically to work around Windows/NTFS quirks — so detection itself is sound. The flake is timing:node ./bin/runcold-start (Node + oclif command load) is much slower on the Windows runner. The test gates the append on seeinginitial lineso startup shouldn't matter — but the append→poll(≤300 ms)→fd.read→stdout.write→pipe-drain round trip still has to finish inside what's left of the 8 s window. On a loaded Windows runner,initial linecan surface late, leaving too little margin.shutdown()callsprocess.exit(0)on SIGINT (logs.ts:313).process.exitdoes not flush a still-draining stdout pipe. On Windows, pipe writes are async, soAPPENDED_LINEcan be written but lost on the abrupt exit.Fix directions:
APPENDED_LINErather than a fixed wall-clock deadline.logs.ts, flush stdout before exiting (process.stdout.write("", cb)/process.exitCode = 0+ let the loop drain) instead of bareprocess.exit(0).Flake 2 —
test/common-utils/daemon-state.test.ts > "should prune a stale state file whose PID now belongs to a process that is not a bitsocial daemon"What the test does: spawns
sleep 120, writes a legacy state file (noprocStartTime) carrying that PID, and expectsgetAliveDaemonStates()to prune it because the live process under that PID issleep, not a bitsocial daemon (regression test for #66).Root cause: the legacy-path identity check relies on reading the process command line, which has no Windows implementation (
src/common-utils/daemon-state.ts:54-67):```ts
async function getProcessCommandLine(pid) {
try { return (await fs.readFile(`/proc/${pid}/cmdline`, "utf-8"))... } // no /proc on Windows
catch {
try { return (await execFileAsync("ps", ["-p", String(pid), "-o", "args="]))... } // ps unreliable on Windows
catch { return undefined; }
}
}
```
And
isDaemonStateAlive(line 141): when the command line is undeterminable it returnstrue("fall back to liveness only"), so the stale legacy state is not pruned.On Windows there is no
/proc. Thepsfallback only sometimes resolves (Git-for-Windows ships a procpsps, found via PATH only because the CI step runs under Git bash) and the freshly-spawnedsleepchild isn't always visible in its output yet despite awaitingonce(child, "spawn"). Whenpsreturns nothing/errors →getProcessCommandLineisundefined→ not pruned → assertion fails. Whenpshappens to work, it passes. Hence the intermittency.Fix directions:
getProcessCommandLine(e.g.wmic process where processid=<pid> get commandline/Get-CimInstance Win32_Process), orit.skipIf(process.platform === "win32")the legacy-cmdline-identity test on Windows and rely on theprocStartTimepath (the modern non-legacy identity check) there, orSuggested triage
These two tests will keep randomly reddening unrelated PRs until addressed. Short term: make the Windows job non-required (or
continue-on-error) so it stops blocking merges. Longer term: the two fixes above. Distinct from #73 (a Linux ephemeral-portEADDRINUSEflake).Evidence: PR #74 runs —
logs.test.tsfailed on one commit,daemon-state.test.tsfailed on the prior commit, macOS+Ubuntu green throughout.