Skip to content

test: two Windows-only CI flakes (logs -f follow, daemon-state PID-reuse pruning) #75

@Rinse12

Description

@Rinse12

Summary

The test-bitsocial-cli (windows-latest) CI job intermittently fails on two different tests across runs, while macos-latest and ubuntu-latest pass consistently. Because a different test fails each run (not the same one deterministically), these are environment flakes, not code regressions — but they redden unrelated PRs (e.g. #74, whose diff touches neither test).

Two distinct root causes, both Windows-specific.


Flake 1 — test/cli/logs.test.ts > "continues watching old file if no new file appears"

AssertionError: expected '[2026-05-01T00:00:00.000Z] initial li…' to contain 'APPENDED_LINE'
 ❯ test/cli/logs.test.ts:462:31

What the test does: spawns node ./bin/run logs --logPath <dir> -f, waits until initial line appears on stdout, then appends APPENDED_LINE to the file and expects the follow-watcher to emit it within a fixed 8 s timeout before it SIGINTs the child.

Root cause: the budget, not the watch mechanism. logs -f already uses a 300 ms userspace poll loop (src/cli/commands/logs.ts:294-306) and reads directly from a tracked byte position (not gated on stat().size) specifically to work around Windows/NTFS quirks — so detection itself is sound. The flake is timing:

  1. node ./bin/run cold-start (Node + oclif command load) is much slower on the Windows runner. The test gates the append on seeing initial line so startup shouldn't matter — but the append→poll(≤300 ms)→fd.readstdout.write→pipe-drain round trip still has to finish inside what's left of the 8 s window. On a loaded Windows runner, initial line can surface late, leaving too little margin.
  2. shutdown() calls process.exit(0) on SIGINT (logs.ts:313). process.exit does not flush a still-draining stdout pipe. On Windows, pipe writes are async, so APPENDED_LINE can be written but lost on the abrupt exit.

Fix directions:

  • Raise the test timeout and/or drive shutdown off observing APPENDED_LINE rather than a fixed wall-clock deadline.
  • In logs.ts, flush stdout before exiting (process.stdout.write("", cb) / process.exitCode = 0 + let the loop drain) instead of bare process.exit(0).

Flake 2 — test/common-utils/daemon-state.test.ts > "should prune a stale state file whose PID now belongs to a process that is not a bitsocial daemon"

AssertionError: expected { pid: 8220, …(3) } to be undefined
 ❯ test/common-utils/daemon-state.test.ts:153:60

What the test does: spawns sleep 120, writes a legacy state file (no procStartTime) carrying that PID, and expects getAliveDaemonStates() to prune it because the live process under that PID is sleep, not a bitsocial daemon (regression test for #66).

Root cause: the legacy-path identity check relies on reading the process command line, which has no Windows implementation (src/common-utils/daemon-state.ts:54-67):

```ts
async function getProcessCommandLine(pid) {
try { return (await fs.readFile(`/proc/${pid}/cmdline`, "utf-8"))... } // no /proc on Windows
catch {
try { return (await execFileAsync("ps", ["-p", String(pid), "-o", "args="]))... } // ps unreliable on Windows
catch { return undefined; }
}
}
```

And isDaemonStateAlive (line 141): when the command line is undeterminable it returns true ("fall back to liveness only"), so the stale legacy state is not pruned.

On Windows there is no /proc. The ps fallback only sometimes resolves (Git-for-Windows ships a procps ps, found via PATH only because the CI step runs under Git bash) and the freshly-spawned sleep child isn't always visible in its output yet despite awaiting once(child, "spawn"). When ps returns nothing/errors → getProcessCommandLine is undefined → not pruned → assertion fails. When ps happens to work, it passes. Hence the intermittency.

Fix directions:

  • Add a Windows branch to getProcessCommandLine (e.g. wmic process where processid=<pid> get commandline / Get-CimInstance Win32_Process), or
  • Skip/it.skipIf(process.platform === "win32") the legacy-cmdline-identity test on Windows and rely on the procStartTime path (the modern non-legacy identity check) there, or
  • Decide Windows is unsupported for this heuristic and document it.

Suggested triage

These two tests will keep randomly reddening unrelated PRs until addressed. Short term: make the Windows job non-required (or continue-on-error) so it stops blocking merges. Longer term: the two fixes above. Distinct from #73 (a Linux ephemeral-port EADDRINUSE flake).

Evidence: PR #74 runs — logs.test.ts failed on one commit, daemon-state.test.ts failed on the prior commit, macOS+Ubuntu green throughout.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions