Skip to content

porch: flaky 5s timeout in next.test.ts "emits BUILD tasks for fresh specify phase" #894

@amrmelsayed

Description

@amrmelsayed

Symptom

Vitest run fails with a single timeout in packages/codev/src/commands/porch/__tests__/next.test.ts:

❯ src/commands/porch/__tests__/next.test.ts (30 tests | 1 failed) 5474ms
  ⎯⎯⎯⎯⎯⎯⎯ Failed Tests 1 ⎯⎯⎯⎯⎯⎯⎯
  FAIL src/commands/porch/__tests__/next.test.ts > porch next > emits BUILD tasks for fresh specify phase
  Error: Test timed out in 5000ms.

 Test Files 1 failed | 150 passed (151)
      Tests 1 failed | 3179 passed | 21 skipped (3201)
##[error]Process completed with exit code 1.

Single test fails by exceeding vitest's 5000ms default timeout. The other 3179 tests pass cleanly. Total runtime for the failing file was 5474ms — i.e. essentially right at the timeout boundary.

Why this is a flake, not a regression

Concrete evidence: the failure was on the PR-check run for PR #893 (pir-885 — capitalize area group header labels). The exact same test passed on the post-merge run on main ~30 seconds later, with no code change between the two runs.

  • Failing run: 26545199849 (Tests on builder/pir-885, 2026-05-27T23:38:41Z)
  • Passing run: 26545216792 (Tests on main post-merge, 2026-05-27T23:39:09Z)

Same code (the merge was a fast-forward of the same commits), different result → timing-dependent flake.

Additionally: pir-885's change was localized to packages/core/src/area-grouping.ts (capitalize headers) + a test file. No porch code touched. The failing test is in packages/codev/src/commands/porch/__tests__/next.test.ts — different package, different concern.

Root cause hypothesis

The test "emits BUILD tasks for fresh specify phase" likely does something time-bounded that's borderline at the 5s default timeout on CI infrastructure (GitHub Actions Linux runners under variable load). Three plausible candidates:

  1. Spawns a real subprocess (e.g. git, afx, or node) and waits for it to complete. Subprocess startup on CI can vary from ~100ms to several seconds depending on host load. If it spawns several in sequence, the cumulative time crosses 5s.
  2. Awaits file-system operations (creating worktree, writing porch state, etc.) that depend on the runner's disk I/O speed.
  3. Real timer or interval inside the test that the test waits for, and the test's await-on-event pattern is racy with the timeout.

The fix surface depends on which of the three. The investigation phase of BUGFIX should reproduce locally (or reason about the code) before picking a fix.

Fix options

Pick after diagnosing the cause:

If cause is Fix
Genuinely slow subprocess/IO that's near 5s Bump test.timeout for this specific test (e.g. it('...', { timeout: 15_000 }, ...)) — pragmatic if the operation is legitimately heavy
Subprocess that should be mocked Mock the subprocess via vi.mock so the test doesn't depend on CI host performance
Racy timer / event-await Replace timer-based wait with deterministic event subscription; remove polling
Real test code path is slow even locally Refactor the test (or the code under test) to be faster — likely the right answer if it surfaces an actual codepath slowness

Default: bumping the timeout is the cheapest fix that resolves the immediate symptom. If the root cause is real performance bloat, the bump is a band-aid and a follow-up performance issue is warranted. But for an intermittent flake at the 5s boundary, a bump to 15s usually buys decisive headroom without papering over a real bug.

Acceptance criteria

  • Identify which of the three causes is actually responsible (with evidence — local repro or code reading)
  • Apply the appropriate fix per the table above
  • Test passes consistently across 5 consecutive CI runs (could verify by re-running the same workflow 5 times)
  • If the fix is just a timeout bump, note in the PR body whether a follow-up "porch next test is genuinely slow — investigate" issue should be filed
  • No regression: the rest of next.test.ts (29 other tests) still pass

Reproduce locally

pnpm --filter @cluesmith/codev test -- src/commands/porch/__tests__/next.test.ts
# Run several times to estimate flake rate
for i in {1..10}; do pnpm --filter @cluesmith/codev test -- src/commands/porch/__tests__/next.test.ts 2>&1 | tail -3; done

If it never fails locally but consistently flakes on CI, that's strong signal that CI infrastructure is the trigger (option 1 or 2 from the hypothesis).

Out of scope

  • Fixing the other 29 tests in the file — they pass; this issue is the one timeout
  • Auditing other porch tests for similar timeout-flake risk — separate sweep if this turns out to be a pattern
  • Investigating GitHub Actions runner performance — outside our control

Discovered while

PR #893 (pir-885 — capitalize area group headers) hit this flake during its PR-check Tests run on 2026-05-27. Merged anyway via --admin override since the failure was unrelated to the PR's change and the post-merge run on main confirmed the test passes.

Metadata

Metadata

Assignees

Labels

area/porchArea: Porch state machine / protocol orchestrationbugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions