Skip to content

server: @simlin/server Jest suite intermittently fails one test under parallel pre-commit load #635

@bpowers

Description

@bpowers

Summary

The @simlin/server Jest suite (src/server) intermittently fails one test when run under the parallel load of the pre-commit hook. A single pre-commit run reported 1 failed, 72 passed for @simlin/server, but re-running the suite standalone passed cleanly 73/73 (confirmed twice). This is a pre-existing flake: the change that triggered the failed pre-commit run was a Rust-only change to src/simlin-engine/src/layout/metrics.rs, completely unrelated to the server, so the test itself was almost certainly the cause rather than the diff under test.

Why it matters

This undermines the reliability of the pre-commit gate (scripts/pre-commit). The hook runs many checks in parallel (Rust fmt/clippy/tests, WASM build, TS lint/typecheck/tests, Python tests -- see root CLAUDE.md "Pre-commit Hooks"), and under that contention one server test flips red. The consequence is a spurious red on a branch that is actually green: a developer can be blocked (or worse, conditioned to distrust the gate) by a failure that has nothing to do with their change. It is a developer-experience and CI-trust problem, not a product correctness bug, but it erodes the value of the canonical gate.

This is the same class of problem as the recently-filed #629 (pre-commit Rust pipeline spuriously failing on a cold cache due to parallel clippy --all-features + capped cargo test contending on the package-cache lock), but in a different component: that one is the Rust pipeline / cargo package-cache lock; this one is the @simlin/server Jest suite. It is also distinct from #474 (an order-dependent/flaky Rust engine test) and from the pysimlin Hypothesis health-check flake tracked in docs/tech-debt.md item 19.

Component affected

  • src/server -- the @simlin/server Jest test suite (7 test files; see src/server/CLAUDE.md).
  • Surfaces through scripts/pre-commit (and by extension CI, if the same suite runs there under parallel load).

How it reproduces / what's known

  • Symptom: under parallel pre-commit load, 1 failed, 72 passed; standalone re-run passes 73/73 (reproduced-clean twice in isolation).
  • The specific failing test has not been identified yet -- the parallel run did not surface (or the surfacing was not captured) which of the 73 tests flaked.
  • Likely root-cause families for a "fails under load, passes in isolation" flake: test-isolation / shared mutable fixture state, a hard-coded port or other shared OS resource, a timer/setTimeout-driven async race, or resource contention (CPU starvation pushing an implicit timeout over the edge when the box is busy with the rest of the pre-commit run).

Possible approaches for resolution

  1. Identify the flaky test. Run the server suite repeatedly under load to reproduce. Useful levers:
    • Run with reduced/maximal worker parallelism to bracket the behavior, e.g. pnpm --filter @simlin/server test -- --runInBand (serial) vs. the default parallel run, and a high --maxWorkers while the machine is otherwise busy.
    • Loop the suite (e.g. a shell for loop, or Jest with a repeat) while a CPU/IO load generator runs in parallel to mimic the rest of the pre-commit hook.
    • Capture the failing test name and its error/stack the first time it trips.
  2. Find the root cause once the test is known: look for shared module-level fixture state not reset between tests, a fixed port / temp path collision, a real timer vs. fake timers, an unawaited promise, or an implicit timeout that's too tight under contention.
  3. Make it deterministic: isolate the shared state (per-test fixtures / beforeEach reset), bind to an ephemeral port (:0) or unique temp paths, use fake timers / explicit awaits instead of wall-clock waits, and/or relax overly tight timeouts. Prefer fixing the root cause over papering it with retries.

Context

Discovered during the layout-quality-eval work (branch layout-quality-eval). The triggering pre-commit run was for an unrelated Rust-only change to src/simlin-engine/src/layout/metrics.rs; the server flake is therefore pre-existing and not caused by that change.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions