Skip to content

perf(test-runner): parallelize YAML scenario execution#515

Merged
avihut merged 4 commits into
masterfrom
daft-510/perf/parallelize-yaml-test-runner
May 17, 2026
Merged

perf(test-runner): parallelize YAML scenario execution#515
avihut merged 4 commits into
masterfrom
daft-510/perf/parallelize-yaml-test-runner

Conversation

@avihut
Copy link
Copy Markdown
Owner

@avihut avihut commented May 17, 2026

Summary

  • Adds --parallel and --jobs N (plus DAFT_MANUAL_TEST_JOBS) to the xtask manual-test runner, executing scenarios concurrently via rayon with per-scenario buffered output flushed in input order. Default stays serial; a follow-up PR will flip the default after a week of opt-in use.
  • Fixes orphan-spawn accumulation: the test env now sets DAFT_NO_LOG_CLEAN=1, suppressing detached daft __clean-logs children that would otherwise outlive xtask scenarios, get reparented to init, and steal CPU. This was effectively a precondition for parallel scaling working at all on local-dev runs.
  • Adds a benchmark harness for perf(test-runner): speed up the YAML scenario suite #509 progress tracking: mise run bench:tests:manual:scale does a hyperfine parameter-scan over --jobs values + an opt-in per-scenario timing pass for p50/p95/max distributions. Methodology in benches/README.md; first reference baseline checked in at benches/baselines/test-manual-scale-2026-05-17.md.

Speedup (M1 Max 10-core, 572 scenarios, full corpus)

Mode Wall-clock Speedup vs serial Speedup vs pre-PR main
pre-PR main (serial, no fix) ~1586 s 1.00×
this PR, --jobs 1 392.10 s ± 1.51 s 1.00× 4.05×
this PR, --parallel (cap=5) 100.00 s ± 3.45 s 3.92× 15.9×
this PR, --jobs 10 82.19 s 4.78× 19.3×

End-to-end: 26 min → 1.7 min on a 10-core box.

Scaling is near-linear up to num_cpus/2: jobs=2 (1.81× / 90%), jobs=4 (3.45× / 86%), jobs=5 (3.92× / 78%). Past num_cpus/2, returns diminish and scenario flakes appear — available_parallelism()/2 is well-calibrated as the default cap. Zero new failures at the default cap across all measured runs.

Per-scenario distribution at the default cap stays clean: p50 567 ms, p95 1293 ms — only ~14% per-scenario slowdown from CPU contention, paid back many times over in wall-clock.

Rollout

The runner ships with serial as the default. --parallel is opt-in; --jobs N and DAFT_MANUAL_TEST_JOBS are the explicit overrides. The acceptance criteria's "parallel by default" flips in a follow-up PR after a week of opt-in use surfaces no new flakes. Interactive mode and --setup-only hard-error when --jobs > 1 — their semantics (TTY ownership, println! of work_dir for shell capture) don't survive the buffered worker model.

CI

CI in test.yml:487 still invokes serial. A follow-up commit on this PR will enable --parallel there to measure the impact on CI wall-clock.

Test plan

  • CI checks pass (build, unit, integration matrix)
  • mise run test:unit locally (1721 + 65 xtask passing)
  • mise run clippy zero warnings
  • Manual target/release/xtask manual-test --ci --parallel produces deterministic input-ordered output matching serial
  • SIGINT mid-run cleans up sandbox dirs (no leaks)
  • Mixed pass/fail scenario set produces stable failed_scenarios list in input order under both serial and parallel

Refs #510, part of #509.

🤖 Generated with Claude Code

@avihut avihut added this to the Public Launch milestone May 17, 2026
@avihut avihut added the perf Performance improvement label May 17, 2026
@avihut avihut self-assigned this May 17, 2026
avihut and others added 4 commits May 17, 2026 08:11
Adds `--parallel` and `--jobs N` (plus `DAFT_MANUAL_TEST_JOBS`) to the
xtask manual-test runner, executing scenarios concurrently via rayon
with per-scenario buffered output flushed in input order. Default stays
serial; a follow-up PR will flip the default after a week of opt-in use.

- Sandbox path: nanos+pid+atomic counter to prevent collisions under
  parallel scheduling.
- Cleanup handler: registry promoted from `Option<PathBuf>` to a
  `HashSet<PathBuf>` with `CleanupGuard` RAII. SIGINT drains under the
  held lock + bounded re-rm loop to fight subprocess-recreation races.
- Aggregator: parallel `par_iter` → `collect` → input-order sort →
  serial fold, so stats and `failed_scenarios` ordering are stable
  regardless of completion order.
- Interactive / `--setup-only`: hard-error when `jobs > 1`, since TTY
  ownership and the `println!` of `work_dir` for shell capture don't
  fit the buffered worker model.

Wall-clock on the `tests/manual/scenarios/clone/` batch: 22.3s → 5.75s
at `--jobs 8` (~3.9× speedup). Refs #510, part of #509.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`daft` spawns a detached `daft __clean-logs` background child via
setsid+spawn from `maybe_clean_logs()`. When the manual-test runner
invokes daft hundreds of times back-to-back — especially under
`--jobs > 1` — these children outlive their parent xtask scenarios,
get reparented to init, and accumulate as orphans that steal CPU.
Observed load average above 500 on a 10-core box during a single
full-corpus run, with each scenario then taking ~10× longer than its
fair share.

Effect on a 69-scenario hooks subset (M1 Max):
- Serial:    366s → 184s   (1.99× faster)
- jobs=5:    197s →  16s   (12.5× faster)

`DAFT_NO_UPDATE_CHECK` and `DAFT_NO_TRUST_PRUNE` already gate the
other two spawn-self startup tasks; adding `DAFT_NO_LOG_CLEAN`
follows the same pattern. The env var is read by `log_clean::is_disabled`
in production code — no daft-side changes needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a benchmark harness so #509 progress is measurable
SHA-over-SHA. Companion to `mise run bench` (daft-vs-git command
runtime) but at a different layer: this measures the YAML manual-test
runner's throughput, not individual daft command latency.

- `mise run bench:tests:manual:scale` — hyperfine parameter-scan over
  --jobs values; writes per-trial wall-clock to
  `benches/results/test-manual-scale.{md,json}`.
- `mise run bench:tests:manual:scale-{baseline,compare}` — pin/diff
  workflow matching the existing `bench:{baseline,compare}` pair.
- `benches/scenarios/test_manual_scale.sh` — driver. Sweeps
  BENCH_JOBS (default 1,2,4,8), 3 trials per value, then a Phase 2
  per-scenario timing pass at jobs=1 and at `--parallel`'s default cap
  for p50/p95/max distribution.
- Opt-in per-scenario timing via `DAFT_MANUAL_TEST_EMIT_TIMING=1` —
  the runner emits one grep-friendly `[bench] scenario="X" elapsed_ms=N`
  line per scenario, scoped to the runner half (excludes sandbox
  setup/teardown overhead so cumulative ≠ wall-clock).
- `benches/README.md` documents both bench families (daft-vs-git and
  test-runner) and the local vs. checked-in baseline split.
- `benches/baselines/test-manual-scale-2026-05-17.md` — first
  reference baseline (M1 Max, 10/10 cores). Shows 3.92× speedup at the
  default cap, 4.78× at full saturation, scenario failures appearing
  past num_cpus/2 — concrete evidence that `available_parallelism()/2`
  is well-chosen.

Phase 1 results from that baseline (full 572-scenario corpus):

  --jobs 1   392.10 s ± 1.51 s   (1.00× — serial reference)
  --jobs 2   217.22 s ± 0.94 s   (1.81×,  90% efficiency)
  --jobs 4   113.78 s ± 0.46 s   (3.45×,  86%)
  --jobs 5   100.00 s ± 3.45 s   (3.92×,  78% — `--parallel` default)
  --jobs 8   ~144 s   (flaky)    (2.71×,  34%)
  --jobs 10   82.19 s             (4.78×,  48%)

End-to-end vs pre-#510 main on the same machine: 1586 s → 100 s, a
15.9× wall-clock reduction. The bulk of the win at `--jobs 1` (4.05×)
comes from the orphan-spawn suppression fix in #510's prior commit;
the remaining 3.92× is the parallelism this PR introduces.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pairs with the parallelization landing in this PR: CI was still on the
serial default, so the speedup wasn't visible in workflow runtime.
`--jobs $(nproc)` saturates the runner since CI is dedicated to the
job — no need for `--parallel`'s `num_cpus/2` headroom that exists for
concurrent local work.

Baseline serial CI timing on this PR (ubuntu-latest):
- integration-tests (default,  yaml): 2m 11s
- integration-tests (gitoxide, yaml): 2m 02s

Next run will show the delta.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@avihut avihut force-pushed the daft-510/perf/parallelize-yaml-test-runner branch from a156d81 to e6beb36 Compare May 17, 2026 05:12
@avihut avihut merged commit f6cc15e into master May 17, 2026
25 of 26 checks passed
@avihut avihut deleted the daft-510/perf/parallelize-yaml-test-runner branch May 17, 2026 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

perf Performance improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant