perf(test-runner): parallelize YAML scenario execution#515
Merged
Conversation
Adds `--parallel` and `--jobs N` (plus `DAFT_MANUAL_TEST_JOBS`) to the xtask manual-test runner, executing scenarios concurrently via rayon with per-scenario buffered output flushed in input order. Default stays serial; a follow-up PR will flip the default after a week of opt-in use. - Sandbox path: nanos+pid+atomic counter to prevent collisions under parallel scheduling. - Cleanup handler: registry promoted from `Option<PathBuf>` to a `HashSet<PathBuf>` with `CleanupGuard` RAII. SIGINT drains under the held lock + bounded re-rm loop to fight subprocess-recreation races. - Aggregator: parallel `par_iter` → `collect` → input-order sort → serial fold, so stats and `failed_scenarios` ordering are stable regardless of completion order. - Interactive / `--setup-only`: hard-error when `jobs > 1`, since TTY ownership and the `println!` of `work_dir` for shell capture don't fit the buffered worker model. Wall-clock on the `tests/manual/scenarios/clone/` batch: 22.3s → 5.75s at `--jobs 8` (~3.9× speedup). Refs #510, part of #509. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`daft` spawns a detached `daft __clean-logs` background child via setsid+spawn from `maybe_clean_logs()`. When the manual-test runner invokes daft hundreds of times back-to-back — especially under `--jobs > 1` — these children outlive their parent xtask scenarios, get reparented to init, and accumulate as orphans that steal CPU. Observed load average above 500 on a 10-core box during a single full-corpus run, with each scenario then taking ~10× longer than its fair share. Effect on a 69-scenario hooks subset (M1 Max): - Serial: 366s → 184s (1.99× faster) - jobs=5: 197s → 16s (12.5× faster) `DAFT_NO_UPDATE_CHECK` and `DAFT_NO_TRUST_PRUNE` already gate the other two spawn-self startup tasks; adding `DAFT_NO_LOG_CLEAN` follows the same pattern. The env var is read by `log_clean::is_disabled` in production code — no daft-side changes needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a benchmark harness so #509 progress is measurable SHA-over-SHA. Companion to `mise run bench` (daft-vs-git command runtime) but at a different layer: this measures the YAML manual-test runner's throughput, not individual daft command latency. - `mise run bench:tests:manual:scale` — hyperfine parameter-scan over --jobs values; writes per-trial wall-clock to `benches/results/test-manual-scale.{md,json}`. - `mise run bench:tests:manual:scale-{baseline,compare}` — pin/diff workflow matching the existing `bench:{baseline,compare}` pair. - `benches/scenarios/test_manual_scale.sh` — driver. Sweeps BENCH_JOBS (default 1,2,4,8), 3 trials per value, then a Phase 2 per-scenario timing pass at jobs=1 and at `--parallel`'s default cap for p50/p95/max distribution. - Opt-in per-scenario timing via `DAFT_MANUAL_TEST_EMIT_TIMING=1` — the runner emits one grep-friendly `[bench] scenario="X" elapsed_ms=N` line per scenario, scoped to the runner half (excludes sandbox setup/teardown overhead so cumulative ≠ wall-clock). - `benches/README.md` documents both bench families (daft-vs-git and test-runner) and the local vs. checked-in baseline split. - `benches/baselines/test-manual-scale-2026-05-17.md` — first reference baseline (M1 Max, 10/10 cores). Shows 3.92× speedup at the default cap, 4.78× at full saturation, scenario failures appearing past num_cpus/2 — concrete evidence that `available_parallelism()/2` is well-chosen. Phase 1 results from that baseline (full 572-scenario corpus): --jobs 1 392.10 s ± 1.51 s (1.00× — serial reference) --jobs 2 217.22 s ± 0.94 s (1.81×, 90% efficiency) --jobs 4 113.78 s ± 0.46 s (3.45×, 86%) --jobs 5 100.00 s ± 3.45 s (3.92×, 78% — `--parallel` default) --jobs 8 ~144 s (flaky) (2.71×, 34%) --jobs 10 82.19 s (4.78×, 48%) End-to-end vs pre-#510 main on the same machine: 1586 s → 100 s, a 15.9× wall-clock reduction. The bulk of the win at `--jobs 1` (4.05×) comes from the orphan-spawn suppression fix in #510's prior commit; the remaining 3.92× is the parallelism this PR introduces. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pairs with the parallelization landing in this PR: CI was still on the serial default, so the speedup wasn't visible in workflow runtime. `--jobs $(nproc)` saturates the runner since CI is dedicated to the job — no need for `--parallel`'s `num_cpus/2` headroom that exists for concurrent local work. Baseline serial CI timing on this PR (ubuntu-latest): - integration-tests (default, yaml): 2m 11s - integration-tests (gitoxide, yaml): 2m 02s Next run will show the delta. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
a156d81 to
e6beb36
Compare
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--paralleland--jobs N(plusDAFT_MANUAL_TEST_JOBS) to the xtask manual-test runner, executing scenarios concurrently via rayon with per-scenario buffered output flushed in input order. Default stays serial; a follow-up PR will flip the default after a week of opt-in use.DAFT_NO_LOG_CLEAN=1, suppressing detacheddaft __clean-logschildren that would otherwise outlive xtask scenarios, get reparented to init, and steal CPU. This was effectively a precondition for parallel scaling working at all on local-dev runs.mise run bench:tests:manual:scaledoes a hyperfine parameter-scan over--jobsvalues + an opt-in per-scenario timing pass for p50/p95/max distributions. Methodology inbenches/README.md; first reference baseline checked in atbenches/baselines/test-manual-scale-2026-05-17.md.Speedup (M1 Max 10-core, 572 scenarios, full corpus)
--jobs 1--parallel(cap=5)--jobs 10End-to-end: 26 min → 1.7 min on a 10-core box.
Scaling is near-linear up to
num_cpus/2: jobs=2 (1.81× / 90%), jobs=4 (3.45× / 86%), jobs=5 (3.92× / 78%). Pastnum_cpus/2, returns diminish and scenario flakes appear —available_parallelism()/2is well-calibrated as the default cap. Zero new failures at the default cap across all measured runs.Per-scenario distribution at the default cap stays clean: p50 567 ms, p95 1293 ms — only ~14% per-scenario slowdown from CPU contention, paid back many times over in wall-clock.
Rollout
The runner ships with serial as the default.
--parallelis opt-in;--jobs NandDAFT_MANUAL_TEST_JOBSare the explicit overrides. The acceptance criteria's "parallel by default" flips in a follow-up PR after a week of opt-in use surfaces no new flakes. Interactive mode and--setup-onlyhard-error when--jobs > 1— their semantics (TTY ownership,println!ofwork_dirfor shell capture) don't survive the buffered worker model.CI
CI in
test.yml:487still invokes serial. A follow-up commit on this PR will enable--parallelthere to measure the impact on CI wall-clock.Test plan
mise run test:unitlocally (1721 + 65 xtask passing)mise run clippyzero warningstarget/release/xtask manual-test --ci --parallelproduces deterministic input-ordered output matching serialfailed_scenarioslist in input order under both serial and parallelRefs #510, part of #509.
🤖 Generated with Claude Code