fix(coordinator): honor needs: between background jobs (#454)#463
fix(coordinator): honor needs: between background jobs (#454)#463
Conversation
- Add Runtime Contract section to background-hook-jobs spec, making explicit that bg jobs must honor needs:, that Failed/Cancelled/Skipped deps cascade DepFailed to dependents, and that the bg closure maps anything-but-Completed to NodeStatus::Failed for DAG cascade purposes. - Add the implementation plan for daft#454 (bg coordinator currently ignores needs: between background jobs). Refs #454.
Refs #454. The test will pass once the coordinator's scheduler is replaced with DagGraph::run_parallel in a later commit.
Precondition for plugging the function into DagGraph::run_parallel in the next commit. Cancelled/Skipped outcomes collapse to Failed for DAG cascade purposes; the on-disk JobMeta retains the precise Completed/Failed/Cancelled distinction.
Address review feedback on 7d9a855. The production call site inside the fan-out loop discards the new NodeStatus return; make the discard explicit with a TODO referencing the next task that removes the loop entirely.
Replace the unconditional fan-out loop in run_all_with_cancel with a DagGraph::run_parallel call. The bg coordinator now uses the same scheduler primitive the foreground runner uses, so bg→bg needs: is honored at runtime in addition to being parsed and recorded. Failed/Cancelled deps cascade DepFailed to dependents (no spawn). Cycles and missing-dep references in the bg bucket are returned as errors instead of silently fanning out. Closes #454.
Address review feedback on 423342f. - Synthesize meta.json + JobResult for bg jobs the DAG marks DepFailed. Without this, dep-failed jobs would be silently absent from \`daft hooks jobs\` listings — a UX regression vs the buggy pre-fix fan-out. - Replace bare ? in fork_coordinator's child branch with explicit eprintln + process::exit(1) so DAG errors (cycles, missing deps) produce a visible diagnostic instead of a silent coordinator exit. Refs #454.
Refs #454. Verifies (a) the dependent's command never runs (no marker file produced), and (b) the coordinator synthesizes a JobStatus::Skipped meta on disk so the job remains visible in `daft hooks jobs`.
Refs #454. Killer thread per-job-cancels A mid-run via cancel_single_job; B (needs: [A]) must not run and must appear on disk as Skipped via the coordinator's dep-failed synthesis path.
Refs #454. Verifies clause 4 of the Runtime Contract: cycles and missing-dependency references in the bg bucket surface as errors rather than silently fanning out.
Two background jobs where the second `needs:` the first; the scenario asserts dep-b's start timestamp is >= dep-a's finish timestamp. Uses python3 time.time_ns() for cross-platform nanosecond resolution (macOS `date` does not support %N).
…ork safe) The earlier Task 4 attempt (423342f, efc15b4) used DagGraph::run_parallel, which is built on std::thread::scope. On macOS, std::thread::scope can deadlock after libc::fork() — the malloc arenas inherited from the parent leave the second allocation inside a scoped worker hung indefinitely. The hang only surfaces in the integration tests because unit tests call run_all_with_cancel directly, without forking. Two pre-existing scenarios (background-jobs, mixed-fg-bg-invocation) regressed silently before this fix. Replace the run_parallel call with a hand-rolled wave scheduler that reuses bare std::thread::spawn / JoinHandle::join (the primitive the buggy pre-fix coordinator used and which is known to work post-fork). DagGraph is still used for cycle / missing-dep detection and dependent lookups (no allocation-heavy paths after construction). Each wave spawns one thread per ready node, joins them all, and applies a status cascade that marks transitive dependents DepFailed when a parent Fails. Spec and plan are updated to document the post-fork constraint as binding so the runtime contract for needs: cannot drift back into a thread::scope-based implementation. Refs #454.
|
/claude review |
|
Claude finished @avihut's task in 4m 19s —— View job PR Review: fix(coordinator): honor needs: between background jobs
SummaryThe fix is sound. The wave-based scheduler correctly implements topological ordering for background jobs, the DFS cascade for transitive A few things worth discussing before merge: Issues1.
|
- SKILL.md: document that needs: between background jobs is honored at
runtime by the wave scheduler, and that dep-failed bg jobs are recorded
as Skipped (CLAUDE.md requires SKILL.md updates when feature behavior
changes).
- src/coordinator/process.rs: add comments documenting non-obvious
invariants flagged in review:
* The wave loop intentionally waits for the whole wave before
advancing — independent fast chains can be held up behind slow
ones, but this is the simplest shape provably safe across
libc::fork() on macOS without using std::thread::scope.
* The `Skipped` arm in the wave_outcomes match is reserved for
future closure variants; today run_single_background_job returns
only Succeeded/Failed.
* The cancel-all early-return splits semantics deliberately:
JobResult.status is Skipped (the user cancelled), but the value
returned to the wave scheduler is Failed so dependents cascade to
DepFailed.
- Tests: marker files for the failure-cascade and cancellation-cascade
unit tests now live inside the per-test TempDir instead of a shared
/tmp path, eliminating a race risk on parallel test runs.
Refs #463 review.
Summary
DagGraphmachinery in the background coordinator sobg→bg
needs:is honored at runtime, not just on paper.run_single_background_jobnow returnsNodeStatus; a hand-rolledwave scheduler in
run_all_with_cancelspawns onestd::thread::spawnper ready node, joins each wave, and cascades
DepFailedto dependents.DagGraphis reused for cycle / missing-dep detection and dependentlookups but not for execution:
DagGraph::run_parallelbuilds onstd::thread::scope, which can deadlock post-libc::fork()on macOS dueto inherited malloc-arena state. The wave scheduler avoids that by using
the same bare-spawn primitive the pre-fix coordinator used.
DepFailedto dependents; dep-failed bgjobs synthesize a
JobStatus::Skippedmeta.jsonso they remain visibleto
daft hooks jobs.Fixes #454.
Test plan
bg_dependent_waits_for_dep_to_finish(regression for Background hook coordinator ignoresneeds:between background jobs #454)bg_dependent_skipped_when_dep_failsbg_dependent_skipped_when_dep_cancelledbg_cycle_in_needs_returns_error,bg_missing_dep_in_needs_returns_errortests/manual/scenarios/hooks/bg-needs-ordering.ymlmise run fmt:check,mise run clippy,mise run test:unit,mise run test:integration