fix(coordinator): honor needs: between background jobs (#454) by avihut · Pull Request #463 · avihut/daft

avihut · 2026-05-02T16:46:36Z

Summary

Reuse the foreground DagGraph machinery in the background coordinator so
bg→bg needs: is honored at runtime, not just on paper.
run_single_background_job now returns NodeStatus; a hand-rolled
wave scheduler in run_all_with_cancel spawns one std::thread::spawn
per ready node, joins each wave, and cascades DepFailed to dependents.
DagGraph is reused for cycle / missing-dep detection and dependent
lookups but not for execution: DagGraph::run_parallel builds on
std::thread::scope, which can deadlock post-libc::fork() on macOS due
to inherited malloc-arena state. The wave scheduler avoids that by using
the same bare-spawn primitive the pre-fix coordinator used.
Cancelled / failed deps cascade DepFailed to dependents; dep-failed bg
jobs synthesize a JobStatus::Skipped meta.json so they remain visible
to daft hooks jobs.

Fixes #454.

Test plan

Unit: bg_dependent_waits_for_dep_to_finish (regression for Background hook coordinator ignores needs: between background jobs #454)
Unit: bg_dependent_skipped_when_dep_fails
Unit: bg_dependent_skipped_when_dep_cancelled
Unit: bg_cycle_in_needs_returns_error, bg_missing_dep_in_needs_returns_error
YAML scenario: tests/manual/scenarios/hooks/bg-needs-ordering.yml
All existing coordinator unit tests still pass (85/85 in coordinator module)
Full integration matrix (default+gitoxide, bash+yaml): 504 scenarios, 1872 steps, all pass
mise run fmt:check, mise run clippy, mise run test:unit, mise run test:integration

- Add Runtime Contract section to background-hook-jobs spec, making explicit that bg jobs must honor needs:, that Failed/Cancelled/Skipped deps cascade DepFailed to dependents, and that the bg closure maps anything-but-Completed to NodeStatus::Failed for DAG cascade purposes. - Add the implementation plan for daft#454 (bg coordinator currently ignores needs: between background jobs). Refs #454.

Refs #454. The test will pass once the coordinator's scheduler is replaced with DagGraph::run_parallel in a later commit.

Precondition for plugging the function into DagGraph::run_parallel in the next commit. Cancelled/Skipped outcomes collapse to Failed for DAG cascade purposes; the on-disk JobMeta retains the precise Completed/Failed/Cancelled distinction.

Address review feedback on 7d9a855. The production call site inside the fan-out loop discards the new NodeStatus return; make the discard explicit with a TODO referencing the next task that removes the loop entirely.

Replace the unconditional fan-out loop in run_all_with_cancel with a DagGraph::run_parallel call. The bg coordinator now uses the same scheduler primitive the foreground runner uses, so bg→bg needs: is honored at runtime in addition to being parsed and recorded. Failed/Cancelled deps cascade DepFailed to dependents (no spawn). Cycles and missing-dep references in the bg bucket are returned as errors instead of silently fanning out. Closes #454.

Address review feedback on 423342f. - Synthesize meta.json + JobResult for bg jobs the DAG marks DepFailed. Without this, dep-failed jobs would be silently absent from \`daft hooks jobs\` listings — a UX regression vs the buggy pre-fix fan-out. - Replace bare ? in fork_coordinator's child branch with explicit eprintln + process::exit(1) so DAG errors (cycles, missing deps) produce a visible diagnostic instead of a silent coordinator exit. Refs #454.

Refs #454. Verifies (a) the dependent's command never runs (no marker file produced), and (b) the coordinator synthesizes a JobStatus::Skipped meta on disk so the job remains visible in `daft hooks jobs`.

Refs #454. Killer thread per-job-cancels A mid-run via cancel_single_job; B (needs: [A]) must not run and must appear on disk as Skipped via the coordinator's dep-failed synthesis path.

Refs #454. Verifies clause 4 of the Runtime Contract: cycles and missing-dependency references in the bg bucket surface as errors rather than silently fanning out.

Two background jobs where the second `needs:` the first; the scenario asserts dep-b's start timestamp is >= dep-a's finish timestamp. Uses python3 time.time_ns() for cross-platform nanosecond resolution (macOS `date` does not support %N).

…ork safe) The earlier Task 4 attempt (423342f, efc15b4) used DagGraph::run_parallel, which is built on std::thread::scope. On macOS, std::thread::scope can deadlock after libc::fork() — the malloc arenas inherited from the parent leave the second allocation inside a scoped worker hung indefinitely. The hang only surfaces in the integration tests because unit tests call run_all_with_cancel directly, without forking. Two pre-existing scenarios (background-jobs, mixed-fg-bg-invocation) regressed silently before this fix. Replace the run_parallel call with a hand-rolled wave scheduler that reuses bare std::thread::spawn / JoinHandle::join (the primitive the buggy pre-fix coordinator used and which is known to work post-fork). DagGraph is still used for cycle / missing-dep detection and dependent lookups (no allocation-heavy paths after construction). Each wave spawns one thread per ready node, joins them all, and applies a status cascade that marks transitive dependents DepFailed when a parent Fails. Spec and plan are updated to document the post-fork constraint as binding so the runtime contract for needs: cannot drift back into a thread::scope-based implementation. Refs #454.

avihut · 2026-05-02T16:49:20Z

/claude review

claude · 2026-05-02T16:49:34Z

Claude finished @avihut's task in 4m 19s —— View job

PR Review: fix(coordinator): honor needs: between background jobs

Gather context: read changed files and git diff
Review src/coordinator/process.rs — correctness, Rust idioms, safety
Review tests/manual/scenarios/hooks/bg-needs-ordering.yml — test coverage
Review docs changes
Post final review

Summary

The fix is sound. The wave-based scheduler correctly implements topological ordering for background jobs, the DFS cascade for transitive DepFailed propagation is correct, and the macOS fork-safety concern (avoiding thread::scope post-fork()) is properly documented and addressed. Test coverage is thorough: unit tests cover the regression, fail-cascade, cancel-cascade, cycle, and missing-dep cases, and the YAML scenario provides end-to-end coverage.

A few things worth discussing before merge:

Issues

1. `NodeStatus::Skipped` arm in wave outcomes is dead code

src/coordinator/process.rs, lines 173–198

match status {
    NodeStatus::Succeeded | NodeStatus::Skipped => {
        // advance dependents
    }
    NodeStatus::Failed => { ... }
    _ => {}
}

run_single_background_job returns only NodeStatus::Succeeded or NodeStatus::Failed (the final if/else at lines 478–482 makes this explicit). The Skipped variant in the first arm is unreachable. It's harmless today, but it implies to readers that a job can return Skipped, and someone may rely on that implication in the future. Consider either removing it or adding a comment explaining it's reserved for future use.

Fix this →

2. Cancelled-before-start early return: `JobResult.status` is `Skipped`, return value is `Failed`

src/coordinator/process.rs, lines 279–289

if cancel_all.load(Ordering::Relaxed) {
    results.lock().unwrap().push(JobResult {
        status: NodeStatus::Skipped,   // ← what goes into results
        ...
    });
    return NodeStatus::Failed;          // ← what the DAG cascade sees
}

The semantic split — Skipped in the result, Failed to the wave scheduler — is explained at the bottom of the function (lines 474–482). But that comment is ~190 lines away from the early-return site. A reader of the early-return branch will see the inconsistency without context. A one-line inline comment on the return NodeStatus::Failed would make the intent clear without requiring the reader to scroll. Also note that this early-return path doesn't write a meta.json, so the cancelled-before-start job is invisible to daft hooks jobs — that's pre-existing but slightly surprising given that the PR adds synthesis for DepFailed jobs.

3. Wave scheduler serializes independent parallel chains (intentional, but worth documenting)

src/coordinator/process.rs, lines 125–199

The wave scheduler waits for every node in the current wave before advancing, which means:

A (slow, no deps) 
B (fast, no deps) → C

Here C waits for A even though they are completely independent. With a free-running scheduler (like DagGraph::run_parallel), C would start as soon as B finishes. The PR body explains the macOS fork-safety rationale, but the code comment inside the function doesn't mention the trade-off — it only states the reason for using bare spawns, not the implication for independent chains. For a docs/superpowers/specs audience this is probably fine, but a future maintainer might "fix" the serialization and reintroduce the deadlock. A brief comment near the while !ready.is_empty() loop noting "this serializes waves intentionally" would protect the invariant.

4. `SKILL.md` not updated (CLAUDE.md requires this)

SKILL.md lines 374–375 currently say:

Background jobs participate in the DAG. If a foreground job depends on a background job, the background job is promoted to foreground automatically.

This describes the foreground-promotes-background direction, but says nothing about bg→bg needs: ordering being honored at runtime (which was the bug). A reader of SKILL.md today would have no guarantee that needs: between two background-only jobs is respected by the coordinator. CLAUDE.md explicitly requires updating SKILL.md when feature behavior changes. Suggest adding a bullet to the "Key behaviors" list (around line 381) such as:

needs: between background jobs is honored at runtime — the coordinator schedules them in topological wave order.

Fix this →

5. Large planning doc committed to the repo

docs/superpowers/plans/2026-05-02-bg-jobs-honor-needs.md — 1022 lines

This appears to be an agent-generated implementation plan. It's the kind of document CLAUDE.md discourages committing ("Do not create planning, decision, or analysis documents unless the user asks for them"). Is this directory (docs/superpowers/plans/) intentionally part of the public repo? If so, that's fine and no action needed — just confirming intent.

Minor observations

bg-needs-ordering.yml uses python3 for nanoseconds: date +%s%N is available on Linux GNU coreutils and avoids the Python dependency, though python3 is fine for the CI environment this targets.
bg_dependent_skipped_when_dep_cancelled uses a global temp path (/tmp/should-never-exist-bg-needs-cancel): two parallel test runs could race on this file. Consider tempfile::NamedTempFile or a unique suffix.
handle.join().unwrap_or(NodeStatus::Failed): solid defensive choice — a panicked worker is treated as Failed, which fires the cascade correctly.
DAG index → job index alignment: The nodes vec is built from self.jobs.iter() in the same order, so self.jobs[idx] is always the right job for graph-index idx. This implicit alignment is an invariant worth a comment if the jobs collection is ever made mutable after graph construction.

Test coverage ✓

The new unit tests are comprehensive and well-structured:

bg_dependent_waits_for_dep_to_finish — timing assertion using started_at/finished_at timestamps
bg_dependent_skipped_when_dep_fails — synthesized Skipped meta + side-effect check
bg_dependent_skipped_when_dep_cancelled — cancel signal cascade + meta check
bg_cycle_in_needs_returns_error / bg_missing_dep_in_needs_returns_error — DAG validation
YAML scenario end-to-end regression

All five regression cases from the checklist in the PR body are covered. The existing 85 coordinator tests passing confirms no regressions.

- SKILL.md: document that needs: between background jobs is honored at runtime by the wave scheduler, and that dep-failed bg jobs are recorded as Skipped (CLAUDE.md requires SKILL.md updates when feature behavior changes). - src/coordinator/process.rs: add comments documenting non-obvious invariants flagged in review: * The wave loop intentionally waits for the whole wave before advancing — independent fast chains can be held up behind slow ones, but this is the simplest shape provably safe across libc::fork() on macOS without using std::thread::scope. * The `Skipped` arm in the wave_outcomes match is reserved for future closure variants; today run_single_background_job returns only Succeeded/Failed. * The cancel-all early-return splits semantics deliberately: JobResult.status is Skipped (the user cancelled), but the value returned to the wave scheduler is Failed so dependents cascade to DepFailed. - Tests: marker files for the failure-cascade and cancellation-cascade unit tests now live inside the per-test TempDir instead of a shared /tmp path, eliminating a race risk on parallel test runs. Refs #463 review.

avihut added 11 commits May 2, 2026 16:40

test(coordinator): add failing regression test for bg needs: ordering

2a2eefc

Refs #454. The test will pass once the coordinator's scheduler is replaced with DagGraph::run_parallel in a later commit.

refactor(coordinator): mark transient bg job-spawn discard explicit

dcbca00

Address review feedback on 7d9a855. The production call site inside the fan-out loop discards the new NodeStatus return; make the discard explicit with a TODO referencing the next task that removes the loop entirely.

test(coordinator): assert failed bg dep cascades skip to dependents

f0d14e8

Refs #454. Verifies (a) the dependent's command never runs (no marker file produced), and (b) the coordinator synthesizes a JobStatus::Skipped meta on disk so the job remains visible in `daft hooks jobs`.

test(coordinator): assert cancelled bg dep cascades skip to dependents

fe5acb1

Refs #454. Killer thread per-job-cancels A mid-run via cancel_single_job; B (needs: [A]) must not run and must appear on disk as Skipped via the coordinator's dep-failed synthesis path.

test(coordinator): bg DAG errors on cycles and missing deps

d283a46

Refs #454. Verifies clause 4 of the Runtime Contract: cycles and missing-dependency references in the bg bucket surface as errors rather than silently fanning out.

avihut added this to the Public Launch milestone May 2, 2026

avihut added the fix Bug fix label May 2, 2026

avihut self-assigned this May 2, 2026

avihut merged commit d004c79 into master May 2, 2026
11 checks passed

avihut deleted the daft-454/fix/background-job-coordinator-ignores-needs branch May 2, 2026 19:02

This was referenced May 2, 2026

chore: release v1.11.1 #460

Merged

chore: release v1.12.0 #472

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(coordinator): honor needs: between background jobs (#454)#463

fix(coordinator): honor needs: between background jobs (#454)#463
avihut merged 12 commits intomasterfrom
daft-454/fix/background-job-coordinator-ignores-needs

avihut commented May 2, 2026

Uh oh!

avihut commented May 2, 2026

Uh oh!

claude Bot commented May 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

avihut commented May 2, 2026

Summary

Test plan

Uh oh!

avihut commented May 2, 2026

Uh oh!

claude Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: fix(coordinator): honor needs: between background jobs

Summary

Issues

1. NodeStatus::Skipped arm in wave outcomes is dead code

2. Cancelled-before-start early return: JobResult.status is Skipped, return value is Failed

3. Wave scheduler serializes independent parallel chains (intentional, but worth documenting)

4. SKILL.md not updated (CLAUDE.md requires this)

5. Large planning doc committed to the repo

Minor observations

Test coverage ✓

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented May 2, 2026 •

edited

Loading

1. `NodeStatus::Skipped` arm in wave outcomes is dead code

2. Cancelled-before-start early return: `JobResult.status` is `Skipped`, return value is `Failed`

4. `SKILL.md` not updated (CLAUDE.md requires this)