Skip to content

fix(coordinator): honor needs: between background jobs (#454)#463

Merged
avihut merged 12 commits intomasterfrom
daft-454/fix/background-job-coordinator-ignores-needs
May 2, 2026
Merged

fix(coordinator): honor needs: between background jobs (#454)#463
avihut merged 12 commits intomasterfrom
daft-454/fix/background-job-coordinator-ignores-needs

Conversation

@avihut
Copy link
Copy Markdown
Owner

@avihut avihut commented May 2, 2026

Summary

  • Reuse the foreground DagGraph machinery in the background coordinator so
    bg→bg needs: is honored at runtime, not just on paper.
  • run_single_background_job now returns NodeStatus; a hand-rolled
    wave scheduler in run_all_with_cancel spawns one std::thread::spawn
    per ready node, joins each wave, and cascades DepFailed to dependents.
  • DagGraph is reused for cycle / missing-dep detection and dependent
    lookups but not for execution: DagGraph::run_parallel builds on
    std::thread::scope, which can deadlock post-libc::fork() on macOS due
    to inherited malloc-arena state. The wave scheduler avoids that by using
    the same bare-spawn primitive the pre-fix coordinator used.
  • Cancelled / failed deps cascade DepFailed to dependents; dep-failed bg
    jobs synthesize a JobStatus::Skipped meta.json so they remain visible
    to daft hooks jobs.

Fixes #454.

Test plan

  • Unit: bg_dependent_waits_for_dep_to_finish (regression for Background hook coordinator ignores needs: between background jobs #454)
  • Unit: bg_dependent_skipped_when_dep_fails
  • Unit: bg_dependent_skipped_when_dep_cancelled
  • Unit: bg_cycle_in_needs_returns_error, bg_missing_dep_in_needs_returns_error
  • YAML scenario: tests/manual/scenarios/hooks/bg-needs-ordering.yml
  • All existing coordinator unit tests still pass (85/85 in coordinator module)
  • Full integration matrix (default+gitoxide, bash+yaml): 504 scenarios, 1872 steps, all pass
  • mise run fmt:check, mise run clippy, mise run test:unit, mise run test:integration

avihut added 11 commits May 2, 2026 16:40
- Add Runtime Contract section to background-hook-jobs spec, making
  explicit that bg jobs must honor needs:, that Failed/Cancelled/Skipped
  deps cascade DepFailed to dependents, and that the bg closure maps
  anything-but-Completed to NodeStatus::Failed for DAG cascade purposes.
- Add the implementation plan for daft#454 (bg coordinator currently
  ignores needs: between background jobs).

Refs #454.
Refs #454. The test will pass once the coordinator's scheduler is
replaced with DagGraph::run_parallel in a later commit.
Precondition for plugging the function into DagGraph::run_parallel
in the next commit. Cancelled/Skipped outcomes collapse to Failed
for DAG cascade purposes; the on-disk JobMeta retains the precise
Completed/Failed/Cancelled distinction.
Address review feedback on 7d9a855. The production call site inside
the fan-out loop discards the new NodeStatus return; make the discard
explicit with a TODO referencing the next task that removes the loop
entirely.
Replace the unconditional fan-out loop in run_all_with_cancel with a
DagGraph::run_parallel call. The bg coordinator now uses the same
scheduler primitive the foreground runner uses, so bg→bg needs: is
honored at runtime in addition to being parsed and recorded.

Failed/Cancelled deps cascade DepFailed to dependents (no spawn).
Cycles and missing-dep references in the bg bucket are returned as
errors instead of silently fanning out.

Closes #454.
Address review feedback on 423342f.

- Synthesize meta.json + JobResult for bg jobs the DAG marks
  DepFailed. Without this, dep-failed jobs would be silently absent
  from \`daft hooks jobs\` listings — a UX regression vs the buggy
  pre-fix fan-out.
- Replace bare ? in fork_coordinator's child branch with explicit
  eprintln + process::exit(1) so DAG errors (cycles, missing deps)
  produce a visible diagnostic instead of a silent coordinator exit.

Refs #454.
Refs #454. Verifies (a) the dependent's command never runs (no marker
file produced), and (b) the coordinator synthesizes a JobStatus::Skipped
meta on disk so the job remains visible in `daft hooks jobs`.
Refs #454. Killer thread per-job-cancels A mid-run via cancel_single_job;
B (needs: [A]) must not run and must appear on disk as Skipped via the
coordinator's dep-failed synthesis path.
Refs #454. Verifies clause 4 of the Runtime Contract: cycles and
missing-dependency references in the bg bucket surface as errors
rather than silently fanning out.
Two background jobs where the second `needs:` the first; the
scenario asserts dep-b's start timestamp is >= dep-a's finish
timestamp. Uses python3 time.time_ns() for cross-platform
nanosecond resolution (macOS `date` does not support %N).
…ork safe)

The earlier Task 4 attempt (423342f, efc15b4) used DagGraph::run_parallel,
which is built on std::thread::scope. On macOS, std::thread::scope can
deadlock after libc::fork() — the malloc arenas inherited from the parent
leave the second allocation inside a scoped worker hung indefinitely. The
hang only surfaces in the integration tests because unit tests call
run_all_with_cancel directly, without forking. Two pre-existing scenarios
(background-jobs, mixed-fg-bg-invocation) regressed silently before this
fix.

Replace the run_parallel call with a hand-rolled wave scheduler that
reuses bare std::thread::spawn / JoinHandle::join (the primitive the
buggy pre-fix coordinator used and which is known to work post-fork).
DagGraph is still used for cycle / missing-dep detection and dependent
lookups (no allocation-heavy paths after construction). Each wave spawns
one thread per ready node, joins them all, and applies a status cascade
that marks transitive dependents DepFailed when a parent Fails.

Spec and plan are updated to document the post-fork constraint as
binding so the runtime contract for needs: cannot drift back into a
thread::scope-based implementation.

Refs #454.
@avihut avihut added this to the Public Launch milestone May 2, 2026
@avihut avihut added the fix Bug fix label May 2, 2026
@avihut avihut self-assigned this May 2, 2026
@avihut
Copy link
Copy Markdown
Owner Author

avihut commented May 2, 2026

/claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 2, 2026

Claude finished @avihut's task in 4m 19s —— View job


PR Review: fix(coordinator): honor needs: between background jobs

  • Gather context: read changed files and git diff
  • Review src/coordinator/process.rs — correctness, Rust idioms, safety
  • Review tests/manual/scenarios/hooks/bg-needs-ordering.yml — test coverage
  • Review docs changes
  • Post final review

Summary

The fix is sound. The wave-based scheduler correctly implements topological ordering for background jobs, the DFS cascade for transitive DepFailed propagation is correct, and the macOS fork-safety concern (avoiding thread::scope post-fork()) is properly documented and addressed. Test coverage is thorough: unit tests cover the regression, fail-cascade, cancel-cascade, cycle, and missing-dep cases, and the YAML scenario provides end-to-end coverage.

A few things worth discussing before merge:


Issues

1. NodeStatus::Skipped arm in wave outcomes is dead code

src/coordinator/process.rs, lines 173–198

match status {
    NodeStatus::Succeeded | NodeStatus::Skipped => {
        // advance dependents
    }
    NodeStatus::Failed => { ... }
    _ => {}
}

run_single_background_job returns only NodeStatus::Succeeded or NodeStatus::Failed (the final if/else at lines 478–482 makes this explicit). The Skipped variant in the first arm is unreachable. It's harmless today, but it implies to readers that a job can return Skipped, and someone may rely on that implication in the future. Consider either removing it or adding a comment explaining it's reserved for future use.

Fix this →


2. Cancelled-before-start early return: JobResult.status is Skipped, return value is Failed

src/coordinator/process.rs, lines 279–289

if cancel_all.load(Ordering::Relaxed) {
    results.lock().unwrap().push(JobResult {
        status: NodeStatus::Skipped,   // ← what goes into results
        ...
    });
    return NodeStatus::Failed;          // ← what the DAG cascade sees
}

The semantic split — Skipped in the result, Failed to the wave scheduler — is explained at the bottom of the function (lines 474–482). But that comment is ~190 lines away from the early-return site. A reader of the early-return branch will see the inconsistency without context. A one-line inline comment on the return NodeStatus::Failed would make the intent clear without requiring the reader to scroll. Also note that this early-return path doesn't write a meta.json, so the cancelled-before-start job is invisible to daft hooks jobs — that's pre-existing but slightly surprising given that the PR adds synthesis for DepFailed jobs.


3. Wave scheduler serializes independent parallel chains (intentional, but worth documenting)

src/coordinator/process.rs, lines 125–199

The wave scheduler waits for every node in the current wave before advancing, which means:

A (slow, no deps) 
B (fast, no deps) → C

Here C waits for A even though they are completely independent. With a free-running scheduler (like DagGraph::run_parallel), C would start as soon as B finishes. The PR body explains the macOS fork-safety rationale, but the code comment inside the function doesn't mention the trade-off — it only states the reason for using bare spawns, not the implication for independent chains. For a docs/superpowers/specs audience this is probably fine, but a future maintainer might "fix" the serialization and reintroduce the deadlock. A brief comment near the while !ready.is_empty() loop noting "this serializes waves intentionally" would protect the invariant.


4. SKILL.md not updated (CLAUDE.md requires this)

SKILL.md lines 374–375 currently say:

Background jobs participate in the DAG. If a foreground job depends on a background job, the background job is promoted to foreground automatically.

This describes the foreground-promotes-background direction, but says nothing about bg→bg needs: ordering being honored at runtime (which was the bug). A reader of SKILL.md today would have no guarantee that needs: between two background-only jobs is respected by the coordinator. CLAUDE.md explicitly requires updating SKILL.md when feature behavior changes. Suggest adding a bullet to the "Key behaviors" list (around line 381) such as:

  • needs: between background jobs is honored at runtime — the coordinator schedules them in topological wave order.

Fix this →


5. Large planning doc committed to the repo

docs/superpowers/plans/2026-05-02-bg-jobs-honor-needs.md — 1022 lines

This appears to be an agent-generated implementation plan. It's the kind of document CLAUDE.md discourages committing ("Do not create planning, decision, or analysis documents unless the user asks for them"). Is this directory (docs/superpowers/plans/) intentionally part of the public repo? If so, that's fine and no action needed — just confirming intent.


Minor observations

  • bg-needs-ordering.yml uses python3 for nanoseconds: date +%s%N is available on Linux GNU coreutils and avoids the Python dependency, though python3 is fine for the CI environment this targets.

  • bg_dependent_skipped_when_dep_cancelled uses a global temp path (/tmp/should-never-exist-bg-needs-cancel): two parallel test runs could race on this file. Consider tempfile::NamedTempFile or a unique suffix.

  • handle.join().unwrap_or(NodeStatus::Failed): solid defensive choice — a panicked worker is treated as Failed, which fires the cascade correctly.

  • DAG index → job index alignment: The nodes vec is built from self.jobs.iter() in the same order, so self.jobs[idx] is always the right job for graph-index idx. This implicit alignment is an invariant worth a comment if the jobs collection is ever made mutable after graph construction.


Test coverage ✓

The new unit tests are comprehensive and well-structured:

  • bg_dependent_waits_for_dep_to_finish — timing assertion using started_at/finished_at timestamps
  • bg_dependent_skipped_when_dep_fails — synthesized Skipped meta + side-effect check
  • bg_dependent_skipped_when_dep_cancelled — cancel signal cascade + meta check
  • bg_cycle_in_needs_returns_error / bg_missing_dep_in_needs_returns_error — DAG validation
  • YAML scenario end-to-end regression

All five regression cases from the checklist in the PR body are covered. The existing 85 coordinator tests passing confirms no regressions.

- SKILL.md: document that needs: between background jobs is honored at
  runtime by the wave scheduler, and that dep-failed bg jobs are recorded
  as Skipped (CLAUDE.md requires SKILL.md updates when feature behavior
  changes).
- src/coordinator/process.rs: add comments documenting non-obvious
  invariants flagged in review:
    * The wave loop intentionally waits for the whole wave before
      advancing — independent fast chains can be held up behind slow
      ones, but this is the simplest shape provably safe across
      libc::fork() on macOS without using std::thread::scope.
    * The `Skipped` arm in the wave_outcomes match is reserved for
      future closure variants; today run_single_background_job returns
      only Succeeded/Failed.
    * The cancel-all early-return splits semantics deliberately:
      JobResult.status is Skipped (the user cancelled), but the value
      returned to the wave scheduler is Failed so dependents cascade to
      DepFailed.
- Tests: marker files for the failure-cascade and cancellation-cascade
  unit tests now live inside the per-test TempDir instead of a shared
  /tmp path, eliminating a race risk on parallel test runs.

Refs #463 review.
@avihut avihut merged commit d004c79 into master May 2, 2026
11 checks passed
@avihut avihut deleted the daft-454/fix/background-job-coordinator-ignores-needs branch May 2, 2026 19:02
This was referenced May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fix Bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Background hook coordinator ignores needs: between background jobs

1 participant