Skip to content

feat(coordinator): redesign internals — persistence, IPC, structured logs (#476)#508

Merged
avihut merged 35 commits into
masterfrom
daft-476/chore/coordinator-redesign
May 16, 2026
Merged

feat(coordinator): redesign internals — persistence, IPC, structured logs (#476)#508
avihut merged 35 commits into
masterfrom
daft-476/chore/coordinator-redesign

Conversation

@avihut
Copy link
Copy Markdown
Owner

@avihut avihut commented May 15, 2026

Abstract

The hooks-jobs coordinator (daft __coordinator) backs every background
hook job daft runs and is the source of truth for job status, logs, and
cancel semantics. Until this PR its internals had grown organically:
durability lived in scattered JSON sidecars, the IPC was loose \n-delimited
JSON without versioning, log content went into a single output.log with
no schema, and crashed jobs were invisible until the next manual cleanup.

This PR redesigns those internals around three pillars plus a
hexagonal architecture (store / ports / adapters / domain) that future
DB-backed features land into:

  1. Persistence. A per-repo SQLite database at
    <state>/jobs/<repo>/coordinator.db (WAL mode) replaces filesystem
    sidecars (meta.json, repo-policy.json) as the system of record.
    Writes are atomic via SQLite transactions. A reconcile pass on
    coordinator boot flips rows for processes whose PIDs are gone to
    Crashed, surfacing failures the user could previously only spot by
    inspecting state directories. Critically, the new store also fixes
    the cross-process-access limitation of the design's first iteration
    (see "Migration notes" below).
  2. Wire protocol. Length-prefixed framed JSON envelopes with a
    stable v / ErrorCode taxonomy replace the old line-delimited
    exchange. A TailLogs { follow, since_seq } streaming variant lets
    the CLI live-tail structured records over the socket. The 8 MiB
    frame ceiling bounds local IPC.
  3. Logs. Per-job output.jsonl records ({seq, ts, kind: stdout|stderr|status, ...}) replace the legacy output.log. A
    per-job log.sampling_every_nth knob drops records without breaking
    sequence numbering. daft hooks jobs logs gains --stdout / --stderr / --status / --seq / --since-seq / --since / --follow flags.

Beyond the three pillars, the PR also lands the user-visible payoff for
the new persistence layer: daft hooks jobs ls / logs and the tab-
completion hot path now read from SQLite, so the new Crashed status is
finally visible from the CLI and stays visible while a coordinator is
actively writing (WAL allows concurrent readers).
A CancelMatching { hook, worktree, tag, invocation_prefix, older_than_secs } IPC variant and matching CLI flags collapse what used
to be three special-case helpers into a single predicate.

Three over-grown internal APIs were also tightened with context structs
(JobAdapterContext, HookExecutionContext, extended
JobInvocationContext) — all three lose their
#[allow(clippy::too_many_arguments)].

Closes #476.

Migration notes

The early iteration of this PR used redb for
persistence — pure-Rust, single-file, atomic transactions. During
review the cross-process access pattern surfaced as a hard wall: redb
takes a process-exclusive OS file lock on open, so the CLI could not
read job state while the coordinator was alive. The Tier-1 design
papered over this with a meta.json dual-write that the CLI fell back
to. Removing that workaround revealed the real issue and led to a swap
from redb to SQLite via rusqlite (bundled, WAL mode):

  • Cross-process concurrency model matches daft's actual access pattern
    (many readers + one writer across processes on the same host) — no
    workaround needed.
  • Fully safe Rust API honors daft's forbid(unsafe_code) policy
    without carveouts. LMDB via heed was evaluated and rejected: its
    Env::open is unsafe fn.
  • SQL fits the trust registry's pattern-match queries and the planned
    repo catalog's relational shape — future DB-backed features land in
    the same store layer rather than each one introducing a new DB.

Legacy files left over from earlier builds — coordinator.redb,
repo-policy.json, and per-job meta.json sidecars under
~/.local/state/daft/jobs/*/<inv>/<job>/ — are removed on first
coordinator startup by SqliteJobsStore::for_repo_base_with_wipe,
with a one-line stderr banner per file. The completion / CLI-reader
path uses a non-wiping constructor so the banner cannot leak into the
user's terminal mid-Tab press.

Architectural changes

Persistence: JSON sidecars → SQLite

flowchart LR
    subgraph Before["Before — JSON sidecars"]
        direction TB
        DM1["meta.json (per job)"]
        DR1["repo-policy.json"]
        DL1["output.log"]
    end

    subgraph After["After — SQLite + JSONL"]
        direction TB
        DB[("coordinator.db<br/>(WAL mode)")]
        DB --> T1["jobs"]
        DB --> T2["invocations"]
        DB --> T3["repo_policy"]
        DL2["output.jsonl (per job)"]
    end

    Before -- "legacy file sweep on coordinator startup" --> After
Loading

Writes go through the JobsStorePort trait (typed upsert_job,
upsert_invocation, write_repo_policy with field-merge semantics)
implemented by SqliteJobsStore. The adapter scrubs *_TOKEN /
*_SECRET / *_KEY / *_PASSWORD env vars at the persistence
boundary so no future caller can accidentally leak secrets to disk.

SQLite is the sole source of truth for job metadata. LogStore::clean
now gathers its retention snapshot via the port (lifted into a pure
coordinator/domain/retention.rs FCIS function), so meta.json no
longer flows through any production write or read path. Legacy
sidecars from earlier builds are swept on first coordinator startup.

Store / Ports / Adapters / Domain pattern

flowchart TB
    APP["Application<br/>(commands/hooks/jobs,<br/>IPC dispatch, complete.rs)"]
    DOMAIN["Domain<br/>(pure logic — no SQL, no syscalls)<br/>coordinator/domain/reconcile.rs"]
    PORTS["Ports (trait declarations)<br/>JobsStorePort · Clock · ProcessControl"]
    ADAPTERS["Adapters (impls)<br/>SqliteJobsStore · SystemClock · UnixProcess"]
    STORE["Store (data layer)<br/>connection · pool · migrate · repos · env_scrub"]

    APP --> DOMAIN
    APP --> ADAPTERS
    DOMAIN --> PORTS
    ADAPTERS --> PORTS
    ADAPTERS --> STORE
Loading

Hard rules (recorded in CLAUDE.md):

  • Domain code never imports rusqlite, nix, or std::env.
  • Store code never imports domain types — returns its own typed row
    structs (JobRow, InvocationRow, RepoPolicyRow).
  • Adapters are the only translation layer; cross-cutting concerns
    (env scrub, status enum ↔ TEXT conversion, policy merge) live there.

coordinator/domain/reconcile.rs is the canonical example — pure
logic, six unit tests against fake adapters. Future PRs will lift
more functions from process.rs (cancel, lifecycle) into the
domain layer using the same pattern.

Security baked into the store layer

Every consumer goes through src/store/connection.rs::bring_up, so
the security defaults are non-bypassable:

Concern Measure
SQL injection All queries parameterized; no format! allowed in src/store/repos/
File perms 0600 on file, 0700 on parent dir; verified once per process
Path safety Always derived from daft_state_dir(); canonicalized; prefix-asserted
Schema version Refuse-open if PRAGMA user_version > CURRENT
App identity PRAGMA application_id check refuses foreign SQLite files
WAL hygiene wal_autocheckpoint = 1000
Concurrent writes Writer pool size 1 + busy_timeout = 5000
Concurrent reads Reader pool SQLITE_OPEN_READ_ONLY + busy_timeout = 300
Secrets in env env_scrub::scrub strips *_TOKEN/_SECRET/_KEY/_PASSWORD before persisting JobRow.env
Migration safety rusqlite_migration-driven; refuses to open user_version > binary's current

Wire protocol: loose JSON → framed envelopes

sequenceDiagram
    participant CLI as daft CLI
    participant Coord as daft __coordinator

    Note over CLI,Coord: Length-prefixed (u32 BE) framed JSON envelopes.<br/>Every frame carries `v` (PROTOCOL_VERSION) and a tagged body.

    CLI->>Coord: RequestEnvelope { v:1, req:"TailLogs",<br/>payload:{job, follow:true, since_seq:42} }
    loop until terminal Status or EOF
        Coord-->>CLI: ResponseEnvelope { v:1, kind:"StreamFrame",<br/>payload: LogRecord {...} }
    end
    Coord-->>CLI: ResponseEnvelope { v:1, kind:"StreamEnd" }

    Note over CLI,Coord: On schema mismatch the coordinator emits<br/>Error { code: SchemaMismatch, message } instead of<br/>silently parse-failing — the typed ErrorCode taxonomy<br/>also covers JobNotFound / NoMatch / KillFailed / Internal.
Loading

Frames cap at 8 MiB so a misbehaving peer can't exhaust memory.
Streaming requests (currently just TailLogs) get many StreamFrame
envelopes terminated by exactly one StreamEnd. Single-response
requests exchange one envelope each way.

Coordinator lifecycle: spawn-self + crash recovery

flowchart TB
    A["Hook fires<br/>(worktree-post-create, ...)"] --> B{"Background jobs<br/>declared?"}
    B -- "no" --> Z["Foreground exec only<br/>(BufferingLogSink<br/>writes JobRow to SQLite)"]
    B -- "yes" --> C["spawn_coordinator()"]

    C --> C1["serialize state → 0600 tempfile"]
    C1 --> C2["current_exe().canonicalize()"]
    C2 --> C3["Command::new(canonical).args(__coordinator, tempfile)<br/>Stdio::null x3"]

    C3 --> D["child: daft __coordinator"]
    D --> D1["read + unlink state tempfile"]
    D1 --> D2["nix::setsid()"]
    D2 --> D3["SqliteJobsStore::for_repo_base(...)<br/>auto-wipe legacy coordinator.redb<br/>+ repo-policy.json"]
    D3 --> D4["coordinator::domain::reconcile_active_jobs()<br/>via UnixProcess + SystemClock adapters<br/>flips dead Running rows → Crashed"]
    D4 --> D5["run_all_with_cancel()<br/>each job in its own process_group(0)<br/>upserts JobRow at start + terminal"]
    D5 --> D6["process::exit"]
Loading

spawn_coordinator is the canonical pattern; fork() is not an
option (libc::fork is unsafe fn, conflicts with
forbid(unsafe_code)). The reconcile pass on boot is what makes the
"my last terminal crashed" case visible: the previous coordinator's
process group is gone, so its Running rows get flipped to Crashed
with finished_at = now, and the next daft hooks jobs ls shows them.

Cancel matching

flowchart LR
    subgraph CLI
        F1["--hook"]
        F2["--worktree"]
        F3["--tag"]
        F4["--inv"]
        F5["--older-than"]
        Fall["--all"]
    end

    F1 & F2 & F3 & F4 & F5 -- "AND" --> CM["CancelMatching {<br/>hook, worktree, tag,<br/>invocation_prefix, older_than_secs<br/>}"]
    Fall -- "no filter required" --> CA["CancelAll"]

    CM --> S["filter_matching_jobs over<br/>SqliteJobsStore::list_active_jobs(repo)"]
    CA --> S

    S --> K["killpg(pgid, SIGTERM)<br/>for each match"]
    K --> R["Cancelled { count, names }"]
Loading

Dependencies

  • Add rusqlite = "0.39" with bundled (compiles SQLite from source
    into the binary, ~+700 KB on release builds, +C compiler at build
    time).
  • Add rusqlite_migration = "2.5", r2d2 = "0.8", r2d2_sqlite = "0.33".
  • Drop redb = "4.1", bincode = "1.3".
  • All within the 7-day cooldown policy; no .dep-age-allowlist
    entries needed.

Release binary size went from 9.3 MB (master) to 10 MB on this branch.

Test plan

  • cargo test --lib — 1790 unit tests pass (90 added under
    store::*, coordinator::adapters::*,
    coordinator::domain::reconcile::*, plus a regression test for
    the legacy coordinator.redb / repo-policy.json sweep)
  • mise run clippy — zero warnings
  • mise run fmt:check — clean
  • mise run test:manual -- --ci — full YAML scenario sweep
  • Pre-push hook (mise run test:unit + build-check) — clean
  • Manual smoke: coordinator.db, -wal, -shm files appear in
    per-repo state dir; concurrent reader works while writer alive;
    legacy coordinator.redb / repo-policy.json auto-wiped with
    stderr banner

Tab completion benchmark

Per the original migration plan, daft __complete is on the
documented hot path (fires every Tab keypress) and the SQLite open
adds work the meta.json scan didn't pay. Measured on the daft project
itself (298 invocation directories on disk, ~30 worktrees) with
hyperfine, comparing release binaries:

binary mean range
master (meta.json hot path) 122 ms 78 ms – 177 ms
this branch (SQLite hot path) 165 ms 97 ms – 217 ms

Branch is ~43 ms slower (1.35×). The --version-only baseline for
both binaries is ~16 ms, so essentially all of the gap is the
SQLite open + PRAGMA bring-up + first reader checkout. The dominant
cost on either binary is the filesystem invocation scan that powers
the worktree drill-down, which is unchanged. 165 ms total stays
under the typical "feels sluggish" threshold (~200 ms).

The advisor flagged the ≤5 ms budget from the original plan as
speculative — order-of-magnitude is the real signal, and 1.35× is
not a regression that warrants the cache-file mitigation from the
plan's Risks section. If user feedback indicates otherwise, a
follow-up can switch the completion path off the pool onto a single
direct Connection::open_with_flags(... READ_ONLY) and skip the
bootstrap/migration check (we know the schema is current — the
binary was built with it).

Architectural direction

This PR also lands ARCHITECTURE.md at the repo
root, pinning the strategic direction the hexagonal coordinator
proves out:

  • Hexagonal at subsystem boundaries. The redb→SQLite cutover here
    was a single-adapter-file replacement; that's the property worth
    preserving as new subsystems land.
  • Functional core inside domain modules. reconcile is the
    worked example — pure function over a snapshot, no port callbacks
    during computation. Future domain extractions (cancel,
    lifecycle) follow this shape.
  • Vertical slice at the CLI command layer. Already the de facto
    pattern in src/commands/; explicitly not extended into hexagonal
    ceremony.
  • Gitoxide-style crate decomposition (long-term). One concern per
    crate, narrow public surface, independent versioning. Not a
    near-term refactor.
  • Native agentic IDE spin-out (long-term). The concrete second
    consumer that justifies lifting ports from inside-one-crate to
    at-a-crate-edge. daft-coordinator + daft-hooks + daft-trust
    become reusable libraries with IDE-supplied adapters for storage,
    process control, filesystem.

CLAUDE.md's Architecture section gets a landpost paragraph pointing
at the new doc, so agents and contributors hit the strategic direction
before any operational detail. The follow-ups below are all concrete
steps toward this direction.

Known follow-ups

  • The invocations SQL table is forward-looking but not yet
    populated by production code (Tier-1 added the schema). The FK
    from jobs.invocation_idinvocations.invocation_id is
    intentionally deferred to the migration that adds the population
    path; declaring it now would refuse every job insert.
  • Lift the remaining business logic in process.rs (cancel,
    lifecycle) into coordinator/domain/ using the
    reconcile_active_jobs pattern. Domain-level unit tests already
    cover the lifted function; the surrounding scaffold can come over
    incrementally.
  • Migrate the trust registry from JSON to src/store/ (new
    trust.db in ~/.config/daft/).
  • Implement the repo catalog on src/store/ (new repos.db).

🤖 Generated with Claude Code

@avihut
Copy link
Copy Markdown
Owner Author

avihut commented May 15, 2026

/claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 15, 2026

Claude finished @avihut's task in 5m 4s —— View job


Review: feat(coordinator): redesign internals (#476)

This is a large, well-structured PR. The spawn pattern is correctly followed, canonicalize() on current_exe() is present, unsafe is not reintroduced, and the schema-version sentinel and frame-size limit are good protocol discipline. Test coverage for the store layer is thorough. The following are actionable findings in descending priority.


Issues

1. --inv alone silently falls through in cancel — likely bug

src/commands/hooks/jobs.rs:537-557

let has_filter =
    hook.is_some() || worktree.is_some() || tag.is_some() || older_than.is_some();
if has_filter {
    cancel_matching(hook…, inv.as_deref(),)  // inv passed here
} else if all {}
else if let Some(j) = job {}
else {
    anyhow::bail!("cancel requires a job address, --all, or at least one filter …");
}

inv is accepted by the Cancel subcommand and forwarded to cancel_matching, but it is not included in has_filter. A user running daft hooks jobs cancel --inv abc123 hits the final bail! instead of filtering by invocation prefix. Either add || inv.is_some() to has_filter, or document that --inv is only a secondary filter that must accompany another predicate.

Fix this →


2. : in job_name can corrupt key comparisons

src/coordinator/store/types.rs:116-118

pub fn job_key(repo_hash: &str, invocation_id: &str, job_name: &str) -> String {
    format!("{repo_hash}:{invocation_id}:{job_name}")
}

rsplitn(3, ':') in JobAddress::parse handles : in worktree names, but the redb key format doesn't escape : in job_name. A hook YAML with a job named build:test produces key {repo}:{inv}:build:test, which parses back as repo=repo, inv=inv, job=build:test and round-trips correctly for exact-match get_job. The range scan in list_jobs_for_repo uses {repo_hash}: as prefix and ; as upper bound — still safe. So this is not exploitable today.

However, invocation_id is a 16-hex string (safe) and repo_hash is a UUID (safe). Only job_name is user-controlled via YAML name:. A comment confirming this invariant ("job names must not contain :") or a validation site at YAML parse time would close this permanently.


3. reconcile_active_jobs comment misstates the probe semantics

src/coordinator/process.rs:62-66

// `signal::kill` with `None` is the POSIX existence probe:
// returns Ok(()) if the process group exists (regardless of permission)
let pid = Pid::from_raw(pgid as i32);
matches!(signal::kill(pid, None), Ok(()))

signal::kill(pid, None) sends signal 0 to PID pgid, not to the process group with that PGID. To probe a process group you'd use killpg(pgid, 0) (i.e. Pid::from_raw(-(pgid as i32))). In practice, since PID == PGID for these jobs via process_group(0), the probe lands on the right process. But the comment says "process group exists" when it really means "process group leader exists". A PID-reused process would also satisfy the probe — this is a known best-effort limitation, worth documenting inline rather than leaving the misleading comment.

Fix this →


4. send_response double-serde is fragile and silently drops errors

src/coordinator/process.rs:1099-1114

fn send_response(stream: &std::os::unix::net::UnixStream, response: &CoordinatorResponse) {
    let env = ResponseEnvelope {
        v: PROTOCOL_VERSION,
        body: match serde_json::to_value(response).and_then(serde_json::from_value) {
            Ok(b) => b,
            Err(_) => return,  // client hangs waiting for a response
        },
    };

The to_valuefrom_value round-trip exists to avoid needing CoordinatorResponse: Clone, per the comment. But serde_json::Value already implements Clone, so #[derive(Clone)] on CoordinatorResponse (which only has String, Vec, usize, and Value fields) would work and remove the round-trip entirely. As written, a serialization bug causes a silent return — the client hangs on read_frame indefinitely because no StreamEnd or error frame is sent.

Fix this →


5. fork_coordinator alias should be removed, not deferred

src/coordinator/process.rs:1356-1365

/// Backwards-compatible alias for callers still using the old name.
/// Functionally identical to [`spawn_coordinator`]; kept thin so the
/// rename across … can land in a follow-up.
pub fn fork_coordinator(state: CoordinatorState, store: LogStore) -> Result<…> {
    spawn_coordinator(state, store)
}

Both call sites (src/hooks/yaml_executor/mod.rs:405 and src/commands/hooks/jobs.rs:1850) are in this same PR. The alias was intended to let the rename land separately, but the rename didn't happen. This is dead API surface with a misleading name (fork_ implies fork()). Removing the alias and updating both call sites to spawn_coordinator is a one-line change per site.

Fix this →


6. follow=true tail loop has no shutdown escape hatch on the coordinator side

src/coordinator/process.rs:1186-1201

if n == 0 {
    if !follow { break; }
    let pos = match file.stream_position() { Ok(p) => p, Err(_) => break };
    std::thread::sleep(std::time::Duration::from_millis(200));
    if file.seek(SeekFrom::Start(pos)).is_err() { break; }
    reader = BufReader::new(&file);
    continue;
}

If the log writer thread panics before emitting a terminal Status::Finished|Signaled|Crashed record, handle_tail_logs will spin at 200ms/cycle forever — the connection read-timeout was cleared for follow=true at line 849, so neither side times out. The client will spin too since its stream_request iterator also cleared its read timeout. A shutdown signal check or a wall-clock deadline (e.g. "bail after N seconds at EOF with no new data after the coordinator PID file is gone") would close this.


7. Sampling inconsistency between foreground and background sinks

src/coordinator/process.rs:551 vs src/executor/log_sink.rs:122

// process.rs — background:
!seq.is_multiple_of(n as u64)

// log_sink.rs — foreground:
seq % (n as u64) != 0

Both expressions are semantically equivalent (is_multiple_of stabilised Rust 1.87, MSRV is 1.93). The inconsistency is cosmetic but signals the two paths diverged. Unifying to one form makes it easier to verify both paths sample identically, especially since is_multiple_of is clearer.


8. job_count() is pub with #[allow(dead_code)]

src/coordinator/store/mod.rs:201-207

A pub function should not need #[allow(dead_code)]. Either demote it to #[cfg(test)] + pub(crate), or add it to an existing test so the annotation is unnecessary.


What's done well

  • Spawn pattern is correct throughoutcanonicalize() on current_exe() is present and load-bearing; the comment in CLAUDE.md is accurately reflected in the code.
  • forbid(unsafe_code) preserved — no unsafe blocks introduced; nix safe wrappers used correctly for setsid() and killpg().
  • Schema-version sentinel — forward-compat refusal ("older binaries refuse to write data they don't fully understand") is correct and tested.
  • Frame size limit (8 MiB) — bounds local IPC without being absurdly small.
  • Protocol version on every envelopeSchemaMismatch error code surfaced instead of silent JSON parse failures.
  • Crash-recovery reconciliationreconcile_marks_dead_running_as_crashed and reconcile_leaves_alive_running_intact tests are solid. Spawning-and-waiting true to get a definitely-dead PID is a clean approach.
  • Tempfile at 0600 — coordinator state passed through a tempfile with default 0600 permissions; comment confirms this.
  • Fallback when redb lock can't be acquiredload_redb_job_meta_index returning None and falling back to meta.json is correctly documented as a known limitation with a test.
  • Hot-path is cleanshell-init, completions, and __* background spawns are all in skip_startup_tasks_for; no new commands on the hot path.
  • invocation_id_from_seed hash distribution — the ns+PID hash fixing the prefix-collision bug is clearly motivated and tested.
  • TOCTOU in write_repo_policy is benign in practice because the coordinator holds the exclusive redb lock, and is documented.

@avihut avihut self-assigned this May 15, 2026
@avihut avihut added the feat New feature label May 15, 2026
@avihut avihut added this to the Public Launch milestone May 15, 2026
avihut added a commit that referenced this pull request May 15, 2026
Windows build fix (jobs.rs:1197 referenced tail_logs which is Unix-only):
- Refactor follow_logs to early-return through the coordinator path on
  Unix and fall through to the file-read fallback otherwise; the
  Unix-only IPC call now lives in a #[cfg(unix)] helper.
- Gate anyhow::Context import in coordinator/process.rs so it isn't
  flagged unused on non-Unix where spawn_coordinator is excluded.

Review-driven changes (#508):
- #1 `cancel --inv X` no longer silently falls through to the bail!:
  `--inv` now counts as a filter. Extracted cancel_has_filter() so the
  invariant has a regression test.
- #2 Document `:` invariant on job_key — repo_hash and invocation_id
  are :-free by construction; job_name is user-controlled but the key
  format is safe via rsplitn(3) on read.
- #3 Fix reconcile probe comment: signal::kill(pid, None) probes the
  process-group leader's PID (== PGID via process_group(0)), not the
  whole process group. Note PID-reuse limitation.
- #4 Replace send_response double-serde round-trip with derive(Clone)
  on CoordinatorResponse + JobInfo; on serialization failure, emit a
  sentinel Internal-error frame instead of silently returning (which
  hung the client on read_frame).
- #5 Remove fork_coordinator alias; rename both call sites to
  spawn_coordinator. (`fork_` implies fork() which we don't use.)
- #6 Document tail-loop follow=true shutdown limitation inline: if the
  writer panics before a terminal Status record, the loop can spin
  forever. Tracked for follow-up; not load-bearing in practice.
- #7 Unify sampling expression: foreground sink now matches background
  via `!seq.is_multiple_of(n)`.
- #8 Remove dead pub job_count() and its sole `ReadableTableMetadata`
  import — no callers anywhere, not worth a cfg(test) helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@avihut
Copy link
Copy Markdown
Owner Author

avihut commented May 15, 2026

Thanks for the thorough review. Addressed in 782272a + 41c9820 (numbering matches the original review list):

  • R1 --inv falls through silently — fixed. inv.is_some() now counts toward has_filter. Extracted cancel_has_filter() with a regression test.
  • R2 : in job_name key correctness — documented the invariant inline on job_key: repo_hash and invocation_id are :-free by construction; the key format remains safe via rsplitn(3, ':') on read and ; upper-bound on range scan even when job_name contains :.
  • R3 reconcile probe comment — rewrote: signal::kill(pid, None) probes the process-group leader's PID (PID == PGID because jobs are launched with process_group(0)). Added the PID-reuse caveat.
  • R4 send_response double-serde / silent error drop — replaced the to_value/from_value round-trip with derive(Clone) on CoordinatorResponse + JobInfo. On serialization failure we now emit a sentinel Error { code: Internal, message } frame so the client's read_frame doesn't hang.
  • R5 fork_coordinator alias — removed. Both call sites now reference spawn_coordinator.
  • R6 follow-loop shutdown escape — documented the limitation inline (writer panic before terminal Status would spin at 200ms/cycle indefinitely). Closing requires either a wall-clock idle deadline gated on stale-mtime + row status no longer Running, or a writer-liveness channel — tracking as a follow-up. Not load-bearing in practice.
  • R7 sampling expression — unified: src/executor/log_sink.rs now uses !seq.is_multiple_of(n) matching process.rs.
  • R8 pub job_count() with #[allow(dead_code)] — deleted (no callers anywhere) along with the now-unused ReadableTableMetadata import.

Also unblocked the Windows build that broke on this PR — refactored follow_logs to early-return through the coordinator path on Unix and fall back to the file-read otherwise; the Unix-only tail_logs IPC call now lives in a #[cfg(unix)] helper. Gated anyhow::Context import for cfg(unix) since spawn_coordinator (its sole user) is Unix-only.

The PR description was also expanded with an abstract and mermaid diagrams covering the four architectural shifts (persistence, IPC, lifecycle, cancel).

avihut and others added 13 commits May 15, 2026 22:04
…er cancel

Lands the durable-state half of the coordinator internals redesign:

- redb-backed `JobStore` at `<state>/jobs/<repo>/coordinator.redb` with
  schema-version guard (refuses writes when on-disk schema is newer than
  the binary)
- `JobStatus` gains `Crashed`/`Cancelling`/`Unknown` variants;
  `#[serde(other)] Unknown` gives older binaries a forward-compat read
  path for newer data
- Reconciliation pass on coordinator boot probes each active row's PGID
  via `kill(.., None)` and marks dead-process rows `Crashed`
- `JobSpec.tags` plumbed from YAML `JobDef.tags`; persisted into `JobRow`
- Dual-write `JobRow` at job start + complete alongside existing
  `meta.json` writes (transitional; meta.json cleanup is follow-up)
- `daft hooks jobs cancel` gains `--hook/--worktree/--tag/--older-than`;
  new `CancelMatching` IPC variant filters via redb and signals each
  matching PGID; refuses to run without at least one filter or `--all`
- `cancel_background_jobs_for_worktree` collapses to a single
  `CancelMatching { worktree }` call (was list-then-cancel-per-name)
- Adds `coordinator::framing` (length-prefixed u32-BE frames) and
  `coordinator::log_record` (LogRecord/OutputKind/StatusEvent JSONL
  types). Both modules are wired only at the seams that need them for
  Tier-1; the full envelope retag + output.jsonl writer migration ships
  with Tier-2.
- Stream-tags the executor output channel: `Sender<(OutputKind, String)>`
  ripples through `LogSink::on_job_output(.., kind, line)` so future
  JSONL writers can record stream origin per line.

Verification: 1731 unit tests (22 new), 15/15 bash hooks integration
tests, 572/572 YAML manual scenarios, clippy clean, fmt clean.

Refs #476

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tured logs reader

Replaces the flat `output.log` per-job file with structured `output.jsonl`:
one `LogRecord` per line carrying `seq`/`ts`/`kind` so consumers can
distinguish stdout from stderr, tail by sequence number, or filter by
record kind. `output.log` stays as a one-cycle legacy fallback path.

- Writers (`BufferingLogSink` + bg writer in `process.rs`) emit JSONL
  records using the stream-tagged channel (`(OutputKind, String)`) plus
  a terminal `Status::Finished`/`Status::Signaled` record per job
- New per-job YAML knob `log.sampling_every_nth: N` drops all but every
  Nth stdout/stderr line; `seq` advances on dropped records so consumers
  see explicit gaps. `Status` records are never sampled
- `daft hooks jobs logs` gains `--stdout` / `--stderr` / `--status` /
  `--seq` / `--since-seq` / `--since <duration>` flags via a shared
  `LogsFilter` applied at render time
- Reader falls back to `output.log` when `output.jsonl` is missing
  (pre-upgrade data), synthesizing `Stdout`-only records keyed by line
  number so existing logs remain readable for one release cycle

Verification: 1732 unit tests (3 new for sampling + log-format round-
trip), 572/572 YAML manual scenarios, clippy clean.

Refs #476

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… streaming

Replaces newline-delimited JSON IPC with length-prefixed framed envelopes
and adds a streaming `TailLogs` request that powers `daft hooks jobs
logs --follow`.

Wire format:
- Every message is now `RequestEnvelope`/`ResponseEnvelope` —
  `{"v":1,"req|kind":"...","payload":...}` — wrapped in a `u32`-BE
  length-prefixed frame via `coordinator::framing`. Length prefix lets
  payloads carry embedded newlines (log lines with `\n`) without escape
  ambiguity, and the explicit `v` field gives a forward-compat handhold
  for future protocol revisions.
- `CoordinatorResponse::Error` gains a typed `code: ErrorCode` field
  (`JOB_NOT_FOUND`/`SCHEMA_MISMATCH`/`INTERNAL`/...) alongside the
  human message; older callers using `..` destructure stay source-compatible.
- New `StreamFrame(serde_json::Value)` + `StreamEnd` response variants
  carry the streaming layer.

`TailLogs` streaming:
- New `CoordinatorRequest::TailLogs { job, follow, since_seq }`. Server
  tails `output.jsonl`, emits one `StreamFrame` per `LogRecord`, sends
  `StreamEnd` when the terminal `Status` record (Finished/Signaled/
  Crashed) appears or the source closes.
- `CoordinatorClient::stream_request` returns a sync iterator over
  `StreamFrame` payloads — daft stays sync, no async runtime imposed.
- `daft hooks jobs logs --follow` opens a tail via the coordinator when
  one is reachable; falls back to a one-shot file read when the
  coordinator is gone (per-invocation lifecycle: dead coordinator means
  the job is final).

`JobAddress` moved from `src/commands/hooks/jobs.rs` to
`src/coordinator/types.rs` (with serde derives) so it can ride on the
wire without a back-pointer from the coordinator layer into the CLI.

Verification: 1731 unit tests, 572/572 YAML manual scenarios, clippy
clean.

Refs #476

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Folds the remaining Tier-3 items from #476 into the coordinator-redesign
branch and closes the visibility gap that Tier-1's crash recovery left.

Two items are load-bearing for Tier-1 actually working end-to-end:

- `daft hooks jobs ls/logs` now read from `JobStore` first, falling back
  to `meta.json` only when no redb row exists. Reconciliation marks rows
  as `Crashed` in redb but never touched the legacy `meta.json`, so the
  new status was invisible to users until this migration.
- `repo-policy.json` moves into the `repo_policy_v1` table with a one-shot
  JSON→redb migration on `JobStore::open_for_repo_base`. redb's WAL gives
  atomicity for free, closing the Tier-3 `write_repo_policy` gap.

API surface cleanup (three over-grown functions get focused context
structs): `JobAdapterContext` for `yaml_jobs_to_specs` (8→6 args),
`HookExecutionContext` for `execute_yaml_hook_with_rc` (11→5 args), and
`JobInvocationContext` extended to absorb `run_single_background_job`'s
five passthrough Arcs (7→3 args). All three lose
`#[allow(clippy::too_many_arguments)]`.

Also fixes `prune --dry-run`'s asymmetric summary gate (defense-in-depth
when `jobs == 0 && invs > 0`) and updates the post-Tier-2 leftover in
`hooks/bg-output-silent.yml` that still looked for `output.log`.

`hooks/crashed-recovery.yml` was attempted but deferred — the YAML gate
hit a hidden dependency on coordinator single-flight semantics that
wasn't visible in the plan. Reconcile is fully covered by
`coordinator::process::tests::reconcile_marks_dead_running_as_crashed`,
so the end-to-end scenario is value-add, not value-blocking.

Verified: 1737 unit tests pass; clippy clean; 572 YAML scenarios pass
on retry (one pre-existing master-too truncation-feature regression
unrelated to this PR).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r-3 readers

Adds a regression test (`concurrent_open_is_rejected_by_redb_lock`) and a
doc-comment on `load_redb_job_meta_index` calling out the known limitation
that redb's process-level lock causes the CLI reader to fall back to
`meta.json` whenever a coordinator is alive holding the same file.

Surfaced by advisor review of the Tier-3 cleanup. Behavior unchanged —
this is documentation + a regression gate so the fallback rationale is
discoverable if anyone later tries to change reader semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three tests in `coordinator::process::tests` exercise cross-thread
coordination by spawning probe/killer threads that race against the
registrar inside `run_single_background_job`. Under heavy parallel-test
load (the lefthook pre-push sweep at ~1740 tests), the fixed-sleep
delays missed their windows:

- `run_single_background_job_registers_and_deregisters_pid` — the probe
  ran before the registrar thread inserted the PID
- `per_job_cancel_marks_status_cancelled_not_failed` — `cancel_single_job`
  fired before the PID was recorded, so the SIGTERM target wasn't found
- `bg_dependent_skipped_when_dep_cancelled` — same race as above

Replace fixed sleeps with polling waits: probe loops until the PID
appears (up to 600ms / 2s respectively), killer loops until the PID is
visible before signalling. All three tests pass 3-for-3 in parallel
mode after the fix. The function under test is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`xtask test-matrix` is a long loop that spawns `xtask manual-test --ci`
per matrix entry. If the invoking shell dies mid-run (terminal closed,
session reload, parent signal lost), the loop survives reparented to
launchd/init and keeps burning CPU for the remaining entries — observed
in the wild as a 1h40m runaway during this PR's development.

Add an orphan check at each iteration boundary: when `getppid() == 1` we
bail out with a one-line stderr message instead of starting the next
entry. Uses `nix::unistd::getppid` (nix 0.31 already pinned in the daft
workspace, adding `process` feature). Non-Unix builds get a no-op.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Companion to the prior `xtask test-matrix` orphan-guard commit — the
new `nix` dependency in `xtask/Cargo.toml` (process feature; same 0.31
already pinned by the main daft crate) needs the corresponding lockfile
entry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Windows build fix (jobs.rs:1197 referenced tail_logs which is Unix-only):
- Refactor follow_logs to early-return through the coordinator path on
  Unix and fall through to the file-read fallback otherwise; the
  Unix-only IPC call now lives in a #[cfg(unix)] helper.
- Gate anyhow::Context import in coordinator/process.rs so it isn't
  flagged unused on non-Unix where spawn_coordinator is excluded.

Review-driven changes (#508):
- #1 `cancel --inv X` no longer silently falls through to the bail!:
  `--inv` now counts as a filter. Extracted cancel_has_filter() so the
  invariant has a regression test.
- #2 Document `:` invariant on job_key — repo_hash and invocation_id
  are :-free by construction; job_name is user-controlled but the key
  format is safe via rsplitn(3) on read.
- #3 Fix reconcile probe comment: signal::kill(pid, None) probes the
  process-group leader's PID (== PGID via process_group(0)), not the
  whole process group. Note PID-reuse limitation.
- #4 Replace send_response double-serde round-trip with derive(Clone)
  on CoordinatorResponse + JobInfo; on serialization failure, emit a
  sentinel Internal-error frame instead of silently returning (which
  hung the client on read_frame).
- #5 Remove fork_coordinator alias; rename both call sites to
  spawn_coordinator. (`fork_` implies fork() which we don't use.)
- #6 Document tail-loop follow=true shutdown limitation inline: if the
  writer panics before a terminal Status record, the loop can spin
  forever. Tracked for follow-up; not load-bearing in practice.
- #7 Unify sampling expression: foreground sink now matches background
  via `!seq.is_multiple_of(n)`.
- #8 Remove dead pub job_count() and its sole `ReadableTableMetadata`
  import — no callers anywhere, not worth a cfg(test) helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…icy YAML scenarios

The Tier-3 cutover migrates `repo-policy.json` into redb and deletes
the sidecar on first open. Three YAML scenarios that previously
`cat`'d the file (repo-log-defaults-applied, repo-policy-not-
clobbered, prune-budget-accounting) had no way to observe the
persisted policy and were failing in CI.

Adds a hidden `daft __dump-store <table>` subcommand that reads
the redb-backed JobStore for the cwd's repo and prints the
requested table as JSON. Currently supports `repo-policy`; jobs
and invocations can follow if needed. The `__` prefix excludes it
from completions and skips startup background tasks.

The three scenarios now `daft __dump-store repo-policy` from the
worktree cwd and grep the JSON for `max_total_size_bytes`/
`keep_last` — same assertions, redb-backed source of truth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eeps working

Tier-2 moved foreground writers to `output.jsonl` only. The per-file
truncator (`LogStore::truncate_oversized_logs`) and the legacy reader
fallback both still target `output.log`, so removing the dual-write
silently broke the truncation feature (`log-cleanup-per-file-cap.yml`
fails on this branch but passes on master).

Restores the legacy `output.log` as a sibling of `output.jsonl` in
`BufferingLogSink::on_job_complete` and `on_job_runner_skipped`. The
raw bytes are reconstructed from the same post-sampling record buffer
so both files reflect a single, consistent view of the job's output.
Matches the one-cycle compat posture the reader path already takes.

Background-job logs (coordinator-managed) still write JSONL only —
not in the truncator's path today.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ncator

Tier-2 of the coordinator redesign moved writers over to the structured
`output.jsonl` format. The legacy `output.log` byte-stream lingered on as
(a) a dual-write target, (b) a reader fallback, and (c) the input to a
byte-level `truncate_oversized_logs` cleanup pass. Daft is effectively
pre-release; preserving these legacy paths is perpetual tax with no
beneficiaries.

Cut entirely:
- `LogStore::log_path`, `LogStore::write_job_record` (only the JSONL
  variant remains).
- `LogStore::truncate_oversized_logs` and its plumbing in
  `prune_jobs` / `run_clean_logs`; `CleanSummary::truncated_logs`; the
  CLI's "truncated N log(s)" suffix.
- `JobMeta.log_truncated` and `JobMeta.original_size_bytes`; same
  fields on `JobRow`.
- The `output.log` read fallback in `render_job_log`.
- `tests/manual/scenarios/hooks/log-cleanup-per-file-cap.yml`.
- Synthetic `meta.json` field references in the surviving log-cleanup
  scenarios.

JSONL-aware truncation is a separate future feature if ever needed —
the typical motivating case (per-file size cap on a runaway log) is
addressable today by sampling or per-job log-size budgets enforced at
hook-fire time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`start_socket_listener` puts the listener in non-blocking mode so the
accept loop can poll the shutdown flag. On macOS (and other BSDs)
accept(2) returns a socket that inherits O_NONBLOCK from the listener,
so every accepted stream landed in `handle_client_connection`
non-blocking. `read_exact` then returned `WouldBlock` before any client
bytes arrived and the request silently dropped — the existing
`set_read_timeout(5s)` is a no-op on non-blocking sockets.

`test_socket_listener_list_jobs` exposed this as a deterministic EOF
mid-frame on the client side. The same race affected real coordinator
IPC on macOS.

Force the accepted stream back to blocking at the top of
`handle_client_connection` so `set_read_timeout` is the only thing
bounding reads.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@avihut avihut force-pushed the daft-476/chore/coordinator-redesign branch from 41c9820 to e386058 Compare May 16, 2026 06:28
avihut and others added 9 commits May 16, 2026 09:47
Phases 1-3 of the redb → SQLite migration. Lays down the new
data-access stack alongside the existing redb store; the cutover that
deletes redb + meta.json comes in a follow-up commit so reviewers can
see the new code in isolation.

**Phase 1 — store layer (`src/store/`)**

Pure SQLite, no daft business logic. Every consumer inherits the same
security defaults because they all funnel through `connection::bring_up`:

* `connection.rs` — applies `PRAGMA application_id`, `journal_mode = WAL`,
  `synchronous = NORMAL`, `foreign_keys = ON`, `busy_timeout`, and
  `wal_autocheckpoint`. Refuses to open a foreign SQLite file
  (mismatched `application_id`); tightens file/dir perms to 0600/0700
  on fresh creates.
* `pool.rs` — `r2d2_sqlite`-backed writer pool (size 1, enforces WAL's
  single-writer ordering) and reader pool (multi-slot, opened with
  `SQLITE_OPEN_READ_ONLY` and a 300ms busy_timeout so CLI paths fail
  fast).
* `migrate.rs` — `rusqlite_migration` runner. Refuses to open if the
  on-disk `user_version` is newer than this binary's understanding.
* `migrations/001_initial.sql` — `invocations`, `jobs`, `repo_policy`
  tables with explicit `ON DELETE CASCADE` FK from `jobs` to
  `invocations` and indexes for the `ls`/reconcile scan paths.
* `paths.rs` — canonical path resolution. `for_repo_under(base, hash)`
  canonicalizes the result and rejects symlink-escape attempts.
* `env_scrub.rs` — strips `*_TOKEN`/`*_SECRET`/`*_KEY`/`*_PASSWORD`
  (case-insensitive suffix match) before job env touches disk.
* `models/` + `repos/` — typed row structs (no bincode blobs) and a
  parameterized-SQL-only query layer. Multi-row writes use
  `repos::with_write_txn`.

**Phase 2 — coordinator ports + adapters**

`src/coordinator/ports/` declares the abstractions the domain layer
talks to: `JobsStorePort`, `Clock`, `ProcessControl`.
`src/coordinator/adapters/` implements them: `SqliteJobsStore` (wraps
the pool, scrubs env on every job upsert, converts RepoPolicyRow ↔
RepoPolicy), `SystemClock`, `UnixProcess` (wraps nix's `killpg` with a
signal-0 existence probe).

**Phase 3 — first domain extraction**

`src/coordinator/domain/reconcile.rs` lifts `reconcile_active_jobs` out
of `process.rs`. The domain version takes the three ports and returns
the count of jobs marked Crashed. Pure logic — no `nix::`, no SQL, no
`Utc::now()`. Six unit tests drive it with fully-mocked adapters.

This proves the layering separates correctly; future PRs can lift
`cancel.rs` and `lifecycle.rs` from `process.rs` into `domain/` using
the same pattern.

**Dependencies**

* `rusqlite = "0.39"` with `bundled` (compiles SQLite from source,
  ~+1 MB on release builds, no system-library skew)
* `rusqlite_migration = "2.5"`
* `r2d2 = "0.8"`
* `r2d2_sqlite = "0.33"` (kept on 0.33 instead of 0.34 to clear the
  7-day cooldown without an allowlist entry)

All within the existing cooldown policy; no `.dep-age-allowlist`
changes needed.

**Test coverage added**

`mise run test:unit` adds 89 new passing tests under
`store::*` (76), `coordinator::adapters::*` (7), and
`coordinator::domain::reconcile::*` (6). The redb-era
`concurrent_open_is_rejected_by_redb_lock` is mirrored by
`coordinator::adapters::sqlite_store::tests::concurrent_open_allowed`,
which asserts the new positive behavior — two pools can open the same
file and the reader sees committed writes mid-flight.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nitial schema

Three small enabling pieces for the upcoming JobStore → SqliteJobsStore
swap. Each is a no-op on its own; the next commit consumes them.

* `JobStatus::as_status_str` / `from_status_str` — the SQLite `jobs.status`
  column is TEXT, the domain uses an enum, and serde's `lowercase`
  rename happens to give the same tags. Putting both directions on the
  enum keeps the conversion at the storage boundary out of inline
  match statements at every call site.

* `SqliteJobsStore::for_repo_base(base)` — convenience that mirrors the
  old `JobStore::open_for_repo_base`. Wipes any sibling
  `coordinator.redb` / `repo-policy.json` files (loudly, one stderr
  line per file) the first time a repo's state dir is opened. Matches
  the pre-1.0 no-back-compat principle: cleanup is not silent.

* Initial migration no longer declares the `jobs → invocations` FK.
  Production code never populated `invocations` under the redb store
  (the table was added forward-looking in Tier-1 and forgotten), so
  the FK would refuse every job insert today. A follow-up migration
  will add it back when invocation tracking actually lands. Test
  expectations updated to pin the current "jobs can stand alone"
  shape and the no-cascade-on-delete behavior, so the change is
  visible the day the FK returns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oduction

Every production call site that previously opened the per-repo redb
`JobStore` now opens `SqliteJobsStore::for_repo_base` instead. The redb
module remains in tree but is unused by production code; it gets
deleted in a follow-up commit once the meta.json dual-write is also
removed and the integration tests have stayed green for a round.

* `src/coordinator/process.rs`:
  * `JobStore` → `SqliteJobsStore` everywhere (types + opens).
  * `job_store_path` helper removed; callers use `for_repo_base`.
  * `reconcile_active_jobs` becomes a thin wrapper around the domain
    function that wires up `SystemClock` + `UnixProcess` adapters. The
    inline reconciliation logic was lifted into `domain/reconcile.rs`
    in the previous commit and is exercised by six unit tests there.
  * `job_row_from_meta` produces the new SQLite `JobRow` shape
    (no `schema_version`, `status: String` via `as_status_str`).
  * Two inline `reconcile_*` tests removed — superseded by the
    domain-layer tests. Filter and dual-write tests updated to the
    new shape.
* `src/log_clean.rs`, `src/hooks/yaml_executor/mod.rs`,
  `src/commands/dump_store.rs`, `src/commands/hooks/jobs.rs`: same
  swap. The `__dump-store repo-policy` debug helper keeps emitting
  the same JSON shape so scenarios that grep its output don't have
  to change.
* `src/commands/hooks/jobs.rs::load_redb_job_meta_index` — the name
  is preserved for callers but the impl now reads from SQLite. The
  prior redb-era "process-lock breaks concurrent reads while
  coordinator is alive" caveat is gone; WAL + the reader pool means
  the open succeeds mid-flight. The "fall back to meta.json" branch
  is still wired so unmigrated state and pre-store tests keep
  rendering — that fallback comes out in the next commit alongside
  the meta.json dual-write.

All 1,801 unit tests pass. Integration scenarios untouched (the
adapter is API-compatible with the redb store's public surface, and
the legacy `coordinator.redb` / `repo-policy.json` files in scratch
state dirs are auto-wiped on first open with a one-line stderr
banner).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Lite

The cutover takes the **read** side off `meta.json` for every hot path
the user is likely to hit interactively, and starts routing every
**write** through the SQLite store alongside the legacy sidecar so the
two stay consistent during the transition.

The remaining `meta.json` write is kept on purpose: `LogStore::clean`
still reads `meta.json` for per-job retention metadata, and rewriting
that to query SQLite is its own scope. The final `meta.json` deletion +
`LogStore::clean` rewrite lands in a follow-up.

**Reads — fully off meta.json on the touched paths**

* `src/commands/complete.rs` — tab completion (the documented hot
  path) now opens a single `SqliteJobsStore` per `daft __complete`
  call and a new `jobs_for_invocation` helper batches the per-
  invocation lookup into one `list_jobs_for_invocation` query
  instead of one filesystem `read_meta` per job dir. Seven prior
  `store.read_meta(dir)` call sites collapsed to four
  `jobs_for_invocation(...)` calls. The reader pool's tight
  `busy_timeout` (300 ms) keeps a stuck SQLite open from blocking
  the user's shell.
* `src/coordinator/process.rs::build_job_list` (powers
  `ListJobs` IPC) now sources its data from `SqliteJobsStore::
  list_jobs_for_repo`; the prior `LogStore::list_job_dirs` +
  per-dir `read_meta` scan is gone.

**Writes — every path also goes to SQLite**

* `BufferingLogSink` (foreground hook path) gains a
  `SqliteJobsStore` opened at construction. `on_job_complete` and
  `on_job_runner_skipped` call a new `persist_job_row` helper that
  converts `JobMeta` → `store::models::JobRow` and upserts. Eight
  test sites updated for the new `repo_hash` arg on
  `BufferingLogSink::new`.
* `src/coordinator/process.rs::dep_failed` loop now persists a
  skipped `JobRow` so dep-failed jobs aren't invisible to
  `daft hooks jobs ls`. Previously they only landed in
  `meta.json`.
* The `JobMeta` → `JobRow` projection used by the tab-completion
  hot path lives next to `JobMeta` itself as
  `log_store::job_meta_from_row`, so future renderer call sites
  can adopt it without re-implementing the conversion.

**Notes on the deferred work**

The advisor's Phase-4c spec called for the meta.json *write* to go
away too. Doing so requires `LogStore::clean` to query SQLite for
per-job retention metadata (it currently scans `meta.json`).
`LogStore::clean` is tangled enough — stale-running detection, per-
worktree sanity floor, candidate aggregation — that a sloppy
rewrite would silently regress the retention pass. The right move
is one focused commit; this one stops short so the diff stays
reviewable and the YAML scenarios for retention/budget keep their
existing baselines.

All 1,801 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`src/coordinator/store/` was the redb-backed `JobStore` plus its key
schema and value types. As of the previous commit nothing in
production calls into it — every call site routes through
`SqliteJobsStore` / `JobsStorePort`. The module is unreachable code;
deleting it is the bookkeeping step that closes Phase 4 of the
redb → SQLite migration.

* Remove the `src/coordinator/store/` directory (mod.rs + types.rs +
  schema.rs) and the matching `pub mod store;` declaration in
  `src/coordinator/mod.rs`.
* Drop `redb = "4.1"` and `bincode = "1.3"` from `Cargo.toml`. The
  pinned bincode comment was about sharing the version with syntect
  and edtui — those crates keep using it through their own dep tree,
  daft doesn't need a top-level entry.

`coordinator.redb` files left over in `~/.local/state/daft/jobs/*/`
are removed on first open by `SqliteJobsStore::for_repo_base`'s
legacy-file sweep (announced via stderr banner), so users upgrading
across the cutover don't have stray bytes hanging around.

All 1,789 unit tests pass — 12 fewer than before because the redb
module's internal tests went with the module.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Lite files

The scenario captured the per-repo state dir's newest subdirectory via
`ls -1tr | grep -v '\.json$' | tail -1`. After the SQLite cutover the
per-repo dir also contains `coordinator.db` plus the WAL/SHM
sidecars (`coordinator.db-wal`, `coordinator.db-shm`), and the WAL
file is touched on every commit so it sorts as "newest" — the probe
captured `coordinator.db-wal` instead of the most recent invocation
dir and the dir-existence check at the end of the scenario failed.

Switch to a `find -type d`-based probe so only directories are
considered. The same scenario already uses `find -type d` for the
count assertions; this aligns the newest-id capture with the same
filter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the architectural decisions from the redb → SQLite migration so
the next agent (or human) lands new DB-backed features in the right
shape without re-deriving the constraints.

Four edits:

* **Critical Rule #4 (forbid unsafe)** — extended with a sentence on
  dependency selection: prefer crates with a fully safe public API.
  Cites SQLite via `rusqlite` over LMDB via `heed` (whose `Env::open`
  is `unsafe fn`) as the canonical worked example.
* **New section after Dependency Cooldown — `## Database & Storage`** —
  SQLite via `rusqlite` (bundled + WAL) is daft's canonical structured-
  data store; no other embedded DBs; redb's process-exclusive lock is
  why the prior choice didn't fit; the pool's reader/writer split;
  the WAL-on-network-fs constraint. References the persisted memory
  doc for the full redb post-mortem.
* **New `## Architecture` subsection — Store / Ports / Adapters /
  Domain pattern** — diagrams the four layers, names the hard rules
  (domain never imports `rusqlite`/`nix`/`std::env`; store never
  imports domain types; adapters are the only translation layer), and
  points at the canonical examples to study before touching the
  pattern.
* **New section — `## Adding a New DB-backed Feature`** — eight-step
  checklist analogous to "Adding a New Command", in the order that
  actually keeps the layering clean (schema first → models → repos
  → port trait → adapter → wiring → unit-tested domain logic →
  security defaults are non-bypassable).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three doc-comments still said "redb" after the cutover. None of these
are user-facing — they're inline comments and a CoordinatorRequest
variant doc — but they'd confuse the next reader trying to map the
code back to the actual store.

* `CoordinatorRequest::CancelMatching` doc: "redb-recorded
  Running/Cancelling rows" → "Running/Cancelling rows from the SQLite
  store".
* `cancel_matching` fn doc + error message: "redb JobRow" / "redb job
  store" / "pre-redb invocations" → "SQLite JobRow" / "SQLite job
  store" / "pre-store invocations".
* `run_coordinator` body comment: "redb-backed job store" →
  "SQLite-backed job store".

Historical references to the redb era are preserved where they're
load-bearing: the `LEGACY_FILES` constant explaining what gets
auto-wiped, the `concurrent_open_allowed` test comment explaining
what it replaced, and the `load_redb_job_meta_index` function name
(an interface kept stable on purpose so callers don't need to know
the store swapped underneath them — renaming is its own follow-up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`SqliteJobsStore::for_repo_base` deletes `coordinator.redb` and
`repo-policy.json` siblings of the new `coordinator.db` the first
time the per-repo state dir is opened — the production "no
back-compat for pre-1.0 formats" path users will hit when they
upgrade across this PR. The sweep was relying on the
production-level `eprintln!` banner for visibility but had no test
asserting it actually fires.

Adds `for_repo_base_wipes_legacy_redb_and_policy_json` which plants
both legacy files, opens the store, and asserts the files are gone
and the new `coordinator.db` is in place.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the strategic intent the redb→SQLite migration vindicated: hexagonal
at subsystem boundaries, functional core inside domain modules, vertical slice
at the CLI command layer. Pins the long-term direction toward gitoxide-style
crate decomposition and the native agentic IDE spin-out — the concrete reason
crate-boundary ports will earn their keep. Lists patterns consciously rejected
(CQRS, event sourcing, microservices, ADR-per-decision) with the reasons
attached so future contributors don't relitigate.

CLAUDE.md gets a landpost paragraph at the top of the Architecture section
pointing at ARCHITECTURE.md, so agents and contributors hit the strategic
direction before any operational detail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@avihut
Copy link
Copy Markdown
Owner Author

avihut commented May 16, 2026

/claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 16, 2026

Claude finished @avihut's task in 4m 50s —— View job


Review: feat(coordinator): redesign internals (#476)

This is a well-executed redesign. The hexagonal layering is clean, the spawn pattern follows CLAUDE.md precisely, forbid(unsafe_code) is preserved throughout, and the previous review's findings are all addressed. The following are new actionable findings.


Issues

1. verify_perms is defined but never called from production code

src/store/paths.rs:68–96

The PR description's security table lists "File perms: 0600 on file, 0700 on parent dir; verified once per process." This is only half-true. tighten_perms is called on fresh DB creation (correct — src/store/connection.rs:80), but verify_perms is never invoked from any production path — only from the module's own unit tests. Subsequent Pool::open calls on an existing DB go through bring_up → verify_application_id, but never verify_perms. A replacement attack (symlink a world-readable DB over the original) is not caught on re-open.

The fix is to call paths::verify_perms(db_path)? inside bring_up for non-read-only connections on existing DBs (the is_fresh_db = false branch). Or update the security table to say "set once at creation" rather than "verified once per process."

Fix this →


2. wipe_legacy_files prints to stderr on the tab-completion hot path

src/coordinator/adapters/sqlite_store.rs:57–75

complete.rs calls SqliteJobsStore::for_repo_base (lines 822, 1169, 1263) which always calls wipe_legacy_files. If a user has coordinator.redb or repo-policy.json on disk (upgrading from the previous design), the first tab press after the upgrade runs eprintln!("daft: removed legacy …"). CLAUDE.md explicitly calls out: "Stderr output is also problematic — eval only captures stdout … The same applies to the __complete tab-completion helper, which fires on every Tab keypress." The banner would appear mid-typing in the user's terminal.

The wipe is one-time (files are gone after first deletion), but the banner appearing on a tab press is confusing UX. The mitigation is to move the legacy wipe out of for_repo_base into a dedicated first-boot-or-coordinator-startup path (e.g., only call it from run_coordinator, not from every for_repo_base call). A flag file in the repo state dir could gate this.

Fix this →


3. format! in src/store/repos/jobs.rs violates the stated invariant

src/store/repos/jobs.rs:144–155

The CLAUDE.md security table records "All queries parameterized; no format! allowed in src/store/repos/." list_by_repo_and_statuses uses format! to build the placeholder list:

let placeholders = (2..=statuses.len() + 1)
    .map(|i| format!("?{i}"))
    .collect::<Vec<_>>()
    .join(", ");
let sql = format!(
    "… WHERE repo_hash = ?1 AND status IN ({placeholders}) …"
);

This is not a SQL injection risk — only the count comes from statuses.len() and the values are bound via params_vec. But it violates the "no format!" invariant that future contributors will rely on as a clear signal that queries are parameterized. The hardcoded callers are always two known statuses ("running" and "cancelling"), so a clean fix is to hardcode the two-placeholder form: WHERE repo_hash = ?1 AND status IN (?2, ?3). If a more general form is genuinely needed later, the rule should be updated in CLAUDE.md first.

Fix this →


4. for_repo_base bypasses the path-safety check that for_repo_under provides

src/coordinator/adapters/sqlite_store.rs:41–47

pub fn for_repo_base(base: &Path) -> Result<Self> {
    wipe_legacy_files(base);
    let db_path = base.join(crate::store::paths::COORDINATOR_DB);
    // ↑ no canonicalize(), no starts_with(base_dir) assertion
    let pool = Pool::open(&db_path);
}

paths::for_repo_under canonicalizes the parent and asserts it stays under base before returning the DB path — this is the symlink-escape guard (for_repo_under_rejects_symlink_escape test). for_repo_base skips this check. All production callers pass a store.base_dir that is itself already within the state dir, so this doesn't create a practical exploit today. But the layering is inconsistent: the paths module has a guard, and the adapter silently routes around it. The safe fix is to call paths::for_repo_under(base, "") or equivalent, or to document why the guard is intentionally skipped here.


5. Socket listener swallows unexpected errors without logging

src/coordinator/process.rs:743–749

Err(ref e) if e.kind() == std::io::ErrorKind::WouldBlock => {
    std::thread::sleep();
}
Err(_) => {
    // Unexpected error; break the loop.
    break;   // ← no eprintln!, no log
}

When an unexpected accept() error fires (file-descriptor exhaustion, protocol error), the socket listener exits silently. From that point on, cancel, tail-logs, and list-jobs IPC requests all fail at the client with "connection refused." This is hard to diagnose because no signal in the coordinator's output explains why the socket disappeared mid-flight. A one-line eprintln!("daft coordinator: socket listener error: {e}") before the break would surface this in whatever terminal collects the coordinator's output.

Fix this →


Minor observations

  • JobMeta.pid is set to the coordinator's own PID on the initial meta.json write (process.rs:468: pid: Some(std::process::id())), but the SQLite JobRow correctly uses child_pid = None for the initial write (via job_row_from_meta(..., child_pid = None)). So SQLite is correct; only the transitional meta.json sidecar carries the misleading coordinator PID. This is self-correcting once the meta.json path is retired in the follow-up.

  • stale_running_after_seconds from RepoPolicy is not used by the reconcilerclean_policy.rs:83 has fn stale_running_threshold consuming this field, but reconcile_active_jobs uses only live/dead PID probing. The field is wired up but the reconciler doesn't consult it. No bug (PID probe is sufficient), but the policy field has no effect today.

  • Pool::open in complete.rs runs migration validation on every tab press — The 165ms benchmark is acceptable, but migrate::run checks user_version on every Pool::open. Since the schema never changes at runtime, the completion path could eventually use a single Connection::open_with_flags(READ_ONLY) + skip the migration check, as the PR description's risks section notes. Not a blocker now.

  • wipe_legacy_files runs on every for_repo_base call, adding two stat() syscalls to the hot path — Once the legacy files are gone, the overhead is two p.exists() checks per completion-path open. Negligible but worth tracking for the eventual cleanup.


What's done well

  • Previous review findings all addressed — R1 (cancel_has_filter + regression test), R2 (: key safety comment), R3 (reconcile probe comment), R4 (derive(Clone) + error frame fallback), R5 (fork_coordinator alias removed), R7 (sampling unified to is_multiple_of), R8 (job_count dead code removed). Clean response.
  • Hexagonal ports/adapters/domain pattern is properly enforced — domain code has no rusqlite/nix/std::env imports; adapters are the only translation layer; the reconcile unit tests use in-memory fakes.
  • forbid(unsafe_code) fully preserved — no unsafe blocks, nix safe wrappers for setsid()/killpg().
  • canonicalize() on current_exe() is correct and load-bearing — the comment in spawn_coordinator explaining why this is necessary (symlink multicall dispatch) is accurate and important.
  • Frame size limit (8 MiB) with testswrite_rejects_oversize_payload and read_rejects_oversize_header both present.
  • Env scrub at the adapter boundaryupsert_job scrubs before any DB write; no caller can accidentally skip this. Tests confirm GH_TOKEN doesn't survive the round-trip.
  • Schema forward-compat sentinelrun refuses to apply migrations when user_version > current_version(), tested.
  • Concurrent access test replaces the redb lock testconcurrent_open_allowed directly demonstrates the property that motivated the redb→SQLite cutover.
  • cancel_has_filter extracted with regression test — clean fix for R1 and the function itself is now independently testable.
  • ACTIVE_STATUSES constant — shared between the fake store in tests and the adapter, so the reconciler tests are always consistent with the production filter.

avihut and others added 6 commits May 16, 2026 14:04
The redb→SQLite cutover left a one-line stderr banner ("daft: removed
legacy ... at ...") behind the `for_repo_base` constructor. The
completion path (`__complete`) opens the store on every Tab press and
runs inside `eval`-captured shell code that only captures stdout, so the
banner leaks straight into the user's terminal mid-keystroke if any
legacy file is still on disk after upgrade.

Split the constructor:
- `for_repo_base` — no wipe; safe to call from completion / CLI readers.
- `for_repo_base_with_wipe` — wipes legacy files and emits the stderr
  banner; called only from `run_coordinator` startup, where the
  diagnostic surface is already in use for other startup messages.

Documents the trust boundary on `for_repo_base` per PR review Issue #4:
`base` is derived upstream by `LogStore::new` from `daft_state_dir() +
jobs + repo_hash` (canonicalize-and-asserted by `paths::for_repo_under`),
so the adapter doesn't redo that check at every open. New callers MUST
keep the chain — no user input feeds `base`.

Adds a regression test (`for_repo_base_does_not_wipe_legacy_files`) so
the hot-path-stderr issue cannot reoccur silently.

Addresses PR #508 review Issues #2 + #4 + Minor #4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The security table in PR #508's description claims "0600 file / 0700
parent dir, verified once per process at first open." That claim was
half-true: `tighten_perms` ran on fresh-DB creation (correct), but
subsequent `Pool::open` calls on an existing DB went through
`bring_up`'s `verify_application_id` without ever re-checking the
filesystem perms.

The threat: an admin process (or a misconfigured backup tool) replaces
the 0o600 DB file with a 0o644 copy between daft invocations. Until the
file is recreated, the new perms persist and the contents are
world-readable. Application_id verification can't catch this because the
header bytes are intact.

Add a single `paths::verify_perms` call in `Pool::open` after
migrations finish but before the writer/reader pools come up. Fresh DBs
pass trivially (they were just tightened in `bring_up`); existing DBs
get the real check. Once per pool — no per-checkout cost.

Adds `pool_rejects_world_readable_db` as a regression test (tighten on
first open → loosen externally → assert the second open fails with
`PermissionsTooOpen`).

Addresses PR #508 review Issue #1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tuses

The original `list_by_repo_and_statuses(conn, repo_hash, statuses)`
built its placeholder list with `format!("?{i}")` and `format!(... IN
({placeholders}) ...)`. Not a SQL injection — only the placeholder
count came from `statuses.len()`, values were bound via `params_vec` —
but it violates the no-`format!`-in-repos invariant. That invariant is
load-bearing because future contributors read the absence of `format!`
in `src/store/repos/` as a clear signal that every query is
parameterized; once `format!` is used "carefully" in one spot, the
signal degrades.

The only caller (`SqliteJobsStore::list_active_jobs`) always passes
exactly two statuses (`"running"`, `"cancelling"`). Rename to
`list_by_repo_and_two_statuses` with a fixed `(status_a, status_b)`
signature and hardcoded `WHERE repo_hash = ?1 AND status IN (?2, ?3)`.
If a third active-set status appears, add a sibling method rather than
re-introducing dynamic placeholders.

Repo layer stays neutral on which statuses constitute "active" — the
caller still picks; only the cardinality is fixed at the type level.

Addresses PR #508 review Issue #3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The IPC socket listener's `accept()` loop has three arms: success
(dispatch to client handler), `WouldBlock` (sleep + retry), and any
other error (break the loop). The last arm exited silently — once it
fired, every subsequent `cancel`, `tail-logs`, and `list-jobs` IPC
request returned "connection refused" at the client, and nothing in the
coordinator's output explained why the socket disappeared.

Add a one-line `eprintln!` before the break so the underlying error
(FD exhaustion, protocol error, ...) lands in whatever terminal collects
coordinator stderr. The listener still exits — the unexpected-error
case isn't recoverable from inside the loop — but the operator gets a
breadcrumb instead of a void.

Addresses PR #508 review Issue #5.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…h SQLite

The legacy `LogStore::clean` walked the filesystem and read each per-job
`meta.json` to drive retention decisions. PR #508's redb→SQLite
migration left that path in tree as the transitional reason for the
meta.json dual-write: production code wrote both SQLite rows and the
sidecar so the retention pass kept seeing per-job
`retention_seconds`/`worktree`/`started_at`. Pre-flight audit confirmed
the SQLite `jobs` row carries every field `clean` reads (`name`,
`worktree`, `started_at`, `status`, `retention_seconds`) with no
defaults — no schema migration needed.

Restructured along the hexagonal-plus-FCIS direction pinned in
ARCHITECTURE.md:

- **New pure module** `coordinator/domain/retention.rs`. Defines
  `JobSnapshot`, `RetentionInput`, `RetentionDecision`,
  `EvictionTarget`, `StaleMarkTarget`, and a single pure function
  `retention(input) -> decision`. No traits, no `&dyn Port`, no I/O,
  no `Utc::now`. Computes which jobs to flip to stale-running and
  which job dirs to evict, returns them as values for the shell to
  apply. 10 unit tests assert directly against returned decisions:
  time-based retention, keep_last floor, per-job override,
  retention_override supremacy, stale-running with/without socket,
  Running jobs exempt, multi-worktree per-floor accounting,
  jobs_per_inv/candidates_per_inv alignment for whole-invocation
  removal, dry-run shape parity.

- **`LogStore::clean` becomes the imperative shell**. Gathers the
  snapshot via `JobsStorePort::list_jobs_for_repo` (replacing the
  per-job `read_meta` loop), stats `output.jsonl` for sizes, probes
  the coordinator socket once, calls `retention(input)`, then applies
  the decision: stale-running marks via the port, evictions via
  rename-to-`.deleting-` then `remove_dir_all`. Signature now takes
  `&dyn JobsStorePort` + `repo_hash` so the data source is explicit.

- **Callers updated.** `log_clean::run_clean_logs` and
  `commands::hooks::jobs::prune_jobs` already constructed a
  `SqliteJobsStore` adjacent to the `LogStore`; both now thread it
  through `clean`. `log_clean` falls through (cannot retention-sweep
  without the store); `prune_jobs` `?`-propagates the open error.

- **Tests reseeded via port.** `test_clean_old_logs`,
  `clean_uses_per_job_retention_from_sqlite_row` (renamed from
  `_from_meta`), `clean_keep_last_floor_overrides_retention`,
  `clean_detects_stale_running_when_socket_missing`, and
  `dry_run_tallies_removed_invocations` now seed via
  `JobsStorePort::upsert_job` through a small `seed_job_row` helper.
  `seed_inv_with_jobs` gains an `Option<(&SqliteJobsStore, &str)>`
  parameter so budget tests (which stat the filesystem, not SQLite)
  can pass `None`.

The transitional `meta.json` write in production still happens — it's
unused by retention now, but `LogStore::clean` was the last reader and
deleting the write side will land in a follow-up commit so this diff
stays focused on the data-source swap + FCIS extraction.

Test count: 1792 → 1802 (10 new domain tests). All retention behavior
that the YAML scenarios exercise (per-job retention, keep_last floor,
stale-running, dry-run) is now also covered by pure-function unit
tests that don't depend on filesystem timing.

Addresses follow-up #55 (partial — meta.json write deletion comes
next).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #508 B1 made `LogStore::clean` query SQLite for retention metadata,
which leaves `meta.json` with no production reader. This commit removes
the writes (and the now-stale fallback reads), adds the sidecar to the
legacy-file sweep, and switches verification paths over to the
`JobsStorePort`.

Production write sites removed:
- `coordinator/process.rs` — three `store.write_meta(&job_dir, &meta)`
  calls at the dep-failed, initial-running, and terminal-update points.
  The matching `JobsStorePort::upsert_job` calls already write the
  authoritative row to SQLite.
- `coordinator/log_store.rs::write_job_record_jsonl` — no longer writes
  the sidecar. The helper now owns just the JSONL payload + per-job
  directory; metadata flows separately via `JobsStorePort::upsert_job`
  from `BufferingLogSink::persist_job_row`.

Production read sites removed:
- `commands/hooks/jobs.rs::read_job_meta_redb_first` — renamed to
  `lookup_job_meta` and reshaped as `(index, inv, job_dir) ->
  Option<JobMeta>`. The meta.json fallback is gone; SQLite-loaded
  index is the sole source. Callers (`build_job_list`, group rendering,
  status filtering, `format_jobs_logs`, retry resolution) switch from
  `Result` to `Option` handling.
- `retry_command` builds the SQLite index up front and threads it into
  `resolve_retry_invocation` (new `index` param) and the inline meta
  loop. The two remaining `store.read_meta` calls in those code paths
  are replaced by `lookup_job_meta`.

Legacy-file sweep extended:
- `sqlite_store::wipe_legacy_files` now also walks `<inv>/<job>/` for
  pre-cutover `meta.json` sidecars and removes them with a stderr
  banner. The sweep runs only from `for_repo_base_with_wipe`
  (coordinator startup), so the completion / CLI-reader hot path never
  prints to stderr mid-keystroke.
- The regression test grew a `meta.json` case in both the wipe and
  non-wipe variants.

Tests reseeded:
- `BufferingLogSink` tests verify against the SQLite row via
  `SqliteJobsStore::get_job`, not `LogStore::read_meta`.
- `process.rs` tests gain a real `SqliteJobsStore` via `make_test_state`
  + `make_ctx`'s new `job_store` parameter; assertions against terminal
  status, dep-failed synthesis, and stale-running flow through the port.
- `write_job_record_jsonl_creates_meta_and_jsonl_atomically` renamed to
  `..._writes_jsonl_without_meta_sidecar` and now asserts the sidecar
  is *absent*.
- Test for `read_job_meta_redb_first_falls_back_to_meta_json_when_redb_silent`
  removed — the behavior it pinned is gone. The companion test for
  "prefers redb when index has row" becomes `lookup_job_meta_returns_indexed_row`
  plus a sibling `lookup_job_meta_returns_none_when_index_misses`.

`LogStore::write_meta` and `LogStore::read_meta` themselves remain in
tree for now — only test setup still uses them to seed JobMeta
fixtures. They have no production callers; deleting them and migrating
the remaining test seedings to the port is a separate mechanical
cleanup.

Closes the meta.json dual-write follow-up referenced in PR #508's
"Known follow-ups" section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@avihut
Copy link
Copy Markdown
Owner Author

avihut commented May 16, 2026

Thanks for the second-pass review. Most of the findings are legitimate
and are addressed in be764d7d
387ed12f.
One Minor observation is a false alarm; tracing below.

Issues (addressed)

  • Feature: Shorthand Remote URL Support #1 verify_perms defined but never called from production
    fixed in f2e7b57c
    (fix(store): verify file perms once per pool open). Pool::open
    now calls paths::verify_perms(db_path)? once after migrations
    finish, both for fresh and pre-existing DBs. Tightened the security
    claim to actually hold. Regression test
    pool_rejects_world_readable_db chmods a freshly-created DB to
    0o644 and asserts the second open fails with PermissionsTooOpen.

  • Feature: Root Level .git Dir #2 wipe_legacy_files prints to stderr on the tab-completion hot
    path
    — fixed in be764d7d
    (refactor(coordinator): move legacy-file wipe off the hot path).
    for_repo_base split into two constructors:

    • for_repo_base — no wipe; used by completion / CLI readers.
    • for_repo_base_with_wipe — wipes legacy files and emits the
      stderr banner; called only from run_coordinator.
      Regression test for_repo_base_does_not_wipe_legacy_files plants
      legacy files and asserts the completion-path constructor leaves
      them alone.
  • Install Through brew #3 format! in src/store/repos/jobs.rs violates the
    no-format! invariant
    — fixed in 46c0921f
    (fix(store): hardcode IN-list placeholders ...). list_by_repo_and_statuses(&[&str])
    list_by_repo_and_two_statuses(status_a, status_b) with
    hardcoded SQL WHERE repo_hash = ?1 AND status IN (?2, ?3). The
    caller (SqliteJobsStore::list_active_jobs) always passes exactly
    two statuses, so the API shrink is lossless. If we ever need three,
    a sibling method is the right shape — re-introducing dynamic
    placeholders would weaken the CI grep-gate's signal.

  • Command Shortcuts #4 for_repo_base bypasses the path-safety check — addressed
    via the same commit (be764d7d)
    by documenting the trust boundary on for_repo_base. The full
    paths::for_repo_under check (canonicalize + prefix-assert
    against state dir) can't be applied uniformly without coupling the
    adapter to daft_state_dir() (which would break every test that
    uses a tempdir). The docstring now records that production callers
    derive base from LogStore::base_dir, which is itself constructed
    via <state_dir>/jobs/<repo_hash> — the chain is upstream-trusted.
    Per the review's own framing, "or to document why the guard is
    intentionally skipped" is a valid resolution; that's what's in tree.

  • Shell Completions #5 Socket listener swallows unexpected errors without logging
    — fixed in 658545d2
    (fix(coordinator): surface socket-listener errors before exiting).
    One-line eprintln! before the break in the Err(_) arm.

Follow-up from the previous review (now closed)

The LogStore::cleanmeta.json cutover that was previously
flagged as a deferred follow-up landed in
6535ef57
and 387ed12f:

  • B1 lifts retention into coordinator/domain/retention.rs as a pure
    FCIS function (10 unit tests, no fake-adapter mocking) and rewrites
    LogStore::clean as the imperative shell that gathers a snapshot
    via JobsStorePort::list_jobs_for_repo + filesystem stat, calls
    retention(input), and applies the returned RetentionDecision.
  • B2 drops the three production write_meta calls in
    coordinator/process.rs, the write_meta inside
    LogStore::write_job_record_jsonl, and the meta.json read fallback
    in commands/hooks/jobs.rs (read_job_meta_redb_first
    lookup_job_meta, ResultOption). meta.json is added to
    the legacy-file sweep so post-upgrade installs get cleaned up the
    same way the redb-era files are.

Minor observations

  • Minor Feature: Shorthand Remote URL Support #1 (JobMeta.pid mismatch on initial meta.json write)
    auto-resolved by B2: that initial write no longer happens.

  • Minor Feature: Root Level .git Dir #2 (stale_running_after_seconds claimed unused) — this
    is a false alarm. The field IS consulted by production code. End-
    to-end trace:

    1. src/coordinator/clean_policy.rs:82-85
      RepoPolicy::stale_running_after_resolved(&self) -> i64
      returns the field with an 86_400s default.
    2. src/coordinator/log_store.rs::clean (post-B1 shell) /
      src/coordinator/domain/retention.rs::retention (pure core) —
      consumes the resolved value as the stale-Running threshold
      (Duration::try_seconds(...)).
    3. Production callers of clean: src/log_clean.rs:120
      (run_clean_logs, fires on every daft invocation via the
      once-per-24h background sweep) and
      src/commands/hooks/jobs.rs::prune_jobs (the user-facing
      daft hooks jobs prune subcommand).
    4. The stale_running_marked_when_socket_missing and
      stale_running_skipped_when_socket_alive domain tests directly
      exercise the threshold path.

    Recommendation: the bot's index missed the indirection through
    stale_running_after_resolved. No change to make.

  • Minor Install Through brew #3 (per-tab-press migration check in Pool::open)
    acknowledged in the PR body's tab-completion benchmark section.
    165 ms total is acceptable for now; the Connection::open_with_flags(... READ_ONLY) + skip-migration path is a follow-up if real user
    feedback indicates the cost matters.

  • Minor Command Shortcuts #4 (extra stat() syscalls in the hot path from
    wipe_legacy_files)
    — auto-resolved by Feature: Root Level .git Dir #2's fix: the completion
    path no longer calls wipe_legacy_files, so the per-keystroke
    exists() checks are gone.

Diff summary

Commit Subject
be764d7d refactor(coordinator): move legacy-file wipe off the hot path
f2e7b57c fix(store): verify file perms once per pool open
46c0921f fix(store): hardcode IN-list placeholders in list_by_repo_and_two_statuses
658545d2 fix(coordinator): surface socket-listener errors before exiting
6535ef57 refactor(coordinator): lift retention into FCIS domain + route through SQLite
387ed12f chore(coordinator): drop meta.json from production paths

mise run test:unit clean (1802 tests); mise run clippy clean;
mise run fmt:check clean. CI on the pushed branch will tell us if
anything regresses against the integration sweep.

avihut and others added 2 commits May 16, 2026 15:37
B2 (`chore(coordinator): drop meta.json from production paths`) dropped
the `write_meta` call inside `LogStore::write_job_record_jsonl`, but the
YAML executor's pre-execution skip loop (the path that handles
per-job `skip: true` / `only: false` / platform / tracking filters)
was relying on that sidecar as its ONLY persistence — the
BufferingLogSink isn't constructed until after the loop, so its
`persist_job_row` path doesn't run for these jobs. Result: every
condition-skipped job became invisible to `daft hooks jobs ls` and to
the SQLite store.

Caught by the
`tests/manual/scenarios/hooks/per-job-skip-records.yml` YAML scenario
in CI.

Open the per-repo `SqliteJobsStore` once up-front (the same handle that
later writes the repo policy) and route both the skipped-job
persistence loop and the repo-policy write through it. Each skipped
job now gets a proper `JobRow` upsert alongside the `output.jsonl`
that carries the reason string. The on-disk shape converges with the
foreground/background paths.

Best-effort: if the store can't be opened, we still write the JSONL
log file and emit a one-line stderr warning; only the SQLite
persistence is skipped. Production paths are no different from before
in that failure mode — they printed the same warning.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
B2 dropped `meta.json` from all production code paths, so YAML
scenarios that read the sidecar (`bg-output-silent`'s settle poll) or
mutated it (the log-cleanup family's backdating step, the synthetic
Running/stale invocation injectors) stopped working: the file simply
isn't there, and even when it was, daft no longer consulted it for
retention decisions.

Switch every affected scenario to operate on `coordinator.db`
directly via `sqlite3` (poll status, backdate `started_at` /
`finished_at` / `created_at`, INSERT synthetic invocations + jobs).
The eight scenarios that were red in CI on the previous push:

- `hooks/bg-output-silent.yml` — settle poll now queries
  `SELECT COUNT(*) FROM jobs WHERE name LIKE 'silent-%-job' AND
  status = 'running'`.
- `hooks/log-cleanup-dry-run.yml`,
  `hooks/log-cleanup-respects-retention.yml`,
  `hooks/prune-dry-run-invocations.yml` — backdate via
  `UPDATE jobs SET started_at = ...; UPDATE invocations SET
  created_at = ...;` with a Python-computed ISO-8601 timestamp.
- `hooks/log-cleanup-honors-keep-last.yml` — per-row shift in Python
  (preserves relative ordering of `started_at` so the keep_last sanity
  floor still applies in the expected order).
- `hooks/log-cleanup-skips-running.yml`,
  `hooks/log-cleanup-stale-running-detected.yml` — synthetic
  invocation+job INSERTs via Python's `sqlite3` module, matching the
  column shape from `001_initial.sql`. Placeholder `output.jsonl`
  files still touch the filesystem so the eviction path has something
  to remove.

All eight scenarios pass locally; the production fix in the previous
commit was caught by `hooks/per-job-skip-records.yml` (not modified
here — the test itself was correct, the production code was the bug).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@avihut
Copy link
Copy Markdown
Owner Author

avihut commented May 16, 2026

/claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 16, 2026

Claude finished @avihut's task in 7m 30s —— View job


Review: feat(coordinator): redesign internals (#476) — Pass 3

All previous findings from passes 1 and 2 are correctly addressed. Four new items follow, two moderate and two minor.


Issues

1. load_redb_job_meta_index / redb_index are misleading misnomers

src/commands/hooks/jobs.rs:654–671

fn load_redb_job_meta_index(
    repo_hash: &str,
    log_store_base: &Path,
) -> Option<HashMap<(String, String), JobMeta>> {
    let store = SqliteJobsStore::for_repo_base(log_store_base).ok()?;
    let rows = store.list_jobs_for_repo(repo_hash).ok()?;}

The function opens a SQLite store and maps rows from it — it has nothing to do with redb. The name survives from the pre-cutover dual-write era when it read from redb and fell back to meta.json. Every call site passes the result as redb_index (lines 723, 885, 942, 1249, 1281, 1641, …), which now describes a SQLite-populated HashMap. Future contributors reading this file will look for a redb dependency that no longer exists.

The fix is a rename: load_redb_job_meta_indexload_sqlite_job_meta_index and redb_indexsqlite_index at every site. No logic change.

Fix this →


2. The no format! CI grep-gate is documented but does not exist

src/store/repos/mod.rs:6 and src/store/repos/jobs.rs:136–137

// mod.rs:6
//! All SQL is parameterized — no `format!` into queries — and a
//! CI grep-gate enforces that in `src/store/repos/`.

// jobs.rs:136-137
/// Fixed at exactly two statuses so the SQL stays parameterized with
/// hardcoded placeholders — the `no format!` invariant in
/// `src/store/repos/` is enforced by a CI grep-gate, …

No grep-gate exists in any workflow file (.github/workflows/) or mise task. The invariant is currently satisfied — format! genuinely does not appear in production query code in src/store/repos/ — but it is enforced by convention only, not tooling. A future contributor can add format! without any CI signal.

Two clean resolutions:

  • Add the gate: a CI step or mise task running ! grep -rPn 'format!' src/store/repos/*.rs | grep -v '^\s*//' would make the claim true.
  • Correct the comments: change "enforced by a CI grep-gate" to "enforced by convention" so the claim accurately reflects the current state.

Fix this →


3. Two stale "redb" doc comments in process.rs and prune.rs

src/coordinator/process.rs:338,355 and src/core/worktree/prune.rs:835

// process.rs:338
/// Optional dual-write target for the redb-backed `JobRow` lifecycle
//              ^^^^^^^^^^ should be "SQLite-backed"

// process.rs:355
/// Project a `JobMeta` into a redb `JobRow`. …
//                             ^^^^ should be "SQLite"

// prune.rs:835
/// filters its redb-recorded active jobs in one round trip …
//              ^^^^ should be "SQLite-recorded"

Redb is gone from the codebase. These three comments still use the old terminology. No logic impact, but they create false signal for future readers (and tooling like grep redb src/ which was being used to verify the cutover was complete).


4. meta.pid = Some(std::process::id()) is a dead assignment

src/coordinator/process.rs:468

let mut meta = JobMeta {pid: Some(std::process::id()),  // coordinator's own PID — never used};

job_row_from_meta ignores meta.pid entirely (it uses the child_pid parameter directly for JobRow.pid and JobRow.pgid). The meta.json write path that previously consumed meta.pid was removed in B2 (chore(coordinator): drop meta.json from production paths). This assignment now has no effect, but it looks meaningful — a reader would reasonably assume the coordinator's PID is being persisted into the initial SQLite row, when in fact None is written.

Fix: either remove the assignment (pid: None) or add a comment explaining it's inert (// not propagated to SQLite — overridden by child_pid in job_row_from_meta).


What's done well

  • All pass-2 findings correctly addressed. verify_perms is now called from Pool::open (with regression test). wipe_legacy_files is split into for_repo_base (no wipe, safe on the completion hot path) vs. for_repo_base_with_wipe (coordinator-only). format! is actually absent from src/store/repos/ production code. cancel_has_filter includes inv.is_some() with a dedicated regression test. fork_coordinator alias is gone.
  • Domain purity holds end-to-end. src/coordinator/domain/reconcile.rs and retention.rs have zero imports of rusqlite, nix, or std::env. Confirmed by grep.
  • process_group_alive now uses killpg correctly. nix::sys::signal::killpg(Pid::from_raw(pgid as i32), None) probes the process group — the rewrite from pass-2 fixed the semantic mismatch the first review flagged.
  • forbid(unsafe_code) preserved. The unsafe { libc::getpgrp() } in unix_process.rs:48 is inside #[cfg(test)], which the #![cfg_attr(not(test), forbid(unsafe_code))] gate in lib.rs allows.
  • Legacy-file sweep regression tests. for_repo_base_with_wipe_removes_legacy_files and for_repo_base_does_not_wipe_legacy_files both exist and cover the expected post-conditions precisely.
  • Condition-skipped jobs persist to SQLite. src/hooks/yaml_executor/mod.rs:306–354 writes both the JSONL record and the JobRow for jobs filtered out before execution. The regression commit a45d5a1 adds test coverage.

avihut and others added 4 commits May 16, 2026 22:05
`run_single_background_job` set `meta.pid = Some(std::process::id())` —
the coordinator's own PID — but production `job_row_from_meta` ignores
`meta.pid` entirely and uses the `child_pid` parameter for the SQLite
`JobRow.pid`/`JobRow.pgid`. The assignment looked meaningful but did
nothing in the row that actually gets persisted.

Replace with `pid: None` and a comment pointing at the terminal write
that populates `child_pid`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Redb is gone — the cutover landed in this PR. Five doc comments still
used the old terminology and would confuse readers using `grep redb src/`
as a cutover-completeness oracle.

Three flagged by the pass-3 review:
- src/coordinator/process.rs:338 (`JobInvocationContext.job_store` doc)
- src/coordinator/process.rs:355 (`job_row_from_meta` doc)
- src/core/worktree/prune.rs:835 (`cancel_background_jobs_for_worktree` doc)

Two more found while sweeping for consistency:
- src/executor/mod.rs:105 (`JobSpec.tags` doc)
- src/commands/hooks/jobs.rs:652 (`load_*_job_meta_index` doc — drop the
  redb-era contrast clause; keep the positive SQLite + WAL statement)

The legacy-wipe references at `process.rs:1396-1397` and inside
`src/coordinator/adapters/sqlite_store.rs` correctly describe historical
files (`coordinator.redb`) that the wipe code deletes — left alone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…_meta_index

Pure mechanical rename. The function and the `redb_index` variable/
parameter survived from the pre-cutover dual-write era when the index
was sourced from redb with a meta.json fallback. Today it opens
`SqliteJobsStore` and reads `JobRow`s directly — there's no redb
dependency to explain.

- Function: `load_redb_job_meta_index` → `load_sqlite_job_meta_index`
- Variable/parameter at 16 call sites: `redb_index` → `sqlite_index`

Confined to `src/commands/hooks/jobs.rs`. No behavior change; existing
unit tests (1802 passing) cover the rename via `cargo check` and
`cargo test`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`src/store/repos/{mod,jobs}.rs` claimed "a CI grep-gate enforces" the
no-`format!` invariant. No such gate existed in any workflow or mise
task — the claim was true by convention only. Make the claim true by
adding the gate.

New mise task `lint:repos-no-format` greps for `format!\s*\(` (matches
macro calls, ignores doc-comment prose like `` `format!` ``) and fails
loudly when one is found. Wired into the existing `lint:rust` umbrella
via `#MISE depends`, so `mise run lint` and the CI `mise run clippy`
step pick it up automatically.

Today the gate passes — no `format!()` calls exist in src/store/repos/
— and verified by temporarily injecting one (it correctly failed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@avihut avihut merged commit fe3e9fa into master May 16, 2026
13 checks passed
@avihut avihut deleted the daft-476/chore/coordinator-redesign branch May 16, 2026 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feat New feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

chore: coordinator internals redesign — job queue, IPC, persistence, cancel semantics

1 participant