feat(coordinator): redesign internals — persistence, IPC, structured logs (#476) by avihut · Pull Request #508 · avihut/daft

avihut · 2026-05-15T16:46:56Z

Abstract

The hooks-jobs coordinator (daft __coordinator) backs every background
hook job daft runs and is the source of truth for job status, logs, and
cancel semantics. Until this PR its internals had grown organically:
durability lived in scattered JSON sidecars, the IPC was loose \n-delimited
JSON without versioning, log content went into a single output.log with
no schema, and crashed jobs were invisible until the next manual cleanup.

This PR redesigns those internals around three pillars plus a
hexagonal architecture (store / ports / adapters / domain) that future
DB-backed features land into:

Persistence. A per-repo SQLite database at
<state>/jobs/<repo>/coordinator.db (WAL mode) replaces filesystem
sidecars (meta.json, repo-policy.json) as the system of record.
Writes are atomic via SQLite transactions. A reconcile pass on
coordinator boot flips rows for processes whose PIDs are gone to
Crashed, surfacing failures the user could previously only spot by
inspecting state directories. Critically, the new store also fixes
the cross-process-access limitation of the design's first iteration
(see "Migration notes" below).
Wire protocol. Length-prefixed framed JSON envelopes with a
stable v / ErrorCode taxonomy replace the old line-delimited
exchange. A TailLogs { follow, since_seq } streaming variant lets
the CLI live-tail structured records over the socket. The 8 MiB
frame ceiling bounds local IPC.
Logs. Per-job output.jsonl records ({seq, ts, kind: stdout|stderr|status, ...}) replace the legacy output.log. A
per-job log.sampling_every_nth knob drops records without breaking
sequence numbering. daft hooks jobs logs gains --stdout / --stderr / --status / --seq / --since-seq / --since / --follow flags.

Beyond the three pillars, the PR also lands the user-visible payoff for
the new persistence layer: daft hooks jobs ls / logs and the tab-
completion hot path now read from SQLite, so the new Crashed status is
finally visible from the CLI and stays visible while a coordinator is
actively writing (WAL allows concurrent readers).
A CancelMatching { hook, worktree, tag, invocation_prefix, older_than_secs } IPC variant and matching CLI flags collapse what used
to be three special-case helpers into a single predicate.

Three over-grown internal APIs were also tightened with context structs
(JobAdapterContext, HookExecutionContext, extended
JobInvocationContext) — all three lose their
#[allow(clippy::too_many_arguments)].

Closes #476.

Migration notes

The early iteration of this PR used redb for
persistence — pure-Rust, single-file, atomic transactions. During
review the cross-process access pattern surfaced as a hard wall: redb
takes a process-exclusive OS file lock on open, so the CLI could not
read job state while the coordinator was alive. The Tier-1 design
papered over this with a meta.json dual-write that the CLI fell back
to. Removing that workaround revealed the real issue and led to a swap
from redb to SQLite via rusqlite (bundled, WAL mode):

Cross-process concurrency model matches daft's actual access pattern
(many readers + one writer across processes on the same host) — no
workaround needed.
Fully safe Rust API honors daft's forbid(unsafe_code) policy
without carveouts. LMDB via heed was evaluated and rejected: its
Env::open is unsafe fn.
SQL fits the trust registry's pattern-match queries and the planned
repo catalog's relational shape — future DB-backed features land in
the same store layer rather than each one introducing a new DB.

Legacy files left over from earlier builds — coordinator.redb,
repo-policy.json, and per-job meta.json sidecars under
~/.local/state/daft/jobs/*/<inv>/<job>/ — are removed on first
coordinator startup by SqliteJobsStore::for_repo_base_with_wipe,
with a one-line stderr banner per file. The completion / CLI-reader
path uses a non-wiping constructor so the banner cannot leak into the
user's terminal mid-Tab press.

Architectural changes

Persistence: JSON sidecars → SQLite

flowchart LR
    subgraph Before["Before — JSON sidecars"]
        direction TB
        DM1["meta.json (per job)"]
        DR1["repo-policy.json"]
        DL1["output.log"]
    end

    subgraph After["After — SQLite + JSONL"]
        direction TB
        DB[("coordinator.db<br/>(WAL mode)")]
        DB --> T1["jobs"]
        DB --> T2["invocations"]
        DB --> T3["repo_policy"]
        DL2["output.jsonl (per job)"]
    end

    Before -- "legacy file sweep on coordinator startup" --> After

Writes go through the JobsStorePort trait (typed upsert_job,
upsert_invocation, write_repo_policy with field-merge semantics)
implemented by SqliteJobsStore. The adapter scrubs *_TOKEN /
*_SECRET / *_KEY / *_PASSWORD env vars at the persistence
boundary so no future caller can accidentally leak secrets to disk.

SQLite is the sole source of truth for job metadata. LogStore::clean
now gathers its retention snapshot via the port (lifted into a pure
coordinator/domain/retention.rs FCIS function), so meta.json no
longer flows through any production write or read path. Legacy
sidecars from earlier builds are swept on first coordinator startup.

Store / Ports / Adapters / Domain pattern

flowchart TB
    APP["Application<br/>(commands/hooks/jobs,<br/>IPC dispatch, complete.rs)"]
    DOMAIN["Domain<br/>(pure logic — no SQL, no syscalls)<br/>coordinator/domain/reconcile.rs"]
    PORTS["Ports (trait declarations)<br/>JobsStorePort · Clock · ProcessControl"]
    ADAPTERS["Adapters (impls)<br/>SqliteJobsStore · SystemClock · UnixProcess"]
    STORE["Store (data layer)<br/>connection · pool · migrate · repos · env_scrub"]

    APP --> DOMAIN
    APP --> ADAPTERS
    DOMAIN --> PORTS
    ADAPTERS --> PORTS
    ADAPTERS --> STORE

Hard rules (recorded in CLAUDE.md):

Domain code never imports rusqlite, nix, or std::env.
Store code never imports domain types — returns its own typed row
structs (JobRow, InvocationRow, RepoPolicyRow).
Adapters are the only translation layer; cross-cutting concerns
(env scrub, status enum ↔ TEXT conversion, policy merge) live there.

coordinator/domain/reconcile.rs is the canonical example — pure
logic, six unit tests against fake adapters. Future PRs will lift
more functions from process.rs (cancel, lifecycle) into the
domain layer using the same pattern.

Security baked into the store layer

Every consumer goes through src/store/connection.rs::bring_up, so
the security defaults are non-bypassable:

Concern	Measure
SQL injection	All queries parameterized; no `format!` allowed in `src/store/repos/`
File perms	0600 on file, 0700 on parent dir; verified once per process
Path safety	Always derived from `daft_state_dir()`; canonicalized; prefix-asserted
Schema version	Refuse-open if `PRAGMA user_version > CURRENT`
App identity	`PRAGMA application_id` check refuses foreign SQLite files
WAL hygiene	`wal_autocheckpoint = 1000`
Concurrent writes	Writer pool size 1 + `busy_timeout = 5000`
Concurrent reads	Reader pool `SQLITE_OPEN_READ_ONLY` + `busy_timeout = 300`
Secrets in env	`env_scrub::scrub` strips `*_TOKEN/_SECRET/_KEY/_PASSWORD` before persisting `JobRow.env`
Migration safety	`rusqlite_migration`-driven; refuses to open `user_version > binary's current`

Wire protocol: loose JSON → framed envelopes

sequenceDiagram
    participant CLI as daft CLI
    participant Coord as daft __coordinator

    Note over CLI,Coord: Length-prefixed (u32 BE) framed JSON envelopes.<br/>Every frame carries `v` (PROTOCOL_VERSION) and a tagged body.

    CLI->>Coord: RequestEnvelope { v:1, req:"TailLogs",<br/>payload:{job, follow:true, since_seq:42} }
    loop until terminal Status or EOF
        Coord-->>CLI: ResponseEnvelope { v:1, kind:"StreamFrame",<br/>payload: LogRecord {...} }
    end
    Coord-->>CLI: ResponseEnvelope { v:1, kind:"StreamEnd" }

    Note over CLI,Coord: On schema mismatch the coordinator emits<br/>Error { code: SchemaMismatch, message } instead of<br/>silently parse-failing — the typed ErrorCode taxonomy<br/>also covers JobNotFound / NoMatch / KillFailed / Internal.

Frames cap at 8 MiB so a misbehaving peer can't exhaust memory.
Streaming requests (currently just TailLogs) get many StreamFrame
envelopes terminated by exactly one StreamEnd. Single-response
requests exchange one envelope each way.

Coordinator lifecycle: spawn-self + crash recovery

flowchart TB
    A["Hook fires<br/>(worktree-post-create, ...)"] --> B{"Background jobs<br/>declared?"}
    B -- "no" --> Z["Foreground exec only<br/>(BufferingLogSink<br/>writes JobRow to SQLite)"]
    B -- "yes" --> C["spawn_coordinator()"]

    C --> C1["serialize state → 0600 tempfile"]
    C1 --> C2["current_exe().canonicalize()"]
    C2 --> C3["Command::new(canonical).args(__coordinator, tempfile)<br/>Stdio::null x3"]

    C3 --> D["child: daft __coordinator"]
    D --> D1["read + unlink state tempfile"]
    D1 --> D2["nix::setsid()"]
    D2 --> D3["SqliteJobsStore::for_repo_base(...)<br/>auto-wipe legacy coordinator.redb<br/>+ repo-policy.json"]
    D3 --> D4["coordinator::domain::reconcile_active_jobs()<br/>via UnixProcess + SystemClock adapters<br/>flips dead Running rows → Crashed"]
    D4 --> D5["run_all_with_cancel()<br/>each job in its own process_group(0)<br/>upserts JobRow at start + terminal"]
    D5 --> D6["process::exit"]

spawn_coordinator is the canonical pattern; fork() is not an
option (libc::fork is unsafe fn, conflicts with
forbid(unsafe_code)). The reconcile pass on boot is what makes the
"my last terminal crashed" case visible: the previous coordinator's
process group is gone, so its Running rows get flipped to Crashed
with finished_at = now, and the next daft hooks jobs ls shows them.

Cancel matching

flowchart LR
    subgraph CLI
        F1["--hook"]
        F2["--worktree"]
        F3["--tag"]
        F4["--inv"]
        F5["--older-than"]
        Fall["--all"]
    end

    F1 & F2 & F3 & F4 & F5 -- "AND" --> CM["CancelMatching {<br/>hook, worktree, tag,<br/>invocation_prefix, older_than_secs<br/>}"]
    Fall -- "no filter required" --> CA["CancelAll"]

    CM --> S["filter_matching_jobs over<br/>SqliteJobsStore::list_active_jobs(repo)"]
    CA --> S

    S --> K["killpg(pgid, SIGTERM)<br/>for each match"]
    K --> R["Cancelled { count, names }"]

Dependencies

Add rusqlite = "0.39" with bundled (compiles SQLite from source
into the binary, ~+700 KB on release builds, +C compiler at build
time).
Add rusqlite_migration = "2.5", r2d2 = "0.8", r2d2_sqlite = "0.33".
Drop redb = "4.1", bincode = "1.3".
All within the 7-day cooldown policy; no .dep-age-allowlist
entries needed.

Release binary size went from 9.3 MB (master) to 10 MB on this branch.

Test plan

cargo test --lib — 1790 unit tests pass (90 added under
store::*, coordinator::adapters::*,
coordinator::domain::reconcile::*, plus a regression test for
the legacy coordinator.redb / repo-policy.json sweep)
mise run clippy — zero warnings
mise run fmt:check — clean
mise run test:manual -- --ci — full YAML scenario sweep
Pre-push hook (mise run test:unit + build-check) — clean
Manual smoke: coordinator.db, -wal, -shm files appear in
per-repo state dir; concurrent reader works while writer alive;
legacy coordinator.redb / repo-policy.json auto-wiped with
stderr banner

Tab completion benchmark

Per the original migration plan, daft __complete is on the
documented hot path (fires every Tab keypress) and the SQLite open
adds work the meta.json scan didn't pay. Measured on the daft project
itself (298 invocation directories on disk, ~30 worktrees) with
hyperfine, comparing release binaries:

binary	mean	range
master (meta.json hot path)	122 ms	78 ms – 177 ms
this branch (SQLite hot path)	165 ms	97 ms – 217 ms

Branch is ~43 ms slower (1.35×). The --version-only baseline for
both binaries is ~16 ms, so essentially all of the gap is the
SQLite open + PRAGMA bring-up + first reader checkout. The dominant
cost on either binary is the filesystem invocation scan that powers
the worktree drill-down, which is unchanged. 165 ms total stays
under the typical "feels sluggish" threshold (~200 ms).

The advisor flagged the ≤5 ms budget from the original plan as
speculative — order-of-magnitude is the real signal, and 1.35× is
not a regression that warrants the cache-file mitigation from the
plan's Risks section. If user feedback indicates otherwise, a
follow-up can switch the completion path off the pool onto a single
direct Connection::open_with_flags(... READ_ONLY) and skip the
bootstrap/migration check (we know the schema is current — the
binary was built with it).

Architectural direction

This PR also lands ARCHITECTURE.md at the repo
root, pinning the strategic direction the hexagonal coordinator
proves out:

Hexagonal at subsystem boundaries. The redb→SQLite cutover here
was a single-adapter-file replacement; that's the property worth
preserving as new subsystems land.
Functional core inside domain modules. reconcile is the
worked example — pure function over a snapshot, no port callbacks
during computation. Future domain extractions (cancel,
lifecycle) follow this shape.
Vertical slice at the CLI command layer. Already the de facto
pattern in src/commands/; explicitly not extended into hexagonal
ceremony.
Gitoxide-style crate decomposition (long-term). One concern per
crate, narrow public surface, independent versioning. Not a
near-term refactor.
Native agentic IDE spin-out (long-term). The concrete second
consumer that justifies lifting ports from inside-one-crate to
at-a-crate-edge. daft-coordinator + daft-hooks + daft-trust
become reusable libraries with IDE-supplied adapters for storage,
process control, filesystem.

CLAUDE.md's Architecture section gets a landpost paragraph pointing
at the new doc, so agents and contributors hit the strategic direction
before any operational detail. The follow-ups below are all concrete
steps toward this direction.

Known follow-ups

The invocations SQL table is forward-looking but not yet
populated by production code (Tier-1 added the schema). The FK
from jobs.invocation_id → invocations.invocation_id is
intentionally deferred to the migration that adds the population
path; declaring it now would refuse every job insert.
Lift the remaining business logic in process.rs (cancel,
lifecycle) into coordinator/domain/ using the
reconcile_active_jobs pattern. Domain-level unit tests already
cover the lifted function; the surrounding scaffold can come over
incrementally.
Migrate the trust registry from JSON to src/store/ (new
trust.db in ~/.config/daft/).
Implement the repo catalog on src/store/ (new repos.db).

🤖 Generated with Claude Code

avihut · 2026-05-15T16:48:14Z

/claude review

claude · 2026-05-15T16:48:34Z

Claude finished @avihut's task in 5m 4s —— View job

Review: feat(coordinator): redesign internals (#476)

This is a large, well-structured PR. The spawn pattern is correctly followed, canonicalize() on current_exe() is present, unsafe is not reintroduced, and the schema-version sentinel and frame-size limit are good protocol discipline. Test coverage for the store layer is thorough. The following are actionable findings in descending priority.

Issues

1. `--inv` alone silently falls through in `cancel` — likely bug

src/commands/hooks/jobs.rs:537-557

let has_filter =
    hook.is_some() || worktree.is_some() || tag.is_some() || older_than.is_some();
if has_filter {
    cancel_matching(hook…, inv.as_deref(), …)  // inv passed here
} else if all { … }
else if let Some(j) = job { … }
else {
    anyhow::bail!("cancel requires a job address, --all, or at least one filter …");
}

inv is accepted by the Cancel subcommand and forwarded to cancel_matching, but it is not included in has_filter. A user running daft hooks jobs cancel --inv abc123 hits the final bail! instead of filtering by invocation prefix. Either add || inv.is_some() to has_filter, or document that --inv is only a secondary filter that must accompany another predicate.

Fix this →

2. `:` in `job_name` can corrupt key comparisons

src/coordinator/store/types.rs:116-118

pub fn job_key(repo_hash: &str, invocation_id: &str, job_name: &str) -> String {
    format!("{repo_hash}:{invocation_id}:{job_name}")
}

rsplitn(3, ':') in JobAddress::parse handles : in worktree names, but the redb key format doesn't escape : in job_name. A hook YAML with a job named build:test produces key {repo}:{inv}:build:test, which parses back as repo=repo, inv=inv, job=build:test and round-trips correctly for exact-match get_job. The range scan in list_jobs_for_repo uses {repo_hash}: as prefix and ; as upper bound — still safe. So this is not exploitable today.

However, invocation_id is a 16-hex string (safe) and repo_hash is a UUID (safe). Only job_name is user-controlled via YAML name:. A comment confirming this invariant ("job names must not contain :") or a validation site at YAML parse time would close this permanently.

3. `reconcile_active_jobs` comment misstates the probe semantics

src/coordinator/process.rs:62-66

// `signal::kill` with `None` is the POSIX existence probe:
// returns Ok(()) if the process group exists (regardless of permission)
let pid = Pid::from_raw(pgid as i32);
matches!(signal::kill(pid, None), Ok(()))

signal::kill(pid, None) sends signal 0 to PID pgid, not to the process group with that PGID. To probe a process group you'd use killpg(pgid, 0) (i.e. Pid::from_raw(-(pgid as i32))). In practice, since PID == PGID for these jobs via process_group(0), the probe lands on the right process. But the comment says "process group exists" when it really means "process group leader exists". A PID-reused process would also satisfy the probe — this is a known best-effort limitation, worth documenting inline rather than leaving the misleading comment.

Fix this →

4. `send_response` double-serde is fragile and silently drops errors

src/coordinator/process.rs:1099-1114

fn send_response(stream: &std::os::unix::net::UnixStream, response: &CoordinatorResponse) {
    let env = ResponseEnvelope {
        v: PROTOCOL_VERSION,
        body: match serde_json::to_value(response).and_then(serde_json::from_value) {
            Ok(b) => b,
            Err(_) => return,  // client hangs waiting for a response
        },
    };

The to_value→from_value round-trip exists to avoid needing CoordinatorResponse: Clone, per the comment. But serde_json::Value already implements Clone, so #[derive(Clone)] on CoordinatorResponse (which only has String, Vec, usize, and Value fields) would work and remove the round-trip entirely. As written, a serialization bug causes a silent return — the client hangs on read_frame indefinitely because no StreamEnd or error frame is sent.

Fix this →

5. `fork_coordinator` alias should be removed, not deferred

src/coordinator/process.rs:1356-1365

/// Backwards-compatible alias for callers still using the old name.
/// Functionally identical to [`spawn_coordinator`]; kept thin so the
/// rename across … can land in a follow-up.
pub fn fork_coordinator(state: CoordinatorState, store: LogStore) -> Result<…> {
    spawn_coordinator(state, store)
}

Both call sites (src/hooks/yaml_executor/mod.rs:405 and src/commands/hooks/jobs.rs:1850) are in this same PR. The alias was intended to let the rename land separately, but the rename didn't happen. This is dead API surface with a misleading name (fork_ implies fork()). Removing the alias and updating both call sites to spawn_coordinator is a one-line change per site.

Fix this →

6. `follow=true` tail loop has no shutdown escape hatch on the coordinator side

src/coordinator/process.rs:1186-1201

if n == 0 {
    if !follow { break; }
    let pos = match file.stream_position() { Ok(p) => p, Err(_) => break };
    std::thread::sleep(std::time::Duration::from_millis(200));
    if file.seek(SeekFrom::Start(pos)).is_err() { break; }
    reader = BufReader::new(&file);
    continue;
}

If the log writer thread panics before emitting a terminal Status::Finished|Signaled|Crashed record, handle_tail_logs will spin at 200ms/cycle forever — the connection read-timeout was cleared for follow=true at line 849, so neither side times out. The client will spin too since its stream_request iterator also cleared its read timeout. A shutdown signal check or a wall-clock deadline (e.g. "bail after N seconds at EOF with no new data after the coordinator PID file is gone") would close this.

7. Sampling inconsistency between foreground and background sinks

src/coordinator/process.rs:551 vs src/executor/log_sink.rs:122

// process.rs — background:
!seq.is_multiple_of(n as u64)

// log_sink.rs — foreground:
seq % (n as u64) != 0

Both expressions are semantically equivalent (is_multiple_of stabilised Rust 1.87, MSRV is 1.93). The inconsistency is cosmetic but signals the two paths diverged. Unifying to one form makes it easier to verify both paths sample identically, especially since is_multiple_of is clearer.

8. `job_count()` is `pub` with `#[allow(dead_code)]`

src/coordinator/store/mod.rs:201-207

A pub function should not need #[allow(dead_code)]. Either demote it to #[cfg(test)] + pub(crate), or add it to an existing test so the annotation is unnecessary.

What's done well

Spawn pattern is correct throughout — canonicalize() on current_exe() is present and load-bearing; the comment in CLAUDE.md is accurately reflected in the code.
forbid(unsafe_code) preserved — no unsafe blocks introduced; nix safe wrappers used correctly for setsid() and killpg().
Schema-version sentinel — forward-compat refusal ("older binaries refuse to write data they don't fully understand") is correct and tested.
Frame size limit (8 MiB) — bounds local IPC without being absurdly small.
Protocol version on every envelope — SchemaMismatch error code surfaced instead of silent JSON parse failures.
Crash-recovery reconciliation — reconcile_marks_dead_running_as_crashed and reconcile_leaves_alive_running_intact tests are solid. Spawning-and-waiting true to get a definitely-dead PID is a clean approach.
Tempfile at 0600 — coordinator state passed through a tempfile with default 0600 permissions; comment confirms this.
Fallback when redb lock can't be acquired — load_redb_job_meta_index returning None and falling back to meta.json is correctly documented as a known limitation with a test.
Hot-path is clean — shell-init, completions, and __* background spawns are all in skip_startup_tasks_for; no new commands on the hot path.
invocation_id_from_seed hash distribution — the ns+PID hash fixing the prefix-collision bug is clearly motivated and tested.
TOCTOU in write_repo_policy is benign in practice because the coordinator holds the exclusive redb lock, and is documented.

Windows build fix (jobs.rs:1197 referenced tail_logs which is Unix-only): - Refactor follow_logs to early-return through the coordinator path on Unix and fall through to the file-read fallback otherwise; the Unix-only IPC call now lives in a #[cfg(unix)] helper. - Gate anyhow::Context import in coordinator/process.rs so it isn't flagged unused on non-Unix where spawn_coordinator is excluded. Review-driven changes (#508): - #1 `cancel --inv X` no longer silently falls through to the bail!: `--inv` now counts as a filter. Extracted cancel_has_filter() so the invariant has a regression test. - #2 Document `:` invariant on job_key — repo_hash and invocation_id are :-free by construction; job_name is user-controlled but the key format is safe via rsplitn(3) on read. - #3 Fix reconcile probe comment: signal::kill(pid, None) probes the process-group leader's PID (== PGID via process_group(0)), not the whole process group. Note PID-reuse limitation. - #4 Replace send_response double-serde round-trip with derive(Clone) on CoordinatorResponse + JobInfo; on serialization failure, emit a sentinel Internal-error frame instead of silently returning (which hung the client on read_frame). - #5 Remove fork_coordinator alias; rename both call sites to spawn_coordinator. (`fork_` implies fork() which we don't use.) - #6 Document tail-loop follow=true shutdown limitation inline: if the writer panics before a terminal Status record, the loop can spin forever. Tracked for follow-up; not load-bearing in practice. - #7 Unify sampling expression: foreground sink now matches background via `!seq.is_multiple_of(n)`. - #8 Remove dead pub job_count() and its sole `ReadableTableMetadata` import — no callers anywhere, not worth a cfg(test) helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

avihut · 2026-05-15T17:25:02Z

Thanks for the thorough review. Addressed in 782272a + 41c9820 (numbering matches the original review list):

R1 --inv falls through silently — fixed. inv.is_some() now counts toward has_filter. Extracted cancel_has_filter() with a regression test.
R2 : in job_name key correctness — documented the invariant inline on job_key: repo_hash and invocation_id are :-free by construction; the key format remains safe via rsplitn(3, ':') on read and ; upper-bound on range scan even when job_name contains :.
R3 reconcile probe comment — rewrote: signal::kill(pid, None) probes the process-group leader's PID (PID == PGID because jobs are launched with process_group(0)). Added the PID-reuse caveat.
R4 send_response double-serde / silent error drop — replaced the to_value/from_value round-trip with derive(Clone) on CoordinatorResponse + JobInfo. On serialization failure we now emit a sentinel Error { code: Internal, message } frame so the client's read_frame doesn't hang.
R5 fork_coordinator alias — removed. Both call sites now reference spawn_coordinator.
R6 follow-loop shutdown escape — documented the limitation inline (writer panic before terminal Status would spin at 200ms/cycle indefinitely). Closing requires either a wall-clock idle deadline gated on stale-mtime + row status no longer Running, or a writer-liveness channel — tracking as a follow-up. Not load-bearing in practice.
R7 sampling expression — unified: src/executor/log_sink.rs now uses !seq.is_multiple_of(n) matching process.rs.
R8 pub job_count() with #[allow(dead_code)] — deleted (no callers anywhere) along with the now-unused ReadableTableMetadata import.

Also unblocked the Windows build that broke on this PR — refactored follow_logs to early-return through the coordinator path on Unix and fall back to the file-read otherwise; the Unix-only tail_logs IPC call now lives in a #[cfg(unix)] helper. Gated anyhow::Context import for cfg(unix) since spawn_coordinator (its sole user) is Unix-only.

The PR description was also expanded with an abstract and mermaid diagrams covering the four architectural shifts (persistence, IPC, lifecycle, cancel).

…er cancel Lands the durable-state half of the coordinator internals redesign: - redb-backed `JobStore` at `<state>/jobs/<repo>/coordinator.redb` with schema-version guard (refuses writes when on-disk schema is newer than the binary) - `JobStatus` gains `Crashed`/`Cancelling`/`Unknown` variants; `#[serde(other)] Unknown` gives older binaries a forward-compat read path for newer data - Reconciliation pass on coordinator boot probes each active row's PGID via `kill(.., None)` and marks dead-process rows `Crashed` - `JobSpec.tags` plumbed from YAML `JobDef.tags`; persisted into `JobRow` - Dual-write `JobRow` at job start + complete alongside existing `meta.json` writes (transitional; meta.json cleanup is follow-up) - `daft hooks jobs cancel` gains `--hook/--worktree/--tag/--older-than`; new `CancelMatching` IPC variant filters via redb and signals each matching PGID; refuses to run without at least one filter or `--all` - `cancel_background_jobs_for_worktree` collapses to a single `CancelMatching { worktree }` call (was list-then-cancel-per-name) - Adds `coordinator::framing` (length-prefixed u32-BE frames) and `coordinator::log_record` (LogRecord/OutputKind/StatusEvent JSONL types). Both modules are wired only at the seams that need them for Tier-1; the full envelope retag + output.jsonl writer migration ships with Tier-2. - Stream-tags the executor output channel: `Sender<(OutputKind, String)>` ripples through `LogSink::on_job_output(.., kind, line)` so future JSONL writers can record stream origin per line. Verification: 1731 unit tests (22 new), 15/15 bash hooks integration tests, 572/572 YAML manual scenarios, clippy clean, fmt clean. Refs #476 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tured logs reader Replaces the flat `output.log` per-job file with structured `output.jsonl`: one `LogRecord` per line carrying `seq`/`ts`/`kind` so consumers can distinguish stdout from stderr, tail by sequence number, or filter by record kind. `output.log` stays as a one-cycle legacy fallback path. - Writers (`BufferingLogSink` + bg writer in `process.rs`) emit JSONL records using the stream-tagged channel (`(OutputKind, String)`) plus a terminal `Status::Finished`/`Status::Signaled` record per job - New per-job YAML knob `log.sampling_every_nth: N` drops all but every Nth stdout/stderr line; `seq` advances on dropped records so consumers see explicit gaps. `Status` records are never sampled - `daft hooks jobs logs` gains `--stdout` / `--stderr` / `--status` / `--seq` / `--since-seq` / `--since <duration>` flags via a shared `LogsFilter` applied at render time - Reader falls back to `output.log` when `output.jsonl` is missing (pre-upgrade data), synthesizing `Stdout`-only records keyed by line number so existing logs remain readable for one release cycle Verification: 1732 unit tests (3 new for sampling + log-format round- trip), 572/572 YAML manual scenarios, clippy clean. Refs #476 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… streaming Replaces newline-delimited JSON IPC with length-prefixed framed envelopes and adds a streaming `TailLogs` request that powers `daft hooks jobs logs --follow`. Wire format: - Every message is now `RequestEnvelope`/`ResponseEnvelope` — `{"v":1,"req|kind":"...","payload":...}` — wrapped in a `u32`-BE length-prefixed frame via `coordinator::framing`. Length prefix lets payloads carry embedded newlines (log lines with `\n`) without escape ambiguity, and the explicit `v` field gives a forward-compat handhold for future protocol revisions. - `CoordinatorResponse::Error` gains a typed `code: ErrorCode` field (`JOB_NOT_FOUND`/`SCHEMA_MISMATCH`/`INTERNAL`/...) alongside the human message; older callers using `..` destructure stay source-compatible. - New `StreamFrame(serde_json::Value)` + `StreamEnd` response variants carry the streaming layer. `TailLogs` streaming: - New `CoordinatorRequest::TailLogs { job, follow, since_seq }`. Server tails `output.jsonl`, emits one `StreamFrame` per `LogRecord`, sends `StreamEnd` when the terminal `Status` record (Finished/Signaled/ Crashed) appears or the source closes. - `CoordinatorClient::stream_request` returns a sync iterator over `StreamFrame` payloads — daft stays sync, no async runtime imposed. - `daft hooks jobs logs --follow` opens a tail via the coordinator when one is reachable; falls back to a one-shot file read when the coordinator is gone (per-invocation lifecycle: dead coordinator means the job is final). `JobAddress` moved from `src/commands/hooks/jobs.rs` to `src/coordinator/types.rs` (with serde derives) so it can ride on the wire without a back-pointer from the coordinator layer into the CLI. Verification: 1731 unit tests, 572/572 YAML manual scenarios, clippy clean. Refs #476 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Folds the remaining Tier-3 items from #476 into the coordinator-redesign branch and closes the visibility gap that Tier-1's crash recovery left. Two items are load-bearing for Tier-1 actually working end-to-end: - `daft hooks jobs ls/logs` now read from `JobStore` first, falling back to `meta.json` only when no redb row exists. Reconciliation marks rows as `Crashed` in redb but never touched the legacy `meta.json`, so the new status was invisible to users until this migration. - `repo-policy.json` moves into the `repo_policy_v1` table with a one-shot JSON→redb migration on `JobStore::open_for_repo_base`. redb's WAL gives atomicity for free, closing the Tier-3 `write_repo_policy` gap. API surface cleanup (three over-grown functions get focused context structs): `JobAdapterContext` for `yaml_jobs_to_specs` (8→6 args), `HookExecutionContext` for `execute_yaml_hook_with_rc` (11→5 args), and `JobInvocationContext` extended to absorb `run_single_background_job`'s five passthrough Arcs (7→3 args). All three lose `#[allow(clippy::too_many_arguments)]`. Also fixes `prune --dry-run`'s asymmetric summary gate (defense-in-depth when `jobs == 0 && invs > 0`) and updates the post-Tier-2 leftover in `hooks/bg-output-silent.yml` that still looked for `output.log`. `hooks/crashed-recovery.yml` was attempted but deferred — the YAML gate hit a hidden dependency on coordinator single-flight semantics that wasn't visible in the plan. Reconcile is fully covered by `coordinator::process::tests::reconcile_marks_dead_running_as_crashed`, so the end-to-end scenario is value-add, not value-blocking. Verified: 1737 unit tests pass; clippy clean; 572 YAML scenarios pass on retry (one pre-existing master-too truncation-feature regression unrelated to this PR). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…r-3 readers Adds a regression test (`concurrent_open_is_rejected_by_redb_lock`) and a doc-comment on `load_redb_job_meta_index` calling out the known limitation that redb's process-level lock causes the CLI reader to fall back to `meta.json` whenever a coordinator is alive holding the same file. Surfaced by advisor review of the Tier-3 cleanup. Behavior unchanged — this is documentation + a regression gate so the fallback rationale is discoverable if anyone later tries to change reader semantics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three tests in `coordinator::process::tests` exercise cross-thread coordination by spawning probe/killer threads that race against the registrar inside `run_single_background_job`. Under heavy parallel-test load (the lefthook pre-push sweep at ~1740 tests), the fixed-sleep delays missed their windows: - `run_single_background_job_registers_and_deregisters_pid` — the probe ran before the registrar thread inserted the PID - `per_job_cancel_marks_status_cancelled_not_failed` — `cancel_single_job` fired before the PID was recorded, so the SIGTERM target wasn't found - `bg_dependent_skipped_when_dep_cancelled` — same race as above Replace fixed sleeps with polling waits: probe loops until the PID appears (up to 600ms / 2s respectively), killer loops until the PID is visible before signalling. All three tests pass 3-for-3 in parallel mode after the fix. The function under test is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`xtask test-matrix` is a long loop that spawns `xtask manual-test --ci` per matrix entry. If the invoking shell dies mid-run (terminal closed, session reload, parent signal lost), the loop survives reparented to launchd/init and keeps burning CPU for the remaining entries — observed in the wild as a 1h40m runaway during this PR's development. Add an orphan check at each iteration boundary: when `getppid() == 1` we bail out with a one-line stderr message instead of starting the next entry. Uses `nix::unistd::getppid` (nix 0.31 already pinned in the daft workspace, adding `process` feature). Non-Unix builds get a no-op. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Companion to the prior `xtask test-matrix` orphan-guard commit — the new `nix` dependency in `xtask/Cargo.toml` (process feature; same 0.31 already pinned by the main daft crate) needs the corresponding lockfile entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Windows build fix (jobs.rs:1197 referenced tail_logs which is Unix-only): - Refactor follow_logs to early-return through the coordinator path on Unix and fall through to the file-read fallback otherwise; the Unix-only IPC call now lives in a #[cfg(unix)] helper. - Gate anyhow::Context import in coordinator/process.rs so it isn't flagged unused on non-Unix where spawn_coordinator is excluded. Review-driven changes (#508): - #1 `cancel --inv X` no longer silently falls through to the bail!: `--inv` now counts as a filter. Extracted cancel_has_filter() so the invariant has a regression test. - #2 Document `:` invariant on job_key — repo_hash and invocation_id are :-free by construction; job_name is user-controlled but the key format is safe via rsplitn(3) on read. - #3 Fix reconcile probe comment: signal::kill(pid, None) probes the process-group leader's PID (== PGID via process_group(0)), not the whole process group. Note PID-reuse limitation. - #4 Replace send_response double-serde round-trip with derive(Clone) on CoordinatorResponse + JobInfo; on serialization failure, emit a sentinel Internal-error frame instead of silently returning (which hung the client on read_frame). - #5 Remove fork_coordinator alias; rename both call sites to spawn_coordinator. (`fork_` implies fork() which we don't use.) - #6 Document tail-loop follow=true shutdown limitation inline: if the writer panics before a terminal Status record, the loop can spin forever. Tracked for follow-up; not load-bearing in practice. - #7 Unify sampling expression: foreground sink now matches background via `!seq.is_multiple_of(n)`. - #8 Remove dead pub job_count() and its sole `ReadableTableMetadata` import — no callers anywhere, not worth a cfg(test) helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…icy YAML scenarios The Tier-3 cutover migrates `repo-policy.json` into redb and deletes the sidecar on first open. Three YAML scenarios that previously `cat`'d the file (repo-log-defaults-applied, repo-policy-not- clobbered, prune-budget-accounting) had no way to observe the persisted policy and were failing in CI. Adds a hidden `daft __dump-store <table>` subcommand that reads the redb-backed JobStore for the cwd's repo and prints the requested table as JSON. Currently supports `repo-policy`; jobs and invocations can follow if needed. The `__` prefix excludes it from completions and skips startup background tasks. The three scenarios now `daft __dump-store repo-policy` from the worktree cwd and grep the JSON for `max_total_size_bytes`/ `keep_last` — same assertions, redb-backed source of truth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…eeps working Tier-2 moved foreground writers to `output.jsonl` only. The per-file truncator (`LogStore::truncate_oversized_logs`) and the legacy reader fallback both still target `output.log`, so removing the dual-write silently broke the truncation feature (`log-cleanup-per-file-cap.yml` fails on this branch but passes on master). Restores the legacy `output.log` as a sibling of `output.jsonl` in `BufferingLogSink::on_job_complete` and `on_job_runner_skipped`. The raw bytes are reconstructed from the same post-sampling record buffer so both files reflect a single, consistent view of the job's output. Matches the one-cycle compat posture the reader path already takes. Background-job logs (coordinator-managed) still write JSONL only — not in the truncator's path today. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ncator Tier-2 of the coordinator redesign moved writers over to the structured `output.jsonl` format. The legacy `output.log` byte-stream lingered on as (a) a dual-write target, (b) a reader fallback, and (c) the input to a byte-level `truncate_oversized_logs` cleanup pass. Daft is effectively pre-release; preserving these legacy paths is perpetual tax with no beneficiaries. Cut entirely: - `LogStore::log_path`, `LogStore::write_job_record` (only the JSONL variant remains). - `LogStore::truncate_oversized_logs` and its plumbing in `prune_jobs` / `run_clean_logs`; `CleanSummary::truncated_logs`; the CLI's "truncated N log(s)" suffix. - `JobMeta.log_truncated` and `JobMeta.original_size_bytes`; same fields on `JobRow`. - The `output.log` read fallback in `render_job_log`. - `tests/manual/scenarios/hooks/log-cleanup-per-file-cap.yml`. - Synthetic `meta.json` field references in the surviving log-cleanup scenarios. JSONL-aware truncation is a separate future feature if ever needed — the typical motivating case (per-file size cap on a runaway log) is addressable today by sampling or per-job log-size budgets enforced at hook-fire time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`start_socket_listener` puts the listener in non-blocking mode so the accept loop can poll the shutdown flag. On macOS (and other BSDs) accept(2) returns a socket that inherits O_NONBLOCK from the listener, so every accepted stream landed in `handle_client_connection` non-blocking. `read_exact` then returned `WouldBlock` before any client bytes arrived and the request silently dropped — the existing `set_read_timeout(5s)` is a no-op on non-blocking sockets. `test_socket_listener_list_jobs` exposed this as a deterministic EOF mid-frame on the client side. The same race affected real coordinator IPC on macOS. Force the accepted stream back to blocking at the top of `handle_client_connection` so `set_read_timeout` is the only thing bounding reads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phases 1-3 of the redb → SQLite migration. Lays down the new data-access stack alongside the existing redb store; the cutover that deletes redb + meta.json comes in a follow-up commit so reviewers can see the new code in isolation. **Phase 1 — store layer (`src/store/`)** Pure SQLite, no daft business logic. Every consumer inherits the same security defaults because they all funnel through `connection::bring_up`: * `connection.rs` — applies `PRAGMA application_id`, `journal_mode = WAL`, `synchronous = NORMAL`, `foreign_keys = ON`, `busy_timeout`, and `wal_autocheckpoint`. Refuses to open a foreign SQLite file (mismatched `application_id`); tightens file/dir perms to 0600/0700 on fresh creates. * `pool.rs` — `r2d2_sqlite`-backed writer pool (size 1, enforces WAL's single-writer ordering) and reader pool (multi-slot, opened with `SQLITE_OPEN_READ_ONLY` and a 300ms busy_timeout so CLI paths fail fast). * `migrate.rs` — `rusqlite_migration` runner. Refuses to open if the on-disk `user_version` is newer than this binary's understanding. * `migrations/001_initial.sql` — `invocations`, `jobs`, `repo_policy` tables with explicit `ON DELETE CASCADE` FK from `jobs` to `invocations` and indexes for the `ls`/reconcile scan paths. * `paths.rs` — canonical path resolution. `for_repo_under(base, hash)` canonicalizes the result and rejects symlink-escape attempts. * `env_scrub.rs` — strips `*_TOKEN`/`*_SECRET`/`*_KEY`/`*_PASSWORD` (case-insensitive suffix match) before job env touches disk. * `models/` + `repos/` — typed row structs (no bincode blobs) and a parameterized-SQL-only query layer. Multi-row writes use `repos::with_write_txn`. **Phase 2 — coordinator ports + adapters** `src/coordinator/ports/` declares the abstractions the domain layer talks to: `JobsStorePort`, `Clock`, `ProcessControl`. `src/coordinator/adapters/` implements them: `SqliteJobsStore` (wraps the pool, scrubs env on every job upsert, converts RepoPolicyRow ↔ RepoPolicy), `SystemClock`, `UnixProcess` (wraps nix's `killpg` with a signal-0 existence probe). **Phase 3 — first domain extraction** `src/coordinator/domain/reconcile.rs` lifts `reconcile_active_jobs` out of `process.rs`. The domain version takes the three ports and returns the count of jobs marked Crashed. Pure logic — no `nix::`, no SQL, no `Utc::now()`. Six unit tests drive it with fully-mocked adapters. This proves the layering separates correctly; future PRs can lift `cancel.rs` and `lifecycle.rs` from `process.rs` into `domain/` using the same pattern. **Dependencies** * `rusqlite = "0.39"` with `bundled` (compiles SQLite from source, ~+1 MB on release builds, no system-library skew) * `rusqlite_migration = "2.5"` * `r2d2 = "0.8"` * `r2d2_sqlite = "0.33"` (kept on 0.33 instead of 0.34 to clear the 7-day cooldown without an allowlist entry) All within the existing cooldown policy; no `.dep-age-allowlist` changes needed. **Test coverage added** `mise run test:unit` adds 89 new passing tests under `store::*` (76), `coordinator::adapters::*` (7), and `coordinator::domain::reconcile::*` (6). The redb-era `concurrent_open_is_rejected_by_redb_lock` is mirrored by `coordinator::adapters::sqlite_store::tests::concurrent_open_allowed`, which asserts the new positive behavior — two pools can open the same file and the reader sees committed writes mid-flight. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…nitial schema Three small enabling pieces for the upcoming JobStore → SqliteJobsStore swap. Each is a no-op on its own; the next commit consumes them. * `JobStatus::as_status_str` / `from_status_str` — the SQLite `jobs.status` column is TEXT, the domain uses an enum, and serde's `lowercase` rename happens to give the same tags. Putting both directions on the enum keeps the conversion at the storage boundary out of inline match statements at every call site. * `SqliteJobsStore::for_repo_base(base)` — convenience that mirrors the old `JobStore::open_for_repo_base`. Wipes any sibling `coordinator.redb` / `repo-policy.json` files (loudly, one stderr line per file) the first time a repo's state dir is opened. Matches the pre-1.0 no-back-compat principle: cleanup is not silent. * Initial migration no longer declares the `jobs → invocations` FK. Production code never populated `invocations` under the redb store (the table was added forward-looking in Tier-1 and forgotten), so the FK would refuse every job insert today. A follow-up migration will add it back when invocation tracking actually lands. Test expectations updated to pin the current "jobs can stand alone" shape and the no-cascade-on-delete behavior, so the change is visible the day the FK returns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…oduction Every production call site that previously opened the per-repo redb `JobStore` now opens `SqliteJobsStore::for_repo_base` instead. The redb module remains in tree but is unused by production code; it gets deleted in a follow-up commit once the meta.json dual-write is also removed and the integration tests have stayed green for a round. * `src/coordinator/process.rs`: * `JobStore` → `SqliteJobsStore` everywhere (types + opens). * `job_store_path` helper removed; callers use `for_repo_base`. * `reconcile_active_jobs` becomes a thin wrapper around the domain function that wires up `SystemClock` + `UnixProcess` adapters. The inline reconciliation logic was lifted into `domain/reconcile.rs` in the previous commit and is exercised by six unit tests there. * `job_row_from_meta` produces the new SQLite `JobRow` shape (no `schema_version`, `status: String` via `as_status_str`). * Two inline `reconcile_*` tests removed — superseded by the domain-layer tests. Filter and dual-write tests updated to the new shape. * `src/log_clean.rs`, `src/hooks/yaml_executor/mod.rs`, `src/commands/dump_store.rs`, `src/commands/hooks/jobs.rs`: same swap. The `__dump-store repo-policy` debug helper keeps emitting the same JSON shape so scenarios that grep its output don't have to change. * `src/commands/hooks/jobs.rs::load_redb_job_meta_index` — the name is preserved for callers but the impl now reads from SQLite. The prior redb-era "process-lock breaks concurrent reads while coordinator is alive" caveat is gone; WAL + the reader pool means the open succeeds mid-flight. The "fall back to meta.json" branch is still wired so unmigrated state and pre-store tests keep rendering — that fallback comes out in the next commit alongside the meta.json dual-write. All 1,801 unit tests pass. Integration scenarios untouched (the adapter is API-compatible with the redb store's public surface, and the legacy `coordinator.redb` / `repo-policy.json` files in scratch state dirs are auto-wiped on first open with a one-line stderr banner). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…Lite The cutover takes the **read** side off `meta.json` for every hot path the user is likely to hit interactively, and starts routing every **write** through the SQLite store alongside the legacy sidecar so the two stay consistent during the transition. The remaining `meta.json` write is kept on purpose: `LogStore::clean` still reads `meta.json` for per-job retention metadata, and rewriting that to query SQLite is its own scope. The final `meta.json` deletion + `LogStore::clean` rewrite lands in a follow-up. **Reads — fully off meta.json on the touched paths** * `src/commands/complete.rs` — tab completion (the documented hot path) now opens a single `SqliteJobsStore` per `daft __complete` call and a new `jobs_for_invocation` helper batches the per- invocation lookup into one `list_jobs_for_invocation` query instead of one filesystem `read_meta` per job dir. Seven prior `store.read_meta(dir)` call sites collapsed to four `jobs_for_invocation(...)` calls. The reader pool's tight `busy_timeout` (300 ms) keeps a stuck SQLite open from blocking the user's shell. * `src/coordinator/process.rs::build_job_list` (powers `ListJobs` IPC) now sources its data from `SqliteJobsStore:: list_jobs_for_repo`; the prior `LogStore::list_job_dirs` + per-dir `read_meta` scan is gone. **Writes — every path also goes to SQLite** * `BufferingLogSink` (foreground hook path) gains a `SqliteJobsStore` opened at construction. `on_job_complete` and `on_job_runner_skipped` call a new `persist_job_row` helper that converts `JobMeta` → `store::models::JobRow` and upserts. Eight test sites updated for the new `repo_hash` arg on `BufferingLogSink::new`. * `src/coordinator/process.rs::dep_failed` loop now persists a skipped `JobRow` so dep-failed jobs aren't invisible to `daft hooks jobs ls`. Previously they only landed in `meta.json`. * The `JobMeta` → `JobRow` projection used by the tab-completion hot path lives next to `JobMeta` itself as `log_store::job_meta_from_row`, so future renderer call sites can adopt it without re-implementing the conversion. **Notes on the deferred work** The advisor's Phase-4c spec called for the meta.json *write* to go away too. Doing so requires `LogStore::clean` to query SQLite for per-job retention metadata (it currently scans `meta.json`). `LogStore::clean` is tangled enough — stale-running detection, per- worktree sanity floor, candidate aggregation — that a sloppy rewrite would silently regress the retention pass. The right move is one focused commit; this one stops short so the diff stays reviewable and the YAML scenarios for retention/budget keep their existing baselines. All 1,801 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`src/coordinator/store/` was the redb-backed `JobStore` plus its key schema and value types. As of the previous commit nothing in production calls into it — every call site routes through `SqliteJobsStore` / `JobsStorePort`. The module is unreachable code; deleting it is the bookkeeping step that closes Phase 4 of the redb → SQLite migration. * Remove the `src/coordinator/store/` directory (mod.rs + types.rs + schema.rs) and the matching `pub mod store;` declaration in `src/coordinator/mod.rs`. * Drop `redb = "4.1"` and `bincode = "1.3"` from `Cargo.toml`. The pinned bincode comment was about sharing the version with syntect and edtui — those crates keep using it through their own dep tree, daft doesn't need a top-level entry. `coordinator.redb` files left over in `~/.local/state/daft/jobs/*/` are removed on first open by `SqliteJobsStore::for_repo_base`'s legacy-file sweep (announced via stderr banner), so users upgrading across the cutover don't have stray bytes hanging around. All 1,789 unit tests pass — 12 fewer than before because the redb module's internal tests went with the module. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…Lite files The scenario captured the per-repo state dir's newest subdirectory via `ls -1tr | grep -v '\.json$' | tail -1`. After the SQLite cutover the per-repo dir also contains `coordinator.db` plus the WAL/SHM sidecars (`coordinator.db-wal`, `coordinator.db-shm`), and the WAL file is touched on every commit so it sorts as "newest" — the probe captured `coordinator.db-wal` instead of the most recent invocation dir and the dir-existence check at the end of the scenario failed. Switch to a `find -type d`-based probe so only directories are considered. The same scenario already uses `find -type d` for the count assertions; this aligns the newest-id capture with the same filter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Documents the architectural decisions from the redb → SQLite migration so the next agent (or human) lands new DB-backed features in the right shape without re-deriving the constraints. Four edits: * **Critical Rule #4 (forbid unsafe)** — extended with a sentence on dependency selection: prefer crates with a fully safe public API. Cites SQLite via `rusqlite` over LMDB via `heed` (whose `Env::open` is `unsafe fn`) as the canonical worked example. * **New section after Dependency Cooldown — `## Database & Storage`** — SQLite via `rusqlite` (bundled + WAL) is daft's canonical structured- data store; no other embedded DBs; redb's process-exclusive lock is why the prior choice didn't fit; the pool's reader/writer split; the WAL-on-network-fs constraint. References the persisted memory doc for the full redb post-mortem. * **New `## Architecture` subsection — Store / Ports / Adapters / Domain pattern** — diagrams the four layers, names the hard rules (domain never imports `rusqlite`/`nix`/`std::env`; store never imports domain types; adapters are the only translation layer), and points at the canonical examples to study before touching the pattern. * **New section — `## Adding a New DB-backed Feature`** — eight-step checklist analogous to "Adding a New Command", in the order that actually keeps the layering clean (schema first → models → repos → port trait → adapter → wiring → unit-tested domain logic → security defaults are non-bypassable). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three doc-comments still said "redb" after the cutover. None of these are user-facing — they're inline comments and a CoordinatorRequest variant doc — but they'd confuse the next reader trying to map the code back to the actual store. * `CoordinatorRequest::CancelMatching` doc: "redb-recorded Running/Cancelling rows" → "Running/Cancelling rows from the SQLite store". * `cancel_matching` fn doc + error message: "redb JobRow" / "redb job store" / "pre-redb invocations" → "SQLite JobRow" / "SQLite job store" / "pre-store invocations". * `run_coordinator` body comment: "redb-backed job store" → "SQLite-backed job store". Historical references to the redb era are preserved where they're load-bearing: the `LEGACY_FILES` constant explaining what gets auto-wiped, the `concurrent_open_allowed` test comment explaining what it replaced, and the `load_redb_job_meta_index` function name (an interface kept stable on purpose so callers don't need to know the store swapped underneath them — renaming is its own follow-up). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`SqliteJobsStore::for_repo_base` deletes `coordinator.redb` and `repo-policy.json` siblings of the new `coordinator.db` the first time the per-repo state dir is opened — the production "no back-compat for pre-1.0 formats" path users will hit when they upgrade across this PR. The sweep was relying on the production-level `eprintln!` banner for visibility but had no test asserting it actually fires. Adds `for_repo_base_wipes_legacy_redb_and_policy_json` which plants both legacy files, opens the store, and asserts the files are gone and the new `coordinator.db` is in place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Captures the strategic intent the redb→SQLite migration vindicated: hexagonal at subsystem boundaries, functional core inside domain modules, vertical slice at the CLI command layer. Pins the long-term direction toward gitoxide-style crate decomposition and the native agentic IDE spin-out — the concrete reason crate-boundary ports will earn their keep. Lists patterns consciously rejected (CQRS, event sourcing, microservices, ADR-per-decision) with the reasons attached so future contributors don't relitigate. CLAUDE.md gets a landpost paragraph at the top of the Architecture section pointing at ARCHITECTURE.md, so agents and contributors hit the strategic direction before any operational detail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

avihut · 2026-05-16T10:17:24Z

/claude review

claude · 2026-05-16T10:17:38Z

Claude finished @avihut's task in 4m 50s —— View job

Review: feat(coordinator): redesign internals (#476)

This is a well-executed redesign. The hexagonal layering is clean, the spawn pattern follows CLAUDE.md precisely, forbid(unsafe_code) is preserved throughout, and the previous review's findings are all addressed. The following are new actionable findings.

Issues

1. `verify_perms` is defined but never called from production code

src/store/paths.rs:68–96

The PR description's security table lists "File perms: 0600 on file, 0700 on parent dir; verified once per process." This is only half-true. tighten_perms is called on fresh DB creation (correct — src/store/connection.rs:80), but verify_perms is never invoked from any production path — only from the module's own unit tests. Subsequent Pool::open calls on an existing DB go through bring_up → verify_application_id, but never verify_perms. A replacement attack (symlink a world-readable DB over the original) is not caught on re-open.

The fix is to call paths::verify_perms(db_path)? inside bring_up for non-read-only connections on existing DBs (the is_fresh_db = false branch). Or update the security table to say "set once at creation" rather than "verified once per process."

Fix this →

2. `wipe_legacy_files` prints to stderr on the tab-completion hot path

src/coordinator/adapters/sqlite_store.rs:57–75

complete.rs calls SqliteJobsStore::for_repo_base (lines 822, 1169, 1263) which always calls wipe_legacy_files. If a user has coordinator.redb or repo-policy.json on disk (upgrading from the previous design), the first tab press after the upgrade runs eprintln!("daft: removed legacy …"). CLAUDE.md explicitly calls out: "Stderr output is also problematic — eval only captures stdout … The same applies to the __complete tab-completion helper, which fires on every Tab keypress." The banner would appear mid-typing in the user's terminal.

The wipe is one-time (files are gone after first deletion), but the banner appearing on a tab press is confusing UX. The mitigation is to move the legacy wipe out of for_repo_base into a dedicated first-boot-or-coordinator-startup path (e.g., only call it from run_coordinator, not from every for_repo_base call). A flag file in the repo state dir could gate this.

Fix this →

3. `format!` in `src/store/repos/jobs.rs` violates the stated invariant

src/store/repos/jobs.rs:144–155

The CLAUDE.md security table records "All queries parameterized; no format! allowed in src/store/repos/." list_by_repo_and_statuses uses format! to build the placeholder list:

let placeholders = (2..=statuses.len() + 1)
    .map(|i| format!("?{i}"))
    .collect::<Vec<_>>()
    .join(", ");
let sql = format!(
    "… WHERE repo_hash = ?1 AND status IN ({placeholders}) …"
);

This is not a SQL injection risk — only the count comes from statuses.len() and the values are bound via params_vec. But it violates the "no format!" invariant that future contributors will rely on as a clear signal that queries are parameterized. The hardcoded callers are always two known statuses ("running" and "cancelling"), so a clean fix is to hardcode the two-placeholder form: WHERE repo_hash = ?1 AND status IN (?2, ?3). If a more general form is genuinely needed later, the rule should be updated in CLAUDE.md first.

Fix this →

4. `for_repo_base` bypasses the path-safety check that `for_repo_under` provides

src/coordinator/adapters/sqlite_store.rs:41–47

pub fn for_repo_base(base: &Path) -> Result<Self> {
    wipe_legacy_files(base);
    let db_path = base.join(crate::store::paths::COORDINATOR_DB);
    // ↑ no canonicalize(), no starts_with(base_dir) assertion
    let pool = Pool::open(&db_path)…;
}

paths::for_repo_under canonicalizes the parent and asserts it stays under base before returning the DB path — this is the symlink-escape guard (for_repo_under_rejects_symlink_escape test). for_repo_base skips this check. All production callers pass a store.base_dir that is itself already within the state dir, so this doesn't create a practical exploit today. But the layering is inconsistent: the paths module has a guard, and the adapter silently routes around it. The safe fix is to call paths::for_repo_under(base, "") or equivalent, or to document why the guard is intentionally skipped here.

5. Socket listener swallows unexpected errors without logging

src/coordinator/process.rs:743–749

Err(ref e) if e.kind() == std::io::ErrorKind::WouldBlock => {
    std::thread::sleep(…);
}
Err(_) => {
    // Unexpected error; break the loop.
    break;   // ← no eprintln!, no log
}

When an unexpected accept() error fires (file-descriptor exhaustion, protocol error), the socket listener exits silently. From that point on, cancel, tail-logs, and list-jobs IPC requests all fail at the client with "connection refused." This is hard to diagnose because no signal in the coordinator's output explains why the socket disappeared mid-flight. A one-line eprintln!("daft coordinator: socket listener error: {e}") before the break would surface this in whatever terminal collects the coordinator's output.

Fix this →

Minor observations

JobMeta.pid is set to the coordinator's own PID on the initial meta.json write (process.rs:468: pid: Some(std::process::id())), but the SQLite JobRow correctly uses child_pid = None for the initial write (via job_row_from_meta(..., child_pid = None)). So SQLite is correct; only the transitional meta.json sidecar carries the misleading coordinator PID. This is self-correcting once the meta.json path is retired in the follow-up.
stale_running_after_seconds from RepoPolicy is not used by the reconciler — clean_policy.rs:83 has fn stale_running_threshold consuming this field, but reconcile_active_jobs uses only live/dead PID probing. The field is wired up but the reconciler doesn't consult it. No bug (PID probe is sufficient), but the policy field has no effect today.
Pool::open in complete.rs runs migration validation on every tab press — The 165ms benchmark is acceptable, but migrate::run checks user_version on every Pool::open. Since the schema never changes at runtime, the completion path could eventually use a single Connection::open_with_flags(READ_ONLY) + skip the migration check, as the PR description's risks section notes. Not a blocker now.
wipe_legacy_files runs on every for_repo_base call, adding two stat() syscalls to the hot path — Once the legacy files are gone, the overhead is two p.exists() checks per completion-path open. Negligible but worth tracking for the eventual cleanup.

What's done well

Previous review findings all addressed — R1 (cancel_has_filter + regression test), R2 (: key safety comment), R3 (reconcile probe comment), R4 (derive(Clone) + error frame fallback), R5 (fork_coordinator alias removed), R7 (sampling unified to is_multiple_of), R8 (job_count dead code removed). Clean response.
Hexagonal ports/adapters/domain pattern is properly enforced — domain code has no rusqlite/nix/std::env imports; adapters are the only translation layer; the reconcile unit tests use in-memory fakes.
forbid(unsafe_code) fully preserved — no unsafe blocks, nix safe wrappers for setsid()/killpg().
canonicalize() on current_exe() is correct and load-bearing — the comment in spawn_coordinator explaining why this is necessary (symlink multicall dispatch) is accurate and important.
Frame size limit (8 MiB) with tests — write_rejects_oversize_payload and read_rejects_oversize_header both present.
Env scrub at the adapter boundary — upsert_job scrubs before any DB write; no caller can accidentally skip this. Tests confirm GH_TOKEN doesn't survive the round-trip.
Schema forward-compat sentinel — run refuses to apply migrations when user_version > current_version(), tested.
Concurrent access test replaces the redb lock test — concurrent_open_allowed directly demonstrates the property that motivated the redb→SQLite cutover.
cancel_has_filter extracted with regression test — clean fix for R1 and the function itself is now independently testable.
ACTIVE_STATUSES constant — shared between the fake store in tests and the adapter, so the reconciler tests are always consistent with the production filter.

The redb→SQLite cutover left a one-line stderr banner ("daft: removed legacy ... at ...") behind the `for_repo_base` constructor. The completion path (`__complete`) opens the store on every Tab press and runs inside `eval`-captured shell code that only captures stdout, so the banner leaks straight into the user's terminal mid-keystroke if any legacy file is still on disk after upgrade. Split the constructor: - `for_repo_base` — no wipe; safe to call from completion / CLI readers. - `for_repo_base_with_wipe` — wipes legacy files and emits the stderr banner; called only from `run_coordinator` startup, where the diagnostic surface is already in use for other startup messages. Documents the trust boundary on `for_repo_base` per PR review Issue #4: `base` is derived upstream by `LogStore::new` from `daft_state_dir() + jobs + repo_hash` (canonicalize-and-asserted by `paths::for_repo_under`), so the adapter doesn't redo that check at every open. New callers MUST keep the chain — no user input feeds `base`. Adds a regression test (`for_repo_base_does_not_wipe_legacy_files`) so the hot-path-stderr issue cannot reoccur silently. Addresses PR #508 review Issues #2 + #4 + Minor #4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The security table in PR #508's description claims "0600 file / 0700 parent dir, verified once per process at first open." That claim was half-true: `tighten_perms` ran on fresh-DB creation (correct), but subsequent `Pool::open` calls on an existing DB went through `bring_up`'s `verify_application_id` without ever re-checking the filesystem perms. The threat: an admin process (or a misconfigured backup tool) replaces the 0o600 DB file with a 0o644 copy between daft invocations. Until the file is recreated, the new perms persist and the contents are world-readable. Application_id verification can't catch this because the header bytes are intact. Add a single `paths::verify_perms` call in `Pool::open` after migrations finish but before the writer/reader pools come up. Fresh DBs pass trivially (they were just tightened in `bring_up`); existing DBs get the real check. Once per pool — no per-checkout cost. Adds `pool_rejects_world_readable_db` as a regression test (tighten on first open → loosen externally → assert the second open fails with `PermissionsTooOpen`). Addresses PR #508 review Issue #1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tuses The original `list_by_repo_and_statuses(conn, repo_hash, statuses)` built its placeholder list with `format!("?{i}")` and `format!(... IN ({placeholders}) ...)`. Not a SQL injection — only the placeholder count came from `statuses.len()`, values were bound via `params_vec` — but it violates the no-`format!`-in-repos invariant. That invariant is load-bearing because future contributors read the absence of `format!` in `src/store/repos/` as a clear signal that every query is parameterized; once `format!` is used "carefully" in one spot, the signal degrades. The only caller (`SqliteJobsStore::list_active_jobs`) always passes exactly two statuses (`"running"`, `"cancelling"`). Rename to `list_by_repo_and_two_statuses` with a fixed `(status_a, status_b)` signature and hardcoded `WHERE repo_hash = ?1 AND status IN (?2, ?3)`. If a third active-set status appears, add a sibling method rather than re-introducing dynamic placeholders. Repo layer stays neutral on which statuses constitute "active" — the caller still picks; only the cardinality is fixed at the type level. Addresses PR #508 review Issue #3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The IPC socket listener's `accept()` loop has three arms: success (dispatch to client handler), `WouldBlock` (sleep + retry), and any other error (break the loop). The last arm exited silently — once it fired, every subsequent `cancel`, `tail-logs`, and `list-jobs` IPC request returned "connection refused" at the client, and nothing in the coordinator's output explained why the socket disappeared. Add a one-line `eprintln!` before the break so the underlying error (FD exhaustion, protocol error, ...) lands in whatever terminal collects coordinator stderr. The listener still exits — the unexpected-error case isn't recoverable from inside the loop — but the operator gets a breadcrumb instead of a void. Addresses PR #508 review Issue #5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…h SQLite The legacy `LogStore::clean` walked the filesystem and read each per-job `meta.json` to drive retention decisions. PR #508's redb→SQLite migration left that path in tree as the transitional reason for the meta.json dual-write: production code wrote both SQLite rows and the sidecar so the retention pass kept seeing per-job `retention_seconds`/`worktree`/`started_at`. Pre-flight audit confirmed the SQLite `jobs` row carries every field `clean` reads (`name`, `worktree`, `started_at`, `status`, `retention_seconds`) with no defaults — no schema migration needed. Restructured along the hexagonal-plus-FCIS direction pinned in ARCHITECTURE.md: - **New pure module** `coordinator/domain/retention.rs`. Defines `JobSnapshot`, `RetentionInput`, `RetentionDecision`, `EvictionTarget`, `StaleMarkTarget`, and a single pure function `retention(input) -> decision`. No traits, no `&dyn Port`, no I/O, no `Utc::now`. Computes which jobs to flip to stale-running and which job dirs to evict, returns them as values for the shell to apply. 10 unit tests assert directly against returned decisions: time-based retention, keep_last floor, per-job override, retention_override supremacy, stale-running with/without socket, Running jobs exempt, multi-worktree per-floor accounting, jobs_per_inv/candidates_per_inv alignment for whole-invocation removal, dry-run shape parity. - **`LogStore::clean` becomes the imperative shell**. Gathers the snapshot via `JobsStorePort::list_jobs_for_repo` (replacing the per-job `read_meta` loop), stats `output.jsonl` for sizes, probes the coordinator socket once, calls `retention(input)`, then applies the decision: stale-running marks via the port, evictions via rename-to-`.deleting-` then `remove_dir_all`. Signature now takes `&dyn JobsStorePort` + `repo_hash` so the data source is explicit. - **Callers updated.** `log_clean::run_clean_logs` and `commands::hooks::jobs::prune_jobs` already constructed a `SqliteJobsStore` adjacent to the `LogStore`; both now thread it through `clean`. `log_clean` falls through (cannot retention-sweep without the store); `prune_jobs` `?`-propagates the open error. - **Tests reseeded via port.** `test_clean_old_logs`, `clean_uses_per_job_retention_from_sqlite_row` (renamed from `_from_meta`), `clean_keep_last_floor_overrides_retention`, `clean_detects_stale_running_when_socket_missing`, and `dry_run_tallies_removed_invocations` now seed via `JobsStorePort::upsert_job` through a small `seed_job_row` helper. `seed_inv_with_jobs` gains an `Option<(&SqliteJobsStore, &str)>` parameter so budget tests (which stat the filesystem, not SQLite) can pass `None`. The transitional `meta.json` write in production still happens — it's unused by retention now, but `LogStore::clean` was the last reader and deleting the write side will land in a follow-up commit so this diff stays focused on the data-source swap + FCIS extraction. Test count: 1792 → 1802 (10 new domain tests). All retention behavior that the YAML scenarios exercise (per-job retention, keep_last floor, stale-running, dry-run) is now also covered by pure-function unit tests that don't depend on filesystem timing. Addresses follow-up #55 (partial — meta.json write deletion comes next). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR #508 B1 made `LogStore::clean` query SQLite for retention metadata, which leaves `meta.json` with no production reader. This commit removes the writes (and the now-stale fallback reads), adds the sidecar to the legacy-file sweep, and switches verification paths over to the `JobsStorePort`. Production write sites removed: - `coordinator/process.rs` — three `store.write_meta(&job_dir, &meta)` calls at the dep-failed, initial-running, and terminal-update points. The matching `JobsStorePort::upsert_job` calls already write the authoritative row to SQLite. - `coordinator/log_store.rs::write_job_record_jsonl` — no longer writes the sidecar. The helper now owns just the JSONL payload + per-job directory; metadata flows separately via `JobsStorePort::upsert_job` from `BufferingLogSink::persist_job_row`. Production read sites removed: - `commands/hooks/jobs.rs::read_job_meta_redb_first` — renamed to `lookup_job_meta` and reshaped as `(index, inv, job_dir) -> Option<JobMeta>`. The meta.json fallback is gone; SQLite-loaded index is the sole source. Callers (`build_job_list`, group rendering, status filtering, `format_jobs_logs`, retry resolution) switch from `Result` to `Option` handling. - `retry_command` builds the SQLite index up front and threads it into `resolve_retry_invocation` (new `index` param) and the inline meta loop. The two remaining `store.read_meta` calls in those code paths are replaced by `lookup_job_meta`. Legacy-file sweep extended: - `sqlite_store::wipe_legacy_files` now also walks `<inv>/<job>/` for pre-cutover `meta.json` sidecars and removes them with a stderr banner. The sweep runs only from `for_repo_base_with_wipe` (coordinator startup), so the completion / CLI-reader hot path never prints to stderr mid-keystroke. - The regression test grew a `meta.json` case in both the wipe and non-wipe variants. Tests reseeded: - `BufferingLogSink` tests verify against the SQLite row via `SqliteJobsStore::get_job`, not `LogStore::read_meta`. - `process.rs` tests gain a real `SqliteJobsStore` via `make_test_state` + `make_ctx`'s new `job_store` parameter; assertions against terminal status, dep-failed synthesis, and stale-running flow through the port. - `write_job_record_jsonl_creates_meta_and_jsonl_atomically` renamed to `..._writes_jsonl_without_meta_sidecar` and now asserts the sidecar is *absent*. - Test for `read_job_meta_redb_first_falls_back_to_meta_json_when_redb_silent` removed — the behavior it pinned is gone. The companion test for "prefers redb when index has row" becomes `lookup_job_meta_returns_indexed_row` plus a sibling `lookup_job_meta_returns_none_when_index_misses`. `LogStore::write_meta` and `LogStore::read_meta` themselves remain in tree for now — only test setup still uses them to seed JobMeta fixtures. They have no production callers; deleting them and migrating the remaining test seedings to the port is a separate mechanical cleanup. Closes the meta.json dual-write follow-up referenced in PR #508's "Known follow-ups" section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

avihut · 2026-05-16T11:46:11Z

Thanks for the second-pass review. Most of the findings are legitimate
and are addressed in be764d7d
… 387ed12f.
One Minor observation is a false alarm; tracing below.

Issues (addressed)

Feature: Shorthand Remote URL Support #1 verify_perms defined but never called from production —
fixed in f2e7b57c
(fix(store): verify file perms once per pool open). Pool::open
now calls paths::verify_perms(db_path)? once after migrations
finish, both for fresh and pre-existing DBs. Tightened the security
claim to actually hold. Regression test
pool_rejects_world_readable_db chmods a freshly-created DB to
0o644 and asserts the second open fails with PermissionsTooOpen.
Feature: Root Level .git Dir #2 wipe_legacy_files prints to stderr on the tab-completion hot
path — fixed in be764d7d
(refactor(coordinator): move legacy-file wipe off the hot path).
for_repo_base split into two constructors:
- for_repo_base — no wipe; used by completion / CLI readers.
- for_repo_base_with_wipe — wipes legacy files and emits the
  stderr banner; called only from run_coordinator.
  Regression test for_repo_base_does_not_wipe_legacy_files plants
  legacy files and asserts the completion-path constructor leaves
  them alone.
Install Through brew #3 format! in src/store/repos/jobs.rs violates the
no-format! invariant — fixed in 46c0921f
(fix(store): hardcode IN-list placeholders ...). list_by_repo_and_statuses(&[&str])
→ list_by_repo_and_two_statuses(status_a, status_b) with
hardcoded SQL WHERE repo_hash = ?1 AND status IN (?2, ?3). The
caller (SqliteJobsStore::list_active_jobs) always passes exactly
two statuses, so the API shrink is lossless. If we ever need three,
a sibling method is the right shape — re-introducing dynamic
placeholders would weaken the CI grep-gate's signal.
Command Shortcuts #4 for_repo_base bypasses the path-safety check — addressed
via the same commit (be764d7d)
by documenting the trust boundary on for_repo_base. The full
paths::for_repo_under check (canonicalize + prefix-assert
against state dir) can't be applied uniformly without coupling the
adapter to daft_state_dir() (which would break every test that
uses a tempdir). The docstring now records that production callers
derive base from LogStore::base_dir, which is itself constructed
via <state_dir>/jobs/<repo_hash> — the chain is upstream-trusted.
Per the review's own framing, "or to document why the guard is
intentionally skipped" is a valid resolution; that's what's in tree.
Shell Completions #5 Socket listener swallows unexpected errors without logging
— fixed in 658545d2
(fix(coordinator): surface socket-listener errors before exiting).
One-line eprintln! before the break in the Err(_) arm.

Follow-up from the previous review (now closed)

The LogStore::clean → meta.json cutover that was previously
flagged as a deferred follow-up landed in
6535ef57
and 387ed12f:

B1 lifts retention into coordinator/domain/retention.rs as a pure
FCIS function (10 unit tests, no fake-adapter mocking) and rewrites
LogStore::clean as the imperative shell that gathers a snapshot
via JobsStorePort::list_jobs_for_repo + filesystem stat, calls
retention(input), and applies the returned RetentionDecision.
B2 drops the three production write_meta calls in
coordinator/process.rs, the write_meta inside
LogStore::write_job_record_jsonl, and the meta.json read fallback
in commands/hooks/jobs.rs (read_job_meta_redb_first →
lookup_job_meta, Result → Option). meta.json is added to
the legacy-file sweep so post-upgrade installs get cleaned up the
same way the redb-era files are.

Minor observations

Minor Feature: Shorthand Remote URL Support #1 (JobMeta.pid mismatch on initial meta.json write) —
auto-resolved by B2: that initial write no longer happens.
Minor Feature: Root Level .git Dir #2 (stale_running_after_seconds claimed unused) — this
is a false alarm. The field IS consulted by production code. End-
to-end trace:
1. src/coordinator/clean_policy.rs:82-85 —
  RepoPolicy::stale_running_after_resolved(&self) -> i64
  returns the field with an 86_400s default.
2. src/coordinator/log_store.rs::clean (post-B1 shell) /
  src/coordinator/domain/retention.rs::retention (pure core) —
  consumes the resolved value as the stale-Running threshold
  (Duration::try_seconds(...)).
3. Production callers of clean: src/log_clean.rs:120
  (run_clean_logs, fires on every daft invocation via the
  once-per-24h background sweep) and
  src/commands/hooks/jobs.rs::prune_jobs (the user-facing
  daft hooks jobs prune subcommand).
4. The stale_running_marked_when_socket_missing and
  stale_running_skipped_when_socket_alive domain tests directly
  exercise the threshold path.
Recommendation: the bot's index missed the indirection through
stale_running_after_resolved. No change to make.
Minor Install Through brew #3 (per-tab-press migration check in Pool::open) —
acknowledged in the PR body's tab-completion benchmark section.
165 ms total is acceptable for now; the Connection::open_with_flags(... READ_ONLY) + skip-migration path is a follow-up if real user
feedback indicates the cost matters.
Minor Command Shortcuts #4 (extra stat() syscalls in the hot path from
wipe_legacy_files) — auto-resolved by Feature: Root Level .git Dir #2's fix: the completion
path no longer calls wipe_legacy_files, so the per-keystroke
exists() checks are gone.

Diff summary

Commit	Subject
be764d7d	`refactor(coordinator): move legacy-file wipe off the hot path`
f2e7b57c	`fix(store): verify file perms once per pool open`
46c0921f	`fix(store): hardcode IN-list placeholders in list_by_repo_and_two_statuses`
658545d2	`fix(coordinator): surface socket-listener errors before exiting`
6535ef57	`refactor(coordinator): lift retention into FCIS domain + route through SQLite`
387ed12f	`chore(coordinator): drop meta.json from production paths`

mise run test:unit clean (1802 tests); mise run clippy clean;
mise run fmt:check clean. CI on the pushed branch will tell us if
anything regresses against the integration sweep.

B2 (`chore(coordinator): drop meta.json from production paths`) dropped the `write_meta` call inside `LogStore::write_job_record_jsonl`, but the YAML executor's pre-execution skip loop (the path that handles per-job `skip: true` / `only: false` / platform / tracking filters) was relying on that sidecar as its ONLY persistence — the BufferingLogSink isn't constructed until after the loop, so its `persist_job_row` path doesn't run for these jobs. Result: every condition-skipped job became invisible to `daft hooks jobs ls` and to the SQLite store. Caught by the `tests/manual/scenarios/hooks/per-job-skip-records.yml` YAML scenario in CI. Open the per-repo `SqliteJobsStore` once up-front (the same handle that later writes the repo policy) and route both the skipped-job persistence loop and the repo-policy write through it. Each skipped job now gets a proper `JobRow` upsert alongside the `output.jsonl` that carries the reason string. The on-disk shape converges with the foreground/background paths. Best-effort: if the store can't be opened, we still write the JSONL log file and emit a one-line stderr warning; only the SQLite persistence is skipped. Production paths are no different from before in that failure mode — they printed the same warning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

B2 dropped `meta.json` from all production code paths, so YAML scenarios that read the sidecar (`bg-output-silent`'s settle poll) or mutated it (the log-cleanup family's backdating step, the synthetic Running/stale invocation injectors) stopped working: the file simply isn't there, and even when it was, daft no longer consulted it for retention decisions. Switch every affected scenario to operate on `coordinator.db` directly via `sqlite3` (poll status, backdate `started_at` / `finished_at` / `created_at`, INSERT synthetic invocations + jobs). The eight scenarios that were red in CI on the previous push: - `hooks/bg-output-silent.yml` — settle poll now queries `SELECT COUNT(*) FROM jobs WHERE name LIKE 'silent-%-job' AND status = 'running'`. - `hooks/log-cleanup-dry-run.yml`, `hooks/log-cleanup-respects-retention.yml`, `hooks/prune-dry-run-invocations.yml` — backdate via `UPDATE jobs SET started_at = ...; UPDATE invocations SET created_at = ...;` with a Python-computed ISO-8601 timestamp. - `hooks/log-cleanup-honors-keep-last.yml` — per-row shift in Python (preserves relative ordering of `started_at` so the keep_last sanity floor still applies in the expected order). - `hooks/log-cleanup-skips-running.yml`, `hooks/log-cleanup-stale-running-detected.yml` — synthetic invocation+job INSERTs via Python's `sqlite3` module, matching the column shape from `001_initial.sql`. Placeholder `output.jsonl` files still touch the filesystem so the eviction path has something to remove. All eight scenarios pass locally; the production fix in the previous commit was caught by `hooks/per-job-skip-records.yml` (not modified here — the test itself was correct, the production code was the bug). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

avihut · 2026-05-16T12:43:35Z

/claude review

claude · 2026-05-16T12:43:49Z

Claude finished @avihut's task in 7m 30s —— View job

Review: feat(coordinator): redesign internals (#476) — Pass 3

All previous findings from passes 1 and 2 are correctly addressed. Four new items follow, two moderate and two minor.

Issues

1. `load_redb_job_meta_index` / `redb_index` are misleading misnomers

src/commands/hooks/jobs.rs:654–671

fn load_redb_job_meta_index(
    repo_hash: &str,
    log_store_base: &Path,
) -> Option<HashMap<(String, String), JobMeta>> {
    let store = SqliteJobsStore::for_repo_base(log_store_base).ok()?;
    let rows = store.list_jobs_for_repo(repo_hash).ok()?;
    …
}

The function opens a SQLite store and maps rows from it — it has nothing to do with redb. The name survives from the pre-cutover dual-write era when it read from redb and fell back to meta.json. Every call site passes the result as redb_index (lines 723, 885, 942, 1249, 1281, 1641, …), which now describes a SQLite-populated HashMap. Future contributors reading this file will look for a redb dependency that no longer exists.

The fix is a rename: load_redb_job_meta_index → load_sqlite_job_meta_index and redb_index → sqlite_index at every site. No logic change.

Fix this →

2. The `no format!` CI grep-gate is documented but does not exist

src/store/repos/mod.rs:6 and src/store/repos/jobs.rs:136–137

// mod.rs:6
//! All SQL is parameterized — no `format!` into queries — and a
//! CI grep-gate enforces that in `src/store/repos/`.

// jobs.rs:136-137
/// Fixed at exactly two statuses so the SQL stays parameterized with
/// hardcoded placeholders — the `no format!` invariant in
/// `src/store/repos/` is enforced by a CI grep-gate, …

No grep-gate exists in any workflow file (.github/workflows/) or mise task. The invariant is currently satisfied — format! genuinely does not appear in production query code in src/store/repos/ — but it is enforced by convention only, not tooling. A future contributor can add format! without any CI signal.

Two clean resolutions:

Add the gate: a CI step or mise task running ! grep -rPn 'format!' src/store/repos/*.rs | grep -v '^\s*//' would make the claim true.
Correct the comments: change "enforced by a CI grep-gate" to "enforced by convention" so the claim accurately reflects the current state.

Fix this →

3. Two stale "redb" doc comments in `process.rs` and `prune.rs`

src/coordinator/process.rs:338,355 and src/core/worktree/prune.rs:835

// process.rs:338
/// Optional dual-write target for the redb-backed `JobRow` lifecycle
//              ^^^^^^^^^^ should be "SQLite-backed"

// process.rs:355
/// Project a `JobMeta` into a redb `JobRow`. …
//                             ^^^^ should be "SQLite"

// prune.rs:835
/// filters its redb-recorded active jobs in one round trip …
//              ^^^^ should be "SQLite-recorded"

Redb is gone from the codebase. These three comments still use the old terminology. No logic impact, but they create false signal for future readers (and tooling like grep redb src/ which was being used to verify the cutover was complete).

4. `meta.pid = Some(std::process::id())` is a dead assignment

src/coordinator/process.rs:468

let mut meta = JobMeta {
    …
    pid: Some(std::process::id()),  // coordinator's own PID — never used
    …
};

job_row_from_meta ignores meta.pid entirely (it uses the child_pid parameter directly for JobRow.pid and JobRow.pgid). The meta.json write path that previously consumed meta.pid was removed in B2 (chore(coordinator): drop meta.json from production paths). This assignment now has no effect, but it looks meaningful — a reader would reasonably assume the coordinator's PID is being persisted into the initial SQLite row, when in fact None is written.

Fix: either remove the assignment (pid: None) or add a comment explaining it's inert (// not propagated to SQLite — overridden by child_pid in job_row_from_meta).

What's done well

All pass-2 findings correctly addressed. verify_perms is now called from Pool::open (with regression test). wipe_legacy_files is split into for_repo_base (no wipe, safe on the completion hot path) vs. for_repo_base_with_wipe (coordinator-only). format! is actually absent from src/store/repos/ production code. cancel_has_filter includes inv.is_some() with a dedicated regression test. fork_coordinator alias is gone.
Domain purity holds end-to-end. src/coordinator/domain/reconcile.rs and retention.rs have zero imports of rusqlite, nix, or std::env. Confirmed by grep.
process_group_alive now uses killpg correctly. nix::sys::signal::killpg(Pid::from_raw(pgid as i32), None) probes the process group — the rewrite from pass-2 fixed the semantic mismatch the first review flagged.
forbid(unsafe_code) preserved. The unsafe { libc::getpgrp() } in unix_process.rs:48 is inside #[cfg(test)], which the #![cfg_attr(not(test), forbid(unsafe_code))] gate in lib.rs allows.
Legacy-file sweep regression tests. for_repo_base_with_wipe_removes_legacy_files and for_repo_base_does_not_wipe_legacy_files both exist and cover the expected post-conditions precisely.
Condition-skipped jobs persist to SQLite. src/hooks/yaml_executor/mod.rs:306–354 writes both the JSONL record and the JobRow for jobs filtered out before execution. The regression commit a45d5a1 adds test coverage.

`run_single_background_job` set `meta.pid = Some(std::process::id())` — the coordinator's own PID — but production `job_row_from_meta` ignores `meta.pid` entirely and uses the `child_pid` parameter for the SQLite `JobRow.pid`/`JobRow.pgid`. The assignment looked meaningful but did nothing in the row that actually gets persisted. Replace with `pid: None` and a comment pointing at the terminal write that populates `child_pid`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Redb is gone — the cutover landed in this PR. Five doc comments still used the old terminology and would confuse readers using `grep redb src/` as a cutover-completeness oracle. Three flagged by the pass-3 review: - src/coordinator/process.rs:338 (`JobInvocationContext.job_store` doc) - src/coordinator/process.rs:355 (`job_row_from_meta` doc) - src/core/worktree/prune.rs:835 (`cancel_background_jobs_for_worktree` doc) Two more found while sweeping for consistency: - src/executor/mod.rs:105 (`JobSpec.tags` doc) - src/commands/hooks/jobs.rs:652 (`load_*_job_meta_index` doc — drop the redb-era contrast clause; keep the positive SQLite + WAL statement) The legacy-wipe references at `process.rs:1396-1397` and inside `src/coordinator/adapters/sqlite_store.rs` correctly describe historical files (`coordinator.redb`) that the wipe code deletes — left alone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…_meta_index Pure mechanical rename. The function and the `redb_index` variable/ parameter survived from the pre-cutover dual-write era when the index was sourced from redb with a meta.json fallback. Today it opens `SqliteJobsStore` and reads `JobRow`s directly — there's no redb dependency to explain. - Function: `load_redb_job_meta_index` → `load_sqlite_job_meta_index` - Variable/parameter at 16 call sites: `redb_index` → `sqlite_index` Confined to `src/commands/hooks/jobs.rs`. No behavior change; existing unit tests (1802 passing) cover the rename via `cargo check` and `cargo test`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`src/store/repos/{mod,jobs}.rs` claimed "a CI grep-gate enforces" the no-`format!` invariant. No such gate existed in any workflow or mise task — the claim was true by convention only. Make the claim true by adding the gate. New mise task `lint:repos-no-format` greps for `format!\s*\(` (matches macro calls, ignores doc-comment prose like `` `format!` ``) and fails loudly when one is found. Wired into the existing `lint:rust` umbrella via `#MISE depends`, so `mise run lint` and the CI `mise run clippy` step pick it up automatically. Today the gate passes — no `format!()` calls exist in src/store/repos/ — and verified by temporarily injecting one (it correctly failed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

avihut self-assigned this May 15, 2026

avihut added the feat New feature label May 15, 2026

avihut added this to the Public Launch milestone May 15, 2026

avihut and others added 13 commits May 15, 2026 22:04

avihut force-pushed the daft-476/chore/coordinator-redesign branch from 41c9820 to e386058 Compare May 16, 2026 06:28

avihut and others added 9 commits May 16, 2026 09:47

avihut and others added 6 commits May 16, 2026 14:04

avihut and others added 2 commits May 16, 2026 15:37

avihut and others added 4 commits May 16, 2026 22:05

avihut merged commit fe3e9fa into master May 16, 2026
13 checks passed

avihut deleted the daft-476/chore/coordinator-redesign branch May 16, 2026 19:27

wheatley-the-moronic-ci-bot Bot mentioned this pull request May 15, 2026

chore: release v1.13.0 #487

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(coordinator): redesign internals — persistence, IPC, structured logs (#476)#508

feat(coordinator): redesign internals — persistence, IPC, structured logs (#476)#508
avihut merged 35 commits into
masterfrom
daft-476/chore/coordinator-redesign

avihut commented May 15, 2026 •

edited

Loading

Uh oh!

avihut commented May 15, 2026

Uh oh!

claude Bot commented May 15, 2026 •

edited

Loading

Uh oh!

avihut commented May 15, 2026 •

edited

Loading

Uh oh!

avihut commented May 16, 2026

Uh oh!

claude Bot commented May 16, 2026 •

edited

Loading

Uh oh!

avihut commented May 16, 2026

Uh oh!

avihut commented May 16, 2026

Uh oh!

claude Bot commented May 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

avihut commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Abstract

Migration notes

Architectural changes

Persistence: JSON sidecars → SQLite

Store / Ports / Adapters / Domain pattern

Security baked into the store layer

Wire protocol: loose JSON → framed envelopes

Coordinator lifecycle: spawn-self + crash recovery

Cancel matching

Dependencies

Test plan

Tab completion benchmark

Architectural direction

Known follow-ups

Uh oh!

avihut commented May 15, 2026

Uh oh!

claude Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: feat(coordinator): redesign internals (#476)

Issues

1. --inv alone silently falls through in cancel — likely bug

2. : in job_name can corrupt key comparisons

3. reconcile_active_jobs comment misstates the probe semantics

4. send_response double-serde is fragile and silently drops errors

5. fork_coordinator alias should be removed, not deferred

6. follow=true tail loop has no shutdown escape hatch on the coordinator side

7. Sampling inconsistency between foreground and background sinks

8. job_count() is pub with #[allow(dead_code)]

What's done well

Uh oh!

avihut commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avihut commented May 16, 2026

Uh oh!

claude Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: feat(coordinator): redesign internals (#476)

Issues

1. verify_perms is defined but never called from production code

2. wipe_legacy_files prints to stderr on the tab-completion hot path

3. format! in src/store/repos/jobs.rs violates the stated invariant

4. for_repo_base bypasses the path-safety check that for_repo_under provides

5. Socket listener swallows unexpected errors without logging

Minor observations

What's done well

Uh oh!

avihut commented May 16, 2026

Issues (addressed)

Follow-up from the previous review (now closed)

Minor observations

Diff summary

Uh oh!

avihut commented May 16, 2026

Uh oh!

claude Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: feat(coordinator): redesign internals (#476) — Pass 3

Issues

1. load_redb_job_meta_index / redb_index are misleading misnomers

2. The no format! CI grep-gate is documented but does not exist

3. Two stale "redb" doc comments in process.rs and prune.rs

4. meta.pid = Some(std::process::id()) is a dead assignment

What's done well

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

avihut commented May 15, 2026 •

edited

Loading

claude Bot commented May 15, 2026 •

edited

Loading

1. `--inv` alone silently falls through in `cancel` — likely bug

2. `:` in `job_name` can corrupt key comparisons

3. `reconcile_active_jobs` comment misstates the probe semantics

4. `send_response` double-serde is fragile and silently drops errors

5. `fork_coordinator` alias should be removed, not deferred

6. `follow=true` tail loop has no shutdown escape hatch on the coordinator side

8. `job_count()` is `pub` with `#[allow(dead_code)]`

avihut commented May 15, 2026 •

edited

Loading

claude Bot commented May 16, 2026 •

edited

Loading

1. `verify_perms` is defined but never called from production code

2. `wipe_legacy_files` prints to stderr on the tab-completion hot path

3. `format!` in `src/store/repos/jobs.rs` violates the stated invariant

4. `for_repo_base` bypasses the path-safety check that `for_repo_under` provides

claude Bot commented May 16, 2026 •

edited

Loading

1. `load_redb_job_meta_index` / `redb_index` are misleading misnomers

2. The `no format!` CI grep-gate is documented but does not exist

3. Two stale "redb" doc comments in `process.rs` and `prune.rs`

4. `meta.pid = Some(std::process::id())` is a dead assignment