feat(coordinator): redesign internals — persistence, IPC, structured logs (#476)#508
Conversation
|
/claude review |
|
Claude finished @avihut's task in 5m 4s —— View job Review: feat(coordinator): redesign internals (#476)This is a large, well-structured PR. The spawn pattern is correctly followed, Issues1.
|
Windows build fix (jobs.rs:1197 referenced tail_logs which is Unix-only): - Refactor follow_logs to early-return through the coordinator path on Unix and fall through to the file-read fallback otherwise; the Unix-only IPC call now lives in a #[cfg(unix)] helper. - Gate anyhow::Context import in coordinator/process.rs so it isn't flagged unused on non-Unix where spawn_coordinator is excluded. Review-driven changes (#508): - #1 `cancel --inv X` no longer silently falls through to the bail!: `--inv` now counts as a filter. Extracted cancel_has_filter() so the invariant has a regression test. - #2 Document `:` invariant on job_key — repo_hash and invocation_id are :-free by construction; job_name is user-controlled but the key format is safe via rsplitn(3) on read. - #3 Fix reconcile probe comment: signal::kill(pid, None) probes the process-group leader's PID (== PGID via process_group(0)), not the whole process group. Note PID-reuse limitation. - #4 Replace send_response double-serde round-trip with derive(Clone) on CoordinatorResponse + JobInfo; on serialization failure, emit a sentinel Internal-error frame instead of silently returning (which hung the client on read_frame). - #5 Remove fork_coordinator alias; rename both call sites to spawn_coordinator. (`fork_` implies fork() which we don't use.) - #6 Document tail-loop follow=true shutdown limitation inline: if the writer panics before a terminal Status record, the loop can spin forever. Tracked for follow-up; not load-bearing in practice. - #7 Unify sampling expression: foreground sink now matches background via `!seq.is_multiple_of(n)`. - #8 Remove dead pub job_count() and its sole `ReadableTableMetadata` import — no callers anywhere, not worth a cfg(test) helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for the thorough review. Addressed in 782272a + 41c9820 (numbering matches the original review list):
Also unblocked the Windows build that broke on this PR — refactored The PR description was also expanded with an abstract and mermaid diagrams covering the four architectural shifts (persistence, IPC, lifecycle, cancel). |
…er cancel
Lands the durable-state half of the coordinator internals redesign:
- redb-backed `JobStore` at `<state>/jobs/<repo>/coordinator.redb` with
schema-version guard (refuses writes when on-disk schema is newer than
the binary)
- `JobStatus` gains `Crashed`/`Cancelling`/`Unknown` variants;
`#[serde(other)] Unknown` gives older binaries a forward-compat read
path for newer data
- Reconciliation pass on coordinator boot probes each active row's PGID
via `kill(.., None)` and marks dead-process rows `Crashed`
- `JobSpec.tags` plumbed from YAML `JobDef.tags`; persisted into `JobRow`
- Dual-write `JobRow` at job start + complete alongside existing
`meta.json` writes (transitional; meta.json cleanup is follow-up)
- `daft hooks jobs cancel` gains `--hook/--worktree/--tag/--older-than`;
new `CancelMatching` IPC variant filters via redb and signals each
matching PGID; refuses to run without at least one filter or `--all`
- `cancel_background_jobs_for_worktree` collapses to a single
`CancelMatching { worktree }` call (was list-then-cancel-per-name)
- Adds `coordinator::framing` (length-prefixed u32-BE frames) and
`coordinator::log_record` (LogRecord/OutputKind/StatusEvent JSONL
types). Both modules are wired only at the seams that need them for
Tier-1; the full envelope retag + output.jsonl writer migration ships
with Tier-2.
- Stream-tags the executor output channel: `Sender<(OutputKind, String)>`
ripples through `LogSink::on_job_output(.., kind, line)` so future
JSONL writers can record stream origin per line.
Verification: 1731 unit tests (22 new), 15/15 bash hooks integration
tests, 572/572 YAML manual scenarios, clippy clean, fmt clean.
Refs #476
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tured logs reader Replaces the flat `output.log` per-job file with structured `output.jsonl`: one `LogRecord` per line carrying `seq`/`ts`/`kind` so consumers can distinguish stdout from stderr, tail by sequence number, or filter by record kind. `output.log` stays as a one-cycle legacy fallback path. - Writers (`BufferingLogSink` + bg writer in `process.rs`) emit JSONL records using the stream-tagged channel (`(OutputKind, String)`) plus a terminal `Status::Finished`/`Status::Signaled` record per job - New per-job YAML knob `log.sampling_every_nth: N` drops all but every Nth stdout/stderr line; `seq` advances on dropped records so consumers see explicit gaps. `Status` records are never sampled - `daft hooks jobs logs` gains `--stdout` / `--stderr` / `--status` / `--seq` / `--since-seq` / `--since <duration>` flags via a shared `LogsFilter` applied at render time - Reader falls back to `output.log` when `output.jsonl` is missing (pre-upgrade data), synthesizing `Stdout`-only records keyed by line number so existing logs remain readable for one release cycle Verification: 1732 unit tests (3 new for sampling + log-format round- trip), 572/572 YAML manual scenarios, clippy clean. Refs #476 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… streaming
Replaces newline-delimited JSON IPC with length-prefixed framed envelopes
and adds a streaming `TailLogs` request that powers `daft hooks jobs
logs --follow`.
Wire format:
- Every message is now `RequestEnvelope`/`ResponseEnvelope` —
`{"v":1,"req|kind":"...","payload":...}` — wrapped in a `u32`-BE
length-prefixed frame via `coordinator::framing`. Length prefix lets
payloads carry embedded newlines (log lines with `\n`) without escape
ambiguity, and the explicit `v` field gives a forward-compat handhold
for future protocol revisions.
- `CoordinatorResponse::Error` gains a typed `code: ErrorCode` field
(`JOB_NOT_FOUND`/`SCHEMA_MISMATCH`/`INTERNAL`/...) alongside the
human message; older callers using `..` destructure stay source-compatible.
- New `StreamFrame(serde_json::Value)` + `StreamEnd` response variants
carry the streaming layer.
`TailLogs` streaming:
- New `CoordinatorRequest::TailLogs { job, follow, since_seq }`. Server
tails `output.jsonl`, emits one `StreamFrame` per `LogRecord`, sends
`StreamEnd` when the terminal `Status` record (Finished/Signaled/
Crashed) appears or the source closes.
- `CoordinatorClient::stream_request` returns a sync iterator over
`StreamFrame` payloads — daft stays sync, no async runtime imposed.
- `daft hooks jobs logs --follow` opens a tail via the coordinator when
one is reachable; falls back to a one-shot file read when the
coordinator is gone (per-invocation lifecycle: dead coordinator means
the job is final).
`JobAddress` moved from `src/commands/hooks/jobs.rs` to
`src/coordinator/types.rs` (with serde derives) so it can ride on the
wire without a back-pointer from the coordinator layer into the CLI.
Verification: 1731 unit tests, 572/572 YAML manual scenarios, clippy
clean.
Refs #476
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Folds the remaining Tier-3 items from #476 into the coordinator-redesign branch and closes the visibility gap that Tier-1's crash recovery left. Two items are load-bearing for Tier-1 actually working end-to-end: - `daft hooks jobs ls/logs` now read from `JobStore` first, falling back to `meta.json` only when no redb row exists. Reconciliation marks rows as `Crashed` in redb but never touched the legacy `meta.json`, so the new status was invisible to users until this migration. - `repo-policy.json` moves into the `repo_policy_v1` table with a one-shot JSON→redb migration on `JobStore::open_for_repo_base`. redb's WAL gives atomicity for free, closing the Tier-3 `write_repo_policy` gap. API surface cleanup (three over-grown functions get focused context structs): `JobAdapterContext` for `yaml_jobs_to_specs` (8→6 args), `HookExecutionContext` for `execute_yaml_hook_with_rc` (11→5 args), and `JobInvocationContext` extended to absorb `run_single_background_job`'s five passthrough Arcs (7→3 args). All three lose `#[allow(clippy::too_many_arguments)]`. Also fixes `prune --dry-run`'s asymmetric summary gate (defense-in-depth when `jobs == 0 && invs > 0`) and updates the post-Tier-2 leftover in `hooks/bg-output-silent.yml` that still looked for `output.log`. `hooks/crashed-recovery.yml` was attempted but deferred — the YAML gate hit a hidden dependency on coordinator single-flight semantics that wasn't visible in the plan. Reconcile is fully covered by `coordinator::process::tests::reconcile_marks_dead_running_as_crashed`, so the end-to-end scenario is value-add, not value-blocking. Verified: 1737 unit tests pass; clippy clean; 572 YAML scenarios pass on retry (one pre-existing master-too truncation-feature regression unrelated to this PR). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r-3 readers Adds a regression test (`concurrent_open_is_rejected_by_redb_lock`) and a doc-comment on `load_redb_job_meta_index` calling out the known limitation that redb's process-level lock causes the CLI reader to fall back to `meta.json` whenever a coordinator is alive holding the same file. Surfaced by advisor review of the Tier-3 cleanup. Behavior unchanged — this is documentation + a regression gate so the fallback rationale is discoverable if anyone later tries to change reader semantics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three tests in `coordinator::process::tests` exercise cross-thread coordination by spawning probe/killer threads that race against the registrar inside `run_single_background_job`. Under heavy parallel-test load (the lefthook pre-push sweep at ~1740 tests), the fixed-sleep delays missed their windows: - `run_single_background_job_registers_and_deregisters_pid` — the probe ran before the registrar thread inserted the PID - `per_job_cancel_marks_status_cancelled_not_failed` — `cancel_single_job` fired before the PID was recorded, so the SIGTERM target wasn't found - `bg_dependent_skipped_when_dep_cancelled` — same race as above Replace fixed sleeps with polling waits: probe loops until the PID appears (up to 600ms / 2s respectively), killer loops until the PID is visible before signalling. All three tests pass 3-for-3 in parallel mode after the fix. The function under test is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`xtask test-matrix` is a long loop that spawns `xtask manual-test --ci` per matrix entry. If the invoking shell dies mid-run (terminal closed, session reload, parent signal lost), the loop survives reparented to launchd/init and keeps burning CPU for the remaining entries — observed in the wild as a 1h40m runaway during this PR's development. Add an orphan check at each iteration boundary: when `getppid() == 1` we bail out with a one-line stderr message instead of starting the next entry. Uses `nix::unistd::getppid` (nix 0.31 already pinned in the daft workspace, adding `process` feature). Non-Unix builds get a no-op. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Companion to the prior `xtask test-matrix` orphan-guard commit — the new `nix` dependency in `xtask/Cargo.toml` (process feature; same 0.31 already pinned by the main daft crate) needs the corresponding lockfile entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Windows build fix (jobs.rs:1197 referenced tail_logs which is Unix-only): - Refactor follow_logs to early-return through the coordinator path on Unix and fall through to the file-read fallback otherwise; the Unix-only IPC call now lives in a #[cfg(unix)] helper. - Gate anyhow::Context import in coordinator/process.rs so it isn't flagged unused on non-Unix where spawn_coordinator is excluded. Review-driven changes (#508): - #1 `cancel --inv X` no longer silently falls through to the bail!: `--inv` now counts as a filter. Extracted cancel_has_filter() so the invariant has a regression test. - #2 Document `:` invariant on job_key — repo_hash and invocation_id are :-free by construction; job_name is user-controlled but the key format is safe via rsplitn(3) on read. - #3 Fix reconcile probe comment: signal::kill(pid, None) probes the process-group leader's PID (== PGID via process_group(0)), not the whole process group. Note PID-reuse limitation. - #4 Replace send_response double-serde round-trip with derive(Clone) on CoordinatorResponse + JobInfo; on serialization failure, emit a sentinel Internal-error frame instead of silently returning (which hung the client on read_frame). - #5 Remove fork_coordinator alias; rename both call sites to spawn_coordinator. (`fork_` implies fork() which we don't use.) - #6 Document tail-loop follow=true shutdown limitation inline: if the writer panics before a terminal Status record, the loop can spin forever. Tracked for follow-up; not load-bearing in practice. - #7 Unify sampling expression: foreground sink now matches background via `!seq.is_multiple_of(n)`. - #8 Remove dead pub job_count() and its sole `ReadableTableMetadata` import — no callers anywhere, not worth a cfg(test) helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…icy YAML scenarios The Tier-3 cutover migrates `repo-policy.json` into redb and deletes the sidecar on first open. Three YAML scenarios that previously `cat`'d the file (repo-log-defaults-applied, repo-policy-not- clobbered, prune-budget-accounting) had no way to observe the persisted policy and were failing in CI. Adds a hidden `daft __dump-store <table>` subcommand that reads the redb-backed JobStore for the cwd's repo and prints the requested table as JSON. Currently supports `repo-policy`; jobs and invocations can follow if needed. The `__` prefix excludes it from completions and skips startup background tasks. The three scenarios now `daft __dump-store repo-policy` from the worktree cwd and grep the JSON for `max_total_size_bytes`/ `keep_last` — same assertions, redb-backed source of truth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eeps working Tier-2 moved foreground writers to `output.jsonl` only. The per-file truncator (`LogStore::truncate_oversized_logs`) and the legacy reader fallback both still target `output.log`, so removing the dual-write silently broke the truncation feature (`log-cleanup-per-file-cap.yml` fails on this branch but passes on master). Restores the legacy `output.log` as a sibling of `output.jsonl` in `BufferingLogSink::on_job_complete` and `on_job_runner_skipped`. The raw bytes are reconstructed from the same post-sampling record buffer so both files reflect a single, consistent view of the job's output. Matches the one-cycle compat posture the reader path already takes. Background-job logs (coordinator-managed) still write JSONL only — not in the truncator's path today. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ncator Tier-2 of the coordinator redesign moved writers over to the structured `output.jsonl` format. The legacy `output.log` byte-stream lingered on as (a) a dual-write target, (b) a reader fallback, and (c) the input to a byte-level `truncate_oversized_logs` cleanup pass. Daft is effectively pre-release; preserving these legacy paths is perpetual tax with no beneficiaries. Cut entirely: - `LogStore::log_path`, `LogStore::write_job_record` (only the JSONL variant remains). - `LogStore::truncate_oversized_logs` and its plumbing in `prune_jobs` / `run_clean_logs`; `CleanSummary::truncated_logs`; the CLI's "truncated N log(s)" suffix. - `JobMeta.log_truncated` and `JobMeta.original_size_bytes`; same fields on `JobRow`. - The `output.log` read fallback in `render_job_log`. - `tests/manual/scenarios/hooks/log-cleanup-per-file-cap.yml`. - Synthetic `meta.json` field references in the surviving log-cleanup scenarios. JSONL-aware truncation is a separate future feature if ever needed — the typical motivating case (per-file size cap on a runaway log) is addressable today by sampling or per-job log-size budgets enforced at hook-fire time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`start_socket_listener` puts the listener in non-blocking mode so the accept loop can poll the shutdown flag. On macOS (and other BSDs) accept(2) returns a socket that inherits O_NONBLOCK from the listener, so every accepted stream landed in `handle_client_connection` non-blocking. `read_exact` then returned `WouldBlock` before any client bytes arrived and the request silently dropped — the existing `set_read_timeout(5s)` is a no-op on non-blocking sockets. `test_socket_listener_list_jobs` exposed this as a deterministic EOF mid-frame on the client side. The same race affected real coordinator IPC on macOS. Force the accepted stream back to blocking at the top of `handle_client_connection` so `set_read_timeout` is the only thing bounding reads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
41c9820 to
e386058
Compare
Phases 1-3 of the redb → SQLite migration. Lays down the new data-access stack alongside the existing redb store; the cutover that deletes redb + meta.json comes in a follow-up commit so reviewers can see the new code in isolation. **Phase 1 — store layer (`src/store/`)** Pure SQLite, no daft business logic. Every consumer inherits the same security defaults because they all funnel through `connection::bring_up`: * `connection.rs` — applies `PRAGMA application_id`, `journal_mode = WAL`, `synchronous = NORMAL`, `foreign_keys = ON`, `busy_timeout`, and `wal_autocheckpoint`. Refuses to open a foreign SQLite file (mismatched `application_id`); tightens file/dir perms to 0600/0700 on fresh creates. * `pool.rs` — `r2d2_sqlite`-backed writer pool (size 1, enforces WAL's single-writer ordering) and reader pool (multi-slot, opened with `SQLITE_OPEN_READ_ONLY` and a 300ms busy_timeout so CLI paths fail fast). * `migrate.rs` — `rusqlite_migration` runner. Refuses to open if the on-disk `user_version` is newer than this binary's understanding. * `migrations/001_initial.sql` — `invocations`, `jobs`, `repo_policy` tables with explicit `ON DELETE CASCADE` FK from `jobs` to `invocations` and indexes for the `ls`/reconcile scan paths. * `paths.rs` — canonical path resolution. `for_repo_under(base, hash)` canonicalizes the result and rejects symlink-escape attempts. * `env_scrub.rs` — strips `*_TOKEN`/`*_SECRET`/`*_KEY`/`*_PASSWORD` (case-insensitive suffix match) before job env touches disk. * `models/` + `repos/` — typed row structs (no bincode blobs) and a parameterized-SQL-only query layer. Multi-row writes use `repos::with_write_txn`. **Phase 2 — coordinator ports + adapters** `src/coordinator/ports/` declares the abstractions the domain layer talks to: `JobsStorePort`, `Clock`, `ProcessControl`. `src/coordinator/adapters/` implements them: `SqliteJobsStore` (wraps the pool, scrubs env on every job upsert, converts RepoPolicyRow ↔ RepoPolicy), `SystemClock`, `UnixProcess` (wraps nix's `killpg` with a signal-0 existence probe). **Phase 3 — first domain extraction** `src/coordinator/domain/reconcile.rs` lifts `reconcile_active_jobs` out of `process.rs`. The domain version takes the three ports and returns the count of jobs marked Crashed. Pure logic — no `nix::`, no SQL, no `Utc::now()`. Six unit tests drive it with fully-mocked adapters. This proves the layering separates correctly; future PRs can lift `cancel.rs` and `lifecycle.rs` from `process.rs` into `domain/` using the same pattern. **Dependencies** * `rusqlite = "0.39"` with `bundled` (compiles SQLite from source, ~+1 MB on release builds, no system-library skew) * `rusqlite_migration = "2.5"` * `r2d2 = "0.8"` * `r2d2_sqlite = "0.33"` (kept on 0.33 instead of 0.34 to clear the 7-day cooldown without an allowlist entry) All within the existing cooldown policy; no `.dep-age-allowlist` changes needed. **Test coverage added** `mise run test:unit` adds 89 new passing tests under `store::*` (76), `coordinator::adapters::*` (7), and `coordinator::domain::reconcile::*` (6). The redb-era `concurrent_open_is_rejected_by_redb_lock` is mirrored by `coordinator::adapters::sqlite_store::tests::concurrent_open_allowed`, which asserts the new positive behavior — two pools can open the same file and the reader sees committed writes mid-flight. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nitial schema Three small enabling pieces for the upcoming JobStore → SqliteJobsStore swap. Each is a no-op on its own; the next commit consumes them. * `JobStatus::as_status_str` / `from_status_str` — the SQLite `jobs.status` column is TEXT, the domain uses an enum, and serde's `lowercase` rename happens to give the same tags. Putting both directions on the enum keeps the conversion at the storage boundary out of inline match statements at every call site. * `SqliteJobsStore::for_repo_base(base)` — convenience that mirrors the old `JobStore::open_for_repo_base`. Wipes any sibling `coordinator.redb` / `repo-policy.json` files (loudly, one stderr line per file) the first time a repo's state dir is opened. Matches the pre-1.0 no-back-compat principle: cleanup is not silent. * Initial migration no longer declares the `jobs → invocations` FK. Production code never populated `invocations` under the redb store (the table was added forward-looking in Tier-1 and forgotten), so the FK would refuse every job insert today. A follow-up migration will add it back when invocation tracking actually lands. Test expectations updated to pin the current "jobs can stand alone" shape and the no-cascade-on-delete behavior, so the change is visible the day the FK returns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oduction
Every production call site that previously opened the per-repo redb
`JobStore` now opens `SqliteJobsStore::for_repo_base` instead. The redb
module remains in tree but is unused by production code; it gets
deleted in a follow-up commit once the meta.json dual-write is also
removed and the integration tests have stayed green for a round.
* `src/coordinator/process.rs`:
* `JobStore` → `SqliteJobsStore` everywhere (types + opens).
* `job_store_path` helper removed; callers use `for_repo_base`.
* `reconcile_active_jobs` becomes a thin wrapper around the domain
function that wires up `SystemClock` + `UnixProcess` adapters. The
inline reconciliation logic was lifted into `domain/reconcile.rs`
in the previous commit and is exercised by six unit tests there.
* `job_row_from_meta` produces the new SQLite `JobRow` shape
(no `schema_version`, `status: String` via `as_status_str`).
* Two inline `reconcile_*` tests removed — superseded by the
domain-layer tests. Filter and dual-write tests updated to the
new shape.
* `src/log_clean.rs`, `src/hooks/yaml_executor/mod.rs`,
`src/commands/dump_store.rs`, `src/commands/hooks/jobs.rs`: same
swap. The `__dump-store repo-policy` debug helper keeps emitting
the same JSON shape so scenarios that grep its output don't have
to change.
* `src/commands/hooks/jobs.rs::load_redb_job_meta_index` — the name
is preserved for callers but the impl now reads from SQLite. The
prior redb-era "process-lock breaks concurrent reads while
coordinator is alive" caveat is gone; WAL + the reader pool means
the open succeeds mid-flight. The "fall back to meta.json" branch
is still wired so unmigrated state and pre-store tests keep
rendering — that fallback comes out in the next commit alongside
the meta.json dual-write.
All 1,801 unit tests pass. Integration scenarios untouched (the
adapter is API-compatible with the redb store's public surface, and
the legacy `coordinator.redb` / `repo-policy.json` files in scratch
state dirs are auto-wiped on first open with a one-line stderr
banner).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Lite The cutover takes the **read** side off `meta.json` for every hot path the user is likely to hit interactively, and starts routing every **write** through the SQLite store alongside the legacy sidecar so the two stay consistent during the transition. The remaining `meta.json` write is kept on purpose: `LogStore::clean` still reads `meta.json` for per-job retention metadata, and rewriting that to query SQLite is its own scope. The final `meta.json` deletion + `LogStore::clean` rewrite lands in a follow-up. **Reads — fully off meta.json on the touched paths** * `src/commands/complete.rs` — tab completion (the documented hot path) now opens a single `SqliteJobsStore` per `daft __complete` call and a new `jobs_for_invocation` helper batches the per- invocation lookup into one `list_jobs_for_invocation` query instead of one filesystem `read_meta` per job dir. Seven prior `store.read_meta(dir)` call sites collapsed to four `jobs_for_invocation(...)` calls. The reader pool's tight `busy_timeout` (300 ms) keeps a stuck SQLite open from blocking the user's shell. * `src/coordinator/process.rs::build_job_list` (powers `ListJobs` IPC) now sources its data from `SqliteJobsStore:: list_jobs_for_repo`; the prior `LogStore::list_job_dirs` + per-dir `read_meta` scan is gone. **Writes — every path also goes to SQLite** * `BufferingLogSink` (foreground hook path) gains a `SqliteJobsStore` opened at construction. `on_job_complete` and `on_job_runner_skipped` call a new `persist_job_row` helper that converts `JobMeta` → `store::models::JobRow` and upserts. Eight test sites updated for the new `repo_hash` arg on `BufferingLogSink::new`. * `src/coordinator/process.rs::dep_failed` loop now persists a skipped `JobRow` so dep-failed jobs aren't invisible to `daft hooks jobs ls`. Previously they only landed in `meta.json`. * The `JobMeta` → `JobRow` projection used by the tab-completion hot path lives next to `JobMeta` itself as `log_store::job_meta_from_row`, so future renderer call sites can adopt it without re-implementing the conversion. **Notes on the deferred work** The advisor's Phase-4c spec called for the meta.json *write* to go away too. Doing so requires `LogStore::clean` to query SQLite for per-job retention metadata (it currently scans `meta.json`). `LogStore::clean` is tangled enough — stale-running detection, per- worktree sanity floor, candidate aggregation — that a sloppy rewrite would silently regress the retention pass. The right move is one focused commit; this one stops short so the diff stays reviewable and the YAML scenarios for retention/budget keep their existing baselines. All 1,801 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`src/coordinator/store/` was the redb-backed `JobStore` plus its key schema and value types. As of the previous commit nothing in production calls into it — every call site routes through `SqliteJobsStore` / `JobsStorePort`. The module is unreachable code; deleting it is the bookkeeping step that closes Phase 4 of the redb → SQLite migration. * Remove the `src/coordinator/store/` directory (mod.rs + types.rs + schema.rs) and the matching `pub mod store;` declaration in `src/coordinator/mod.rs`. * Drop `redb = "4.1"` and `bincode = "1.3"` from `Cargo.toml`. The pinned bincode comment was about sharing the version with syntect and edtui — those crates keep using it through their own dep tree, daft doesn't need a top-level entry. `coordinator.redb` files left over in `~/.local/state/daft/jobs/*/` are removed on first open by `SqliteJobsStore::for_repo_base`'s legacy-file sweep (announced via stderr banner), so users upgrading across the cutover don't have stray bytes hanging around. All 1,789 unit tests pass — 12 fewer than before because the redb module's internal tests went with the module. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Lite files The scenario captured the per-repo state dir's newest subdirectory via `ls -1tr | grep -v '\.json$' | tail -1`. After the SQLite cutover the per-repo dir also contains `coordinator.db` plus the WAL/SHM sidecars (`coordinator.db-wal`, `coordinator.db-shm`), and the WAL file is touched on every commit so it sorts as "newest" — the probe captured `coordinator.db-wal` instead of the most recent invocation dir and the dir-existence check at the end of the scenario failed. Switch to a `find -type d`-based probe so only directories are considered. The same scenario already uses `find -type d` for the count assertions; this aligns the newest-id capture with the same filter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the architectural decisions from the redb → SQLite migration so the next agent (or human) lands new DB-backed features in the right shape without re-deriving the constraints. Four edits: * **Critical Rule #4 (forbid unsafe)** — extended with a sentence on dependency selection: prefer crates with a fully safe public API. Cites SQLite via `rusqlite` over LMDB via `heed` (whose `Env::open` is `unsafe fn`) as the canonical worked example. * **New section after Dependency Cooldown — `## Database & Storage`** — SQLite via `rusqlite` (bundled + WAL) is daft's canonical structured- data store; no other embedded DBs; redb's process-exclusive lock is why the prior choice didn't fit; the pool's reader/writer split; the WAL-on-network-fs constraint. References the persisted memory doc for the full redb post-mortem. * **New `## Architecture` subsection — Store / Ports / Adapters / Domain pattern** — diagrams the four layers, names the hard rules (domain never imports `rusqlite`/`nix`/`std::env`; store never imports domain types; adapters are the only translation layer), and points at the canonical examples to study before touching the pattern. * **New section — `## Adding a New DB-backed Feature`** — eight-step checklist analogous to "Adding a New Command", in the order that actually keeps the layering clean (schema first → models → repos → port trait → adapter → wiring → unit-tested domain logic → security defaults are non-bypassable). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three doc-comments still said "redb" after the cutover. None of these are user-facing — they're inline comments and a CoordinatorRequest variant doc — but they'd confuse the next reader trying to map the code back to the actual store. * `CoordinatorRequest::CancelMatching` doc: "redb-recorded Running/Cancelling rows" → "Running/Cancelling rows from the SQLite store". * `cancel_matching` fn doc + error message: "redb JobRow" / "redb job store" / "pre-redb invocations" → "SQLite JobRow" / "SQLite job store" / "pre-store invocations". * `run_coordinator` body comment: "redb-backed job store" → "SQLite-backed job store". Historical references to the redb era are preserved where they're load-bearing: the `LEGACY_FILES` constant explaining what gets auto-wiped, the `concurrent_open_allowed` test comment explaining what it replaced, and the `load_redb_job_meta_index` function name (an interface kept stable on purpose so callers don't need to know the store swapped underneath them — renaming is its own follow-up). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`SqliteJobsStore::for_repo_base` deletes `coordinator.redb` and `repo-policy.json` siblings of the new `coordinator.db` the first time the per-repo state dir is opened — the production "no back-compat for pre-1.0 formats" path users will hit when they upgrade across this PR. The sweep was relying on the production-level `eprintln!` banner for visibility but had no test asserting it actually fires. Adds `for_repo_base_wipes_legacy_redb_and_policy_json` which plants both legacy files, opens the store, and asserts the files are gone and the new `coordinator.db` is in place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the strategic intent the redb→SQLite migration vindicated: hexagonal at subsystem boundaries, functional core inside domain modules, vertical slice at the CLI command layer. Pins the long-term direction toward gitoxide-style crate decomposition and the native agentic IDE spin-out — the concrete reason crate-boundary ports will earn their keep. Lists patterns consciously rejected (CQRS, event sourcing, microservices, ADR-per-decision) with the reasons attached so future contributors don't relitigate. CLAUDE.md gets a landpost paragraph at the top of the Architecture section pointing at ARCHITECTURE.md, so agents and contributors hit the strategic direction before any operational detail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/claude review |
|
Claude finished @avihut's task in 4m 50s —— View job Review: feat(coordinator): redesign internals (#476)This is a well-executed redesign. The hexagonal layering is clean, the spawn pattern follows CLAUDE.md precisely, Issues1.
|
The redb→SQLite cutover left a one-line stderr banner ("daft: removed
legacy ... at ...") behind the `for_repo_base` constructor. The
completion path (`__complete`) opens the store on every Tab press and
runs inside `eval`-captured shell code that only captures stdout, so the
banner leaks straight into the user's terminal mid-keystroke if any
legacy file is still on disk after upgrade.
Split the constructor:
- `for_repo_base` — no wipe; safe to call from completion / CLI readers.
- `for_repo_base_with_wipe` — wipes legacy files and emits the stderr
banner; called only from `run_coordinator` startup, where the
diagnostic surface is already in use for other startup messages.
Documents the trust boundary on `for_repo_base` per PR review Issue #4:
`base` is derived upstream by `LogStore::new` from `daft_state_dir() +
jobs + repo_hash` (canonicalize-and-asserted by `paths::for_repo_under`),
so the adapter doesn't redo that check at every open. New callers MUST
keep the chain — no user input feeds `base`.
Adds a regression test (`for_repo_base_does_not_wipe_legacy_files`) so
the hot-path-stderr issue cannot reoccur silently.
Addresses PR #508 review Issues #2 + #4 + Minor #4.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The security table in PR #508's description claims "0600 file / 0700 parent dir, verified once per process at first open." That claim was half-true: `tighten_perms` ran on fresh-DB creation (correct), but subsequent `Pool::open` calls on an existing DB went through `bring_up`'s `verify_application_id` without ever re-checking the filesystem perms. The threat: an admin process (or a misconfigured backup tool) replaces the 0o600 DB file with a 0o644 copy between daft invocations. Until the file is recreated, the new perms persist and the contents are world-readable. Application_id verification can't catch this because the header bytes are intact. Add a single `paths::verify_perms` call in `Pool::open` after migrations finish but before the writer/reader pools come up. Fresh DBs pass trivially (they were just tightened in `bring_up`); existing DBs get the real check. Once per pool — no per-checkout cost. Adds `pool_rejects_world_readable_db` as a regression test (tighten on first open → loosen externally → assert the second open fails with `PermissionsTooOpen`). Addresses PR #508 review Issue #1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tuses
The original `list_by_repo_and_statuses(conn, repo_hash, statuses)`
built its placeholder list with `format!("?{i}")` and `format!(... IN
({placeholders}) ...)`. Not a SQL injection — only the placeholder
count came from `statuses.len()`, values were bound via `params_vec` —
but it violates the no-`format!`-in-repos invariant. That invariant is
load-bearing because future contributors read the absence of `format!`
in `src/store/repos/` as a clear signal that every query is
parameterized; once `format!` is used "carefully" in one spot, the
signal degrades.
The only caller (`SqliteJobsStore::list_active_jobs`) always passes
exactly two statuses (`"running"`, `"cancelling"`). Rename to
`list_by_repo_and_two_statuses` with a fixed `(status_a, status_b)`
signature and hardcoded `WHERE repo_hash = ?1 AND status IN (?2, ?3)`.
If a third active-set status appears, add a sibling method rather than
re-introducing dynamic placeholders.
Repo layer stays neutral on which statuses constitute "active" — the
caller still picks; only the cardinality is fixed at the type level.
Addresses PR #508 review Issue #3.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The IPC socket listener's `accept()` loop has three arms: success (dispatch to client handler), `WouldBlock` (sleep + retry), and any other error (break the loop). The last arm exited silently — once it fired, every subsequent `cancel`, `tail-logs`, and `list-jobs` IPC request returned "connection refused" at the client, and nothing in the coordinator's output explained why the socket disappeared. Add a one-line `eprintln!` before the break so the underlying error (FD exhaustion, protocol error, ...) lands in whatever terminal collects coordinator stderr. The listener still exits — the unexpected-error case isn't recoverable from inside the loop — but the operator gets a breadcrumb instead of a void. Addresses PR #508 review Issue #5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…h SQLite The legacy `LogStore::clean` walked the filesystem and read each per-job `meta.json` to drive retention decisions. PR #508's redb→SQLite migration left that path in tree as the transitional reason for the meta.json dual-write: production code wrote both SQLite rows and the sidecar so the retention pass kept seeing per-job `retention_seconds`/`worktree`/`started_at`. Pre-flight audit confirmed the SQLite `jobs` row carries every field `clean` reads (`name`, `worktree`, `started_at`, `status`, `retention_seconds`) with no defaults — no schema migration needed. Restructured along the hexagonal-plus-FCIS direction pinned in ARCHITECTURE.md: - **New pure module** `coordinator/domain/retention.rs`. Defines `JobSnapshot`, `RetentionInput`, `RetentionDecision`, `EvictionTarget`, `StaleMarkTarget`, and a single pure function `retention(input) -> decision`. No traits, no `&dyn Port`, no I/O, no `Utc::now`. Computes which jobs to flip to stale-running and which job dirs to evict, returns them as values for the shell to apply. 10 unit tests assert directly against returned decisions: time-based retention, keep_last floor, per-job override, retention_override supremacy, stale-running with/without socket, Running jobs exempt, multi-worktree per-floor accounting, jobs_per_inv/candidates_per_inv alignment for whole-invocation removal, dry-run shape parity. - **`LogStore::clean` becomes the imperative shell**. Gathers the snapshot via `JobsStorePort::list_jobs_for_repo` (replacing the per-job `read_meta` loop), stats `output.jsonl` for sizes, probes the coordinator socket once, calls `retention(input)`, then applies the decision: stale-running marks via the port, evictions via rename-to-`.deleting-` then `remove_dir_all`. Signature now takes `&dyn JobsStorePort` + `repo_hash` so the data source is explicit. - **Callers updated.** `log_clean::run_clean_logs` and `commands::hooks::jobs::prune_jobs` already constructed a `SqliteJobsStore` adjacent to the `LogStore`; both now thread it through `clean`. `log_clean` falls through (cannot retention-sweep without the store); `prune_jobs` `?`-propagates the open error. - **Tests reseeded via port.** `test_clean_old_logs`, `clean_uses_per_job_retention_from_sqlite_row` (renamed from `_from_meta`), `clean_keep_last_floor_overrides_retention`, `clean_detects_stale_running_when_socket_missing`, and `dry_run_tallies_removed_invocations` now seed via `JobsStorePort::upsert_job` through a small `seed_job_row` helper. `seed_inv_with_jobs` gains an `Option<(&SqliteJobsStore, &str)>` parameter so budget tests (which stat the filesystem, not SQLite) can pass `None`. The transitional `meta.json` write in production still happens — it's unused by retention now, but `LogStore::clean` was the last reader and deleting the write side will land in a follow-up commit so this diff stays focused on the data-source swap + FCIS extraction. Test count: 1792 → 1802 (10 new domain tests). All retention behavior that the YAML scenarios exercise (per-job retention, keep_last floor, stale-running, dry-run) is now also covered by pure-function unit tests that don't depend on filesystem timing. Addresses follow-up #55 (partial — meta.json write deletion comes next). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #508 B1 made `LogStore::clean` query SQLite for retention metadata, which leaves `meta.json` with no production reader. This commit removes the writes (and the now-stale fallback reads), adds the sidecar to the legacy-file sweep, and switches verification paths over to the `JobsStorePort`. Production write sites removed: - `coordinator/process.rs` — three `store.write_meta(&job_dir, &meta)` calls at the dep-failed, initial-running, and terminal-update points. The matching `JobsStorePort::upsert_job` calls already write the authoritative row to SQLite. - `coordinator/log_store.rs::write_job_record_jsonl` — no longer writes the sidecar. The helper now owns just the JSONL payload + per-job directory; metadata flows separately via `JobsStorePort::upsert_job` from `BufferingLogSink::persist_job_row`. Production read sites removed: - `commands/hooks/jobs.rs::read_job_meta_redb_first` — renamed to `lookup_job_meta` and reshaped as `(index, inv, job_dir) -> Option<JobMeta>`. The meta.json fallback is gone; SQLite-loaded index is the sole source. Callers (`build_job_list`, group rendering, status filtering, `format_jobs_logs`, retry resolution) switch from `Result` to `Option` handling. - `retry_command` builds the SQLite index up front and threads it into `resolve_retry_invocation` (new `index` param) and the inline meta loop. The two remaining `store.read_meta` calls in those code paths are replaced by `lookup_job_meta`. Legacy-file sweep extended: - `sqlite_store::wipe_legacy_files` now also walks `<inv>/<job>/` for pre-cutover `meta.json` sidecars and removes them with a stderr banner. The sweep runs only from `for_repo_base_with_wipe` (coordinator startup), so the completion / CLI-reader hot path never prints to stderr mid-keystroke. - The regression test grew a `meta.json` case in both the wipe and non-wipe variants. Tests reseeded: - `BufferingLogSink` tests verify against the SQLite row via `SqliteJobsStore::get_job`, not `LogStore::read_meta`. - `process.rs` tests gain a real `SqliteJobsStore` via `make_test_state` + `make_ctx`'s new `job_store` parameter; assertions against terminal status, dep-failed synthesis, and stale-running flow through the port. - `write_job_record_jsonl_creates_meta_and_jsonl_atomically` renamed to `..._writes_jsonl_without_meta_sidecar` and now asserts the sidecar is *absent*. - Test for `read_job_meta_redb_first_falls_back_to_meta_json_when_redb_silent` removed — the behavior it pinned is gone. The companion test for "prefers redb when index has row" becomes `lookup_job_meta_returns_indexed_row` plus a sibling `lookup_job_meta_returns_none_when_index_misses`. `LogStore::write_meta` and `LogStore::read_meta` themselves remain in tree for now — only test setup still uses them to seed JobMeta fixtures. They have no production callers; deleting them and migrating the remaining test seedings to the port is a separate mechanical cleanup. Closes the meta.json dual-write follow-up referenced in PR #508's "Known follow-ups" section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for the second-pass review. Most of the findings are legitimate Issues (addressed)
Follow-up from the previous review (now closed)The
Minor observations
Diff summary
|
B2 (`chore(coordinator): drop meta.json from production paths`) dropped the `write_meta` call inside `LogStore::write_job_record_jsonl`, but the YAML executor's pre-execution skip loop (the path that handles per-job `skip: true` / `only: false` / platform / tracking filters) was relying on that sidecar as its ONLY persistence — the BufferingLogSink isn't constructed until after the loop, so its `persist_job_row` path doesn't run for these jobs. Result: every condition-skipped job became invisible to `daft hooks jobs ls` and to the SQLite store. Caught by the `tests/manual/scenarios/hooks/per-job-skip-records.yml` YAML scenario in CI. Open the per-repo `SqliteJobsStore` once up-front (the same handle that later writes the repo policy) and route both the skipped-job persistence loop and the repo-policy write through it. Each skipped job now gets a proper `JobRow` upsert alongside the `output.jsonl` that carries the reason string. The on-disk shape converges with the foreground/background paths. Best-effort: if the store can't be opened, we still write the JSONL log file and emit a one-line stderr warning; only the SQLite persistence is skipped. Production paths are no different from before in that failure mode — they printed the same warning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
B2 dropped `meta.json` from all production code paths, so YAML scenarios that read the sidecar (`bg-output-silent`'s settle poll) or mutated it (the log-cleanup family's backdating step, the synthetic Running/stale invocation injectors) stopped working: the file simply isn't there, and even when it was, daft no longer consulted it for retention decisions. Switch every affected scenario to operate on `coordinator.db` directly via `sqlite3` (poll status, backdate `started_at` / `finished_at` / `created_at`, INSERT synthetic invocations + jobs). The eight scenarios that were red in CI on the previous push: - `hooks/bg-output-silent.yml` — settle poll now queries `SELECT COUNT(*) FROM jobs WHERE name LIKE 'silent-%-job' AND status = 'running'`. - `hooks/log-cleanup-dry-run.yml`, `hooks/log-cleanup-respects-retention.yml`, `hooks/prune-dry-run-invocations.yml` — backdate via `UPDATE jobs SET started_at = ...; UPDATE invocations SET created_at = ...;` with a Python-computed ISO-8601 timestamp. - `hooks/log-cleanup-honors-keep-last.yml` — per-row shift in Python (preserves relative ordering of `started_at` so the keep_last sanity floor still applies in the expected order). - `hooks/log-cleanup-skips-running.yml`, `hooks/log-cleanup-stale-running-detected.yml` — synthetic invocation+job INSERTs via Python's `sqlite3` module, matching the column shape from `001_initial.sql`. Placeholder `output.jsonl` files still touch the filesystem so the eviction path has something to remove. All eight scenarios pass locally; the production fix in the previous commit was caught by `hooks/per-job-skip-records.yml` (not modified here — the test itself was correct, the production code was the bug). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/claude review |
|
Claude finished @avihut's task in 7m 30s —— View job Review: feat(coordinator): redesign internals (#476) — Pass 3All previous findings from passes 1 and 2 are correctly addressed. Four new items follow, two moderate and two minor. Issues1.
|
`run_single_background_job` set `meta.pid = Some(std::process::id())` — the coordinator's own PID — but production `job_row_from_meta` ignores `meta.pid` entirely and uses the `child_pid` parameter for the SQLite `JobRow.pid`/`JobRow.pgid`. The assignment looked meaningful but did nothing in the row that actually gets persisted. Replace with `pid: None` and a comment pointing at the terminal write that populates `child_pid`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Redb is gone — the cutover landed in this PR. Five doc comments still used the old terminology and would confuse readers using `grep redb src/` as a cutover-completeness oracle. Three flagged by the pass-3 review: - src/coordinator/process.rs:338 (`JobInvocationContext.job_store` doc) - src/coordinator/process.rs:355 (`job_row_from_meta` doc) - src/core/worktree/prune.rs:835 (`cancel_background_jobs_for_worktree` doc) Two more found while sweeping for consistency: - src/executor/mod.rs:105 (`JobSpec.tags` doc) - src/commands/hooks/jobs.rs:652 (`load_*_job_meta_index` doc — drop the redb-era contrast clause; keep the positive SQLite + WAL statement) The legacy-wipe references at `process.rs:1396-1397` and inside `src/coordinator/adapters/sqlite_store.rs` correctly describe historical files (`coordinator.redb`) that the wipe code deletes — left alone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…_meta_index Pure mechanical rename. The function and the `redb_index` variable/ parameter survived from the pre-cutover dual-write era when the index was sourced from redb with a meta.json fallback. Today it opens `SqliteJobsStore` and reads `JobRow`s directly — there's no redb dependency to explain. - Function: `load_redb_job_meta_index` → `load_sqlite_job_meta_index` - Variable/parameter at 16 call sites: `redb_index` → `sqlite_index` Confined to `src/commands/hooks/jobs.rs`. No behavior change; existing unit tests (1802 passing) cover the rename via `cargo check` and `cargo test`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`src/store/repos/{mod,jobs}.rs` claimed "a CI grep-gate enforces" the
no-`format!` invariant. No such gate existed in any workflow or mise
task — the claim was true by convention only. Make the claim true by
adding the gate.
New mise task `lint:repos-no-format` greps for `format!\s*\(` (matches
macro calls, ignores doc-comment prose like `` `format!` ``) and fails
loudly when one is found. Wired into the existing `lint:rust` umbrella
via `#MISE depends`, so `mise run lint` and the CI `mise run clippy`
step pick it up automatically.
Today the gate passes — no `format!()` calls exist in src/store/repos/
— and verified by temporarily injecting one (it correctly failed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Abstract
The hooks-jobs coordinator (
daft __coordinator) backs every backgroundhook job daft runs and is the source of truth for job status, logs, and
cancel semantics. Until this PR its internals had grown organically:
durability lived in scattered JSON sidecars, the IPC was loose
\n-delimitedJSON without versioning, log content went into a single
output.logwithno schema, and crashed jobs were invisible until the next manual cleanup.
This PR redesigns those internals around three pillars plus a
hexagonal architecture (
store / ports / adapters / domain) that futureDB-backed features land into:
<state>/jobs/<repo>/coordinator.db(WAL mode) replaces filesystemsidecars (
meta.json,repo-policy.json) as the system of record.Writes are atomic via SQLite transactions. A reconcile pass on
coordinator boot flips rows for processes whose PIDs are gone to
Crashed, surfacing failures the user could previously only spot byinspecting state directories. Critically, the new store also fixes
the cross-process-access limitation of the design's first iteration
(see "Migration notes" below).
stable
v/ErrorCodetaxonomy replace the old line-delimitedexchange. A
TailLogs { follow, since_seq }streaming variant letsthe CLI live-tail structured records over the socket. The 8 MiB
frame ceiling bounds local IPC.
output.jsonlrecords ({seq, ts, kind: stdout|stderr|status, ...}) replace the legacyoutput.log. Aper-job
log.sampling_every_nthknob drops records without breakingsequence numbering.
daft hooks jobs logsgains--stdout / --stderr / --status / --seq / --since-seq / --since / --followflags.Beyond the three pillars, the PR also lands the user-visible payoff for
the new persistence layer:
daft hooks jobs ls / logsand the tab-completion hot path now read from SQLite, so the new
Crashedstatus isfinally visible from the CLI and stays visible while a coordinator is
actively writing (WAL allows concurrent readers).
A
CancelMatching { hook, worktree, tag, invocation_prefix, older_than_secs }IPC variant and matching CLI flags collapse what usedto be three special-case helpers into a single predicate.
Three over-grown internal APIs were also tightened with context structs
(
JobAdapterContext,HookExecutionContext, extendedJobInvocationContext) — all three lose their#[allow(clippy::too_many_arguments)].Closes #476.
Migration notes
The early iteration of this PR used redb for
persistence — pure-Rust, single-file, atomic transactions. During
review the cross-process access pattern surfaced as a hard wall: redb
takes a process-exclusive OS file lock on open, so the CLI could not
read job state while the coordinator was alive. The Tier-1 design
papered over this with a
meta.jsondual-write that the CLI fell backto. Removing that workaround revealed the real issue and led to a swap
from redb to SQLite via
rusqlite(bundled, WAL mode):(many readers + one writer across processes on the same host) — no
workaround needed.
forbid(unsafe_code)policywithout carveouts. LMDB via
heedwas evaluated and rejected: itsEnv::openisunsafe fn.repo catalog's relational shape — future DB-backed features land in
the same store layer rather than each one introducing a new DB.
Legacy files left over from earlier builds —
coordinator.redb,repo-policy.json, and per-jobmeta.jsonsidecars under~/.local/state/daft/jobs/*/<inv>/<job>/— are removed on firstcoordinator startup by
SqliteJobsStore::for_repo_base_with_wipe,with a one-line stderr banner per file. The completion / CLI-reader
path uses a non-wiping constructor so the banner cannot leak into the
user's terminal mid-
Tabpress.Architectural changes
Persistence: JSON sidecars → SQLite
flowchart LR subgraph Before["Before — JSON sidecars"] direction TB DM1["meta.json (per job)"] DR1["repo-policy.json"] DL1["output.log"] end subgraph After["After — SQLite + JSONL"] direction TB DB[("coordinator.db<br/>(WAL mode)")] DB --> T1["jobs"] DB --> T2["invocations"] DB --> T3["repo_policy"] DL2["output.jsonl (per job)"] end Before -- "legacy file sweep on coordinator startup" --> AfterWrites go through the
JobsStorePorttrait (typedupsert_job,upsert_invocation,write_repo_policywith field-merge semantics)implemented by
SqliteJobsStore. The adapter scrubs*_TOKEN/*_SECRET/*_KEY/*_PASSWORDenv vars at the persistenceboundary so no future caller can accidentally leak secrets to disk.
SQLite is the sole source of truth for job metadata.
LogStore::cleannow gathers its retention snapshot via the port (lifted into a pure
coordinator/domain/retention.rsFCIS function), someta.jsonnolonger flows through any production write or read path. Legacy
sidecars from earlier builds are swept on first coordinator startup.
Store / Ports / Adapters / Domain pattern
flowchart TB APP["Application<br/>(commands/hooks/jobs,<br/>IPC dispatch, complete.rs)"] DOMAIN["Domain<br/>(pure logic — no SQL, no syscalls)<br/>coordinator/domain/reconcile.rs"] PORTS["Ports (trait declarations)<br/>JobsStorePort · Clock · ProcessControl"] ADAPTERS["Adapters (impls)<br/>SqliteJobsStore · SystemClock · UnixProcess"] STORE["Store (data layer)<br/>connection · pool · migrate · repos · env_scrub"] APP --> DOMAIN APP --> ADAPTERS DOMAIN --> PORTS ADAPTERS --> PORTS ADAPTERS --> STOREHard rules (recorded in CLAUDE.md):
rusqlite,nix, orstd::env.structs (
JobRow,InvocationRow,RepoPolicyRow).(env scrub, status enum ↔ TEXT conversion, policy merge) live there.
coordinator/domain/reconcile.rsis the canonical example — purelogic, six unit tests against fake adapters. Future PRs will lift
more functions from
process.rs(cancel,lifecycle) into thedomain layer using the same pattern.
Security baked into the store layer
Every consumer goes through
src/store/connection.rs::bring_up, sothe security defaults are non-bypassable:
format!allowed insrc/store/repos/daft_state_dir(); canonicalized; prefix-assertedPRAGMA user_version > CURRENTPRAGMA application_idcheck refuses foreign SQLite fileswal_autocheckpoint = 1000busy_timeout = 5000SQLITE_OPEN_READ_ONLY+busy_timeout = 300env_scrub::scrubstrips*_TOKEN/_SECRET/_KEY/_PASSWORDbefore persistingJobRow.envrusqlite_migration-driven; refuses to openuser_version > binary's currentWire protocol: loose JSON → framed envelopes
sequenceDiagram participant CLI as daft CLI participant Coord as daft __coordinator Note over CLI,Coord: Length-prefixed (u32 BE) framed JSON envelopes.<br/>Every frame carries `v` (PROTOCOL_VERSION) and a tagged body. CLI->>Coord: RequestEnvelope { v:1, req:"TailLogs",<br/>payload:{job, follow:true, since_seq:42} } loop until terminal Status or EOF Coord-->>CLI: ResponseEnvelope { v:1, kind:"StreamFrame",<br/>payload: LogRecord {...} } end Coord-->>CLI: ResponseEnvelope { v:1, kind:"StreamEnd" } Note over CLI,Coord: On schema mismatch the coordinator emits<br/>Error { code: SchemaMismatch, message } instead of<br/>silently parse-failing — the typed ErrorCode taxonomy<br/>also covers JobNotFound / NoMatch / KillFailed / Internal.Frames cap at 8 MiB so a misbehaving peer can't exhaust memory.
Streaming requests (currently just
TailLogs) get manyStreamFrameenvelopes terminated by exactly one
StreamEnd. Single-responserequests exchange one envelope each way.
Coordinator lifecycle: spawn-self + crash recovery
flowchart TB A["Hook fires<br/>(worktree-post-create, ...)"] --> B{"Background jobs<br/>declared?"} B -- "no" --> Z["Foreground exec only<br/>(BufferingLogSink<br/>writes JobRow to SQLite)"] B -- "yes" --> C["spawn_coordinator()"] C --> C1["serialize state → 0600 tempfile"] C1 --> C2["current_exe().canonicalize()"] C2 --> C3["Command::new(canonical).args(__coordinator, tempfile)<br/>Stdio::null x3"] C3 --> D["child: daft __coordinator"] D --> D1["read + unlink state tempfile"] D1 --> D2["nix::setsid()"] D2 --> D3["SqliteJobsStore::for_repo_base(...)<br/>auto-wipe legacy coordinator.redb<br/>+ repo-policy.json"] D3 --> D4["coordinator::domain::reconcile_active_jobs()<br/>via UnixProcess + SystemClock adapters<br/>flips dead Running rows → Crashed"] D4 --> D5["run_all_with_cancel()<br/>each job in its own process_group(0)<br/>upserts JobRow at start + terminal"] D5 --> D6["process::exit"]spawn_coordinatoris the canonical pattern;fork()is not anoption (libc::fork is
unsafe fn, conflicts withforbid(unsafe_code)). The reconcile pass on boot is what makes the"my last terminal crashed" case visible: the previous coordinator's
process group is gone, so its
Runningrows get flipped toCrashedwith
finished_at = now, and the nextdaft hooks jobs lsshows them.Cancel matching
flowchart LR subgraph CLI F1["--hook"] F2["--worktree"] F3["--tag"] F4["--inv"] F5["--older-than"] Fall["--all"] end F1 & F2 & F3 & F4 & F5 -- "AND" --> CM["CancelMatching {<br/>hook, worktree, tag,<br/>invocation_prefix, older_than_secs<br/>}"] Fall -- "no filter required" --> CA["CancelAll"] CM --> S["filter_matching_jobs over<br/>SqliteJobsStore::list_active_jobs(repo)"] CA --> S S --> K["killpg(pgid, SIGTERM)<br/>for each match"] K --> R["Cancelled { count, names }"]Dependencies
rusqlite = "0.39"withbundled(compiles SQLite from sourceinto the binary, ~+700 KB on release builds, +C compiler at build
time).
rusqlite_migration = "2.5",r2d2 = "0.8",r2d2_sqlite = "0.33".redb = "4.1",bincode = "1.3"..dep-age-allowlistentries needed.
Release binary size went from 9.3 MB (master) to 10 MB on this branch.
Test plan
cargo test --lib— 1790 unit tests pass (90 added understore::*,coordinator::adapters::*,coordinator::domain::reconcile::*, plus a regression test forthe legacy
coordinator.redb/repo-policy.jsonsweep)mise run clippy— zero warningsmise run fmt:check— cleanmise run test:manual -- --ci— full YAML scenario sweepmise run test:unit+build-check) — cleancoordinator.db,-wal,-shmfiles appear inper-repo state dir; concurrent reader works while writer alive;
legacy
coordinator.redb/repo-policy.jsonauto-wiped withstderr banner
Tab completion benchmark
Per the original migration plan,
daft __completeis on thedocumented hot path (fires every Tab keypress) and the SQLite open
adds work the meta.json scan didn't pay. Measured on the daft project
itself (298 invocation directories on disk, ~30 worktrees) with
hyperfine, comparing release binaries:
Branch is ~43 ms slower (1.35×). The
--version-only baseline forboth binaries is ~16 ms, so essentially all of the gap is the
SQLite open + PRAGMA bring-up + first reader checkout. The dominant
cost on either binary is the filesystem invocation scan that powers
the worktree drill-down, which is unchanged. 165 ms total stays
under the typical "feels sluggish" threshold (~200 ms).
The advisor flagged the ≤5 ms budget from the original plan as
speculative — order-of-magnitude is the real signal, and 1.35× is
not a regression that warrants the cache-file mitigation from the
plan's Risks section. If user feedback indicates otherwise, a
follow-up can switch the completion path off the pool onto a single
direct
Connection::open_with_flags(... READ_ONLY)and skip thebootstrap/migration check (we know the schema is current — the
binary was built with it).
Architectural direction
This PR also lands
ARCHITECTURE.mdat the reporoot, pinning the strategic direction the hexagonal coordinator
proves out:
was a single-adapter-file replacement; that's the property worth
preserving as new subsystems land.
reconcileis theworked example — pure function over a snapshot, no port callbacks
during computation. Future domain extractions (
cancel,lifecycle) follow this shape.pattern in
src/commands/; explicitly not extended into hexagonalceremony.
crate, narrow public surface, independent versioning. Not a
near-term refactor.
consumer that justifies lifting ports from inside-one-crate to
at-a-crate-edge.
daft-coordinator+daft-hooks+daft-trustbecome reusable libraries with IDE-supplied adapters for storage,
process control, filesystem.
CLAUDE.md's Architecture section gets a landpost paragraph pointing
at the new doc, so agents and contributors hit the strategic direction
before any operational detail. The follow-ups below are all concrete
steps toward this direction.
Known follow-ups
invocationsSQL table is forward-looking but not yetpopulated by production code (Tier-1 added the schema). The FK
from
jobs.invocation_id→invocations.invocation_idisintentionally deferred to the migration that adds the population
path; declaring it now would refuse every job insert.
process.rs(cancel,lifecycle) intocoordinator/domain/using thereconcile_active_jobspattern. Domain-level unit tests alreadycover the lifted function; the surrounding scaffold can come over
incrementally.
src/store/(newtrust.dbin~/.config/daft/).src/store/(newrepos.db).🤖 Generated with Claude Code