[FORGE-64] feat(orchestrator): question channel infrastructure (P2-T11)#87
Merged
Merged
Conversation
…r (T11a) Adds the filesystem-only question/answer I/O primitives that downstream orchestrator tasks (FORGE-20 dispatcher, FORGE-32 worker writer) will import. - `writeQuestionAtomic` / `writeAnswerAtomic` use `link()` + `unlink()` for the final placement (NOT `rename()`), so a duplicate-id race is rejected at the OS level with EEXIST → `DUPLICATE_ID`. The "never overwritten" invariant in spec/ORCHESTRATOR.md §"File semantics" is impossible to honor with rename on Linux, where POSIX rename silently clobbers. - Per-call temp filename suffix (pid + monotonic counter + random) guards against same-pid concurrent writers (e.g. Promise.all on the same id). - Every fs call in the chain is wrapped in its own try/catch per docs/learnings/2026-Q2/toctou-between-stat-and-read-leaks-raw-fs-errors.md — a stat success never implies a read success. Reader exposes an explicit `IS_DIRECTORY` code so the TOCTOU class is testable. - Reader enforces `QUESTION_FILE_MAX_BYTES` (64KB) before JSON.parse; oversized junk files surface as `OVERSIZED` without ever reaching zod. - `listOpenQuestions` filters by absence of corresponding answer file, skips `.tmp` leftovers, surfaces corrupt entries via `onSkip` without crashing the listing. Tests cover: happy paths, duplicate rejection, schema rejection at the write boundary, EISDIR / NOT_FOUND on every read code path, oversized payload rejection, 8-way concurrent-writer race (exactly one wins), temp-file cleanup on failure, listing sort + answer filtering.
`createQuestionWatcher` exposes a typed event stream over the .forge/questions/ directory. Designed so the dispatcher (FORGE-20) can subscribe without touching fs primitives directly. - 50ms debounce per filename absorbs FSEvents coalescing on macOS and inotify rename pairs on Linux. Configurable via `debounceMs`. - Filters `.tmp` files so events fired during the atomic write window never reach the consumer — only the canonical .json filename triggers a downstream readQuestion call. - Suppresses `NOT_FOUND` errors silently (benign race between event and read); surfaces SCHEMA_INVALID, OVERSIZED, IS_DIRECTORY via onError so corrupted writes are operator-visible. - Adds `src/orchestrator/questions/index.ts` barrel + re-exports the whole question channel surface from `src/orchestrator/index.ts` for one-import consumption by downstream tasks. Tests cover: fires on write via writeQuestionAtomic, debounce coalesces rapid writes to one event, .tmp leftovers ignored, stop() cancels pending timers, corrupt JSON surfaces via onError.
Supervisor-side commands that operate purely on .forge/ filesystem state — no dispatcher imports, so they work whether the orchestrator is running or not. - `forge orchestrate questions --open` lists open questions sorted by created_at, formatted as plain text (no markup — per ORCHESTRATOR.md §"Security posture" question contents are untrusted). - `forge orchestrate answer <question_id> --option <id> [--note <text>]` validates the option against the question's allowed ids, refuses to overwrite an existing answer, and writes atomically. A concurrent supervisor writing the same answer id surfaces as a typed DUPLICATE_ID → "answer was written by a concurrent supervisor" rather than data loss. Both commands accept injectable stdout/stderr streams and (for answer) an injectable clock — tests capture output deterministically with a PassThrough rather than mocking console globals.
Closes out the question-channel CLI surface and wires it into the forge
binary. All four subcommands now reachable via
`forge orchestrate <questions|answer|status|attach>`.
- `forge orchestrate status [<run_id>]` reads
.forge/orchestrator/{run_id}/state.json without requiring a live
dispatcher. Auto-detects the latest run via lexicographic sort over
UUIDv7-prefixed run ids. State file capped at 1MB (distinct from the
64KB cap on question files — state.json grows with worker count).
- `forge orchestrate attach [<run_id>]` tails
.forge/orchestrator/{run_id}/notifications.jsonl, formatting events
via the existing tryParseEventLine helper. Replays history first, then
follows new lines via fs.watch. Corrupt lines surface as `[warn]` on
stderr without crashing the supervisor (per ORCHESTRATOR.md §"JSONL
notification stream").
- attach checks pid liveness via process.kill(pid, 0) and warns when the
dispatcher has exited — tailing historical logs is still useful.
- src/bin/forge.ts gains a `dispatchOrchestrate()` block that hand-parses
--flag and --flag=value forms (keeping the existing no-commander
convention from `init`).
Replaces the test that previously asserted `forge orchestrate` was an
unimplemented command — now exercises the dispatch + a smoke for
`questions --open`.
Reflects the FORGE-64 implementation in spec. POSIX rename silently clobbers an existing target on Linux, which would let two concurrent writers on the same question_id stomp each other and quietly violate the "never overwritten" rule the very next row asserts. link+unlink fails with EEXIST when the target exists, making the invariant true at the OS level rather than aspirational at the docs level.
This was referenced May 13, 2026
This was referenced May 14, 2026
firatcand
added a commit
that referenced
this pull request
May 14, 2026
… (Codex review) (#95) Closes 4 implementation bugs and 2 cheap enhancements that an earlier Codex review of FORGE-64 (PR #87) surfaced in the question/answer atomic writer. Architectural choice (link+unlink over rename for OS-level enforcement of the "never overwritten" invariant) is unchanged. - Bug #1: try/finally ensures temp-file cleanup on every code path including DUPLICATE_ID. Cleanup ownership moved from inside placeAtomic to the caller's finally block — single unconditional unlink point. - Bug #2: writeSync loop with Buffer offsets. Short writes are recoverable (NFS / quota / signal-interrupted); writeSync returning 0 with bytes still to write is treated as IO_ERROR rather than an infinite-loop trigger. - Bug #3: closeSync errors surface as IO_ERROR when no prior write/fsync error exists (delayed write-back on NFS/quota paths per close(2)). Prior errors win — close error is informational, prior error is actionable. - Bug #4: payload byte-size cap enforced at the write boundary (PAYLOAD_TOO_LARGE) before any disk I/O. Schema caps strings by UTF-16 code units; the cap is enforced in UTF-8 bytes. A schema-valid payload of 4-byte-per-codepoint characters can exceed 64KB even while passing zod validation — the cap closes that gap. - Enh #6: temp filenames use crypto.randomBytes(8).toString('hex') instead of Math.random(). Closes the hostile-same-UID-process question definitively. - Enh #7: durability contract documented in spec/ORCHESTRATOR.md "File semantics" — file fsync protects against partial writes, not power-loss durability of placement; dispatcher reconciles from tracker on restart. - Enh #5: deferred to plans/BACKLOG.md (capability probe → forge doctor). Internal: adds __fsForTesting export on src/orchestrator/questions/writer.ts as a test seam. node:test's mock.method cannot patch the frozen node:fs Module Namespace directly; the seam is a plain mutable object holding fs method references used internally by the writer. Tests use mock.method on this object to simulate partial writes, closeSync failures, and validate the random-bytes temp-name shape. Not part of the public API. Test coverage: +8 unit tests covering each bug and Enh #6. Total 16 tests in writer.test.ts (was 8), all pass. Full suite 609 pass / 0 fail / 8 pre-existing skipped. Build OK, smoke ./dist/bin/forge.cjs --version OK. Codex second-opinion review on the diff: no findings. Unblocks FORGE-20 (dispatcher core). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merged
11 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First link in the serial Phase 2 orchestrator chain (FORGE-31 → 20 → 21 → 32 → 22). Implements the filesystem-only question/answer channel that every downstream orchestrator task will import.
link(tmp, target)+unlink(tmp)— notrename. POSIX rename silently overwrites on Linux, which would violate the "never overwritten" invariant inspec/ORCHESTRATOR.md§"File semantics".linkfails withEEXIST, giving OS-level enforcement. 8-way concurrent-writer race test confirms exactly one wins withDUPLICATE_ID.docs/learnings/2026-Q2/toctou-between-stat-and-read-leaks-raw-fs-errors.md— everyfscall individually wrapped, explicitIS_DIRECTORYregression test, size cap (QUESTION_FILE_MAX_BYTES) enforced beforeJSON.parse.fs.watchmailbox watcher with 50ms debounce (absorbs FSEvents coalescing on macOS),.tmpfiltering, typedQuestionWatcherEventemission.forge orchestrate {questions,answer,status,attach}. All four subcommands operate purely on filesystem state — they work whether or not a dispatcher is running.attachtolerates corrupt JSONL lines and dead dispatcher PIDs without crashing.Architecture / contract
Downstream tasks import from
src/orchestrator/questions/index.ts:createQuestionWatcher,readQuestion,readAnswer,listOpenQuestions,QuestionChannelErrorwriteQuestionAtomicsrc/cli/orchestrate-*.tsWhy
Question channel ships first because it has zero dispatcher dependency and every other Phase 2 orchestrator task imports its contracts. Built on top of the schemas + JSONL helpers shipped in PR #84 — this PR adds the I/O + CLI surface.
Test plan
npm run typecheck— cleannpm test— 602 pass / 8 skipped (pre-existing) / 0 fail (+46 new tests)npm run build— dist/bin/forge.{cjs,mjs} build cleanlynode dist/bin/forge.cjs --version,... orchestrate questions --open,... orchestrate(usage) all workforge orchestrate questions --open, answer it withforge orchestrate answer, verify the answer file lands atomically.Acceptance criteria (from Linear)
writeQuestionAtomic+writeAnswerAtomicwrite via temp + fsync + link/unlink; no half-written file observablereadQuestion+readAnswerenforceQUESTION_FILE_MAX_BYTES(64KB) and zod schema; typed errors on rejectforge orchestrate questions --openlists open questions, sorted bycreated_atforge orchestrate answer <question_id> --option <id> [--note <text>]writes atomically; rejects duplicateforge orchestrate statusreads.forge/orchestrator/{run_id}/state.jsonwithout a live dispatcherforge orchestrate attachtails.forge/orchestrator/{run_id}/notifications.jsonland streams typed events.forge/tree without a dispatcher runningLinked: closes FORGE-64