Skip to content

[FORGE-64] feat(orchestrator): question channel infrastructure (P2-T11)#87

Merged
firatcand merged 6 commits into
mainfrom
feat/FORGE-64-question-channel-infrastructure
May 14, 2026
Merged

[FORGE-64] feat(orchestrator): question channel infrastructure (P2-T11)#87
firatcand merged 6 commits into
mainfrom
feat/FORGE-64-question-channel-infrastructure

Conversation

@firatcand
Copy link
Copy Markdown
Owner

Summary

First link in the serial Phase 2 orchestrator chain (FORGE-31 → 20 → 21 → 32 → 22). Implements the filesystem-only question/answer channel that every downstream orchestrator task will import.

  • Atomic file placement via link(tmp, target) + unlink(tmp) — not rename. POSIX rename silently overwrites on Linux, which would violate the "never overwritten" invariant in spec/ORCHESTRATOR.md §"File semantics". link fails with EEXIST, giving OS-level enforcement. 8-way concurrent-writer race test confirms exactly one wins with DUPLICATE_ID.
  • TOCTOU discipline applied per docs/learnings/2026-Q2/toctou-between-stat-and-read-leaks-raw-fs-errors.md — every fs call individually wrapped, explicit IS_DIRECTORY regression test, size cap (QUESTION_FILE_MAX_BYTES) enforced before JSON.parse.
  • fs.watch mailbox watcher with 50ms debounce (absorbs FSEvents coalescing on macOS), .tmp filtering, typed QuestionWatcherEvent emission.
  • CLI surface: forge orchestrate {questions,answer,status,attach}. All four subcommands operate purely on filesystem state — they work whether or not a dispatcher is running. attach tolerates corrupt JSONL lines and dead dispatcher PIDs without crashing.
  • Spec doc updated to reflect the link+unlink atomicity primitive (closes the silent-clobber gap in the rename-based phrasing).

Architecture / contract

Downstream tasks import from src/orchestrator/questions/index.ts:

  • FORGE-20 (dispatcher): createQuestionWatcher, readQuestion, readAnswer, listOpenQuestions, QuestionChannelError
  • FORGE-32 (worker question writer): writeQuestionAtomic
  • Neither imports anything from src/cli/orchestrate-*.ts

Why

Question channel ships first because it has zero dispatcher dependency and every other Phase 2 orchestrator task imports its contracts. Built on top of the schemas + JSONL helpers shipped in PR #84 — this PR adds the I/O + CLI surface.

Test plan

  • npm run typecheck — clean
  • npm test — 602 pass / 8 skipped (pre-existing) / 0 fail (+46 new tests)
  • npm run build — dist/bin/forge.{cjs,mjs} build cleanly
  • Bundled CJS smoke — node dist/bin/forge.cjs --version, ... orchestrate questions --open, ... orchestrate (usage) all work
  • Gitleaks — no secrets leaked
  • Conventional commits — 5/5
  • Manual: write a question file from a worker fixture, watch it surface in forge orchestrate questions --open, answer it with forge orchestrate answer, verify the answer file lands atomically.

Acceptance criteria (from Linear)

  • writeQuestionAtomic + writeAnswerAtomic write via temp + fsync + link/unlink; no half-written file observable
  • readQuestion + readAnswer enforce QUESTION_FILE_MAX_BYTES (64KB) and zod schema; typed errors on reject
  • Question/answer files never overwrite; duplicate-id write rejected
  • fs.watch-based watcher fires on new question files and emits typed events; debounced for rapid bursts
  • forge orchestrate questions --open lists open questions, sorted by created_at
  • forge orchestrate answer <question_id> --option <id> [--note <text>] writes atomically; rejects duplicate
  • forge orchestrate status reads .forge/orchestrator/{run_id}/state.json without a live dispatcher
  • forge orchestrate attach tails .forge/orchestrator/{run_id}/notifications.jsonl and streams typed events
  • All CLI subcommands work on a fixture .forge/ tree without a dispatcher running
  • Unit tests cover atomic-write race + schema-rejection + oversized payload

Linked: closes FORGE-64

firatcand added 5 commits May 13, 2026 17:48
…r (T11a)

Adds the filesystem-only question/answer I/O primitives that downstream
orchestrator tasks (FORGE-20 dispatcher, FORGE-32 worker writer) will
import.

- `writeQuestionAtomic` / `writeAnswerAtomic` use `link()` + `unlink()` for
  the final placement (NOT `rename()`), so a duplicate-id race is rejected
  at the OS level with EEXIST → `DUPLICATE_ID`. The "never overwritten"
  invariant in spec/ORCHESTRATOR.md §"File semantics" is impossible to
  honor with rename on Linux, where POSIX rename silently clobbers.
- Per-call temp filename suffix (pid + monotonic counter + random) guards
  against same-pid concurrent writers (e.g. Promise.all on the same id).
- Every fs call in the chain is wrapped in its own try/catch per
  docs/learnings/2026-Q2/toctou-between-stat-and-read-leaks-raw-fs-errors.md
  — a stat success never implies a read success. Reader exposes an
  explicit `IS_DIRECTORY` code so the TOCTOU class is testable.
- Reader enforces `QUESTION_FILE_MAX_BYTES` (64KB) before JSON.parse;
  oversized junk files surface as `OVERSIZED` without ever reaching zod.
- `listOpenQuestions` filters by absence of corresponding answer file,
  skips `.tmp` leftovers, surfaces corrupt entries via `onSkip` without
  crashing the listing.

Tests cover: happy paths, duplicate rejection, schema rejection at the
write boundary, EISDIR / NOT_FOUND on every read code path, oversized
payload rejection, 8-way concurrent-writer race (exactly one wins),
temp-file cleanup on failure, listing sort + answer filtering.
`createQuestionWatcher` exposes a typed event stream over the
.forge/questions/ directory. Designed so the dispatcher (FORGE-20) can
subscribe without touching fs primitives directly.

- 50ms debounce per filename absorbs FSEvents coalescing on macOS and
  inotify rename pairs on Linux. Configurable via `debounceMs`.
- Filters `.tmp` files so events fired during the atomic write window
  never reach the consumer — only the canonical .json filename triggers
  a downstream readQuestion call.
- Suppresses `NOT_FOUND` errors silently (benign race between event and
  read); surfaces SCHEMA_INVALID, OVERSIZED, IS_DIRECTORY via onError so
  corrupted writes are operator-visible.
- Adds `src/orchestrator/questions/index.ts` barrel + re-exports the
  whole question channel surface from `src/orchestrator/index.ts` for
  one-import consumption by downstream tasks.

Tests cover: fires on write via writeQuestionAtomic, debounce coalesces
rapid writes to one event, .tmp leftovers ignored, stop() cancels
pending timers, corrupt JSON surfaces via onError.
Supervisor-side commands that operate purely on .forge/ filesystem
state — no dispatcher imports, so they work whether the orchestrator
is running or not.

- `forge orchestrate questions --open` lists open questions sorted by
  created_at, formatted as plain text (no markup — per ORCHESTRATOR.md
  §"Security posture" question contents are untrusted).
- `forge orchestrate answer <question_id> --option <id> [--note <text>]`
  validates the option against the question's allowed ids, refuses to
  overwrite an existing answer, and writes atomically. A concurrent
  supervisor writing the same answer id surfaces as a typed
  DUPLICATE_ID → "answer was written by a concurrent supervisor"
  rather than data loss.

Both commands accept injectable stdout/stderr streams and (for answer)
an injectable clock — tests capture output deterministically with a
PassThrough rather than mocking console globals.
Closes out the question-channel CLI surface and wires it into the forge
binary. All four subcommands now reachable via
`forge orchestrate <questions|answer|status|attach>`.

- `forge orchestrate status [<run_id>]` reads
  .forge/orchestrator/{run_id}/state.json without requiring a live
  dispatcher. Auto-detects the latest run via lexicographic sort over
  UUIDv7-prefixed run ids. State file capped at 1MB (distinct from the
  64KB cap on question files — state.json grows with worker count).
- `forge orchestrate attach [<run_id>]` tails
  .forge/orchestrator/{run_id}/notifications.jsonl, formatting events
  via the existing tryParseEventLine helper. Replays history first, then
  follows new lines via fs.watch. Corrupt lines surface as `[warn]` on
  stderr without crashing the supervisor (per ORCHESTRATOR.md §"JSONL
  notification stream").
- attach checks pid liveness via process.kill(pid, 0) and warns when the
  dispatcher has exited — tailing historical logs is still useful.
- src/bin/forge.ts gains a `dispatchOrchestrate()` block that hand-parses
  --flag and --flag=value forms (keeping the existing no-commander
  convention from `init`).

Replaces the test that previously asserted `forge orchestrate` was an
unimplemented command — now exercises the dispatch + a smoke for
`questions --open`.
Reflects the FORGE-64 implementation in spec. POSIX rename silently
clobbers an existing target on Linux, which would let two concurrent
writers on the same question_id stomp each other and quietly violate
the "never overwritten" rule the very next row asserts. link+unlink
fails with EEXIST when the target exists, making the invariant true at
the OS level rather than aspirational at the docs level.
@firatcand firatcand merged commit 45a8159 into main May 14, 2026
9 checks passed
@firatcand firatcand deleted the feat/FORGE-64-question-channel-infrastructure branch May 14, 2026 06:14
firatcand added a commit that referenced this pull request May 14, 2026
… (Codex review) (#95)

Closes 4 implementation bugs and 2 cheap enhancements that an earlier Codex
review of FORGE-64 (PR #87) surfaced in the question/answer atomic writer.
Architectural choice (link+unlink over rename for OS-level enforcement of
the "never overwritten" invariant) is unchanged.

- Bug #1: try/finally ensures temp-file cleanup on every code path including
  DUPLICATE_ID. Cleanup ownership moved from inside placeAtomic to the
  caller's finally block — single unconditional unlink point.
- Bug #2: writeSync loop with Buffer offsets. Short writes are recoverable
  (NFS / quota / signal-interrupted); writeSync returning 0 with bytes still
  to write is treated as IO_ERROR rather than an infinite-loop trigger.
- Bug #3: closeSync errors surface as IO_ERROR when no prior write/fsync
  error exists (delayed write-back on NFS/quota paths per close(2)). Prior
  errors win — close error is informational, prior error is actionable.
- Bug #4: payload byte-size cap enforced at the write boundary
  (PAYLOAD_TOO_LARGE) before any disk I/O. Schema caps strings by UTF-16
  code units; the cap is enforced in UTF-8 bytes. A schema-valid payload of
  4-byte-per-codepoint characters can exceed 64KB even while passing zod
  validation — the cap closes that gap.
- Enh #6: temp filenames use crypto.randomBytes(8).toString('hex') instead
  of Math.random(). Closes the hostile-same-UID-process question definitively.
- Enh #7: durability contract documented in spec/ORCHESTRATOR.md "File
  semantics" — file fsync protects against partial writes, not power-loss
  durability of placement; dispatcher reconciles from tracker on restart.
- Enh #5: deferred to plans/BACKLOG.md (capability probe → forge doctor).

Internal: adds __fsForTesting export on src/orchestrator/questions/writer.ts
as a test seam. node:test's mock.method cannot patch the frozen node:fs
Module Namespace directly; the seam is a plain mutable object holding fs
method references used internally by the writer. Tests use mock.method on
this object to simulate partial writes, closeSync failures, and validate
the random-bytes temp-name shape. Not part of the public API.

Test coverage: +8 unit tests covering each bug and Enh #6. Total 16 tests
in writer.test.ts (was 8), all pass. Full suite 609 pass / 0 fail / 8
pre-existing skipped. Build OK, smoke ./dist/bin/forge.cjs --version OK.

Codex second-opinion review on the diff: no findings.

Unblocks FORGE-20 (dispatcher core).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant