feat(gateway): fail-fast provider semantics self-test at boot#739
Conversation
fro-bot
left a comment
There was a problem hiding this comment.
Verdict: PASS
This is a clean, well-scoped safety improvement. It wires the existing validateProviderSemantics self-test into the boot sequence so a misconfigured object-store provider (one that silently ignores If-None-Match/If-Match) aborts startup before the gateway connects to Discord or serves traffic. The injection follows the established GatewayProgramDeps pattern, the placement is correct, and the new tests cover both the ordering guarantee and the fail-fast path.
Verification performed
- Traced the error-propagation chain:
validateProviderSemantics(Result) →validateProviderSemanticsEffect(Effect<void, Error>) →main.tsEffect.runPromise(rejects on failure) →program.tsEffect.tryPromise(catches + re-fails). Thecatchnormalizes non-Errorrejections vianew Error(String(error)), so the message survives the double-wrap. Theboot fails fasttest confirmsloginis never called. - Confirmed placement (
program.ts:138-141): the self-test runs aftercreateS3Adapterand beforestartAnnounceServer,installShutdownHandlers, andlogin. A failed probe therefore never opens the HTTP ingress or registers shutdown handlers — clean abort, no partial boot. - Ran
pnpm --filter @fro-bot/gateway test: 681 passed (34 files).
Blocking issues
None.
Non-blocking concerns
program.ts:131-137constructsselfTestCoordConfigwith the same five fields (storeAdapter,storeConfig,lockTtlSeconds,heartbeatIntervalMs,staleThresholdMs) duplicated atmentionDeps(lines 259-265) andrecoverStaleRuns(lines 333-339). Three literal copies of the coordination-config shape now exist in one function; a smallmakeCoordinationConfig(s3Adapter, config)helper would remove the drift risk. Not blocking — purely a maintainability nit, and the TTL/heartbeat/stale constants are the module defaults in every case.- The self-test issues a real write+delete probe against the object store on every boot. This is the documented intent (fail-fast over silent corruption) and the probe key is namespaced (
self-test/_probe/locks/semantics.json), so it won't collide with live coordination state. Worth keeping in mind for environments where boot frequency is high, but acceptable for a single-process gateway.
Missing tests
None that block. The two new tests (self-test runs once before login; boot rejects without connecting when the self-test fails) cover the behavior this PR adds. The underlying semantics matrix (IfNoneMatch/IfMatch phases) is already exercised by runtime/src/coordination/self-test.test.ts, and the Effect wrapper by runtime-effect.test.ts. One optional addition: assert the self-test runs before startAnnounceServer (currently only ordering vs. login is asserted) — minor, since the announce server starts after the self-test in source order regardless.
Risk assessment (LOW/MED/HIGH): LOW
Additive, dependency-injected change confined to the gateway boot path. No public API surface changes, no behavior change for an already-correctly-configured provider (the probe succeeds and boot proceeds identically). Worst-case regression is a gateway that refuses to start against a non-compliant store — which is the explicit, desired fail-closed behavior. Blast radius is limited to startup; runtime request handling is untouched. The plan-doc edits are documentation reconciliation only.
Run Summary
| Field | Value |
|---|---|
| Event | pull_request |
| Repository | fro-bot/agent |
| Run ID | 26856145401 |
| Cache | hit |
| Session | ses_1751ef036ffe1akU51JQzX4G55 |
fro-bot
left a comment
There was a problem hiding this comment.
Verdict: PASS
Re-review after commit 020732d. The core change is unchanged from my prior review (which was dismissed): it wires the existing validateProviderSemantics self-test into the boot sequence so a misconfigured object-store provider aborts startup before the gateway connects to Discord or serves traffic. The new commit extracts a makeCoordinationConfig helper, directly resolving the maintainability nit I raised previously. The change is clean and behavior-preserving.
Verification performed
- Confirmed
makeCoordinationConfig(program.ts:34-42) is a pure helper typed asCoordinationConfig, returning the exact same five fields (storeAdapter,storeConfig,lockTtlSeconds,heartbeatIntervalMs,staleThresholdMs) that were previously duplicated. It is now applied at all three call sites: self-test (:147), mention handler (:269), and stale-run recovery (:337). No values changed — pure deduplication. - Re-traced the fail-fast chain:
validateProviderSemantics(Result) →validateProviderSemanticsEffect(Effect<void, Error>) →main.tsEffect.runPromise→program.tsEffect.tryPromise(catch normalizes non-Errorvianew Error(String(error))). Placement is still correct: self-test runs aftercreateS3Adapterand beforestartAnnounceServer,installShutdownHandlers, andlogin, so a failed probe never opens the HTTP ingress. - Ran
pnpm --filter @fro-bot/gateway test: 681 passed (34 files).
Blocking issues
None.
Non-blocking concerns
- The self-test issues a real write+delete probe against the object store on every boot. This is the documented intent (fail-fast over silent corruption) and the probe key is namespaced (
self-test/_probe/locks/semantics.json), so it won't collide with live coordination state. Acceptable for a single-process gateway; flagged only for awareness in high-boot-frequency environments.
Missing tests
None that block. The two boot tests (self-test runs once before login; boot rejects without connecting when the self-test fails) cover the added behavior; the semantics matrix is covered by runtime/src/coordination/self-test.test.ts and the Effect wrapper by runtime-effect.test.ts. The makeCoordinationConfig extraction needs no dedicated test — it is exercised transitively by the existing 681 passing tests and changes no values. Optional (unchanged from prior review): assert self-test ordering vs. startAnnounceServer, not just login.
Risk assessment (LOW/MED/HIGH): LOW
Additive, dependency-injected change confined to the gateway boot path, plus a behavior-preserving local refactor. No public API surface changes; no behavior change for a correctly-configured provider. Worst-case regression is a gateway that refuses to start against a non-compliant store — the explicit, desired fail-closed behavior. The refactor's blast radius is internal to makeGatewayProgram and fully covered by the green test suite. Plan-doc edits are documentation reconciliation only.
Run Summary
| Field | Value |
|---|---|
| Event | pull_request |
| Repository | fro-bot/agent |
| Run ID | 26856437316 |
| Cache | hit |
| Session | ses_1751ef036ffe1akU51JQzX4G55 |
Wires the coordination provider semantics self-test into the gateway boot sequence so a misconfigured object-store provider fails startup before the gateway connects to Discord or serves traffic.
What
The gateway's lock and run-state coordination depends on the object-store honoring conditional writes (
If-None-Match/If-Match). That contract was validated byvalidateProviderSemanticsbut never exercised at boot — a provider that silently ignores conditional headers would only surface as corruption during a real lock operation.This runs the self-test fail-fast after the S3 adapter is created and before
client.login(). A provider that doesn't honor conditional semantics now aborts startup with a clear error instead of booting into a broken coordination state.runProviderSelfTeston the program deps (real impl inmain.ts, stubbed in tests), consistent with the existing readiness/login injection pattern.Also reconciles the plan docs: the
no-fro-botblock role is dropped as a redundant deny-list over the existing allow-list authorization model (exclude a user by not granting the trigger role).