feat(llmo-config): fail-closed writeConfig validation (SITES-43238)#1574
feat(llmo-config): fail-closed writeConfig validation (SITES-43238)#1574
Conversation
Step 2 of SITES-43238. Step 1 (writer-side filter in spacecat-audit-worker) merged 2026-05-04 and is soaking in prod. This plan covers: - Phase 1: caller audit across consumer repos (pre-flight) - Phase 2: implementation in writeConfig + LlmoConfigValidationError - Phase 3: release coordination (semantic-release major bump + dep bumps) - Phase 4: production verification (>=48h soak watching for new errors) Each phase has explicit validation gates per workspace CLAUDE.md. Refs: SITES-43238 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Audit complete (2026-05-04). Two production callers of writeConfig: - spacecat-audit-worker drs-config-writer.js:200 — already filtered by the writer-side guard in PR #2442 (step 1, merged 2026-05-04). - spacecat-api-service llmo.js:543 — already calls llmoConfigSchema.safeParse before writeConfig and returns 400 on failure, so fail-closed inside writeConfig is a no-op for it. No fix-at-edge work required before Phase 2. Refs: SITES-43238 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Validate the LLMO configuration against its Zod schema before issuing the S3 PutObject. On validation failure, throw LlmoConfigValidationError without calling S3, so invalid configs cannot reach the bucket through this function. The error class carries the offending siteId and the Zod issue list so callers and log readers can identify the failing fields without re-parsing the original error. This closes the asymmetry between readConfig (which already fails closed since 2025) and writeConfig (which was fail-open until now), and prevents the SITES-43238 class of corruption at the platform level — not just at the DRS edge that step 1 fixed. Step 2 of SITES-43238. Phase 1 caller audit (committed in the previous commit) confirmed both production callers (audit-worker DRS writer, api-service LLMO controller) are safe — neither emits invalid configs. Tests added in writeConfig describe block: invalid category region, missing required field, and an explicit assertion that the thrown error carries name, siteId, and Zod issues. BREAKING CHANGE: writeConfig now throws LlmoConfigValidationError when the supplied config does not match the published llmoConfig schema. Previously, writeConfig accepted any object and persisted it verbatim. Existing callers that produce schema-valid configs are unaffected; any caller producing invalid configs must fix the source before upgrading. Refs: SITES-43238 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
danieljchuser
left a comment
There was a problem hiding this comment.
Hey @dzehnder,
Strengths
- Validation runs before S3 I/O (
src/llmo-config.js:131-135), structurally guaranteeing fail-closed: invalid configs cannot reach the bucket regardless of which caller evolves next. - Closes a real read/write asymmetry latent since 2025.
readConfighas been fail-closed; this bringswriteConfigto parity at the platform layer, adding defense-in-depth without duplicating the edge-level filtering from step 1. LlmoConfigValidationErrorcarries structured fields (siteId,issues) with an explicitly setname, giving consumers something specific to catch and log without re-parsing.- Uses
safeParse(notparse), keeping control flow clean - no catching a thrown ZodError just to re-wrap it. - Tests assert both the throw type and the
s3Client.sendnot-called invariant (test/llmo-config.test.js:204-244), which is exactly the right shape for a fail-closed contract. - No new dependencies, no lockfile changes. Reuses the existing Zod schema from
schemas.js, so there is no new supply-chain surface. - BREAKING CHANGE footer with a documented consumer-bump order (audit-worker then api-service) is the right release primitive. The caller audit in the plan doc is auditable with file:line references, not just asserted.
Issues
Minor (Nice to Have)
-
src/llmo-config.js:25- Missingcauseon Error constructor.LlmoConfigValidationErrordoes not pass{ cause: zodError }tosuper(), so the original Zod error's stack is lost. The.issuesarray preserves the useful payload, but passingcauseis one line and improves debuggability without changing the public shape:super(`LLMO config for site ${siteId} failed schema validation: ${summary}`, { cause: zodError });
-
src/llmo-config.js:24-25- Summary string includes Zodi.message, which may echo user-supplied values. Some Zod validators surface received values in default message text. Since the LLMO config includes user-curated content (brands, competitors, regions) on the api-service path, invalid values could be reflected into CloudWatch/Coralogix logs via the error message. Low-impact in practice (the api-service pre-validates and returns 400 before reachingwriteConfig), but the safer pattern usesi.codeinstead ofi.messagein the summary:.map((i) => `${i.path.join('.')}: ${i.code}`)
This keeps the full human-readable detail in
this.issuesfor trusted callers to inspect. -
test/llmo-config.test.js:240- Message-format contract is under-asserted. The error class promises log readers can "identify the failing fields without re-parsing," but the test only assertscaught.messageincludessiteId. If someone later changes the summary shape, the test still passes. A small assertion likeexpect(caught.message).to.match(/categories\..+\.region/)would lock in the formatting contract.
Recommendations
- The summary string is unbounded - a config with many failing fields produces a long error message. Not worth gating the merge, but if these errors show up in alerts, consider truncating to the first N issues with a
(+M more)suffix and leaving the full list on.issues. readConfigcurrently lets rawZodErrorpropagate on parse failure, whilewriteConfignow throws the typedLlmoConfigValidationError. Consumers wanting to catch "bad LLMO config for site X" will need two catch shapes. A small follow-up wrappingreadConfig's parse failure in the same (or sibling) error type would close the asymmetry fully.
Assessment
Ready to merge? Yes.
The change is small, well-scoped, and correct. Validation is placed before the side effect, errors are structured, tests verify the no-S3-call invariant, and CI is green. The three minor items above are optional polish - the cause addition being the cheapest win. The caller audit and release plan are thorough.
danieljchuser
left a comment
There was a problem hiding this comment.
Approving - clean change, no blockers. Minor polish items noted in the earlier comment review.
Address minor polish items from PR #1574 review: - Pass { cause: zodError } to super() so the original ZodError stack is preserved on the cause chain. Improves debuggability without changing the public error shape. - Use issue `code` instead of `message` in the summary string. Zod's default messages can echo received values, and the LLMO config can carry user-curated content (brand names, competitor URLs) on the api-service write path. The full human-readable detail (including the values) remains on `this.issues` for trusted callers to inspect. - Lock down the message-format contract with a regex assertion so the path-shape (e.g. categories.<uuid>.region) is not silently dropped by a future maintainer. - Switch the test fixture UUID to a v4-conformant value so Zod's validation reaches the inner field rather than failing at the record key. Refs: SITES-43238 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
This PR will trigger a major release when merged. |
|
Thanks @danieljchuser — appreciate the careful review. Pushed a4335cc addressing the three Minor polish items. Applied
While in there I also fixed the test fixture UUID — it was Deferred
CI green, 1038 tests passing in PR remains approved; merge is gated only on the step 1 soak window (≥24h since 2026-05-04 08:42 UTC) and the Coralogix audit-worker error-rate check. |
…tionError Closes the read/write asymmetry surfaced in PR #1574 review. readConfig previously let raw ZodError propagate from llmoConfig.parse on schema failure, while writeConfig now throws the typed LlmoConfigValidationError. This commit aligns readConfig: switch to safeParse, throw LlmoConfigValidationError(siteId, zodError) on failure, and propagate result.data on success. The error type change is a runtime contract change for any caller that catches ZodError specifically. In practice today's only consumer (spacecat-api-service llmo controller) propagates readConfig errors to a 5xx unchanged. The major-version bump from the prior writeConfig commit's BREAKING CHANGE footer covers this together. Closes SITES-43908. Refs: SITES-43238. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Update: rolled the readConfig wrapping into this PR after all. Pushed 3ee0c5b — readConfig now uses Why in-PR rather than as the separate follow-up I originally filed (SITES-43908):
Plan doc updated to record readConfig as rolled-in scope.
|
## [@adobe/spacecat-shared-utils-v1.114.0](https://github.com/adobe/spacecat-shared/compare/@adobe/spacecat-shared-utils-v1.113.0...@adobe/spacecat-shared-utils-v1.114.0) (2026-05-04) ### Features * **llmo-config:** fail-closed writeConfig validation (SITES-43238) ([#1574](#1574)) ([a177743](a177743))
|
🎉 This PR is included in version @adobe/spacecat-shared-utils-v1.114.0 🎉 The release is available on: Your semantic-release bot 📦🚀 |
Summary
Step 2 of SITES-43238. Closes the asymmetry between
readConfig(already fails closed since 2025) andwriteConfig(fail-open until now). Step 1 (writer-side filter inspacecat-audit-workerPR #2442) merged 2026-05-04 and stops new bad data at the DRS edge; this PR backstops it at the platform level.What this PR does
In
packages/spacecat-shared-utils/src/llmo-config.js:LlmoConfigValidationError(extendsError) carryingsiteIdand Zodissues. Message includes a one-line summary of issue paths so log lines are diagnosable without inspecting the error object.writeConfignow callsllmoConfig.safeParse(config)beforePutObject. On failure, throwsLlmoConfigValidationErrorand does not call S3. Successful path is unchanged — the originalconfigis still what gets serialized to the bucket, so byte-for-byte output for schema-valid input is preserved.LlmoConfigValidationErroras a top-level named export fromsrc/index.js.Plan doc with full phase breakdown lives at
packages/spacecat-shared-utils/docs/plans/2026-05-04-llmo-config-fail-closed-writeconfig.md— included in this PR for reviewer context.Phase 1 caller audit (pre-flight)
Two production callers of
writeConfigacross all spacecat-* consumer repos:configsrc/drs-prompt-generation/drs-config-writer.js:200src/controllers/llmo/llmo.js:543llmoConfigSchema.safeParse(newConfig)at line 533 and returns HTTP 400 on failureNo fix-at-edge work was required before this PR. See the plan doc for the full audit and the validation gate.
Test plan
writeConfigtests still pass without modification (success path is unchanged).describe('writeConfig', ...)block:LlmoConfigValidationErrorwhen category region is non-alpha-2 ('en-us')LlmoConfigValidationErrorwhen a required field is missing (entitiesdeleted)name,siteId, and a non-emptyissuesarray;s3Client.sendis never calledLlmoConfigValidationErrorexport.npm test -w packages/spacecat-shared-utils— 1038 passing,llmo-config.jsat 100% coverage.npm run lint -w packages/spacecat-shared-utils— clean.Release coordination
Commit message uses
feat(llmo-config):with aBREAKING CHANGE:footer documenting the new throw behavior, so semantic-release will publish a major-version bump for@adobe/spacecat-shared-utils.After release, dependency bumps in this order:
spacecat-audit-worker(DRS prompt writer, depends on step 1 already being live)spacecat-api-service(LLMO config endpoints — already pre-validates, fail-closed is a no-op)Each consumer's CI is the gate. If a consumer's tests start failing on dep bump, that indicates a Phase 1 miss and must be fixed at the edge.
Branch name note
This branch (
docs/SITES-43238-step-2-plan) was originally created for the plan doc only. The plan + implementation landed together in two commits —docs:for the audit findings update,feat:for the implementation — so reviewers can see the audit context that justifies the implementation.🤖 Generated with Claude Code