feat(llmo-config): fail-closed writeConfig validation (SITES-43238) by dzehnder · Pull Request #1574 · adobe/spacecat-shared

dzehnder · 2026-05-04T09:14:27Z

Summary

Step 2 of SITES-43238. Closes the asymmetry between readConfig (already fails closed since 2025) and writeConfig (fail-open until now). Step 1 (writer-side filter in spacecat-audit-worker PR #2442) merged 2026-05-04 and stops new bad data at the DRS edge; this PR backstops it at the platform level.

What this PR does

In packages/spacecat-shared-utils/src/llmo-config.js:

Add LlmoConfigValidationError (extends Error) carrying siteId and Zod issues. Message includes a one-line summary of issue paths so log lines are diagnosable without inspecting the error object.
writeConfig now calls llmoConfig.safeParse(config) before PutObject. On failure, throws LlmoConfigValidationError and does not call S3. Successful path is unchanged — the original config is still what gets serialized to the bucket, so byte-for-byte output for schema-valid input is preserved.
Re-export LlmoConfigValidationError as a top-level named export from src/index.js.

Plan doc with full phase breakdown lives at packages/spacecat-shared-utils/docs/plans/2026-05-04-llmo-config-fail-closed-writeconfig.md — included in this PR for reviewer context.

Phase 1 caller audit (pre-flight)

Two production callers of writeConfig across all spacecat-* consumer repos:

Repo	File:line	Source of `config`	Risk
spacecat-audit-worker	`src/drs-prompt-generation/drs-config-writer.js:200`	DRS prompts merged into config	Low — already filtered by PR #2442
spacecat-api-service	`src/controllers/llmo/llmo.js:543`	User input via PUT endpoint	None — controller already calls `llmoConfigSchema.safeParse(newConfig)` at line 533 and returns HTTP 400 on failure

No fix-at-edge work was required before this PR. See the plan doc for the full audit and the validation gate.

Test plan

Existing writeConfig tests still pass without modification (success path is unchanged).
New tests in the describe('writeConfig', ...) block:
- throws LlmoConfigValidationError when category region is non-alpha-2 ('en-us')
- throws LlmoConfigValidationError when a required field is missing (entities deleted)
- thrown error carries name, siteId, and a non-empty issues array; s3Client.send is never called
Index exports test updated to include the new top-level LlmoConfigValidationError export.
npm test -w packages/spacecat-shared-utils — 1038 passing, llmo-config.js at 100% coverage.
npm run lint -w packages/spacecat-shared-utils — clean.

Release coordination

Commit message uses feat(llmo-config): with a BREAKING CHANGE: footer documenting the new throw behavior, so semantic-release will publish a major-version bump for @adobe/spacecat-shared-utils.

After release, dependency bumps in this order:

spacecat-audit-worker (DRS prompt writer, depends on step 1 already being live)
spacecat-api-service (LLMO config endpoints — already pre-validates, fail-closed is a no-op)
Any other consumer flagged later

Each consumer's CI is the gate. If a consumer's tests start failing on dep bump, that indicates a Phase 1 miss and must be fixed at the edge.

Branch name note

This branch (docs/SITES-43238-step-2-plan) was originally created for the plan doc only. The plan + implementation landed together in two commits — docs: for the audit findings update, feat: for the implementation — so reviewers can see the audit context that justifies the implementation.

🤖 Generated with Claude Code

Step 2 of SITES-43238. Step 1 (writer-side filter in spacecat-audit-worker) merged 2026-05-04 and is soaking in prod. This plan covers: - Phase 1: caller audit across consumer repos (pre-flight) - Phase 2: implementation in writeConfig + LlmoConfigValidationError - Phase 3: release coordination (semantic-release major bump + dep bumps) - Phase 4: production verification (>=48h soak watching for new errors) Each phase has explicit validation gates per workspace CLAUDE.md. Refs: SITES-43238 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Audit complete (2026-05-04). Two production callers of writeConfig: - spacecat-audit-worker drs-config-writer.js:200 — already filtered by the writer-side guard in PR #2442 (step 1, merged 2026-05-04). - spacecat-api-service llmo.js:543 — already calls llmoConfigSchema.safeParse before writeConfig and returns 400 on failure, so fail-closed inside writeConfig is a no-op for it. No fix-at-edge work required before Phase 2. Refs: SITES-43238 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Validate the LLMO configuration against its Zod schema before issuing the S3 PutObject. On validation failure, throw LlmoConfigValidationError without calling S3, so invalid configs cannot reach the bucket through this function. The error class carries the offending siteId and the Zod issue list so callers and log readers can identify the failing fields without re-parsing the original error. This closes the asymmetry between readConfig (which already fails closed since 2025) and writeConfig (which was fail-open until now), and prevents the SITES-43238 class of corruption at the platform level — not just at the DRS edge that step 1 fixed. Step 2 of SITES-43238. Phase 1 caller audit (committed in the previous commit) confirmed both production callers (audit-worker DRS writer, api-service LLMO controller) are safe — neither emits invalid configs. Tests added in writeConfig describe block: invalid category region, missing required field, and an explicit assertion that the thrown error carries name, siteId, and Zod issues. BREAKING CHANGE: writeConfig now throws LlmoConfigValidationError when the supplied config does not match the published llmoConfig schema. Previously, writeConfig accepted any object and persisted it verbatim. Existing callers that produce schema-valid configs are unaffected; any caller producing invalid configs must fix the source before upgrading. Refs: SITES-43238 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

danieljchuser

Hey @dzehnder,

Strengths

Validation runs before S3 I/O (src/llmo-config.js:131-135), structurally guaranteeing fail-closed: invalid configs cannot reach the bucket regardless of which caller evolves next.
Closes a real read/write asymmetry latent since 2025. readConfig has been fail-closed; this brings writeConfig to parity at the platform layer, adding defense-in-depth without duplicating the edge-level filtering from step 1.
LlmoConfigValidationError carries structured fields (siteId, issues) with an explicitly set name, giving consumers something specific to catch and log without re-parsing.
Uses safeParse (not parse), keeping control flow clean - no catching a thrown ZodError just to re-wrap it.
Tests assert both the throw type and the s3Client.send not-called invariant (test/llmo-config.test.js:204-244), which is exactly the right shape for a fail-closed contract.
No new dependencies, no lockfile changes. Reuses the existing Zod schema from schemas.js, so there is no new supply-chain surface.
BREAKING CHANGE footer with a documented consumer-bump order (audit-worker then api-service) is the right release primitive. The caller audit in the plan doc is auditable with file:line references, not just asserted.

Issues

Minor (Nice to Have)

src/llmo-config.js:25 - Missing cause on Error constructor. LlmoConfigValidationError does not pass { cause: zodError } to super(), so the original Zod error's stack is lost. The .issues array preserves the useful payload, but passing cause is one line and improves debuggability without changing the public shape:
```
super(`LLMO config for site ${siteId} failed schema validation: ${summary}`, { cause: zodError });
```
src/llmo-config.js:24-25 - Summary string includes Zod i.message, which may echo user-supplied values. Some Zod validators surface received values in default message text. Since the LLMO config includes user-curated content (brands, competitors, regions) on the api-service path, invalid values could be reflected into CloudWatch/Coralogix logs via the error message. Low-impact in practice (the api-service pre-validates and returns 400 before reaching writeConfig), but the safer pattern uses i.code instead of i.message in the summary:
```
.map((i) => `${i.path.join('.')}: ${i.code}`)
```
This keeps the full human-readable detail in this.issues for trusted callers to inspect.
test/llmo-config.test.js:240 - Message-format contract is under-asserted. The error class promises log readers can "identify the failing fields without re-parsing," but the test only asserts caught.message includes siteId. If someone later changes the summary shape, the test still passes. A small assertion like expect(caught.message).to.match(/categories\..+\.region/) would lock in the formatting contract.

Recommendations

The summary string is unbounded - a config with many failing fields produces a long error message. Not worth gating the merge, but if these errors show up in alerts, consider truncating to the first N issues with a (+M more) suffix and leaving the full list on .issues.
readConfig currently lets raw ZodError propagate on parse failure, while writeConfig now throws the typed LlmoConfigValidationError. Consumers wanting to catch "bad LLMO config for site X" will need two catch shapes. A small follow-up wrapping readConfig's parse failure in the same (or sibling) error type would close the asymmetry fully.

Assessment

Ready to merge? Yes.

The change is small, well-scoped, and correct. Validation is placed before the side effect, errors are structured, tests verify the no-S3-call invariant, and CI is green. The three minor items above are optional polish - the cause addition being the cheapest win. The caller audit and release plan are thorough.

danieljchuser

Approving - clean change, no blockers. Minor polish items noted in the earlier comment review.

Address minor polish items from PR #1574 review: - Pass { cause: zodError } to super() so the original ZodError stack is preserved on the cause chain. Improves debuggability without changing the public error shape. - Use issue `code` instead of `message` in the summary string. Zod's default messages can echo received values, and the LLMO config can carry user-curated content (brand names, competitor URLs) on the api-service write path. The full human-readable detail (including the values) remains on `this.issues` for trusted callers to inspect. - Lock down the message-format contract with a regex assertion so the path-shape (e.g. categories.<uuid>.region) is not silently dropped by a future maintainer. - Switch the test fixture UUID to a v4-conformant value so Zod's validation reaches the inner field rather than failing at the record key. Refs: SITES-43238 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-04T11:47:14Z

This PR will trigger a major release when merged.

dzehnder · 2026-05-04T11:47:21Z

Thanks @danieljchuser — appreciate the careful review. Pushed a4335cc addressing the three Minor polish items.

Applied

{ cause: zodError } on super() — preserves the original ZodError on the cause chain. One line, real debuggability win.
i.code instead of i.message in the summary — agreed with the log-hygiene reasoning. Even though api-service pre-validates today, fail-closed is exactly the place to be conservative about untrusted input. Full human-readable detail (values, all) remains on this.issues for trusted callers. Added a comment in the source explaining the choice so a future maintainer doesn't "improve" it back.
Message-format regex assertion — added expect(caught.message).to.match(/categories\..+\.region/) plus an assertion that caught.cause.issues === caught.issues so the cause wiring is also locked in.

While in there I also fixed the test fixture UUID — it was 00000000-...-000000000001, which fails Zod's v4 check and produced an invalid_key issue at categories.<uuid> rather than reaching the inner region field. Switched to a v4-conformant UUID so the test exercises the actual path the production code will hit.

Deferred

Truncating long summaries — same trade-off discussion we had on the step 1 WARN log: real-world cardinality of validation issues is small, log volume scales with distinct issues per write (not prompts). I'd rather wait for prod evidence before adding the top-N+others machinery here too. Will revisit if step 4 prod monitoring shows long error lines.
Wrapping readConfig parse failures in LlmoConfigValidationError — agreed this would close the asymmetry, but it changes a different function's runtime behavior on a separate code path. Filing a follow-up sub-task on SITES-43238 rather than expanding this PR's scope. The asymmetry itself is pre-existing (readConfig has thrown raw ZodError since the schema was added).

CI green, 1038 tests passing in spacecat-shared-utils, llmo-config.js at 100% coverage, lint clean.

PR remains approved; merge is gated only on the step 1 soak window (≥24h since 2026-05-04 08:42 UTC) and the Coralogix audit-worker error-rate check.

…tionError Closes the read/write asymmetry surfaced in PR #1574 review. readConfig previously let raw ZodError propagate from llmoConfig.parse on schema failure, while writeConfig now throws the typed LlmoConfigValidationError. This commit aligns readConfig: switch to safeParse, throw LlmoConfigValidationError(siteId, zodError) on failure, and propagate result.data on success. The error type change is a runtime contract change for any caller that catches ZodError specifically. In practice today's only consumer (spacecat-api-service llmo controller) propagates readConfig errors to a 5xx unchanged. The major-version bump from the prior writeConfig commit's BREAKING CHANGE footer covers this together. Closes SITES-43908. Refs: SITES-43238. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dzehnder · 2026-05-04T12:02:18Z

Update: rolled the readConfig wrapping into this PR after all.

Pushed 3ee0c5b — readConfig now uses safeParse and throws LlmoConfigValidationError(siteId, zodError) on schema failure (instead of raw ZodError). Closes the read/write asymmetry you flagged so consumers have a uniform catch contract across both paths.

Why in-PR rather than as the separate follow-up I originally filed (SITES-43908):

Diff is 6 lines of source + 1 test rewrite. Smaller than the audit/release coordination overhead of a separate PR.
Same package, same release, same major-version bump. Splitting forces a second @adobe/spacecat-shared-utils major bump and a second round of consumer dep bumps for what is essentially one decision ("all schema failures throw LlmoConfigValidationError").
Existing throws when the configuration fails schema validation test was rewritten to assert the new error type, name, siteId, issues, and cause — same shape as the writeConfig assertions.

Plan doc updated to record readConfig as rolled-in scope.

npm test: 1038 passing, llmo-config.js at 100% coverage. Lint clean. SITES-43908 will be closed against this commit.

## [@adobe/spacecat-shared-utils-v1.114.0](https://github.com/adobe/spacecat-shared/compare/@adobe/spacecat-shared-utils-v1.113.0...@adobe/spacecat-shared-utils-v1.114.0) (2026-05-04) ### Features * **llmo-config:** fail-closed writeConfig validation (SITES-43238) ([#1574](#1574)) ([a177743](a177743))

solaris007 · 2026-05-04T12:21:03Z

🎉 This PR is included in version @adobe/spacecat-shared-utils-v1.114.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

dzehnder and others added 3 commits May 4, 2026 10:50

danieljchuser reviewed May 4, 2026

View reviewed changes

danieljchuser approved these changes May 4, 2026

View reviewed changes

dzehnder merged commit a177743 into main May 4, 2026
5 checks passed

dzehnder deleted the docs/SITES-43238-step-2-plan branch May 4, 2026 12:14

solaris007 added the released label May 4, 2026

dzehnder mentioned this pull request May 4, 2026

feat: add LLMO config schema-sweep script (SITES-43238 step 3a) adobe/spacecat-audit-worker#2470

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(llmo-config): fail-closed writeConfig validation (SITES-43238)#1574

feat(llmo-config): fail-closed writeConfig validation (SITES-43238)#1574
dzehnder merged 5 commits intomainfrom
docs/SITES-43238-step-2-plan

dzehnder commented May 4, 2026

Uh oh!

danieljchuser left a comment

Uh oh!

danieljchuser left a comment

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

dzehnder commented May 4, 2026

Uh oh!

dzehnder commented May 4, 2026

Uh oh!

Uh oh!

solaris007 commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dzehnder commented May 4, 2026

Summary

What this PR does

Phase 1 caller audit (pre-flight)

Test plan

Release coordination

Branch name note

Uh oh!

danieljchuser left a comment

Choose a reason for hiding this comment

Strengths

Issues

Minor (Nice to Have)

Recommendations

Assessment

Uh oh!

danieljchuser left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

dzehnder commented May 4, 2026

Applied

Deferred

Uh oh!

dzehnder commented May 4, 2026

Uh oh!

Uh oh!

solaris007 commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants