Skip to content

feat(llmo-config): fail-closed writeConfig validation (SITES-43238)#1574

Merged
dzehnder merged 5 commits intomainfrom
docs/SITES-43238-step-2-plan
May 4, 2026
Merged

feat(llmo-config): fail-closed writeConfig validation (SITES-43238)#1574
dzehnder merged 5 commits intomainfrom
docs/SITES-43238-step-2-plan

Conversation

@dzehnder
Copy link
Copy Markdown
Contributor

@dzehnder dzehnder commented May 4, 2026

Summary

Step 2 of SITES-43238. Closes the asymmetry between readConfig (already fails closed since 2025) and writeConfig (fail-open until now). Step 1 (writer-side filter in spacecat-audit-worker PR #2442) merged 2026-05-04 and stops new bad data at the DRS edge; this PR backstops it at the platform level.

What this PR does

In packages/spacecat-shared-utils/src/llmo-config.js:

  • Add LlmoConfigValidationError (extends Error) carrying siteId and Zod issues. Message includes a one-line summary of issue paths so log lines are diagnosable without inspecting the error object.
  • writeConfig now calls llmoConfig.safeParse(config) before PutObject. On failure, throws LlmoConfigValidationError and does not call S3. Successful path is unchanged — the original config is still what gets serialized to the bucket, so byte-for-byte output for schema-valid input is preserved.
  • Re-export LlmoConfigValidationError as a top-level named export from src/index.js.

Plan doc with full phase breakdown lives at packages/spacecat-shared-utils/docs/plans/2026-05-04-llmo-config-fail-closed-writeconfig.md — included in this PR for reviewer context.

Phase 1 caller audit (pre-flight)

Two production callers of writeConfig across all spacecat-* consumer repos:

Repo File:line Source of config Risk
spacecat-audit-worker src/drs-prompt-generation/drs-config-writer.js:200 DRS prompts merged into config Low — already filtered by PR #2442
spacecat-api-service src/controllers/llmo/llmo.js:543 User input via PUT endpoint None — controller already calls llmoConfigSchema.safeParse(newConfig) at line 533 and returns HTTP 400 on failure

No fix-at-edge work was required before this PR. See the plan doc for the full audit and the validation gate.

Test plan

  • Existing writeConfig tests still pass without modification (success path is unchanged).
  • New tests in the describe('writeConfig', ...) block:
    • throws LlmoConfigValidationError when category region is non-alpha-2 ('en-us')
    • throws LlmoConfigValidationError when a required field is missing (entities deleted)
    • thrown error carries name, siteId, and a non-empty issues array; s3Client.send is never called
  • Index exports test updated to include the new top-level LlmoConfigValidationError export.
  • npm test -w packages/spacecat-shared-utils — 1038 passing, llmo-config.js at 100% coverage.
  • npm run lint -w packages/spacecat-shared-utils — clean.

Release coordination

Commit message uses feat(llmo-config): with a BREAKING CHANGE: footer documenting the new throw behavior, so semantic-release will publish a major-version bump for @adobe/spacecat-shared-utils.

After release, dependency bumps in this order:

  1. spacecat-audit-worker (DRS prompt writer, depends on step 1 already being live)
  2. spacecat-api-service (LLMO config endpoints — already pre-validates, fail-closed is a no-op)
  3. Any other consumer flagged later

Each consumer's CI is the gate. If a consumer's tests start failing on dep bump, that indicates a Phase 1 miss and must be fixed at the edge.

Branch name note

This branch (docs/SITES-43238-step-2-plan) was originally created for the plan doc only. The plan + implementation landed together in two commits — docs: for the audit findings update, feat: for the implementation — so reviewers can see the audit context that justifies the implementation.

🤖 Generated with Claude Code

dzehnder and others added 3 commits May 4, 2026 10:50
Step 2 of SITES-43238. Step 1 (writer-side filter in spacecat-audit-worker)
merged 2026-05-04 and is soaking in prod. This plan covers:

- Phase 1: caller audit across consumer repos (pre-flight)
- Phase 2: implementation in writeConfig + LlmoConfigValidationError
- Phase 3: release coordination (semantic-release major bump + dep bumps)
- Phase 4: production verification (>=48h soak watching for new errors)

Each phase has explicit validation gates per workspace CLAUDE.md.

Refs: SITES-43238

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Audit complete (2026-05-04). Two production callers of writeConfig:

- spacecat-audit-worker drs-config-writer.js:200 — already filtered
  by the writer-side guard in PR #2442 (step 1, merged 2026-05-04).
- spacecat-api-service llmo.js:543 — already calls
  llmoConfigSchema.safeParse before writeConfig and returns 400 on
  failure, so fail-closed inside writeConfig is a no-op for it.

No fix-at-edge work required before Phase 2.

Refs: SITES-43238

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Validate the LLMO configuration against its Zod schema before issuing
the S3 PutObject. On validation failure, throw LlmoConfigValidationError
without calling S3, so invalid configs cannot reach the bucket through
this function.

The error class carries the offending siteId and the Zod issue list so
callers and log readers can identify the failing fields without
re-parsing the original error.

This closes the asymmetry between readConfig (which already fails closed
since 2025) and writeConfig (which was fail-open until now), and
prevents the SITES-43238 class of corruption at the platform level —
not just at the DRS edge that step 1 fixed.

Step 2 of SITES-43238. Phase 1 caller audit (committed in the previous
commit) confirmed both production callers (audit-worker DRS writer,
api-service LLMO controller) are safe — neither emits invalid configs.

Tests added in writeConfig describe block: invalid category region,
missing required field, and an explicit assertion that the thrown error
carries name, siteId, and Zod issues.

BREAKING CHANGE: writeConfig now throws LlmoConfigValidationError when
the supplied config does not match the published llmoConfig schema.
Previously, writeConfig accepted any object and persisted it verbatim.
Existing callers that produce schema-valid configs are unaffected; any
caller producing invalid configs must fix the source before upgrading.

Refs: SITES-43238

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@danieljchuser danieljchuser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @dzehnder,

Strengths

  • Validation runs before S3 I/O (src/llmo-config.js:131-135), structurally guaranteeing fail-closed: invalid configs cannot reach the bucket regardless of which caller evolves next.
  • Closes a real read/write asymmetry latent since 2025. readConfig has been fail-closed; this brings writeConfig to parity at the platform layer, adding defense-in-depth without duplicating the edge-level filtering from step 1.
  • LlmoConfigValidationError carries structured fields (siteId, issues) with an explicitly set name, giving consumers something specific to catch and log without re-parsing.
  • Uses safeParse (not parse), keeping control flow clean - no catching a thrown ZodError just to re-wrap it.
  • Tests assert both the throw type and the s3Client.send not-called invariant (test/llmo-config.test.js:204-244), which is exactly the right shape for a fail-closed contract.
  • No new dependencies, no lockfile changes. Reuses the existing Zod schema from schemas.js, so there is no new supply-chain surface.
  • BREAKING CHANGE footer with a documented consumer-bump order (audit-worker then api-service) is the right release primitive. The caller audit in the plan doc is auditable with file:line references, not just asserted.

Issues

Minor (Nice to Have)

  1. src/llmo-config.js:25 - Missing cause on Error constructor. LlmoConfigValidationError does not pass { cause: zodError } to super(), so the original Zod error's stack is lost. The .issues array preserves the useful payload, but passing cause is one line and improves debuggability without changing the public shape:

    super(`LLMO config for site ${siteId} failed schema validation: ${summary}`, { cause: zodError });
  2. src/llmo-config.js:24-25 - Summary string includes Zod i.message, which may echo user-supplied values. Some Zod validators surface received values in default message text. Since the LLMO config includes user-curated content (brands, competitors, regions) on the api-service path, invalid values could be reflected into CloudWatch/Coralogix logs via the error message. Low-impact in practice (the api-service pre-validates and returns 400 before reaching writeConfig), but the safer pattern uses i.code instead of i.message in the summary:

    .map((i) => `${i.path.join('.')}: ${i.code}`)

    This keeps the full human-readable detail in this.issues for trusted callers to inspect.

  3. test/llmo-config.test.js:240 - Message-format contract is under-asserted. The error class promises log readers can "identify the failing fields without re-parsing," but the test only asserts caught.message includes siteId. If someone later changes the summary shape, the test still passes. A small assertion like expect(caught.message).to.match(/categories\..+\.region/) would lock in the formatting contract.

Recommendations

  • The summary string is unbounded - a config with many failing fields produces a long error message. Not worth gating the merge, but if these errors show up in alerts, consider truncating to the first N issues with a (+M more) suffix and leaving the full list on .issues.
  • readConfig currently lets raw ZodError propagate on parse failure, while writeConfig now throws the typed LlmoConfigValidationError. Consumers wanting to catch "bad LLMO config for site X" will need two catch shapes. A small follow-up wrapping readConfig's parse failure in the same (or sibling) error type would close the asymmetry fully.

Assessment

Ready to merge? Yes.

The change is small, well-scoped, and correct. Validation is placed before the side effect, errors are structured, tests verify the no-S3-call invariant, and CI is green. The three minor items above are optional polish - the cause addition being the cheapest win. The caller audit and release plan are thorough.

Copy link
Copy Markdown
Contributor

@danieljchuser danieljchuser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving - clean change, no blockers. Minor polish items noted in the earlier comment review.

Address minor polish items from PR #1574 review:

- Pass { cause: zodError } to super() so the original ZodError stack is
  preserved on the cause chain. Improves debuggability without changing
  the public error shape.
- Use issue `code` instead of `message` in the summary string. Zod's
  default messages can echo received values, and the LLMO config can
  carry user-curated content (brand names, competitor URLs) on the
  api-service write path. The full human-readable detail (including the
  values) remains on `this.issues` for trusted callers to inspect.
- Lock down the message-format contract with a regex assertion so the
  path-shape (e.g. categories.<uuid>.region) is not silently dropped by
  a future maintainer.
- Switch the test fixture UUID to a v4-conformant value so Zod's
  validation reaches the inner field rather than failing at the record
  key.

Refs: SITES-43238

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

This PR will trigger a major release when merged.

@dzehnder
Copy link
Copy Markdown
Contributor Author

dzehnder commented May 4, 2026

Thanks @danieljchuser — appreciate the careful review. Pushed a4335cc addressing the three Minor polish items.

Applied

  1. { cause: zodError } on super() — preserves the original ZodError on the cause chain. One line, real debuggability win.

  2. i.code instead of i.message in the summary — agreed with the log-hygiene reasoning. Even though api-service pre-validates today, fail-closed is exactly the place to be conservative about untrusted input. Full human-readable detail (values, all) remains on this.issues for trusted callers. Added a comment in the source explaining the choice so a future maintainer doesn't "improve" it back.

  3. Message-format regex assertion — added expect(caught.message).to.match(/categories\..+\.region/) plus an assertion that caught.cause.issues === caught.issues so the cause wiring is also locked in.

While in there I also fixed the test fixture UUID — it was 00000000-...-000000000001, which fails Zod's v4 check and produced an invalid_key issue at categories.<uuid> rather than reaching the inner region field. Switched to a v4-conformant UUID so the test exercises the actual path the production code will hit.

Deferred

  • Truncating long summaries — same trade-off discussion we had on the step 1 WARN log: real-world cardinality of validation issues is small, log volume scales with distinct issues per write (not prompts). I'd rather wait for prod evidence before adding the top-N+others machinery here too. Will revisit if step 4 prod monitoring shows long error lines.

  • Wrapping readConfig parse failures in LlmoConfigValidationError — agreed this would close the asymmetry, but it changes a different function's runtime behavior on a separate code path. Filing a follow-up sub-task on SITES-43238 rather than expanding this PR's scope. The asymmetry itself is pre-existing (readConfig has thrown raw ZodError since the schema was added).

CI green, 1038 tests passing in spacecat-shared-utils, llmo-config.js at 100% coverage, lint clean.

PR remains approved; merge is gated only on the step 1 soak window (≥24h since 2026-05-04 08:42 UTC) and the Coralogix audit-worker error-rate check.

…tionError

Closes the read/write asymmetry surfaced in PR #1574 review. readConfig
previously let raw ZodError propagate from llmoConfig.parse on schema
failure, while writeConfig now throws the typed LlmoConfigValidationError.
This commit aligns readConfig: switch to safeParse, throw
LlmoConfigValidationError(siteId, zodError) on failure, and propagate
result.data on success.

The error type change is a runtime contract change for any caller that
catches ZodError specifically. In practice today's only consumer
(spacecat-api-service llmo controller) propagates readConfig errors to
a 5xx unchanged. The major-version bump from the prior writeConfig
commit's BREAKING CHANGE footer covers this together.

Closes SITES-43908. Refs: SITES-43238.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dzehnder
Copy link
Copy Markdown
Contributor Author

dzehnder commented May 4, 2026

Update: rolled the readConfig wrapping into this PR after all.

Pushed 3ee0c5b — readConfig now uses safeParse and throws LlmoConfigValidationError(siteId, zodError) on schema failure (instead of raw ZodError). Closes the read/write asymmetry you flagged so consumers have a uniform catch contract across both paths.

Why in-PR rather than as the separate follow-up I originally filed (SITES-43908):

  • Diff is 6 lines of source + 1 test rewrite. Smaller than the audit/release coordination overhead of a separate PR.
  • Same package, same release, same major-version bump. Splitting forces a second @adobe/spacecat-shared-utils major bump and a second round of consumer dep bumps for what is essentially one decision ("all schema failures throw LlmoConfigValidationError").
  • Existing throws when the configuration fails schema validation test was rewritten to assert the new error type, name, siteId, issues, and cause — same shape as the writeConfig assertions.

Plan doc updated to record readConfig as rolled-in scope.

npm test: 1038 passing, llmo-config.js at 100% coverage. Lint clean. SITES-43908 will be closed against this commit.

@dzehnder dzehnder merged commit a177743 into main May 4, 2026
5 checks passed
@dzehnder dzehnder deleted the docs/SITES-43238-step-2-plan branch May 4, 2026 12:14
solaris007 pushed a commit that referenced this pull request May 4, 2026
## [@adobe/spacecat-shared-utils-v1.114.0](https://github.com/adobe/spacecat-shared/compare/@adobe/spacecat-shared-utils-v1.113.0...@adobe/spacecat-shared-utils-v1.114.0) (2026-05-04)

### Features

* **llmo-config:** fail-closed writeConfig validation (SITES-43238) ([#1574](#1574)) ([a177743](a177743))
@solaris007
Copy link
Copy Markdown
Member

🎉 This PR is included in version @adobe/spacecat-shared-utils-v1.114.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants