diff --git a/packages/spacecat-shared-utils/docs/plans/2026-05-04-llmo-config-fail-closed-writeconfig.md b/packages/spacecat-shared-utils/docs/plans/2026-05-04-llmo-config-fail-closed-writeconfig.md new file mode 100644 index 000000000..e12d25b30 --- /dev/null +++ b/packages/spacecat-shared-utils/docs/plans/2026-05-04-llmo-config-fail-closed-writeconfig.md @@ -0,0 +1,182 @@ +# LLMO Config: Fail-Closed `writeConfig` Validation + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Make `llmoConfig.writeConfig` in `@adobe/spacecat-shared-utils` validate against the published Zod schema before `PutObject`. Throw a structured error on failure so no caller can silently corrupt an LLMO config in S3. + +**Architecture:** Single source-level change in `packages/spacecat-shared-utils/src/llmo-config.js`. Add `llmoConfig.parse(config)` upstream of the S3 write; on validation failure throw a typed error (`LlmoConfigValidationError`) carrying the Zod issue list and site context. The function signature and import path are unchanged; runtime contract changes from fail-open to fail-closed. + +**Tech Stack:** Node.js ESM, Zod (already a dependency via `schemas.js`), Mocha/Chai/Sinon for tests. + +**Context:** Step 2 of [SITES-43238](https://jira.corp.adobe.com/browse/SITES-43238). Step 1 (writer-side filter in `spacecat-audit-worker`) merged 2026-05-04 in [PR #2442](https://github.com/adobe/spacecat-audit-worker/pull/2442) — commit `7793a536`. This step prevents future regressions of the same class from any caller. + +**Prerequisites:** + +- Step 1 has soaked ≥24h in prod with the writer-side filter active. Confirmed via Coralogix WARN signals (`Dropped non-alpha-2 region values from DRS prompts`, `Skipped DRS categories with no valid region`) and no audit-worker error-rate spike. +- Phase 1 caller audit (below) is complete and any non-DRS callers identified as emitting potentially-invalid configs have a fix-at-edge PR merged. + +--- + +## File Map + +**Modified files:** + +- `packages/spacecat-shared-utils/src/llmo-config.js` — `writeConfig`: parse against schema before PutObject; throw on failure. +- `packages/spacecat-shared-utils/src/llmo-config.d.ts` (if it exists; otherwise add type for the new error) — declare the new error class export. +- `packages/spacecat-shared-utils/src/index.js` — export `LlmoConfigValidationError`. +- `packages/spacecat-shared-utils/test/llmo-config.test.js` — add fail-closed test cases for `writeConfig`. + +**New files:** + +- None. The error class lives next to `writeConfig` in `llmo-config.js`. + +**Also rolled in (originally tracked separately as SITES-43908):** + +- `readConfig` switched from raw `llmoConfig.parse` to `safeParse` and now throws `LlmoConfigValidationError` (instead of raw `ZodError`) on schema failure. Closes the read/write asymmetry surfaced in PR #1574 review so consumers have a single catch contract. + +**Out of scope (this plan):** + +- `customerConfigV2` read/write — different schema, separate concern. +- Repair of already-corrupted S3 configs — that's step 3. + +All paths below are relative to `packages/spacecat-shared-utils/` unless stated otherwise. + +--- + +## Phase 1: Caller Audit (pre-flight) + +**Purpose:** Identify every caller of `llmoConfig.writeConfig` and confirm none silently produce configs that would fail validation. Without this, flipping the contract to fail-closed will turn pre-existing latent bugs into production exceptions. + +### Task 1.1: Enumerate callers across consumer repos + +- [x] Grep for `writeConfig` import + call patterns across the workspace. +- [x] Search pattern: `writeConfig\(` in `src/` of every spacecat-* consumer repo. + +### Task 1.2: Classify each call site + +- [x] Audit table (executed 2026-05-04 against current main of each repo): + +| Repo | File:line | Source of `config` arg | Risk | Notes | +|---|---|---|---|---| +| spacecat-audit-worker | `src/drs-prompt-generation/drs-config-writer.js:200` | DRS prompts merged into config (read-then-modify) | Low | Step 1 (PR #2442, merged 2026-05-04) filters non-alpha-2 regions and skips new categories without a valid region before reaching writeConfig. Residual risk is anything else the schema rejects that the filter doesn't catch — covered by fail-closed itself. | +| spacecat-api-service | `src/controllers/llmo/llmo.js:543` | User input via PUT endpoint | None | Controller already calls `llmoConfigSchema.safeParse(newConfig)` at line 533 and returns HTTP 400 on failure before calling writeConfig. Fail-closed inside writeConfig is a no-op for this caller. | + +No other production callers found across `spacecat-api-service`, `spacecat-audit-worker`, `spacecat-import-worker`, `spacecat-content-processor`, `spacecat-content-scraper`, `spacecat-fulfillment-worker`, `spacecat-task-processor`, `spacecat-autofix-worker`, `spacecat-reporting-worker`, or `spacecat-jobs-dispatcher`. + +### Task 1.3: File fix-at-edge tasks if needed + +- [x] None required. Both known callers either pre-validate (api-service) or pre-filter (audit-worker via step 1). + +**Validation gate (Phase 1):** PASSED. + +- Audit table covers 100% of `writeConfig` call sites in the consumer-repo set above. +- No fix-at-edge PRs needed before Phase 2. + +--- + +## Phase 2: Implementation + +**Purpose:** Add fail-closed validation to `writeConfig`. One function, one error class, tests. + +### Task 2.1: Add `LlmoConfigValidationError` and validate in `writeConfig` + +- [ ] Define `LlmoConfigValidationError` near the top of `src/llmo-config.js`: + - Extends `Error`. + - Constructor takes `(siteId, zodError)` and exposes `siteId` and `issues` (Zod's `error.issues`) as enumerable properties. + - `name = 'LlmoConfigValidationError'`. + - Message includes `siteId` and a one-line summary of issue paths/messages so log lines are diagnosable without inspecting the error object. +- [ ] In `writeConfig`, before constructing the `PutObjectCommand`, call `llmoConfig.safeParse(config)`. On failure, throw `new LlmoConfigValidationError(siteId, result.error)`. +- [ ] Update the JSDoc on `writeConfig` to document the new throw behavior and the error type. + +### Task 2.2: Export the error class from the package entry point + +- [ ] Re-export `LlmoConfigValidationError` from `src/index.js` (matching the pattern used for other named exports). +- [ ] If a `.d.ts` declaration exists for `llmo-config`, add the error class declaration. If not, do nothing — JS-only consumers don't need it. + +### Task 2.3: Add tests in `test/llmo-config.test.js` + +- [ ] Inside `describe('writeConfig', ...)`: + - [ ] `throws LlmoConfigValidationError when category region is non-alpha-2` — config with `categories: { id: { name: 'x', region: 'en-us' } }` rejects, no PutObject call (assert via Sinon spy on `s3Client.send`). + - [ ] `throws LlmoConfigValidationError when prompt regions contain invalid values` — `aiTopics[*].prompts[*].regions: ['global']`. + - [ ] `throws LlmoConfigValidationError when a required field is missing` — e.g. category without `name`. + - [ ] Error includes `siteId` and a non-empty `issues` array. + - [ ] Existing valid-config test still passes unchanged. + +**Validation gate (Phase 2):** + +- `npm test -w packages/spacecat-shared-utils` is green. +- Coverage thresholds for the package hold (100% lines/statements, 97% branches per `.nycrc.json`). +- `npm run lint -w packages/spacecat-shared-utils` is clean. +- All existing `writeConfig` tests in this file still pass without modification. + +--- + +## Phase 3: Release Coordination + +**Purpose:** Ship the change to consumers in a controlled order. The contract change is runtime-breaking for any caller writing invalid configs; the audit in Phase 1 should make this a non-event in practice. + +### Task 3.1: Conventional commit + semantic-release + +- [ ] Commit message: `feat(llmo-config): fail-closed writeConfig validation` with a `BREAKING CHANGE:` footer documenting that `writeConfig` now throws `LlmoConfigValidationError` on schema failure. This forces a major-version bump for `@adobe/spacecat-shared-utils`. +- [ ] PR description references SITES-43238 and links the Phase 1 audit table. +- [ ] On merge to main, semantic-release publishes the new major version automatically. + +### Task 3.2: Bump dependency in critical consumers + +- [ ] Open dep-bump PRs (or manual updates) in the consumers identified in Phase 1 as actually using `writeConfig`. Order by risk: + 1. `spacecat-audit-worker` (DRS prompt writer; depends on step 1's filter being live) + 2. `spacecat-api-service` (LLMO config endpoints) + 3. Any other consumer flagged in Phase 1 +- [ ] Each dep-bump PR runs the consumer's full test suite; failures here indicate a Phase 1 miss and must be fixed before merging the bump. + +**Validation gate (Phase 3):** + +- spacecat-shared-utils new major version published. +- Each consumer dep-bump PR: CI green, no test failures attributable to fail-closed validation. +- For any consumer whose CI surfaces an invalid-config write that Phase 1 missed: do not merge the bump; reopen Phase 1 for that caller, fix at edge, then retry. + +--- + +## Phase 4: Production Verification + +**Purpose:** Confirm the fail-closed behavior is live and not generating production errors in normal operation. + +### Task 4.1: Watch for unexpected validation failures + +- [ ] After each consumer dep-bump deploy, monitor Coralogix for `LlmoConfigValidationError` occurrences across all consumer Lambda subsystems for ≥48h. +- [ ] For each occurrence: identify the caller, the `siteId`, and the failing issue paths. Decide whether to (a) fix at source if a real bug, or (b) add a writer-side filter analogous to step 1 if the upstream data is the problem. + +### Task 4.2: Re-run the SITES-43238 reproducer + +- [ ] After step 3 (repair) completes for site `78d59744-e06c-4d14-a77a-9490c1464116`, confirm `GET /api/v1/sites/78d59744-e06c-4d14-a77a-9490c1464116/llmo/config` returns 200 (or the expected response shape — not 400 from a Zod failure on read). +- [ ] No correctness regression for sites that were healthy before this change. + +**Validation gate (Phase 4):** + +- Zero `LlmoConfigValidationError` occurrences in prod logs over ≥48h, OR each occurrence has a documented root cause and an open ticket / merged fix. +- The SITES-43238 reproducer no longer 400s on read for the originally-affected site (this gate may be satisfied by step 3's repair rather than this step alone). +- Step 2 declared complete on the Jira ticket. + +--- + +## Risks + +1. **Hidden invalid-writing caller missed by Phase 1.** Most likely cause of post-deploy incidents. Mitigation: Phase 4's 48h watch + a pre-merge dry-run that runs a handful of representative writes through the new code path. +2. **Zod parse cost on every write.** LLMO configs are small JSON; parse is microseconds. Negligible. Not mitigated. +3. **Caller catches `Error` and swallows.** If any caller has a generic `try/catch` around `writeConfig` and discards the error, fail-closed silently degrades to a no-op. Phase 1 should flag these. Mitigation: any caller that catches around `writeConfig` should re-throw `LlmoConfigValidationError` explicitly or escalate. +4. **Major-version bump churn.** Renovate/dependabot may push the new version to consumers we haven't yet audited. Mitigation: pin in `renovate.json5` until Phase 3 is complete, OR rely on consumer CI to catch issues. + +## Deferred + +- Schema-evolution policy (what happens when the schema changes and old configs become invalid). Out of scope; tracked separately if the schema starts to evolve. +- Read-validate-write helper (helper that reads, applies a mutator, writes — atomically schema-checked). Possible follow-up if multiple callers re-implement that pattern. +- CloudWatch / Coralogix counter metric for validation failures. Useful for trend-watching but not required for correctness. + +## Timeline (estimate) + +- Phase 1 (audit): 1 day +- Phase 2 (implementation + tests): half a day +- Phase 3 (release + critical dep bumps): 1–2 days +- Phase 4 (prod verification soak): 2–3 days + +End-to-end ~1 calendar week assuming Phase 1 surfaces no surprises. diff --git a/packages/spacecat-shared-utils/src/index.js b/packages/spacecat-shared-utils/src/index.js index 71ec03619..7199a5da2 100644 --- a/packages/spacecat-shared-utils/src/index.js +++ b/packages/spacecat-shared-utils/src/index.js @@ -109,6 +109,7 @@ export { detectAEMVersion, DELIVERY_TYPES, AUTHORING_TYPES } from './aem.js'; export { determineAEMCSPageId, getPageEditUrl } from './aem-content-api-utils.js'; export * as llmoConfig from './llmo-config.js'; +export { LlmoConfigValidationError } from './llmo-config.js'; export * as llmoStrategy from './llmo-strategy.js'; export * as schemas from './schemas.js'; diff --git a/packages/spacecat-shared-utils/src/llmo-config.js b/packages/spacecat-shared-utils/src/llmo-config.js index dac1bfd09..408bd9b35 100644 --- a/packages/spacecat-shared-utils/src/llmo-config.js +++ b/packages/spacecat-shared-utils/src/llmo-config.js @@ -18,6 +18,31 @@ import { llmoConfig } from './schemas.js'; * @import { LLMOConfig } from "./schemas.js" */ +/** + * Thrown by `writeConfig` when the supplied LLMO configuration does not match + * the published Zod schema. Exposes the offending site and the Zod issue list + * so callers and log readers can identify the failing fields without + * re-parsing the error. + */ +export class LlmoConfigValidationError extends Error { + constructor(siteId, zodError) { + // Use issue `code` rather than `message` in the summary: Zod's default + // messages can echo received values, which may include user-supplied + // content (brand names, competitor URLs) on the api-service write path. + // The full message and value remain on `this.issues` for trusted callers. + const summary = zodError.issues + .map((i) => `${i.path.join('.')}: ${i.code}`) + .join('; '); + super( + `LLMO config for site ${siteId} failed schema validation: ${summary}`, + { cause: zodError }, + ); + this.name = 'LlmoConfigValidationError'; + this.siteId = siteId; + this.issues = zodError.issues; + } +} + /** * @param {string} siteId The ID of the site to get the config directory for. * @returns {string} The configuration directory path for the given site ID. @@ -62,6 +87,10 @@ export function defaultConfig() { * Reads the LLMO configuration for a given site. * Returns an empty configuration if the configuration does not exist. * + * If the persisted config exists but fails schema validation, throws + * `LlmoConfigValidationError` so callers have a uniform error contract + * across read and write paths. + * * @param {string} sideId The ID of the site. * @param {S3Client} s3Client The S3 client to use for reading the configuration. * @param {object} [options] @@ -70,6 +99,7 @@ export function defaultConfig() { * @param {string} [options.s3Bucket] Optional S3 bucket name. * @returns {Promise<{config: LLMOConfig, exists: boolean, version?: string}>} The configuration, * a flag indicating if it existed, and the version ID if it exists. + * @throws {LlmoConfigValidationError} If the persisted config fails schema validation. * @throws {Error} If reading the configuration fails for reasons other than it not existing. */ export async function readConfig(sideId, s3Client, options) { @@ -97,20 +127,35 @@ export async function readConfig(sideId, s3Client, options) { throw new Error('LLMO config body is empty'); } const text = await body.transformToString(); - const config = llmoConfig.parse(JSON.parse(text)); - return { config, exists: true, version: res.VersionId || undefined }; + const result = llmoConfig.safeParse(JSON.parse(text)); + if (!result.success) { + throw new LlmoConfigValidationError(sideId, result.error); + } + return { config: result.data, exists: true, version: res.VersionId || undefined }; } /** * Writes the LLMO configuration for a given site. + * + * Validates `config` against the published Zod schema before issuing the S3 + * `PutObject`. If validation fails, throws `LlmoConfigValidationError` and + * does not call S3, so invalid configs cannot reach the bucket through this + * function. + * * @param {string} siteId The ID of the site. * @param {LLMOConfig} config The configuration object to write. * @param {S3Client} s3Client The S3 client to use for reading the configuration. * @param {object} [options] * @param {string} [options.s3Bucket] Optional S3 bucket name. * @returns {Promise<{ version: string }>} The version of the configuration written. + * @throws {LlmoConfigValidationError} If `config` does not match the LLMO schema. */ export async function writeConfig(siteId, config, s3Client, options) { + const result = llmoConfig.safeParse(config); + if (!result.success) { + throw new LlmoConfigValidationError(siteId, result.error); + } + const s3Bucket = options?.s3Bucket || process.env.S3_BUCKET_NAME; const putObjectCommand = new PutObjectCommand({ diff --git a/packages/spacecat-shared-utils/test/index.test.js b/packages/spacecat-shared-utils/test/index.test.js index c82672dd6..693d1705f 100644 --- a/packages/spacecat-shared-utils/test/index.test.js +++ b/packages/spacecat-shared-utils/test/index.test.js @@ -103,6 +103,7 @@ describe('Index Exports', () => { 'isoCalendarWeekMonday', 'isoCalendarWeekSunday', 'llmoConfig', + 'LlmoConfigValidationError', 'llmoStrategy', 'logWrapper', 'OPPORTUNITY_DEPENDENCY_MAP', diff --git a/packages/spacecat-shared-utils/test/llmo-config.test.js b/packages/spacecat-shared-utils/test/llmo-config.test.js index 119f06e47..d3b391dab 100644 --- a/packages/spacecat-shared-utils/test/llmo-config.test.js +++ b/packages/spacecat-shared-utils/test/llmo-config.test.js @@ -24,6 +24,7 @@ import { customerConfigV2Path, readCustomerConfigV2, writeCustomerConfigV2, + LlmoConfigValidationError, } from '../src/llmo-config.js'; use(sinonChai); @@ -160,13 +161,25 @@ describe('llmo-config utilities', () => { await expect(readConfig(siteId, s3Client)).rejectedWith(SyntaxError); }); - it('throws when the configuration fails schema validation', async () => { + it('throws LlmoConfigValidationError when the persisted config fails schema validation', async () => { const body = { transformToString: sinon.stub().resolves(JSON.stringify({ entities: {} })), }; s3Client.send.resolves({ Body: body }); - await expect(readConfig(siteId, s3Client)).rejectedWith(Error); + let caught; + try { + await readConfig(siteId, s3Client); + } catch (e) { + caught = e; + } + + expect(caught).instanceOf(LlmoConfigValidationError); + expect(caught.name).equals('LlmoConfigValidationError'); + expect(caught.siteId).equals(siteId); + expect(caught.issues).to.be.an('array').with.length.greaterThan(0); + expect(caught.cause).to.exist; + expect(caught.cause.issues).to.equal(caught.issues); }); }); @@ -200,6 +213,57 @@ describe('llmo-config utilities', () => { await expect(writeConfig(siteId, validConfig, s3Client)).rejectedWith('Failed to get version ID after writing LLMO config'); }); + + it('throws LlmoConfigValidationError when category region is invalid', async () => { + const invalidConfig = { + ...validConfig, + categories: { + '550e8400-e29b-41d4-a716-446655440000': { name: 'brand', region: 'en-us' }, + }, + }; + + await expect(writeConfig(siteId, invalidConfig, s3Client)) + .rejectedWith(LlmoConfigValidationError); + expect(s3Client.send).not.called; + }); + + it('throws LlmoConfigValidationError when a required field is missing', async () => { + const invalidConfig = { ...validConfig }; + delete invalidConfig.entities; + + await expect(writeConfig(siteId, invalidConfig, s3Client)) + .rejectedWith(LlmoConfigValidationError); + expect(s3Client.send).not.called; + }); + + it('LlmoConfigValidationError carries siteId, name, Zod issues, and cause', async () => { + const invalidConfig = { + ...validConfig, + categories: { + '550e8400-e29b-41d4-a716-446655440000': { name: 'brand', region: 'en-us' }, + }, + }; + + let caught; + try { + await writeConfig(siteId, invalidConfig, s3Client); + } catch (e) { + caught = e; + } + + expect(caught).instanceOf(LlmoConfigValidationError); + expect(caught.name).equals('LlmoConfigValidationError'); + expect(caught.siteId).equals(siteId); + expect(caught.issues).to.be.an('array').with.length.greaterThan(0); + + // Message format: "...failed schema validation: : " + expect(caught.message).to.include(siteId); + expect(caught.message).to.match(/categories\..+\.region/); + + // Original ZodError preserved as cause for stack-chain debugging. + expect(caught.cause).to.exist; + expect(caught.cause.issues).to.equal(caught.issues); + }); }); describe('customerConfigV2Path', () => {