Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
# LLMO Config: Fail-Closed `writeConfig` Validation

> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.

**Goal:** Make `llmoConfig.writeConfig` in `@adobe/spacecat-shared-utils` validate against the published Zod schema before `PutObject`. Throw a structured error on failure so no caller can silently corrupt an LLMO config in S3.

**Architecture:** Single source-level change in `packages/spacecat-shared-utils/src/llmo-config.js`. Add `llmoConfig.parse(config)` upstream of the S3 write; on validation failure throw a typed error (`LlmoConfigValidationError`) carrying the Zod issue list and site context. The function signature and import path are unchanged; runtime contract changes from fail-open to fail-closed.

**Tech Stack:** Node.js ESM, Zod (already a dependency via `schemas.js`), Mocha/Chai/Sinon for tests.

**Context:** Step 2 of [SITES-43238](https://jira.corp.adobe.com/browse/SITES-43238). Step 1 (writer-side filter in `spacecat-audit-worker`) merged 2026-05-04 in [PR #2442](https://github.com/adobe/spacecat-audit-worker/pull/2442) — commit `7793a536`. This step prevents future regressions of the same class from any caller.

**Prerequisites:**

- Step 1 has soaked ≥24h in prod with the writer-side filter active. Confirmed via Coralogix WARN signals (`Dropped non-alpha-2 region values from DRS prompts`, `Skipped DRS categories with no valid region`) and no audit-worker error-rate spike.
- Phase 1 caller audit (below) is complete and any non-DRS callers identified as emitting potentially-invalid configs have a fix-at-edge PR merged.

---

## File Map

**Modified files:**

- `packages/spacecat-shared-utils/src/llmo-config.js` — `writeConfig`: parse against schema before PutObject; throw on failure.
- `packages/spacecat-shared-utils/src/llmo-config.d.ts` (if it exists; otherwise add type for the new error) — declare the new error class export.
- `packages/spacecat-shared-utils/src/index.js` — export `LlmoConfigValidationError`.
- `packages/spacecat-shared-utils/test/llmo-config.test.js` — add fail-closed test cases for `writeConfig`.

**New files:**

- None. The error class lives next to `writeConfig` in `llmo-config.js`.

**Also rolled in (originally tracked separately as SITES-43908):**

- `readConfig` switched from raw `llmoConfig.parse` to `safeParse` and now throws `LlmoConfigValidationError` (instead of raw `ZodError`) on schema failure. Closes the read/write asymmetry surfaced in PR #1574 review so consumers have a single catch contract.

**Out of scope (this plan):**

- `customerConfigV2` read/write — different schema, separate concern.
- Repair of already-corrupted S3 configs — that's step 3.

All paths below are relative to `packages/spacecat-shared-utils/` unless stated otherwise.

---

## Phase 1: Caller Audit (pre-flight)

**Purpose:** Identify every caller of `llmoConfig.writeConfig` and confirm none silently produce configs that would fail validation. Without this, flipping the contract to fail-closed will turn pre-existing latent bugs into production exceptions.

### Task 1.1: Enumerate callers across consumer repos

- [x] Grep for `writeConfig` import + call patterns across the workspace.
- [x] Search pattern: `writeConfig\(` in `src/` of every spacecat-* consumer repo.

### Task 1.2: Classify each call site

- [x] Audit table (executed 2026-05-04 against current main of each repo):

| Repo | File:line | Source of `config` arg | Risk | Notes |
|---|---|---|---|---|
| spacecat-audit-worker | `src/drs-prompt-generation/drs-config-writer.js:200` | DRS prompts merged into config (read-then-modify) | Low | Step 1 (PR #2442, merged 2026-05-04) filters non-alpha-2 regions and skips new categories without a valid region before reaching writeConfig. Residual risk is anything else the schema rejects that the filter doesn't catch — covered by fail-closed itself. |
| spacecat-api-service | `src/controllers/llmo/llmo.js:543` | User input via PUT endpoint | None | Controller already calls `llmoConfigSchema.safeParse(newConfig)` at line 533 and returns HTTP 400 on failure before calling writeConfig. Fail-closed inside writeConfig is a no-op for this caller. |

No other production callers found across `spacecat-api-service`, `spacecat-audit-worker`, `spacecat-import-worker`, `spacecat-content-processor`, `spacecat-content-scraper`, `spacecat-fulfillment-worker`, `spacecat-task-processor`, `spacecat-autofix-worker`, `spacecat-reporting-worker`, or `spacecat-jobs-dispatcher`.

### Task 1.3: File fix-at-edge tasks if needed

- [x] None required. Both known callers either pre-validate (api-service) or pre-filter (audit-worker via step 1).

**Validation gate (Phase 1):** PASSED.

- Audit table covers 100% of `writeConfig` call sites in the consumer-repo set above.
- No fix-at-edge PRs needed before Phase 2.

---

## Phase 2: Implementation

**Purpose:** Add fail-closed validation to `writeConfig`. One function, one error class, tests.

### Task 2.1: Add `LlmoConfigValidationError` and validate in `writeConfig`

- [ ] Define `LlmoConfigValidationError` near the top of `src/llmo-config.js`:
- Extends `Error`.
- Constructor takes `(siteId, zodError)` and exposes `siteId` and `issues` (Zod's `error.issues`) as enumerable properties.
- `name = 'LlmoConfigValidationError'`.
- Message includes `siteId` and a one-line summary of issue paths/messages so log lines are diagnosable without inspecting the error object.
- [ ] In `writeConfig`, before constructing the `PutObjectCommand`, call `llmoConfig.safeParse(config)`. On failure, throw `new LlmoConfigValidationError(siteId, result.error)`.
- [ ] Update the JSDoc on `writeConfig` to document the new throw behavior and the error type.

### Task 2.2: Export the error class from the package entry point

- [ ] Re-export `LlmoConfigValidationError` from `src/index.js` (matching the pattern used for other named exports).
- [ ] If a `.d.ts` declaration exists for `llmo-config`, add the error class declaration. If not, do nothing — JS-only consumers don't need it.

### Task 2.3: Add tests in `test/llmo-config.test.js`

- [ ] Inside `describe('writeConfig', ...)`:
- [ ] `throws LlmoConfigValidationError when category region is non-alpha-2` — config with `categories: { id: { name: 'x', region: 'en-us' } }` rejects, no PutObject call (assert via Sinon spy on `s3Client.send`).
- [ ] `throws LlmoConfigValidationError when prompt regions contain invalid values` — `aiTopics[*].prompts[*].regions: ['global']`.
- [ ] `throws LlmoConfigValidationError when a required field is missing` — e.g. category without `name`.
- [ ] Error includes `siteId` and a non-empty `issues` array.
- [ ] Existing valid-config test still passes unchanged.

**Validation gate (Phase 2):**

- `npm test -w packages/spacecat-shared-utils` is green.
- Coverage thresholds for the package hold (100% lines/statements, 97% branches per `.nycrc.json`).
- `npm run lint -w packages/spacecat-shared-utils` is clean.
- All existing `writeConfig` tests in this file still pass without modification.

---

## Phase 3: Release Coordination

**Purpose:** Ship the change to consumers in a controlled order. The contract change is runtime-breaking for any caller writing invalid configs; the audit in Phase 1 should make this a non-event in practice.

### Task 3.1: Conventional commit + semantic-release

- [ ] Commit message: `feat(llmo-config): fail-closed writeConfig validation` with a `BREAKING CHANGE:` footer documenting that `writeConfig` now throws `LlmoConfigValidationError` on schema failure. This forces a major-version bump for `@adobe/spacecat-shared-utils`.
- [ ] PR description references SITES-43238 and links the Phase 1 audit table.
- [ ] On merge to main, semantic-release publishes the new major version automatically.

### Task 3.2: Bump dependency in critical consumers

- [ ] Open dep-bump PRs (or manual updates) in the consumers identified in Phase 1 as actually using `writeConfig`. Order by risk:
1. `spacecat-audit-worker` (DRS prompt writer; depends on step 1's filter being live)
2. `spacecat-api-service` (LLMO config endpoints)
3. Any other consumer flagged in Phase 1
- [ ] Each dep-bump PR runs the consumer's full test suite; failures here indicate a Phase 1 miss and must be fixed before merging the bump.

**Validation gate (Phase 3):**

- spacecat-shared-utils new major version published.
- Each consumer dep-bump PR: CI green, no test failures attributable to fail-closed validation.
- For any consumer whose CI surfaces an invalid-config write that Phase 1 missed: do not merge the bump; reopen Phase 1 for that caller, fix at edge, then retry.

---

## Phase 4: Production Verification

**Purpose:** Confirm the fail-closed behavior is live and not generating production errors in normal operation.

### Task 4.1: Watch for unexpected validation failures

- [ ] After each consumer dep-bump deploy, monitor Coralogix for `LlmoConfigValidationError` occurrences across all consumer Lambda subsystems for ≥48h.
- [ ] For each occurrence: identify the caller, the `siteId`, and the failing issue paths. Decide whether to (a) fix at source if a real bug, or (b) add a writer-side filter analogous to step 1 if the upstream data is the problem.

### Task 4.2: Re-run the SITES-43238 reproducer

- [ ] After step 3 (repair) completes for site `78d59744-e06c-4d14-a77a-9490c1464116`, confirm `GET /api/v1/sites/78d59744-e06c-4d14-a77a-9490c1464116/llmo/config` returns 200 (or the expected response shape — not 400 from a Zod failure on read).
- [ ] No correctness regression for sites that were healthy before this change.

**Validation gate (Phase 4):**

- Zero `LlmoConfigValidationError` occurrences in prod logs over ≥48h, OR each occurrence has a documented root cause and an open ticket / merged fix.
- The SITES-43238 reproducer no longer 400s on read for the originally-affected site (this gate may be satisfied by step 3's repair rather than this step alone).
- Step 2 declared complete on the Jira ticket.

---

## Risks

1. **Hidden invalid-writing caller missed by Phase 1.** Most likely cause of post-deploy incidents. Mitigation: Phase 4's 48h watch + a pre-merge dry-run that runs a handful of representative writes through the new code path.
2. **Zod parse cost on every write.** LLMO configs are small JSON; parse is microseconds. Negligible. Not mitigated.
3. **Caller catches `Error` and swallows.** If any caller has a generic `try/catch` around `writeConfig` and discards the error, fail-closed silently degrades to a no-op. Phase 1 should flag these. Mitigation: any caller that catches around `writeConfig` should re-throw `LlmoConfigValidationError` explicitly or escalate.
4. **Major-version bump churn.** Renovate/dependabot may push the new version to consumers we haven't yet audited. Mitigation: pin in `renovate.json5` until Phase 3 is complete, OR rely on consumer CI to catch issues.

## Deferred

- Schema-evolution policy (what happens when the schema changes and old configs become invalid). Out of scope; tracked separately if the schema starts to evolve.
- Read-validate-write helper (helper that reads, applies a mutator, writes — atomically schema-checked). Possible follow-up if multiple callers re-implement that pattern.
- CloudWatch / Coralogix counter metric for validation failures. Useful for trend-watching but not required for correctness.

## Timeline (estimate)

- Phase 1 (audit): 1 day
- Phase 2 (implementation + tests): half a day
- Phase 3 (release + critical dep bumps): 1–2 days
- Phase 4 (prod verification soak): 2–3 days

End-to-end ~1 calendar week assuming Phase 1 surfaces no surprises.
1 change: 1 addition & 0 deletions packages/spacecat-shared-utils/src/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@ export { detectAEMVersion, DELIVERY_TYPES, AUTHORING_TYPES } from './aem.js';
export { determineAEMCSPageId, getPageEditUrl } from './aem-content-api-utils.js';

export * as llmoConfig from './llmo-config.js';
export { LlmoConfigValidationError } from './llmo-config.js';
export * as llmoStrategy from './llmo-strategy.js';
export * as schemas from './schemas.js';

Expand Down
49 changes: 47 additions & 2 deletions packages/spacecat-shared-utils/src/llmo-config.js
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,31 @@ import { llmoConfig } from './schemas.js';
* @import { LLMOConfig } from "./schemas.js"
*/

/**
* Thrown by `writeConfig` when the supplied LLMO configuration does not match
* the published Zod schema. Exposes the offending site and the Zod issue list
* so callers and log readers can identify the failing fields without
* re-parsing the error.
*/
export class LlmoConfigValidationError extends Error {
constructor(siteId, zodError) {
// Use issue `code` rather than `message` in the summary: Zod's default
// messages can echo received values, which may include user-supplied
// content (brand names, competitor URLs) on the api-service write path.
// The full message and value remain on `this.issues` for trusted callers.
const summary = zodError.issues
.map((i) => `${i.path.join('.')}: ${i.code}`)
.join('; ');
super(
`LLMO config for site ${siteId} failed schema validation: ${summary}`,
{ cause: zodError },
);
this.name = 'LlmoConfigValidationError';
this.siteId = siteId;
this.issues = zodError.issues;
}
}

/**
* @param {string} siteId The ID of the site to get the config directory for.
* @returns {string} The configuration directory path for the given site ID.
Expand Down Expand Up @@ -62,6 +87,10 @@ export function defaultConfig() {
* Reads the LLMO configuration for a given site.
* Returns an empty configuration if the configuration does not exist.
*
* If the persisted config exists but fails schema validation, throws
* `LlmoConfigValidationError` so callers have a uniform error contract
* across read and write paths.
*
* @param {string} sideId The ID of the site.
* @param {S3Client} s3Client The S3 client to use for reading the configuration.
* @param {object} [options]
Expand All @@ -70,6 +99,7 @@ export function defaultConfig() {
* @param {string} [options.s3Bucket] Optional S3 bucket name.
* @returns {Promise<{config: LLMOConfig, exists: boolean, version?: string}>} The configuration,
* a flag indicating if it existed, and the version ID if it exists.
* @throws {LlmoConfigValidationError} If the persisted config fails schema validation.
* @throws {Error} If reading the configuration fails for reasons other than it not existing.
*/
export async function readConfig(sideId, s3Client, options) {
Expand Down Expand Up @@ -97,20 +127,35 @@ export async function readConfig(sideId, s3Client, options) {
throw new Error('LLMO config body is empty');
}
const text = await body.transformToString();
const config = llmoConfig.parse(JSON.parse(text));
return { config, exists: true, version: res.VersionId || undefined };
const result = llmoConfig.safeParse(JSON.parse(text));
if (!result.success) {
throw new LlmoConfigValidationError(sideId, result.error);
}
return { config: result.data, exists: true, version: res.VersionId || undefined };
}

/**
* Writes the LLMO configuration for a given site.
*
* Validates `config` against the published Zod schema before issuing the S3
* `PutObject`. If validation fails, throws `LlmoConfigValidationError` and
* does not call S3, so invalid configs cannot reach the bucket through this
* function.
*
* @param {string} siteId The ID of the site.
* @param {LLMOConfig} config The configuration object to write.
* @param {S3Client} s3Client The S3 client to use for reading the configuration.
* @param {object} [options]
* @param {string} [options.s3Bucket] Optional S3 bucket name.
* @returns {Promise<{ version: string }>} The version of the configuration written.
* @throws {LlmoConfigValidationError} If `config` does not match the LLMO schema.
*/
export async function writeConfig(siteId, config, s3Client, options) {
const result = llmoConfig.safeParse(config);
if (!result.success) {
throw new LlmoConfigValidationError(siteId, result.error);
}

const s3Bucket = options?.s3Bucket || process.env.S3_BUCKET_NAME;

const putObjectCommand = new PutObjectCommand({
Expand Down
1 change: 1 addition & 0 deletions packages/spacecat-shared-utils/test/index.test.js
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,7 @@ describe('Index Exports', () => {
'isoCalendarWeekMonday',
'isoCalendarWeekSunday',
'llmoConfig',
'LlmoConfigValidationError',
'llmoStrategy',
'logWrapper',
'OPPORTUNITY_DEPENDENCY_MAP',
Expand Down
68 changes: 66 additions & 2 deletions packages/spacecat-shared-utils/test/llmo-config.test.js
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ import {
customerConfigV2Path,
readCustomerConfigV2,
writeCustomerConfigV2,
LlmoConfigValidationError,
} from '../src/llmo-config.js';

use(sinonChai);
Expand Down Expand Up @@ -160,13 +161,25 @@ describe('llmo-config utilities', () => {
await expect(readConfig(siteId, s3Client)).rejectedWith(SyntaxError);
});

it('throws when the configuration fails schema validation', async () => {
it('throws LlmoConfigValidationError when the persisted config fails schema validation', async () => {
const body = {
transformToString: sinon.stub().resolves(JSON.stringify({ entities: {} })),
};
s3Client.send.resolves({ Body: body });

await expect(readConfig(siteId, s3Client)).rejectedWith(Error);
let caught;
try {
await readConfig(siteId, s3Client);
} catch (e) {
caught = e;
}

expect(caught).instanceOf(LlmoConfigValidationError);
expect(caught.name).equals('LlmoConfigValidationError');
expect(caught.siteId).equals(siteId);
expect(caught.issues).to.be.an('array').with.length.greaterThan(0);
expect(caught.cause).to.exist;
expect(caught.cause.issues).to.equal(caught.issues);
});
});

Expand Down Expand Up @@ -200,6 +213,57 @@ describe('llmo-config utilities', () => {

await expect(writeConfig(siteId, validConfig, s3Client)).rejectedWith('Failed to get version ID after writing LLMO config');
});

it('throws LlmoConfigValidationError when category region is invalid', async () => {
const invalidConfig = {
...validConfig,
categories: {
'550e8400-e29b-41d4-a716-446655440000': { name: 'brand', region: 'en-us' },
},
};

await expect(writeConfig(siteId, invalidConfig, s3Client))
.rejectedWith(LlmoConfigValidationError);
expect(s3Client.send).not.called;
});

it('throws LlmoConfigValidationError when a required field is missing', async () => {
const invalidConfig = { ...validConfig };
delete invalidConfig.entities;

await expect(writeConfig(siteId, invalidConfig, s3Client))
.rejectedWith(LlmoConfigValidationError);
expect(s3Client.send).not.called;
});

it('LlmoConfigValidationError carries siteId, name, Zod issues, and cause', async () => {
const invalidConfig = {
...validConfig,
categories: {
'550e8400-e29b-41d4-a716-446655440000': { name: 'brand', region: 'en-us' },
},
};

let caught;
try {
await writeConfig(siteId, invalidConfig, s3Client);
} catch (e) {
caught = e;
}

expect(caught).instanceOf(LlmoConfigValidationError);
expect(caught.name).equals('LlmoConfigValidationError');
expect(caught.siteId).equals(siteId);
expect(caught.issues).to.be.an('array').with.length.greaterThan(0);

// Message format: "...failed schema validation: <path>: <code>"
expect(caught.message).to.include(siteId);
expect(caught.message).to.match(/categories\..+\.region/);

// Original ZodError preserved as cause for stack-chain debugging.
expect(caught.cause).to.exist;
expect(caught.cause.issues).to.equal(caught.issues);
});
});

describe('customerConfigV2Path', () => {
Expand Down
Loading