Skip to content

fix(ci): v3 reference seller storyboard job actually asserts on results#693

Merged
bokelley merged 16 commits into
mainfrom
claude/issue-410-v3-storyboard-assertion
May 12, 2026
Merged

fix(ci): v3 reference seller storyboard job actually asserts on results#693
bokelley merged 16 commits into
mainfrom
claude/issue-410-v3-storyboard-assertion

Conversation

@bokelley
Copy link
Copy Markdown
Contributor

Closes #410.

Problem

The `storyboard-v3-reference-seller` job advertised itself as gating "every PR's translator-pattern conformance" but ended its storyboard invocation with `|| true` (.github/workflows/ci.yml:645), so a failing run silently passed CI. The artifact path also pointed at `examples/v3_reference_seller/` while the file was written to workspace root, so post-mortem inspection landed on `if-no-files-found: warn` instead of the actual result.

The job had cosmetic gate-value, no real gate-value.

Fix

Three changes:

  1. Drop `|| true` from the storyboard invocation. `cd` into `examples/v3_reference_seller` so the result file lands where `upload-artifact` expects it.
  2. "Assert storyboard passed" step matching the seller_agent job's pattern (ci.yml:435-451): `overall_status == 'passing'` AND `controller_detected == true`. On failure, dump the full JSON so the diagnostic IS the storyboard report.
  3. "Assert upstream traffic" anti-façade gate: GET `/_debug/traffic` from the seller; fail if all per-method counts are zero. A seller that returns stub data without translating would have passed the storyboard but emitted no traffic — this is the failure mode feat(testing): storyboard runner matrix entry for examples/v3_reference_seller #410 specifically called out.

What was tested

  • `python3 -c "import yaml; yaml.safe_load(open('.github/workflows/ci.yml'))"` — YAML valid
  • CI will run on this PR; the v3 reference seller already implements `enable_debug_endpoints=True` (per examples/v3_reference_seller/src/app.py:192) so the `/_debug/traffic` endpoint exists and will return real counts.

🤖 Generated with Claude Code

Closes #410.

The storyboard-v3-reference-seller job advertised itself as gating
"every PR's translator-pattern conformance" but ended its storyboard
invocation with `|| true`, so a failing run silently passed CI. The
artifact path also pointed at examples/v3_reference_seller/ while the
file was written to workspace root, so post-mortem inspection landed
on a warning instead of the actual result.

Three changes:

1. Drop `|| true` from the storyboard invocation. Move into
   examples/v3_reference_seller so the result file matches the
   upload-artifact path.
2. Add an "Assert storyboard passed" step matching the seller_agent
   job's pattern: overall_status == 'passing' AND controller_detected.
   On failure, dump the full JSON so the PR diagnostic is the raw
   storyboard report, not a partial cat | head -50.
3. Add an "Assert upstream traffic" step that GETs /_debug/traffic
   and fails if all per-method counts are zero. This is the
   anti-façade gate the issue called for: a seller that returns
   stub data without translating to upstream calls would have passed
   the storyboard but emitted no traffic. The gate makes that mode
   detectable in CI instead of in production.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bokelley
Copy link
Copy Markdown
Contributor Author

Review

The patch itself is right — drop || true, assert overall_status == 'passing', assert non-zero upstream traffic. That's exactly the gate description.

But the gate is already doing its job: the new `AdCP storyboard runner — v3 reference seller (translator)` check is failing on this PR with 12 storyboard-level failures, which is what the previous || true was masking. Three distinct failure clusters:

1. Storyboard fixture / authorization setup (7 failures — same root cause)

All 7 media_buy_seller/*/get_products_brief steps fail with:

PERMISSION_DENIED: Buyer agent is not authorized for this seller. The seller's commercial allowlist did not authorize this credential.

This looks like a storyboard buyer-agent fixture issue — the runner is presenting credentials the v3 ref seller's commercial allowlist doesn't carry. Either the storyboard scenarios assume a fixture step that no longer runs, or the v3 ref seller's seed data missed the buyer-agent registry. Worth confirming locally before landing this PR — if it's a fixture gap on the reference seller side, fix in same PR; if it's a real allowlist policy regression, file separately.

2. Cascade-rule error code mismatches (2 failures)

  • create_media_buy_aggressive_terms: expected TERMS_REJECTED, got something else at /adcp_error/code
  • update_unknown_media_buy: expected MEDIA_BUY_NOT_FOUND
  • update_unknown_package: expected PACKAGE_NOT_FOUND

These are real spec-conformance regressions in the v3 ref seller. Each one should be a separate issue; the cascade-rules layer can't normalize these because they're terminal codes.

3. Spec conformance: missing list_accounts / sync_accounts (1 failure)

Agent declared account-bearing specialism(s) [sales-non-guaranteed, sales-guaranteed]
but advertises neither list_accounts nor sync_accounts.

This is exactly #377, which is already open. Worth linking that issue from this PR's description so the dependency is explicit.

4. Storyboard step bug?

INVALID_REQUEST: buying_mode='refine' must not be combined with brief

get_products_refine storyboard step is sending buying_mode=refine alongside brief. Either the spec changed (refine doesn't take brief), or the storyboard step is wrong. Verify against the spec slice.

Process recommendation

The PR has correctly diagnosed and fixed the cosmetic-gate problem. The trade-off now is:

  • Land as-is → the v3 storyboard job goes red on main for every PR until the 12 failures are fixed. This is the honest signal but blocks unrelated work.
  • Hold until the 12 failures are fixed (or at least the fixture issue, which is 7 of them) → defeats the purpose somewhat but keeps main green.
  • Land + immediately file follow-ups (preferred) — capture each cluster as a tracked issue, link from PR body, and accept the temporary red. The whole point of the gate is that broken conformance should be visible, not hidden.

Also: branch is BEHIND base. Needs rebase before merge regardless.

I'll file follow-ups for clusters 2 and 4 separately. Cluster 1 needs human eyes on the v3 ref seller fixture setup — please confirm before I file.

@bokelley
Copy link
Copy Markdown
Contributor Author

Filed the PERMISSION_DENIED cluster as #703 with three hypotheses (fixture seed gap / allowlist policy regression / middleware drift) and triage steps. Team can pick it up from there.

Full follow-up set for this PR's failures: #701, #702, #703 + already-open #377.

bokelley and others added 7 commits May 12, 2026 11:20
Brings the v3 reference seller from 12 storyboard failures down to (hopefully)
0 + #702 (storyboard YAML bug, upstream). Four changes:

1. CI storyboard step gets `--auth dev-bearer-token-acme-1` so the runner's
   requests resolve through PgBuyerAgentRegistry to ba_acme_bearer (seeded in
   seed.py). Without this, 7 storyboard steps fail with PERMISSION_DENIED.

2. update_media_buy validates inputs against the upstream BEFORE bailing
   with UNSUPPORTED_FEATURE. Unknown media_buy_id resolves to MEDIA_BUY_NOT_FOUND
   (via SDK 404 projection on get_order). Unknown package_id resolves to
   PACKAGE_NOT_FOUND. Closes the 2 invalid_transitions storyboard probes that
   asserted on the specific code.

3. create_media_buy gains _reject_unworkable_terms. When a buyer proposes
   measurement_terms.billing_measurement.max_variance_percent <= 0 we raise
   TERMS_REJECTED. Zero variance on third-party measurement is unworkable for
   any real vendor. Closes the measurement_terms_rejected/aggressive_terms
   probe that asserted TERMS_REJECTED.

4. Tests grow three new cases covering the three new error paths, and the
   existing UNSUPPORTED_FEATURE smoke test gains a respx mock for the new
   upstream-existence check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…entials

Closes the root-cause behind the 12-failure cluster surfaced by the
storyboard gate in #410. The v3 reference seller had no bearer auth
middleware wired in serve(), so every storyboard request arrived with
``Authorization: Bearer dev-bearer-token-acme-1`` but the framework
never extracted it into a credential — BuyerAgentRegistry dispatch
saw credential=None and returned PERMISSION_DENIED on every call.

Three coordinated changes in app.py:

1. ``_make_validate_token(sessionmaker)`` — async validator that looks up
   a bearer token against the seeded ``BuyerAgent.api_key_id`` column and
   returns a Principal with ``caller_identity=agent_url`` and
   ``metadata={"api_key_id": token}`` so the raw token survives the
   middleware → dispatch handoff.

2. ``_build_context_factory()`` — replaces the simple tenant-pinning
   factory with one that wraps ``adcp.server.auth.auth_context_factory``
   and upgrades the bearer-flow ``adcp.auth_info`` from
   ``credential=None`` to ``credential=ApiKeyCredential(key_id=token)``.
   That's exactly what auth_context_factory's docstring tells adopters
   to do when they wire BuyerAgentRegistry alongside bearer auth.

3. ``serve()`` gains ``auth=BearerTokenAuth(validate_token=...)`` so the
   middleware is actually installed.

With these wired together the storyboard runner's --auth flag finally
resolves through to ``BuyerAgentRegistry.resolve_by_credential`` and the
seeded ``ba_acme_bearer`` admits the request. The platform fixes for
MEDIA_BUY_NOT_FOUND / PACKAGE_NOT_FOUND / TERMS_REJECTED (already in
this branch) then run for the negative-path probes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two CI failures on commit 1ec1f1a:

1. v3 ref seller boot crashed with TypeError because BearerTokenAuth
   rejects async validators when transport='both' (the A2A leg's
   middleware cannot await). Fix: load the BuyerAgent.api_key_id rows
   into a token-to-Principal map at boot, swap the async DB-lookup
   validator for a sync dict lookup. The bootstrap function does
   schema-create + token-load in the same event loop, disposes the
   engine before returning so uvicorn opens a fresh asyncpg pool.

2. conventional-commits gate rejected commit 63453bb ("Merge branch
   'main' into ...") because the skip regex only matched the
   merge-queue API shape ("Merge <sha> into <sha>"), not the shape
   gh pr update-branch and git merge produce. Widen the regex.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the ACCOUNT_NOT_FOUND cluster surfaced by the storyboard gate
after bearer auth started working. The seller previously used
ExplicitAccounts which requires account.account_id on every request,
but AdCP storyboards send account: {brand: ..., operator: ...} with
no account_id during the discover-products phase.

Replace ExplicitAccounts with a custom AccountStore that handles both
reference shapes:

1. Explicit {account_id} ref - direct lookup against the accounts
   table's account_id column (existing behavior).
2. Brand-shaped {brand, operator} ref with no account_id - resolve
   to the first active account associated with the authenticated
   buyer agent (via auth_info.principal -> agent_url -> buyer_agent
   row -> first matching account). This is the path AdCP storyboards
   exercise before the buyer has been issued a per-relationship
   account_id.

The row-to-Account projection is factored out so both paths share
the same upstream-routing metadata stamping (network_code,
advertiser_id, mock_upstream_url).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ia_buy_id

After bearer auth + AccountStore brand-fallback landed, the storyboard
gate's remaining 5 create_buy failures all shared the same root cause:
seller returns Submitted({task_id, status: 'submitted'}) when the upstream
order lands in pending_approval, and the storyboard validation expects
media_buy_id in the response body.

The reference seller runs against a fast mock upstream that auto-approves
in milliseconds, so awaiting the polling synchronously is the right
behavior for this example. Adopters with slow real-world approvals
swap to ctx.handoff_to_task(...) per the documented escape hatch.

Tests updated:
- test_create_media_buy_returns_task_handoff_on_pending_approval renamed
  to test_create_media_buy_sync_polls_to_success_on_pending_approval;
  mocks the task-completion poll and the post-completion order refetch,
  asserts CreateMediaBuySuccessResponse with media_buy_id.
- test_create_media_buy_raises_when_polling_times_out simplified to
  await directly (no TaskHandoff._fn indirection).
- test_create_media_buy_raises_when_task_rejected likewise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… status

create_media_buy now mints a deterministic package_id per requested
package and echoes the spec-marked echo fields (targeting_overlay,
measurement_terms, creative_assignments, etc.) so AdCP storyboards can
capture packages[0].package_id and verify list-targeting / measurement-
terms persistence. When the buyer supplies no creatives anywhere, the
response surfaces status='pending_creatives' so the next step is
sync_creatives. This unblocks the inventory_list_targeting,
invalid_transitions, creative_fate_after_cancellation, and
pending_creatives_to_start storyboard create steps.
bokelley added 8 commits May 12, 2026 15:13
The v3 storyboard runner job now passes 19/25 active steps but reports
overall_status='partial' on a handful of scenarios that require feature
work tracked under #706 — update_media_buy operations on persisted
state, get_media_buys per-package projection, list_accounts/sync_accounts
(#377), and an upstream storyboard YAML bug (adcp#702). Split the
assert into a hard gate (runner produced output + controller_detected)
and a soft gate (overall_status == passing), so the anti-façade work
in #410 stays enforced while the residuals are tracked openly.
controller_detected has never been true for the v3 reference seller's
translator topology — the runner detects DemoStore overrides only for
the examples/seller_agent.py stub-mode shape. The previous "Assert
storyboard passed" step never reached the controller_detected check
because overall_status != 'passing' tripped first. Splitting the
asserts surfaced this latent gate as a hard failure even when the
soft gate is suppressed. Anti-façade enforcement is unchanged: the
"Assert upstream traffic" step still requires non-zero per-method
upstream-call counts.
…ia_buy

create_media_buy now POSTs each requested package as an upstream line
item (POST /v1/orders/{id}/lineitems) and returns the line_item_id as
the AdCP package_id. update_media_buy is no longer UNSUPPORTED_FEATURE:
cancel / pause / creative-assignment patches are applied to a
seller-local shadow store (self._buy_state), and creative_assignments
flow upstream via POST .../lineitems/{li}/creative-attach. get_media_buys
projects each line item plus its shadow-store entry so list-targeting,
measurement_terms, and per-package cancel/pause survive create → get.
Per-buy double-cancel raises NOT_CANCELLABLE.

Unblocks the storyboard residuals tracked in #706: pending_creatives_to_start,
inventory_list_targeting, invalid_transitions, and creative_fate_after_cancellation.
get_media_buys now narrows to the requested media_buy_ids when the
buyer supplies them — without the filter, the response leaks every
advertiser-scoped buy and the storyboard's media_buys[0] lookup hits
a different scenario's order. Adds a bidirectional buyer-creative-id ↔
upstream-creative-id map so sync_creatives can echo the buyer's id on
list_creatives and update_media_buy can translate before attach_creative
(the upstream mints cr_<uuid> regardless of client_request_id).
update_media_buy now applies the buyer's package echo fields (targeting_overlay,
measurement_terms, etc.) to the shadow store using replacement semantics, so
inventory_list_targeting's swap step reads back the new list_id values.
list_creatives now honors the request's filters.creative_ids — without the
filter the storyboard's creatives[0] lookup picks up a different scenario's
creative.
sync_creatives now treats a re-sync of a known buyer creative_id as a
no-op upload (action='unchanged') instead of letting the upstream's
idempotency-conflict fire. The assignments[] field on the request is
applied via attach_creative + shadow store, mirroring the path
update_media_buy uses. Replaces the storyboard's continue-on-error
soft gate with an explicit allowlist of expected failures — only
adcp#702 (refine_products) and #377 (account_discovery) are allowed.
Any new failure fails the job; any allowlisted failure that newly
passes also fails the job so the list gets pruned as upstream issues
close.
…oses #377)

Moves the sync_accounts / list_accounts implementations off the platform
class (where they were dead code — the framework never dispatched there)
onto the AccountStore returned by _make_account_store. The framework's
tool-advertising layer probes platform.accounts.upsert and
platform.accounts.list as callables; without them every sales-* agent
silently dropped both tools and the AdCP 3.0.9 §accounts/overview probe
failed.

upsert(refs, ctx) persists the buyer's AccountReference (full
billing_entity, bank and all) and returns SyncAccountsResultRow per
account. The framework projects through to_wire_sync_accounts_row,
applying the billing_entity.bank write-only strip on emit.

list(filter, ctx) returns Account[TMeta] with the billing_entity
populated from Postgres; framework's to_wire_account strips bank on
emit. Per-principal scoping by buyer-agent agent_url is enforced —
unauthenticated callers see an empty list.

Tests cover both projections end-to-end (bank persists on write, never
appears on the wire) plus a surface-check that AccountStore exposes
upsert/list callables.

Drops account_discovery from the storyboard allowlist; allowlist now
covers only the refine_products feature gap.
… fetch)

The reference seller has no pricing / forecasting model upstream — the
mock-server returns the same product set regardless of refinements.
The new refine_get_products re-fetches the base product list via
get_products and reports each refine[] entry as RefinementOutcome
(status='partial', notes=<no engine>). Adopters with a real forecaster
swap the outcomes to 'applied' and project pricing changes onto products.

With refine_get_products in place AND sync_accounts/list_accounts on the
AccountStore from the previous commit, every in-scope storyboard
scenario passes. Drops the allowlist step in favor of a strict
overall_status == 'passing' gate.
@bokelley bokelley merged commit 27a5866 into main May 12, 2026
16 checks passed
@bokelley bokelley deleted the claude/issue-410-v3-storyboard-assertion branch May 12, 2026 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(testing): storyboard runner matrix entry for examples/v3_reference_seller

1 participant