Skip to content

spec(idempotency): rules 9 (concurrent retries) + 10 (downstream reconciliation); IDEMPOTENCY_IN_FLIGHT#4402

Merged
bokelley merged 3 commits into
mainfrom
bokelley/idempotency-blog-notes
May 11, 2026
Merged

spec(idempotency): rules 9 (concurrent retries) + 10 (downstream reconciliation); IDEMPOTENCY_IN_FLIGHT#4402
bokelley merged 3 commits into
mainfrom
bokelley/idempotency-blog-notes

Conversation

@bokelley
Copy link
Copy Markdown
Contributor

@bokelley bokelley commented May 11, 2026

Summary

Closes the last two gaps the existing idempotency rules 1–8 left underspecified, both surfaced by a recent post on idempotency that walks through the failure modes a replay cache doesn't explain:

  • Rule 9 — Concurrent retries / first-insert-wins. A second request carrying the same (authenticated_agent, account_id, idempotency_key) MAY arrive while the first is still executing. Sellers MUST resolve the race deterministically via INSERT … ON CONFLICT DO NOTHING on the scope tuple (with the canonical payload hash persisted at INSERT — not a sentinel — so same-key/different-payload mismatches are detectable before the first request completes), and MAY adopt one of two policies (behaving consistently): wait-and-replay (block the second request, return cached) or reject-and-redirect (return new IDEMPOTENCY_IN_FLIGHT with error.details.retry_after, bounded above by replay_ttl_seconds). In-flight row lifetime is bounded by the seller's declared per-task handler timeout.
  • Rule 10 — Crossing service boundaries / downstream reconciliation. When a seller invokes a downstream system, "errors don't cache" (rule 3) is necessary but not sufficient. Sellers MUST adopt either write-claim-before-invoke or thread-buyer-key (where the threaded key is a per-downstream-provider HMAC derivative, not the buyer's raw key across trust boundaries, and not a single shared seller secret across all downstreams). Best-effort dedup on downstream response inspection is explicitly forbidden. Rule 10 is reviewer-graded (per parallel_dispatch_runner.reviewer_checks), not programmatically graded.
  • New error code IDEMPOTENCY_IN_FLIGHT (held for 3.1 per wire-stability policy). Recovery: transient. Buyers MUST retry with the same key — minting a fresh one turns a safe retry into a double-execution race.

Storyboard coverage

static/compliance/source/universal/idempotency.yaml gains a concurrent_retry phase using two new cross-response check kinds (cross_response_count_distinct, cross_response_field_equal) that operate on the resolved response set across N parallel dispatches. The check kinds live in a separate cross_response_check_kinds enum (not authored_check_kinds) to prevent per-response-interpretation footguns; storyboard-schema.yaml documents the parallel_dispatch step block and the kinds' semantics. The runner contract is in static/compliance/source/test-kits/parallel-dispatch-runner.yaml; runners without parallel-dispatch support grade the phase not_applicable.

Wonderstruck (Python sales-agent) validation

Ran the new rules against the canonical Python sales-agent at wonderstruck.sales-agent.scope3.com:

Expert review

Ran through run-by-experts (4 reviewers in parallel) twice. Round 1 surfaced 1 convergent blocker + 4 single-reviewer must-fix items; all addressed in commit 404fec7322. Round 2 surfaced 0 convergent blockers + 3 single-reviewer must-fix items; all addressed in commit 4d98a0a198. Key fixes:

  • Split cross_response_check_kinds out of authored_check_kinds (blocker: per-response-interpretation footgun)
  • Bound in-flight row lifetime to declared handler timeout (locks-buyer-out footgun)
  • Bound retry_after at replay_ttl_seconds (vacuous-wait footgun)
  • Rule 9 hash-at-INSERT requirement (so same-key/different-payload is detectable mid-flight)
  • Rule 10 per-downstream-provider HMAC derivation (not shared seller secret)
  • Rule 10 strip buyer key from propagated downstream errors
  • Rule 10 fail-closed on ambiguous downstream lookup
  • Rule 10 raw-key clause narrowed to "across trust boundaries" (intra-tenant microservice forwarding permitted)
  • Rule 10 explicitly flagged as reviewer-graded

Follow-up issues

Test plan

  • node scripts/lint-error-codes.cjs, lint-error-code-drift.cjs, lint-storyboard-check-enum.cjs — clean
  • node scripts/build-schemas.cjs, build-compliance.cjs — clean
  • All 14 storyboard authoring lints — clean
  • node --test tests/lint-storyboard-check-enum.test.cjs — 6/6 pass (after function rename)
  • npm test — 35/36 pass; the one failure (TMP serve_window_sec) is pre-existing on main, not introduced by this PR
  • Validated rule 9 behaviorally against Wonderstruck — first-insert-wins works, both dispatches converge on one media_buy_id

🤖 Generated with Claude Code

bokelley and others added 2 commits May 11, 2026 10:08
…m reconciliation); introduce IDEMPOTENCY_IN_FLIGHT

Two new normative rules in L1/security.mdx#idempotency close the
last gaps the existing 1-8 left underspecified — what happens when a
second request arrives while the first is still running, and what a
seller must do about downstream calls whose outcome it lost.

Rule 9 (first-insert-wins). Sellers MUST resolve concurrent same-key
retries deterministically via INSERT ... ON CONFLICT DO NOTHING on
(authenticated_agent, account_id, idempotency_key), and MAY pick one
of two policies: wait-and-replay (block, return cached) or
reject-and-redirect (return new IDEMPOTENCY_IN_FLIGHT with
error.details.retry_after). Verified against the Python sales-agent
(Wonderstruck) — its wait-and-replay implementation passes out of the
box; the new rule documents existing canonical behavior.

Rule 10 (downstream reconciliation). When a seller invokes a downstream
system during request handling, it MUST adopt either write-claim-
before-invoke or thread-buyer-key. Best-effort dedup on downstream
response inspection is explicitly forbidden.

New error code IDEMPOTENCY_IN_FLIGHT (held for 3.1 per wire-stability
policy). Recovery: transient. Buyers MUST retry with the SAME key
after retry_after — minting a fresh key here turns a safe retry into
a double-execution race.

Storyboard concurrent_retry phase added to
static/compliance/source/universal/idempotency.yaml, with two new
cross-response check kinds (cross_response_count_distinct,
cross_response_field_equal). Runner contract documented in
static/compliance/source/test-kits/parallel-dispatch-runner.yaml.
Runners without parallel-dispatch support skip with not_applicable.

Author skill and calling-an-agent.mdx updated so buyers retry rather
than mint fresh keys on the new error code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eck kinds, bound in-flight TTL, tighten rule 10 reconciliation

Expert review (run-by-experts skill, 4 reviewers in parallel) surfaced 3
convergent blockers and 4 single-reviewer must-fix items on the rule-9/10
spec text and storyboard surface. All addressed:

CONVERGENT (2+ reviewers):
- Cross-response check kinds (cross_response_field_equal,
  cross_response_count_distinct) moved out of authored_check_kinds into
  a sibling cross_response_check_kinds enum in runner-output-contract.yaml.
  Prevents the per-response-interpretation footgun. Lint updated to read
  both arrays. Storyboard-schema.yaml gains a dedicated section
  documenting parallel_dispatch step block + the two check kinds.
  Runner-output-contract.yaml expected/actual blocks document both
  kinds' payload shapes.
- In-flight TTL bound: rule 9 now requires sellers to bound in-flight
  row lifetime to their declared per-task handler timeout and release
  the row on timeout. Also caps retry_after at <= replay_ttl_seconds —
  recommends an order of magnitude below the TTL ceiling.
- Rule 10 reviewer-graded note: explicit clause that rule 10 conformance
  is reviewer-graded, not programmatically graded by the storyboard
  suite. parallel_dispatch_runner already lists rule-10 as a
  reviewer_check; this surfaces it in the spec text too.

SINGLE-REVIEWER MUST-FIX:
- Rule 9 hash-availability: spec text now requires sellers to persist
  the canonical payload hash on INSERT (not a sentinel). Closes the
  gap where the SDK store's __adcp_in_flight__ placeholder prevented
  the rule-9-vs-rule-5 boundary from being detectable.
- Rule 10 HMAC derivation rationale: tightened to per-downstream-
  provider derivation (HMAC(K_provider, idempotency_key)) rather than
  a single shared seller secret across all downstreams. Prevents
  cross-provider replay if any single downstream is compromised.
- Rule 10 downstream-error key strip: sellers MUST NOT propagate the
  buyer's idempotency_key (or a reversible derivative) in error
  envelopes originating from a downstream. Closes a cross-trust-
  boundary key-disclosure surface.
- Rule 10 claim-row-without-invoke: if the downstream reports no
  record of downstream_request_id (claim row persisted but seller
  crashed pre-invoke), the seller MUST proceed with the invocation
  rather than fail-closed.
- Storyboard verify_media_buy_count narrative: corrected from
  "only TWO media buys" claim (false after the new concurrent_retry
  phase creates a third) to "captured media_buy_ids both exist and
  are distinct" — matches what the step's query actually asserts.

Plus a transitional note in the changeset clarifying that
SERVICE_UNAVAILABLE + retry_after on the in-flight branch remains
conformant under wait-and-replay; IDEMPOTENCY_IN_FLIGHT is only
required for reject-and-redirect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…l-closed on ambiguous downstream lookup, rename lint helper

Round-2 expert review surfaced 3 single-reviewer must-fix items
(no convergent blockers). All three addressed:

- Rule 10 raw-key clause was too absolute (security review #5). The
  prior MUST-NOT-pass-raw forbade legitimate intra-tenant patterns
  where a seller threads the key into a microservice it operates
  end-to-end (same KMS, same audit log, same operator). Reworded to
  forbid passing across trust boundaries specifically — same-trust-
  principal forwarding is permitted, though per-provider HMAC
  derivation remains the better default.

- Rule 10 claim-row-without-invoke fail-closed on ambiguous lookup
  (security review #7-low). The prior "if downstream reports no
  record, proceed with invocation" wording assumed the downstream's
  lookup response is authentic. Added: on transient 5xx, network
  error, or malformed response, the seller MUST fail closed (return
  a transient error to the buyer per rule 9), not proceed on an
  unauthenticated "no record" signal.

- Renamed lint helper loadAuthoredCheckKinds → loadKnownCheckKinds
  to match its actual behavior (returns the union of authored + cross-
  response check kinds after PR's round-2 split). Pure naming; lint
  output and behavior unchanged. Updated tests/lint-storyboard-
  check-enum.test.cjs to match.

Follow-up filed: #4406 — declare
capabilities.idempotency.in_flight_max_seconds so buyers can compute
retry budgets bounded by the seller's handler-timeout rather than
the much-wider replay_ttl_seconds ceiling. Not blocking this PR;
additive 3.1 capability declaration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bokelley bokelley merged commit 231bc2e into main May 11, 2026
23 checks passed
@bokelley bokelley deleted the bokelley/idempotency-blog-notes branch May 11, 2026 14:49
bokelley added a commit that referenced this pull request May 11, 2026
…onds (#4409)

Closes #4406. Follow-up to #4402 (rules 9 + 10 + IDEMPOTENCY_IN_FLIGHT).

Rule 9 bounds in-flight row lifetime to the seller's per-task handler
timeout but the bound is not buyer-observable today — the existing
capabilities surface declares replay_ttl_seconds only (1h–7d), much
wider than a realistic handler timeout. Buyers retrying on
IDEMPOTENCY_IN_FLIGHT have no way to compute a tight retry budget.

This adds an optional in_flight_max_seconds field to the
IdempotencySupported branch of adcp.idempotency:

- Optional in 3.1 (additive; SDKs that don't see it fall back to
  rule 9's order-of-magnitude SHOULD heuristic).
- Required when supported: true in 4.0 (same migration path
  replay_ttl_seconds followed across 2.x → 3.x).
- Bounded integer >= 1, <= 604800 at the schema layer; cross-field
  bound (<= replay_ttl_seconds) enforced by composed-schema test
  suite since JSON Schema cannot express field-relative bounds.
- Forbidden on IdempotencyUnsupported branch (mirrors the existing
  replay_ttl_seconds treatment — no replay window means no in-flight
  bound).

Rule 9 in security.mdx updated to point at the new capability field
as the primary retry-budget bound when declared (Option A per the
triage design pass); the order-of-magnitude heuristic remains the
fallback when the field is absent.

Composed-schema validation suite gains 5 new tests:
- IdempotencySupported with in_flight_max_seconds accepts
- Rejects in_flight_max_seconds: 0 (below minimum 1)
- Rejects in_flight_max_seconds on unsupported branch
- Schema accepts in_flight_max_seconds > replay_ttl_seconds at the
  schema layer (programmatic cross-field assertion catches it
  separately)
- Cross-field assertion: in_flight_max_seconds > replay_ttl_seconds
  is detected

40/40 tests pass.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant