Skip to content

fix(training-agent): PostgresReplayStore so vector 016 catches cross-instance replays#3351

Merged
bokelley merged 1 commit intomainfrom
bokelley/replay-store-postgres-3338
Apr 27, 2026
Merged

fix(training-agent): PostgresReplayStore so vector 016 catches cross-instance replays#3351
bokelley merged 1 commit intomainfrom
bokelley/replay-store-postgres-3338

Conversation

@bokelley
Copy link
Copy Markdown
Contributor

What and why

#3346 closed the cross-route replay bug but adcp grade request-signing vector 016 (replayed-nonce) still failed in prod. Root cause: Fly runs min_machines_running = 2 web machines and the per-process InMemoryReplayStore can't see across machines. Probe 1 hits machine A and consumes the nonce; probe 2 gets routed to machine B by Fly's LB, machine B has never seen the nonce locally, accepts.

Swaps to PostgresReplayStore from @adcp/client/signing/server (shipped in adcp-client#1018, 5.21.0+). All instances share one adcp_replay_cache table; the (keyid, scope, nonce) primary key serializes concurrent inserts and ON CONFLICT DO NOTHING returns 'replayed' on the second submission.

What's in this PR

  • Migration 447_adcp_replay_cache.sql(keyid, scope, nonce) PK + expires_at TTL column + two supporting indexes. Schema mirrors the SDK's getReplayStoreMigration() exactly.
  • getReplayStore() singleton in request-signing.ts shared across all per-route authenticators (default / strict / strict-required / strict-forbidden). Safe to share because (keyid, scope, nonce) already partitions by route via the @target-uri-derived scope.
  • startReplayCacheSweeper() wired in index.ts boot path — runs sweepExpiredReplays(pool) every 60s with an unref'd interval (won't keep the loop alive on shutdown). Postgres has no native TTL; without sweeping the table grows unboundedly.

Test plan

  • tsc --noEmit clean
  • server/tests/integration/training-agent-strict.test.ts 15/15 pass (no regression)
  • Post-deploy: npx adcp grade request-signing https://agenticadvertising.org/api/training-agent/mcp-strict --transport mcp --skip-rate-abuse --only 016-replayed-nonce → expect PASS

Closes

#3338 (the remaining cross-instance piece). #3346 closed the cross-route piece earlier.

Sibling work

Upstream adcp-client#1018 is the SDK side that made this trivial — three months of in-flight work landed in 5.21.0 just before we needed it.

…s cross-instance replays

#3346 closed the cross-route bug but vector 016 still failed in prod
because Fly runs min_machines_running = 2 web machines and the
per-process InMemoryReplayStore can't see across machines. Probe 1
hits machine A and consumes the nonce; probe 2 routes to machine B,
which has never seen the nonce locally, accepts.

Swaps to PostgresReplayStore from @adcp/client/signing/server (5.21.0+,
adcp-client#1018). All instances share one adcp_replay_cache table.

Adds:
- Migration 447_adcp_replay_cache.sql — schema mirrors the SDK's
  getReplayStoreMigration() output: (keyid, scope, nonce) PK +
  expires_at TTL column + two indexes.
- startReplayCacheSweeper() called from index.ts at boot — runs
  sweepExpiredReplays every 60s. Postgres has no native TTL.

Singleton replay store across the per-route authenticators (default /
strict / strict-required / strict-forbidden). The (keyid, scope, nonce)
primary key partitions by route via the @target-uri-derived scope, so
sharing the table is safe.

Closes the remaining piece of #3338. Grader vector 016 should pass
against /mcp-strict once deployed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bokelley
Copy link
Copy Markdown
Contributor Author

Post-deploy verification — completed.

Ran the full grader against /api/training-agent/mcp-strict after the post-#3351 deploy (commit 2bb3dcd) landed:

npx adcp grade request-signing https://agenticadvertising.org/api/training-agent/mcp-strict --transport mcp --skip-rate-abuse
→ 31 passed, 2 failed, 6 skipped

Specifically:

  • neg/016-replayed-nonce PASS401 request_signature_replayed. The PostgresReplayStore catches the cross-instance replay this PR was filed to fix.
  • All 8 graded positive vectors pass (the 4 marked SKIP are mcp_mode_flattens_url_edges, not failures).
  • Manual cross-check via .context/demo-signing.mjs against prod: valid signature → 200, replay → 401-replayed, malformed header → 401-malformed, body tampered → 401-digest-mismatch, unsigned → 401-required. All five behaviors as expected.

The two remaining neg/007 and neg/018 failures are the capability-semantics vectors covered separately by #3340 — they PASS on the dedicated /mcp-strict-required and /mcp-strict-forbidden routes:

npx adcp grade request-signing .../mcp-strict-required --only 007-missing-content-digest → PASS (401 request_signature_components_incomplete)
npx adcp grade request-signing .../mcp-strict-forbidden --only 018-digest-covered-when-forbidden → PASS (401 request_signature_components_unexpected)

Net: 33/33 graded vectors pass across the four strict-mode routes, closing #3338.

@bokelley
Copy link
Copy Markdown
Contributor Author

Verification noted — 33/33 graded vectors passing across all four strict-mode routes, #3338 closed. Thanks for the thorough post-deploy rundown.


Generated by Claude Code

@bokelley
Copy link
Copy Markdown
Contributor Author

Glad the verification landed cleanly — 33/33 across all four routes is a solid close on #3338. Nothing further from me here.


Generated by Claude Code

@bokelley
Copy link
Copy Markdown
Contributor Author

Acknowledged — 33/33 is a clean close. Nothing further needed here either.


Generated by Claude Code

@bokelley
Copy link
Copy Markdown
Contributor Author

Noted — closed cleanly, nothing queued on my end either.


Generated by Claude Code

bokelley added a commit that referenced this pull request Apr 28, 2026
…ol (#3379)

#3351 swapped to PostgresReplayStore to close the cross-instance replay
gap (#3338). Worked in production but broke the storyboard runner:
CI runs the full server in-process without initializing a Postgres
pool, and getReplayStore() was unconditionally calling getPool() which
throws "Database not initialized."

Symptom (#3373 storyboards CI):
  signed_requests  ✗ 3P / 28F / 9S  (every positive vector returned
                                     401 because PostgresReplayStore
                                     .insert rejected on the
                                     unavailable pool, verifier failed
                                     closed)

Fix:
- getReplayStore() falls back to InMemoryReplayStore when getPool()
  throws — gated on NODE_ENV !== 'production' so a misconfigured prod
  still fails loudly (preserves the multi-instance protection).
- startReplayCacheSweeper() is a silent no-op when no pool is
  initialized.
- resetRequestSigning() now also clears the cached _replayStore so
  test suites that swap process state stay coherent.

Verified locally:
  signed_requests  ✓ 31P / 9S / 0N/A  (recovered)

Production unaffected — prod always has a pool, so PostgresReplayStore
is used and cross-instance replay protection holds. Confirmed via
adcp grade --only 016-replayed-nonce → still PASS.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bokelley pushed a commit that referenced this pull request Apr 28, 2026
…uthenticator

The comment still referenced per-authenticator InMemoryReplayStore as the
isolation mechanism. Since #3351, production isolation comes from the
@target-uri-derived scope column of the Postgres singleton.

https://claude.ai/code/session_015vh3dBv4iAKcoWj4a16ygU
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant