ENG-3564: Design doc for dynamic DB credentials via AWS Secrets Manager by erosselli · Pull Request #8016 · ethyca/fides

erosselli · 2026-04-23T14:02:52Z

Summary

Adds a design doc for integrating AWS Secrets Manager to enable dynamic database credential rotation without pod restarts.
Covers the secret provider abstraction, engine integration (creator pattern), auto-retry on auth failure, and readonly replica credential fallback.
No code changes — design doc only.

Test plan

Review design doc for completeness and correctness
Discuss open questions (alternating user strategy, SQLAlchemy 2.0 migration path)

🤖 Generated with Claude Code

Describes the architecture for allowing Fides to pull DB credentials from AWS Secrets Manager at runtime, enabling credential rotation without pod restarts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vercel · 2026-04-23T14:02:57Z

The latest updates on your projects. Learn more about Vercel for GitHub.

2 Skipped Deployments

Project	Deployment	Actions	Updated (UTC)
fides-plus-nightly	Ignored	Preview	Apr 29, 2026 2:50pm
fides-privacy-center	Ignored		Apr 29, 2026 2:50pm

erosselli · 2026-04-23T14:06:32Z

+
+Database-specific settings on `DatabaseSettings` reference which secret to use:
+
+- `database.credential_secret_id`: the Secrets Manager secret name/ARN containing the DB credentials. When `secrets.provider` is `"static"`, this is ignored and credentials come from `user`/`password` as today.


should the secret_ids be under "database" ( and in the future if we do this for other creds, e.g redis, under each of those separate sections) , or should this be secrets.db_credential_secret_id ? Open to thoughts on it

JadeCara

This seems like something that having some good metrics/debugging logs around will be really important. This is one of those things we would:
a) want to know as soon as it failed
b) have some good signal around why it was failing

Other than that - left a few comments, but this is exciting stuff!

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…on, circuit breaker, SQLSTATE error detection - Add stale-while-revalidate fallback when Secrets Manager is unreachable (T-2) - Wrap secret values in SecretValue class with redacted __repr__/__str__ (T-3) - Add circuit breaker to prevent retry amplification on bad credentials (T-5) - Use SQLSTATE 28P01 instead of string matching for auth error detection (T-6) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

erosselli · 2026-04-27T13:18:26Z

/code-review

claude

Design Review: Dynamic Database Credentials via AWS Secrets Manager

This is a well-structured design document. The problem statement is clear, the SecretProvider abstraction is cleanly separated from the database engine layer, and several non-obvious failure modes are addressed thoughtfully (thundering-herd protection, stale-while-revalidate, circuit breaker, SQLSTATE-based error detection). The decision to use the creator pattern rather than a SQLAlchemy event hook is the right call for psycopg2.

Key concerns

Functional gaps before implementation:

boto3 authentication (see inline, line 69): The configuration section doesn't address how the boto3 client authenticates to AWS — IAM role, explicit credentials, or custom endpoint. This is a deployment blocker and needs to be in the design before anyone can implement or test this. A LocalStack endpoint override is also needed for CI.
connect_args forwarding (see inline, line 95): With the creator pattern, connect_args passed to create_engine are not forwarded to the creator callable. SSL settings, keepalive configuration, and any custom type codecs that currently live in connect_args must be explicitly merged into psycopg2.connect() / asyncpg.connect() inside the creator. The design's claim that "all other engine options remain unchanged" is not accurate without this being called out explicitly.

Design clarifications needed:

Secret JSON schema (see inline, line 41): The field names in the Secrets Manager JSON (username/password or user/password?) should be formally specified, not just shown as an example. This affects both the rotation Lambda and any validation the provider should perform on the fetched value.
Stale TTL semantics + invalidate() (see inline, line 54): The reference point for cache_stale_ttl_seconds when invalidate() is called (and the subsequent fetch fails) needs to be defined explicitly to avoid unintended extension of the stale window.
Secrets Manager staging labels (see inline, line 49): The retry-on-auth-failure path implicitly assumes AWSCURRENT is already updated when the old password stops working. This holds for some rotation strategies but not all — worth stating the assumption.
asyncpg SQLSTATE 28000 (see inline, line 128): Aurora/RDS can return 28000 (generic auth failure) instead of 28P01 during rotation. The decision to narrow to 28P01 only should be deliberate and documented.

Minor notes

The greenlet guard suggestion (line 105) is low priority but would make the failure mode more debugable if the SQLAlchemy pin is ever changed.
The 4-level readonly credential fallback chain (line 74) is correct but complex — a short diagram or table in the doc would help reviewers verify it's right.
The __eq__ note on SecretValue (line 41) is primarily a testing concern — fine to defer but worth keeping in mind when writing the test suite.

Overall the architecture is sound. The main asks are: fill in the boto3 auth configuration, explicitly call out the connect_args forwarding requirement, and specify the secret JSON schema. Once those are addressed this is ready to implement.

🔬 Codegraph: connected (47570 nodes)

💡 Write /code-review in a comment to re-run this review.

- Document AWS authentication mechanism (boto3 credential chain, required IAM permissions, AWSCURRENT staging label) and add endpoint_url config for LocalStack support - Clarify that the creator callable must explicitly forward connect_args (SSL, keepalives, JSON codecs) to avoid silent regressions - Add 1-2s retry delay to cover AWSPENDING → AWSCURRENT propagation window - Note that SQLSTATE 28000 should also trigger retry for Aurora/RDS Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

daveqnet · 2026-04-29T09:45:26Z

Hi @erosselli, thanks for putting this together, this is a solid design and plan. I've some security recommendations which I'd like you to consider. Not blocking.

Please remember not to discuss any security issues with existing/legacy code here in a public PR (nudge me on an internal channel).

Credential leakage

The proposed SecretValue wrapper handles the obvious cases - logger.info(secret) will print <redacted>, which is the right idea - but won't cover everything. You will understand the low-level code details here better than me, but Claude is telling me that pydantic serialization, frame locals in exceptions, and driver-constructed exceptions will all bypass the wrapper as currently proposed.

Would it be possible for the design to commit to:

The DB password must never appear in any log, traceback or exception — at any log level, including DEBUG, and including driver-level errors from psycopg2 and asyncpg.
Define what can be logged for credential operations e.g. secret ID and PostgreSQL error code.
Test plan should include a forced auth failure that captures all log/error output and asserts the password string doesn't appear. It'd need to cover both psycopg2 and asyncpg paths.

Silent failures

Both of these are about silent failures, but at different layers. The first is a user/customer deployer risk. The second is a developer risk.

Log at WARN/WARNING when config is incoherent e.g. credential_secret_id is set but secrets.provider is still static. Not a startup failure (customers may stage config before flipping the switch), but a visible log on every startup so it can't quietly end up using the env-var password forever.
Add a TLS-enforcement test. The doc correctly flags that connect_args don't auto-flow through the creator pattern. Easy to miss one and never notice, since the connection still works, just unencrypted. Bring up Postgres in TLS-required mode, run a connection through each engine, assert success.

- Move from docs/design/ to design-docs/ to avoid accidental inclusion in public documentation builds - Add Section 5: Security Invariants addressing credential leakage prevention, loggable fields allow-list, config coherence warning, and test requirements (credential leakage + TLS enforcement) - Add connect_args forwarding note to Section 4 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

erosselli commented Apr 23, 2026

View reviewed changes

erosselli marked this pull request as ready for review April 23, 2026 14:06

JadeCara approved these changes Apr 23, 2026

View reviewed changes

Comment thread docs/design/dynamic-database-credentials.md Outdated

Comment thread design-docs/dynamic-database-credentials.md

erosselli requested a review from a team April 24, 2026 13:18

erosselli and others added 2 commits April 24, 2026 10:50

Fix naming inconsistency: get_credentials() → get_secret()

fda6973

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

claude Bot reviewed Apr 27, 2026

View reviewed changes

erosselli mentioned this pull request Apr 28, 2026

ENG-3564: SecretProvider abstraction and AWS Secrets Manager provider #8051

Open

8 tasks

erosselli added this pull request to the merge queue Apr 29, 2026

Merged via the queue into main with commit 99330d0 Apr 29, 2026
46 of 47 checks passed

erosselli deleted the erosselli/ENG-3564-design-doc branch April 29, 2026 15:05

erosselli mentioned this pull request May 8, 2026

ENG-3566: Refactor engine creation to use SQLAlchemy creator pattern #8148

Draft

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENG-3564: Design doc for dynamic DB credentials via AWS Secrets Manager#8016

ENG-3564: Design doc for dynamic DB credentials via AWS Secrets Manager#8016
erosselli merged 5 commits intomainfrom
erosselli/ENG-3564-design-doc

erosselli commented Apr 23, 2026

Uh oh!

vercel Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

erosselli Apr 23, 2026

Uh oh!

JadeCara left a comment

Uh oh!

Uh oh!

Uh oh!

erosselli commented Apr 27, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

daveqnet commented Apr 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		Database-specific settings on `DatabaseSettings` reference which secret to use:

		- `database.credential_secret_id`: the Secrets Manager secret name/ARN containing the DB credentials. When `secrets.provider` is `"static"`, this is ignored and credentials come from `user`/`password` as today.

Conversation

erosselli commented Apr 23, 2026

Summary

Test plan

Uh oh!

vercel Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erosselli Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

JadeCara left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

erosselli commented Apr 27, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Design Review: Dynamic Database Credentials via AWS Secrets Manager

Key concerns

Minor notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

daveqnet commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vercel Bot commented Apr 23, 2026 •

edited

Loading

daveqnet commented Apr 29, 2026 •

edited

Loading