Skip to content

ENG-3564: Design doc for dynamic DB credentials via AWS Secrets Manager#8016

Merged
erosselli merged 5 commits intomainfrom
erosselli/ENG-3564-design-doc
Apr 29, 2026
Merged

ENG-3564: Design doc for dynamic DB credentials via AWS Secrets Manager#8016
erosselli merged 5 commits intomainfrom
erosselli/ENG-3564-design-doc

Conversation

@erosselli
Copy link
Copy Markdown
Contributor

Summary

  • Adds a design doc for integrating AWS Secrets Manager to enable dynamic database credential rotation without pod restarts.
  • Covers the secret provider abstraction, engine integration (creator pattern), auto-retry on auth failure, and readonly replica credential fallback.
  • No code changes — design doc only.

Test plan

  • Review design doc for completeness and correctness
  • Discuss open questions (alternating user strategy, SQLAlchemy 2.0 migration path)

🤖 Generated with Claude Code

Describes the architecture for allowing Fides to pull DB credentials
from AWS Secrets Manager at runtime, enabling credential rotation
without pod restarts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented Apr 23, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

2 Skipped Deployments
Project Deployment Actions Updated (UTC)
fides-plus-nightly Ignored Ignored Preview Apr 29, 2026 2:50pm
fides-privacy-center Ignored Ignored Apr 29, 2026 2:50pm

Request Review


Database-specific settings on `DatabaseSettings` reference which secret to use:

- `database.credential_secret_id`: the Secrets Manager secret name/ARN containing the DB credentials. When `secrets.provider` is `"static"`, this is ignored and credentials come from `user`/`password` as today.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the secret_ids be under "database" ( and in the future if we do this for other creds, e.g redis, under each of those separate sections) , or should this be secrets.db_credential_secret_id ? Open to thoughts on it

@erosselli erosselli marked this pull request as ready for review April 23, 2026 14:06
Copy link
Copy Markdown
Contributor

@JadeCara JadeCara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like something that having some good metrics/debugging logs around will be really important. This is one of those things we would:
a) want to know as soon as it failed
b) have some good signal around why it was failing

Other than that - left a few comments, but this is exciting stuff!

Comment thread docs/design/dynamic-database-credentials.md Outdated
Comment thread design-docs/dynamic-database-credentials.md
@erosselli erosselli requested a review from a team April 24, 2026 13:18
erosselli and others added 2 commits April 24, 2026 10:50
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…on, circuit breaker, SQLSTATE error detection

- Add stale-while-revalidate fallback when Secrets Manager is unreachable (T-2)
- Wrap secret values in SecretValue class with redacted __repr__/__str__ (T-3)
- Add circuit breaker to prevent retry amplification on bad credentials (T-5)
- Use SQLSTATE 28P01 instead of string matching for auth error detection (T-6)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@erosselli
Copy link
Copy Markdown
Contributor Author

/code-review

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Design Review: Dynamic Database Credentials via AWS Secrets Manager

This is a well-structured design document. The problem statement is clear, the SecretProvider abstraction is cleanly separated from the database engine layer, and several non-obvious failure modes are addressed thoughtfully (thundering-herd protection, stale-while-revalidate, circuit breaker, SQLSTATE-based error detection). The decision to use the creator pattern rather than a SQLAlchemy event hook is the right call for psycopg2.

Key concerns

Functional gaps before implementation:

  1. boto3 authentication (see inline, line 69): The configuration section doesn't address how the boto3 client authenticates to AWS — IAM role, explicit credentials, or custom endpoint. This is a deployment blocker and needs to be in the design before anyone can implement or test this. A LocalStack endpoint override is also needed for CI.

  2. connect_args forwarding (see inline, line 95): With the creator pattern, connect_args passed to create_engine are not forwarded to the creator callable. SSL settings, keepalive configuration, and any custom type codecs that currently live in connect_args must be explicitly merged into psycopg2.connect() / asyncpg.connect() inside the creator. The design's claim that "all other engine options remain unchanged" is not accurate without this being called out explicitly.

Design clarifications needed:

  1. Secret JSON schema (see inline, line 41): The field names in the Secrets Manager JSON (username/password or user/password?) should be formally specified, not just shown as an example. This affects both the rotation Lambda and any validation the provider should perform on the fetched value.

  2. Stale TTL semantics + invalidate() (see inline, line 54): The reference point for cache_stale_ttl_seconds when invalidate() is called (and the subsequent fetch fails) needs to be defined explicitly to avoid unintended extension of the stale window.

  3. Secrets Manager staging labels (see inline, line 49): The retry-on-auth-failure path implicitly assumes AWSCURRENT is already updated when the old password stops working. This holds for some rotation strategies but not all — worth stating the assumption.

  4. asyncpg SQLSTATE 28000 (see inline, line 128): Aurora/RDS can return 28000 (generic auth failure) instead of 28P01 during rotation. The decision to narrow to 28P01 only should be deliberate and documented.

Minor notes

  • The greenlet guard suggestion (line 105) is low priority but would make the failure mode more debugable if the SQLAlchemy pin is ever changed.
  • The 4-level readonly credential fallback chain (line 74) is correct but complex — a short diagram or table in the doc would help reviewers verify it's right.
  • The __eq__ note on SecretValue (line 41) is primarily a testing concern — fine to defer but worth keeping in mind when writing the test suite.

Overall the architecture is sound. The main asks are: fill in the boto3 auth configuration, explicitly call out the connect_args forwarding requirement, and specify the secret JSON schema. Once those are addressed this is ready to implement.

🔬 Codegraph: connected (47570 nodes)


💡 Write /code-review in a comment to re-run this review.

Comment thread design-docs/dynamic-database-credentials.md
Comment thread design-docs/dynamic-database-credentials.md
Comment thread design-docs/dynamic-database-credentials.md
Comment thread design-docs/dynamic-database-credentials.md
Comment thread design-docs/dynamic-database-credentials.md
Comment thread design-docs/dynamic-database-credentials.md
Comment thread docs/design/dynamic-database-credentials.md Outdated
- Document AWS authentication mechanism (boto3 credential chain, required
  IAM permissions, AWSCURRENT staging label) and add endpoint_url config
  for LocalStack support
- Clarify that the creator callable must explicitly forward connect_args
  (SSL, keepalives, JSON codecs) to avoid silent regressions
- Add 1-2s retry delay to cover AWSPENDING → AWSCURRENT propagation window
- Note that SQLSTATE 28000 should also trigger retry for Aurora/RDS

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@daveqnet
Copy link
Copy Markdown
Contributor

daveqnet commented Apr 29, 2026

Hi @erosselli, thanks for putting this together, this is a solid design and plan. I've some security recommendations which I'd like you to consider. Not blocking.

Please remember not to discuss any security issues with existing/legacy code here in a public PR (nudge me on an internal channel).

Credential leakage

The proposed SecretValue wrapper handles the obvious cases - logger.info(secret) will print <redacted>, which is the right idea - but won't cover everything. You will understand the low-level code details here better than me, but Claude is telling me that pydantic serialization, frame locals in exceptions, and driver-constructed exceptions will all bypass the wrapper as currently proposed.

Would it be possible for the design to commit to:

  1. The DB password must never appear in any log, traceback or exception — at any log level, including DEBUG, and including driver-level errors from psycopg2 and asyncpg.
  2. Define what can be logged for credential operations e.g. secret ID and PostgreSQL error code.
  3. Test plan should include a forced auth failure that captures all log/error output and asserts the password string doesn't appear. It'd need to cover both psycopg2 and asyncpg paths.

Silent failures

Both of these are about silent failures, but at different layers. The first is a user/customer deployer risk. The second is a developer risk.

  1. Log at WARN/WARNING when config is incoherent e.g. credential_secret_id is set but secrets.provider is still static. Not a startup failure (customers may stage config before flipping the switch), but a visible log on every startup so it can't quietly end up using the env-var password forever.
  2. Add a TLS-enforcement test. The doc correctly flags that connect_args don't auto-flow through the creator pattern. Easy to miss one and never notice, since the connection still works, just unencrypted. Bring up Postgres in TLS-required mode, run a connection through each engine, assert success.

- Move from docs/design/ to design-docs/ to avoid accidental inclusion
  in public documentation builds
- Add Section 5: Security Invariants addressing credential leakage
  prevention, loggable fields allow-list, config coherence warning,
  and test requirements (credential leakage + TLS enforcement)
- Add connect_args forwarding note to Section 4

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@erosselli erosselli added this pull request to the merge queue Apr 29, 2026
Merged via the queue into main with commit 99330d0 Apr 29, 2026
46 of 47 checks passed
@erosselli erosselli deleted the erosselli/ENG-3564-design-doc branch April 29, 2026 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants