ENG-3564: Design doc for dynamic DB credentials via AWS Secrets Manager#8016
ENG-3564: Design doc for dynamic DB credentials via AWS Secrets Manager#8016
Conversation
Describes the architecture for allowing Fides to pull DB credentials from AWS Secrets Manager at runtime, enabling credential rotation without pod restarts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub. 2 Skipped Deployments
|
|
|
||
| Database-specific settings on `DatabaseSettings` reference which secret to use: | ||
|
|
||
| - `database.credential_secret_id`: the Secrets Manager secret name/ARN containing the DB credentials. When `secrets.provider` is `"static"`, this is ignored and credentials come from `user`/`password` as today. |
There was a problem hiding this comment.
should the secret_ids be under "database" ( and in the future if we do this for other creds, e.g redis, under each of those separate sections) , or should this be secrets.db_credential_secret_id ? Open to thoughts on it
JadeCara
left a comment
There was a problem hiding this comment.
This seems like something that having some good metrics/debugging logs around will be really important. This is one of those things we would:
a) want to know as soon as it failed
b) have some good signal around why it was failing
Other than that - left a few comments, but this is exciting stuff!
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…on, circuit breaker, SQLSTATE error detection - Add stale-while-revalidate fallback when Secrets Manager is unreachable (T-2) - Wrap secret values in SecretValue class with redacted __repr__/__str__ (T-3) - Add circuit breaker to prevent retry amplification on bad credentials (T-5) - Use SQLSTATE 28P01 instead of string matching for auth error detection (T-6) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/code-review |
There was a problem hiding this comment.
Design Review: Dynamic Database Credentials via AWS Secrets Manager
This is a well-structured design document. The problem statement is clear, the SecretProvider abstraction is cleanly separated from the database engine layer, and several non-obvious failure modes are addressed thoughtfully (thundering-herd protection, stale-while-revalidate, circuit breaker, SQLSTATE-based error detection). The decision to use the creator pattern rather than a SQLAlchemy event hook is the right call for psycopg2.
Key concerns
Functional gaps before implementation:
-
boto3 authentication (see inline, line 69): The configuration section doesn't address how the boto3 client authenticates to AWS — IAM role, explicit credentials, or custom endpoint. This is a deployment blocker and needs to be in the design before anyone can implement or test this. A LocalStack endpoint override is also needed for CI.
-
connect_argsforwarding (see inline, line 95): With thecreatorpattern,connect_argspassed tocreate_engineare not forwarded to the creator callable. SSL settings, keepalive configuration, and any custom type codecs that currently live inconnect_argsmust be explicitly merged intopsycopg2.connect()/asyncpg.connect()inside the creator. The design's claim that "all other engine options remain unchanged" is not accurate without this being called out explicitly.
Design clarifications needed:
-
Secret JSON schema (see inline, line 41): The field names in the Secrets Manager JSON (
username/passwordoruser/password?) should be formally specified, not just shown as an example. This affects both the rotation Lambda and any validation the provider should perform on the fetched value. -
Stale TTL semantics +
invalidate()(see inline, line 54): The reference point forcache_stale_ttl_secondswheninvalidate()is called (and the subsequent fetch fails) needs to be defined explicitly to avoid unintended extension of the stale window. -
Secrets Manager staging labels (see inline, line 49): The retry-on-auth-failure path implicitly assumes
AWSCURRENTis already updated when the old password stops working. This holds for some rotation strategies but not all — worth stating the assumption. -
asyncpg SQLSTATE
28000(see inline, line 128): Aurora/RDS can return28000(generic auth failure) instead of28P01during rotation. The decision to narrow to28P01only should be deliberate and documented.
Minor notes
- The greenlet guard suggestion (line 105) is low priority but would make the failure mode more debugable if the SQLAlchemy pin is ever changed.
- The 4-level readonly credential fallback chain (line 74) is correct but complex — a short diagram or table in the doc would help reviewers verify it's right.
- The
__eq__note onSecretValue(line 41) is primarily a testing concern — fine to defer but worth keeping in mind when writing the test suite.
Overall the architecture is sound. The main asks are: fill in the boto3 auth configuration, explicitly call out the connect_args forwarding requirement, and specify the secret JSON schema. Once those are addressed this is ready to implement.
🔬 Codegraph: connected (47570 nodes)
💡 Write /code-review in a comment to re-run this review.
- Document AWS authentication mechanism (boto3 credential chain, required IAM permissions, AWSCURRENT staging label) and add endpoint_url config for LocalStack support - Clarify that the creator callable must explicitly forward connect_args (SSL, keepalives, JSON codecs) to avoid silent regressions - Add 1-2s retry delay to cover AWSPENDING → AWSCURRENT propagation window - Note that SQLSTATE 28000 should also trigger retry for Aurora/RDS Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Hi @erosselli, thanks for putting this together, this is a solid design and plan. I've some security recommendations which I'd like you to consider. Not blocking. Please remember not to discuss any security issues with existing/legacy code here in a public PR (nudge me on an internal channel). Credential leakage The proposed Would it be possible for the design to commit to:
Silent failures Both of these are about silent failures, but at different layers. The first is a user/customer deployer risk. The second is a developer risk.
|
- Move from docs/design/ to design-docs/ to avoid accidental inclusion in public documentation builds - Add Section 5: Security Invariants addressing credential leakage prevention, loggable fields allow-list, config coherence warning, and test requirements (credential leakage + TLS enforcement) - Add connect_args forwarding note to Section 4 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Test plan
🤖 Generated with Claude Code