[ENG-3281] Add all pools to database healthcheck, add extended pool status data #7856
[ENG-3281] Add all pools to database healthcheck, add extended pool status data #7856
Conversation
…for prewarming state (can be alerted on, for example)
|
The latest updates on your projects. Learn more about Vercel for GitHub.
1 Skipped Deployment
|
There was a problem hiding this comment.
Code Review
The overall structure is solid — extracting per-pool health checks into small helper functions (_check_sync_session, _check_async_session, _get_async_readonly_prewarming_details) makes the main handler readable, and the new PoolStatus/PoolPrewarming models are clean. The prewarm-awareness logic in ensure_async_readonly_pool_prewarmed is correct (double-checked locking, global guarded inside the async lock).
A few issues worth addressing before merge:
Must Fix
-
database_revisionis silently dropped (line 161) — hardcoding it toNoneis a breaking change for consumers that relied on the migration revision in the health response. If intentional, the field should be removed from the schema and the changelog updated to mention it; otherwise the revision check needs to be preserved. -
get_readonly_api_session()failures cause a 500, not a 503 (lines 122–131) — thetry/finallyblock has noexceptclause, so any exception thrown while acquiring the readonly session propagates up as an unhandled error rather than being reported as an unhealthy pool. Anexcept Exceptionbranch that setspools["api_sync_readonly"] = PoolStatus(healthy=False)would fix this.
Suggestions
-
"skipped"inPoolHealthis never used (line 29) — either use it for non-configured pools or remove it from the literal to avoid a confusing dead value. -
Non-configured pools are reported as
healthy=True(lines 133, 153) — a pool that was never checked isn't really healthy. This is where"skipped"would be the right signal, and it would let callers distinguish "pool is healthy" from "pool is not configured". -
session = session # type: AsyncSession(line 186) — no-op reassignment; use a proper inline type annotation instead. -
Redundant
if/elseintest_database_healthcheck(test line ~3436–3441) — both branches assert identically, so the conditional is dead code.
Minor
The "needs migration" status that the old get_db_health could return is gone without mention in the changelog. If migration-state health reporting is intentionally dropped, it's worth a brief note there.
| database="unhealthy" if has_unhealthy_pool else "healthy", | ||
| pools=pools, | ||
| async_readonly_pool_prewarmed=async_readonly_pool_prewarmed, | ||
| database_revision=None, |
There was a problem hiding this comment.
Breaking change: database_revision is now always null
Previously the response populated database_revision with the actual Alembic migration revision (from get_db_health). This hardcodes it to None, silently breaking any consumer of this endpoint that reads database_revision to detect migration state.
If migration-state reporting is intentionally removed, the changelog entry should call that out explicitly and the field should probably be removed from DatabaseHealthCheck rather than left as an always-null optional. If it should still be reported, the revision check needs to be re-added.
There was a problem hiding this comment.
This sounds concerning, is dropping migration-state awareness intentional here? Doesn't seem to be part of the ticket. Is claude right about this?
| if CONFIG.database.sqlalchemy_readonly_database_uri: | ||
| readonly_db: Optional[Session] = None | ||
| try: |
There was a problem hiding this comment.
Missing except — errors from get_readonly_api_session() will surface as 500 instead of unhealthy
The try/finally block closes the session but doesn't catch exceptions. If get_readonly_api_session() itself raises (e.g. pool exhausted, misconfigured URI), the exception propagates uncaught from database_health and FastAPI returns a 500 rather than a well-formed {"database": "unhealthy", ...} with 503.
The intent is clearly to surface this as an unhealthy pool. Suggest adding an except branch:
try:
readonly_db = get_readonly_api_session()
pools["api_sync_readonly"] = PoolStatus(
healthy=_check_sync_session(readonly_db) == "healthy"
)
except Exception as error:
logger.error("Unable to reach sync readonly pool: {}", Pii(str(error)))
pools["api_sync_readonly"] = PoolStatus(healthy=False)
finally:
if readonly_db:
readonly_db.close()| from fides.config import CONFIG | ||
|
|
||
| CacheHealth = Literal["healthy", "unhealthy", "no cache configured", "skipped"] | ||
| PoolHealth = Literal["healthy", "unhealthy", "skipped"] |
There was a problem hiding this comment.
"skipped" is defined in PoolHealth but never returned
All code paths set a pool to either healthy=True or healthy=False via PoolStatus(healthy=...). The PoolHealth literal on the top-level database field never uses "skipped" either. Either use it for non-configured pools (instead of healthy=True) or remove it from the type to avoid confusion.
| else: | ||
| pools["api_async_readonly"] = PoolStatus(healthy=True) |
There was a problem hiding this comment.
Unconfigured pools reported as healthy=True — semantically misleading
When async_readonly_database_uri is not set, the pool entry is still added to the response as healthy: true. The same pattern appears for api_sync_readonly at line 133. A pool that doesn't exist isn't "healthy" — it wasn't checked at all. Since PoolHealth already declares "skipped" as a valid literal, this would be a natural place to use it, and consumers could distinguish "configured and healthy" from "not configured".
| """Return health status for an async pool-backed session factory.""" | ||
| try: | ||
| async with session_factory() as session: | ||
| session = session # type: AsyncSession |
There was a problem hiding this comment.
No-op reassignment used as a type comment
session = session # type: AsyncSession is a no-op — it reassigns the variable to itself. The # type: comment form is for legacy Python 2-style annotations and has no effect here either. Use a proper inline annotation instead:
async with session_factory() as session:
session: AsyncSession # or just rely on the factory's type annotation
await session.execute(text("SELECT 1"))Or, if the goal is just to satisfy a type checker, annotate session_factory more precisely so the checker can infer the session type.
| if test_config.database.sqlalchemy_readonly_database_uri: | ||
| assert payload["pools"]["api_sync_readonly"]["healthy"] is True | ||
| assert payload["pools"]["api_sync_readonly"]["prewarming"] is None | ||
| else: | ||
| assert payload["pools"]["api_sync_readonly"]["healthy"] is True | ||
| assert payload["pools"]["api_sync_readonly"]["prewarming"] is None |
There was a problem hiding this comment.
Both branches of the if/else are identical — the conditional is dead code
The if test_config.database.sqlalchemy_readonly_database_uri: block and the else: block make exactly the same assertions. The conditional adds no test value and may mask a missing test for the "URI configured" branch (e.g. asserting the pool was actually checked). Simplify to:
assert payload["pools"]["api_sync_readonly"]["healthy"] is True
assert payload["pools"]["api_sync_readonly"]["prewarming"] is None
Codecov Report❌ Patch coverage is ❌ Your patch status has failed because the patch coverage (66.95%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #7856 +/- ##
==========================================
- Coverage 85.06% 85.03% -0.04%
==========================================
Files 629 629
Lines 40861 40971 +110
Branches 4748 4763 +15
==========================================
+ Hits 34758 34838 +80
- Misses 5030 5052 +22
- Partials 1073 1081 +8 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
nreyes-dev
left a comment
There was a problem hiding this comment.
LGTM except for the stuff that Claude pointed to. Particularly the migration thing.
My Claude came up with these two critical things that would need to be fixed for sure before merging this:
- Missing except on sync readonly session — get_readonly_api_session() failure gives a 500 instead of a properly formatted 503. Real bug.
- database_revision / migration check silently dropped — not in scope of the ticket, breaks existing behavior. Should either be preserved or explicitly called out as intentional.
| database="unhealthy" if has_unhealthy_pool else "healthy", | ||
| pools=pools, | ||
| async_readonly_pool_prewarmed=async_readonly_pool_prewarmed, | ||
| database_revision=None, |
There was a problem hiding this comment.
This sounds concerning, is dropping migration-state awareness intentional here? Doesn't seem to be part of the ticket. Is claude right about this?
…eck for consistency, uses 'skipped' instead of leaving it for future use, and marks non-configured pools as skipped
|
/code-review |
There was a problem hiding this comment.
Code Review: Database Health Endpoint — Per-Pool Details & Readonly Prewarming
This PR expands the /health/database endpoint to report per-pool health, adds PoolStatus/PoolPrewarming response models, and handles optional readonly pools as "skipped". The overall shape is clean and the test coverage is solid. There is one correctness bug that survived the stated "no-op assignment fix", and a few minor issues worth addressing.
Bug (should fix)
database.py:143 — double call to context.get_current_revision()
current_revision is correctly stored on line 141 and returned on line 149, but the if-comparison on line 143 still calls context.get_current_revision() a second time instead of using the stored variable. The commit message describes this as a fixed "no-op assignment" but the fix is incomplete. No observed behavioral bug since the revision is stable between calls, but the code is semantically wrong and the stored value is misleading. Fix: replace context.get_current_revision() on line 143 with current_revision.
Suggestions
health.py:172 — migration_health == "unhealthy" branch appears to be dead code
get_db_health returns "unhealthy" only when the sync DB is unreachable, which also causes the sync primary pool check to return "unhealthy", making has_unhealthy_pool true first. This branch can never be reached. Either add a comment explaining it as an intentional safety net or remove it.
health.py:206 — intermediate sess: AsyncSession variable
Exists only to satisfy the type checker because session_factory: Any makes the async with result untyped. Better to type session_factory properly (e.g. Callable[[], AsyncContextManager[AsyncSession]]) and drop the intermediate variable. See inline comment.
ctl_session.py:207 — lock scope comment
The return ASYNC_READONLY_POOL_WARMED at the end of ensure_async_readonly_pool_prewarmed is read outside the lock. This is safe (Python GIL, bool assignment is atomic), but a brief comment explaining the lock is only guarding against double-warming would help future readers.
test_api.py:test_database_healthcheck — consider splitting the test
The single integration test uses if/else branches for readonly-configured vs. not-configured paths. In most CI environments only one branch exercises. Splitting into two named tests would make failures immediately identifiable from the test name.
What looks good
PoolHealth = Literal["healthy", "unhealthy", "skipped"]is a clean addition;"skipped"is correctly excluded from thehas_unhealthy_poolcheck.async_readonly_pool_prewarmed = Nonewhen the pool is unconfigured is the right sentinel (distinct fromFalsemeaning "configured but not warmed").- The
readonly_db.close()in afinallyblock correctly handles cleanup even when_check_sync_sessionraises. - New tests cover all major code paths: unhealthy pool → 503, needs migration → 503, readonly session failure → 503, prewarmed details in response.
_get_async_readonly_prewarming_detailscorrectly guardstarget <= 0to avoid a division-by-zero forcapacity_percentage.
🔬 Codegraph: unavailable
💡 Write /code-review in a comment to re-run this review.
| """Return health status for an async pool-backed session factory.""" | ||
| try: | ||
| async with session_factory() as raw_session: | ||
| sess: AsyncSession = raw_session |
There was a problem hiding this comment.
src/fides/api/v1/endpoints/health.py:206
The intermediate sess: AsyncSession = raw_session assignment is only here to satisfy the type checker, because session_factory: Any makes the async with result untyped. Consider typing session_factory more precisely instead — e.g.:
from collections.abc import Callable, AsyncGenerator
async def _check_async_session(
session_factory: Callable[[], AsyncContextManager[AsyncSession]]
) -> PoolHealth:
try:
async with session_factory() as sess:
await sess.execute(text("SELECT 1"))
return "healthy"
...This removes the intermediate variable and makes the intent clearer.
| overall_database: DatabaseResponseHealth = "unhealthy" | ||
| elif migration_health == "needs migration": | ||
| overall_database = "needs migration" | ||
| elif migration_health == "unhealthy": |
There was a problem hiding this comment.
src/fides/api/v1/endpoints/health.py:168-173
Minor: the migration_health == "unhealthy" branch (line 172) is currently dead code. get_db_health in database.py only returns "healthy", "unhealthy", or "needs migration", and the "unhealthy" case from get_db_health is already subsumed by has_unhealthy_pool (since the sync primary pool will also be unhealthy whenever get_db_health returns "unhealthy"). Either document why this branch exists as a safety net, or remove it to avoid confusion.
There was a problem hiding this comment.
I think what claude said here is true, isn't it? Approving anyways, worst case this is just a dead branch of the if/else block.
| readonly_async_engine, | ||
| ) | ||
| ASYNC_READONLY_POOL_WARMED = True | ||
|
|
There was a problem hiding this comment.
src/fides/api/db/ctl_session.py:207
ensure_async_readonly_pool_prewarmed always returns ASYNC_READONLY_POOL_WARMED at the end, but the value is read under the lock only within the if not ASYNC_READONLY_POOL_WARMED branch. If the pool was already warmed before this call, the function returns True without entering the lock — which is fine for the warm check, but means the return ASYNC_READONLY_POOL_WARMED at line 207 reads the global without the lock held. This is benign (it's a bool and Python GIL protects single-variable reads), but it's worth a comment explaining that the lock is only needed to prevent double-warming, not to protect the final read.
| assert detail["database_revision"] == "9e83545ed9b6" | ||
|
|
||
| def test_database_healthcheck_readonly_pool_session_failure_returns_503( | ||
| self, |
There was a problem hiding this comment.
tests/ctl/core/test_api.py:3438 (the test_database_healthcheck function)
The single test function contains if/else branches for both "readonly configured" and "readonly not configured" paths. In CI the test will silently skip the readonly branch assertions if the pool isn't configured. Consider splitting into two separate tests (e.g. test_database_healthcheck_no_readonly and test_database_healthcheck_with_readonly) with appropriate pytest.mark.skipif or fixture-driven conditions, so a failing readonly assertion is immediately identifiable from the test name rather than requiring test output inspection.
|
/code-review |
There was a problem hiding this comment.
Code Review: Database Health Pool Details (#7856)
This PR expands the /health/database endpoint from a single database/database_revision response into a structured per-pool health report. The direction is solid — per-pool visibility is a meaningful observability improvement — and the overall structure is clean. A few things worth addressing before merge:
Issues
Redundant flag read after ensure_async_readonly_pool_prewarmed (minor)
ensure_async_readonly_pool_prewarmed() already returns ASYNC_READONLY_POOL_WARMED, but the return value is dropped and is_async_readonly_pool_prewarmed() is called immediately after. The two calls can be collapsed to a single await ensure_async_readonly_pool_prewarmed().
capacity_percentage can exceed 1.0 (minor)
The calculation (checked_in + checked_out) / target does not account for SQLAlchemy pool overflow connections. Values > 1.0 are possible and may surprise consumers. The field should be documented or clamped, or renamed to signal that overflow is possible.
migration_health == "unhealthy" branch appears unreachable (minor)
get_db_health in database.py only produces "healthy" or "needs migration". If this third branch is genuinely reachable it needs a test; if it isn't, it should be removed to avoid confusion.
Health endpoint now has up to 4 sequential pool round-trips with no timeout (suggestion)
A hung pool will stall the entire health response. Consider per-probe timeouts so a slow pool surfaces as "unhealthy" quickly rather than blocking the endpoint.
Primary sync pool SELECT 1 is redundant with get_db_health (nit)
Connectivity to the primary sync pool is already exercised by the migration check. The extra SELECT 1 doesn't add health signal, but it does give the pool a consistent named slot in the response. Worth a brief comment explaining the intent.
ensure_async_readonly_pool_prewarmed return semantics are ambiguous (nit)
False is returned both when the pool is not configured and when it's configured but not yet warmed. A docstring clarifying the three states would help future callers.
What's good
- The
database.pychange avoiding a duplicatecontext.get_current_revision()call is a clean micro-improvement. - The test rewrite — replacing a single parametrized test with focused, named scenario tests — is a clear readability win and makes failure messages much more actionable.
- The
finally: readonly_db.close()guard in the sync readonly path correctly prevents session leaks on error. - Using
Pii(str(error))consistently in error logs maintains existing PII-scrubbing discipline. - The
PoolStatus(health="skipped")pattern for unconfigured optional pools is a clean API contract.
🔬 Codegraph: connected (46725 nodes)
💡 Write /code-review in a comment to re-run this review.
| else: | ||
| pools["api_async_readonly"] = PoolStatus(health="skipped") | ||
| async_readonly_pool_prewarmed = None | ||
|
|
There was a problem hiding this comment.
src/fides/api/v1/endpoints/health.py:168
ensure_async_readonly_pool_prewarmed() already returns the current value of ASYNC_READONLY_POOL_WARMED, but the return value is discarded and is_async_readonly_pool_prewarmed() is called immediately after to read the same flag. The two calls can be collapsed:
async_readonly_pool_prewarmed = await ensure_async_readonly_pool_prewarmed()This removes the redundant is_async_readonly_pool_prewarmed() call and eliminates the tiny window (irrelevant in practice, but semantically cleaner) between the two reads of the global flag.
| capacity_percentage: Optional[float] | ||
| if target <= 0: | ||
| capacity_percentage = None | ||
| else: |
There was a problem hiding this comment.
src/fides/api/v1/endpoints/health.py:236
capacity_percentage is computed as (checked_in + checked_out) / target, which can exceed 1.0 when SQLAlchemy's max_overflow allows connections beyond the configured pool size. Consumers of this field may reasonably expect a value in [0, 1]. Consider either:
- Clamping:
min(1.0, (checked_in + checked_out) / target) - Renaming the field to
pool_utilization_ratioand documenting that values > 1.0 indicate overflow connections are in use.
At minimum, add a note to PoolPrewarming.capacity_percentage that values > 1.0 are possible.
| else: | ||
| overall_database = "healthy" | ||
|
|
||
| response = DatabaseHealthCheck( |
There was a problem hiding this comment.
src/fides/api/v1/endpoints/health.py:183
The migration_health == "unhealthy" branch appears to be dead code (or at least untested). get_db_health in database.py only sets db_health to "needs migration" or "healthy" — it doesn't appear to return "unhealthy" under normal failure paths. If this branch is genuinely reachable (e.g., via an exception path elsewhere in get_db_health), it should have a corresponding test. If it isn't reachable, it should be removed to avoid confusion.
| @@ -83,21 +114,136 @@ def get_cache_health() -> str: | |||
| }, | |||
There was a problem hiding this comment.
src/fides/api/v1/endpoints/health.py:114
The endpoint now issues up to four sequential network round-trips (two sync pool probes + two async pool probes). If any pool is hung or slow, the health check will block for a correspondingly long time, which can cause monitoring systems to time out and falsely report the service as down.
Consider wrapping each pool probe with asyncio.wait_for(..., timeout=<seconds>) (or a connect_args socket timeout at the engine level) so that a single slow pool surfaces as "unhealthy" quickly rather than stalling the whole response.
| return response | ||
|
|
||
|
|
||
| def _check_sync_session(db: Session) -> PoolHealth: |
There was a problem hiding this comment.
src/fides/api/v1/endpoints/health.py:198
_check_sync_session runs SELECT 1 on the DI-injected db session that get_db_health has already used (via db.connection()). The connectivity has effectively already been verified by the migration check. The primary sync pool entry does add a named slot in the response, but it doesn't give incremental health signal over what get_db_health already established. This is fine as-is, but worth a brief comment explaining the intent (e.g., "provides a consistent pools entry even though connectivity was already verified by get_db_health").
| ) | ||
| ASYNC_READONLY_POOL_WARMED = True | ||
|
|
||
| # Lock only serializes warm_async_pool; the flag read below is safe without the lock |
There was a problem hiding this comment.
src/fides/api/db/ctl_session.py:208
The function returns ASYNC_READONLY_POOL_WARMED read outside the lock. The comment explains this is safe within asyncio's single-threaded event loop (no concurrent coroutines can mutate the flag between the lock release and this read). That reasoning is correct, but the two early-return False paths (lines before the lock) mean the function can also return False when the pool is simply not configured — which is indistinguishable from "configured but not yet warmed". This is fine for the current call sites but could be surprising to future callers. A docstring clarifying the three return states ("not configured", "configured but not warmed", "configured and warmed") would help.
erosselli
left a comment
There was a problem hiding this comment.
this makes thanks, thanks! let's make sure to get CI green , or add our coverage ignore comments if there's parts of the code we can't test
| overall_database: DatabaseResponseHealth = "unhealthy" | ||
| elif migration_health == "needs migration": | ||
| overall_database = "needs migration" | ||
| elif migration_health == "unhealthy": |
There was a problem hiding this comment.
I think what claude said here is true, isn't it? Approving anyways, worst case this is just a dead branch of the if/else block.
Ticket ENG-3281
Description Of Changes
Code Changes
Steps to Confirm
N/A
Pre-Merge Checklist
CHANGELOG.mdupdatedmaindowngrade()migration is correct and works