Skip to content

fix(health): DatabaseHealthCheck has same executor+shutdownNow() race condition as pre-#34490 telemetry and uses DbConnectionFactory.getConnection() anti-pattern #34839

@spbolton

Description

@spbolton

Problem Statement

DatabaseHealthCheck.performCheck() contains two structural problems that are direct parallels of issues already fixed or tracked elsewhere in the connection management epic.

Problem 1 — Same race condition as pre-#34490 telemetry (lines 67–94)

The health check uses the identical executor + shutdownNow() pattern that #34490 fixed for MetricStatsCollector:

ExecutorService executor = Executors.newSingleThreadExecutor();
Future<String> future = executor.submit(() -> {
    try (Connection conn = DbConnectionFactory.getConnection()) {  // ThreadLocal
        try (var stmt = conn.prepareStatement("SELECT 1")) { ... }
    }
});
try {
    return future.get(2, TimeUnit.SECONDS);
} catch (TimeoutException e) {
    future.cancel(true);         // interrupt to executor thread
} finally {
    executor.shutdownNow();      // same pattern as pre-fix MetricStatsCollector
}

try-with-resources is not interrupt-resilient. If shutdownNow() interrupts the executor thread while conn.close() is executing, the close is abandoned and the connection is orphaned — the same race condition that wrapConnection()'s Thread.interrupted() / restore pattern in DbConnectionFactory was specifically designed to prevent. The health check never received the equivalent fix.

Problem 2 — DbConnectionFactory.getConnection() anti-pattern (#34489)

DbConnectionFactory.getConnection() is a ThreadLocal-first lookup. For an explicit connectivity test, this is wrong:

  1. If the executor thread happens to have a connection in its ThreadLocal (possible with thread reuse), the health check validates the cached connection rather than testing a fresh network path to RDS — it could report "healthy" while connectivity is broken.
  2. After try-with-resources closes the connection, the ThreadLocal on the executor thread still holds the reference. If newSingleThreadExecutor() reuses that thread on the next probe cycle, DbConnectionFactory.connectionExists() returns true on a stale reference.

The correct source for a health check is DbConnectionFactory.getDataSource().getConnection() — direct from the pool, bypassing ThreadLocal entirely.

Problem 3 — PreparedStatement for SELECT 1 instead of Connection.isValid()

conn.prepareStatement("SELECT 1") allocates a PreparedStatement and parses a query. conn.isValid(timeoutSeconds) is the JDBC4 standard for connection health validation — pgjdbc implements it without a PreparedStatement, and it is semantically correct for this use case. The health check should use conn.isValid(2) rather than a manual SQL round-trip.


Impact

Each k8s readiness probe fires every 7 seconds per pod. With 177 StatefulSets on k8s-us-prod-1, this is ~25 probe invocations/second fleet-wide. Under the race condition (2-second timeout + shutdownNow()), a slow RDS response could orphan a connection on every probe — adding a high-frequency source of connection leaks on top of the background thread issues already tracked in this epic.


Files

  • dotCMS/src/main/java/com/dotcms/health/checks/cdi/DatabaseHealthCheck.java — lines 67–94

Acceptance Criteria

  • DbConnectionFactory.getConnection() replaced with DbConnectionFactory.getDataSource().getConnection() — bypasses ThreadLocal, always tests a real pool connection
  • try-with-resources on the connection replaced with wrapConnection() pattern (or explicit interrupt-resilient finally block matching DbConnectionFactory.wrapConnection()) so that future.cancel(true) / shutdownNow() cannot orphan the connection
  • conn.prepareStatement("SELECT 1") replaced with conn.isValid(2) — JDBC4 standard, no PreparedStatement allocation
  • executor.shutdownNow() in finally replaced with graceful executor.shutdown() + awaitTermination() + fallback shutdownNow(), consistent with the pattern in MetricStatsCollector after fix(telemetry): fixtelemetry DB connection leak in MetricStatsColl (#34480) #34490
  • When DatabaseHealthEventManager is in event-driven mode and its cached result is fresher than the probe interval, the health check returns the cached status directly without opening a connection
  • Existing health check behaviour (2-second timeout, DOWN on failure, excluded from liveness probes) is preserved

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    Status

    QA

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions