Problem Statement
DatabaseHealthCheck.performCheck() contains two structural problems that are direct parallels of issues already fixed or tracked elsewhere in the connection management epic.
Problem 1 — Same race condition as pre-#34490 telemetry (lines 67–94)
The health check uses the identical executor + shutdownNow() pattern that #34490 fixed for MetricStatsCollector:
ExecutorService executor = Executors.newSingleThreadExecutor();
Future<String> future = executor.submit(() -> {
try (Connection conn = DbConnectionFactory.getConnection()) { // ThreadLocal
try (var stmt = conn.prepareStatement("SELECT 1")) { ... }
}
});
try {
return future.get(2, TimeUnit.SECONDS);
} catch (TimeoutException e) {
future.cancel(true); // interrupt to executor thread
} finally {
executor.shutdownNow(); // same pattern as pre-fix MetricStatsCollector
}
try-with-resources is not interrupt-resilient. If shutdownNow() interrupts the executor thread while conn.close() is executing, the close is abandoned and the connection is orphaned — the same race condition that wrapConnection()'s Thread.interrupted() / restore pattern in DbConnectionFactory was specifically designed to prevent. The health check never received the equivalent fix.
Problem 2 — DbConnectionFactory.getConnection() anti-pattern (#34489)
DbConnectionFactory.getConnection() is a ThreadLocal-first lookup. For an explicit connectivity test, this is wrong:
- If the executor thread happens to have a connection in its ThreadLocal (possible with thread reuse), the health check validates the cached connection rather than testing a fresh network path to RDS — it could report "healthy" while connectivity is broken.
- After
try-with-resources closes the connection, the ThreadLocal on the executor thread still holds the reference. If newSingleThreadExecutor() reuses that thread on the next probe cycle, DbConnectionFactory.connectionExists() returns true on a stale reference.
The correct source for a health check is DbConnectionFactory.getDataSource().getConnection() — direct from the pool, bypassing ThreadLocal entirely.
Problem 3 — PreparedStatement for SELECT 1 instead of Connection.isValid()
conn.prepareStatement("SELECT 1") allocates a PreparedStatement and parses a query. conn.isValid(timeoutSeconds) is the JDBC4 standard for connection health validation — pgjdbc implements it without a PreparedStatement, and it is semantically correct for this use case. The health check should use conn.isValid(2) rather than a manual SQL round-trip.
Impact
Each k8s readiness probe fires every 7 seconds per pod. With 177 StatefulSets on k8s-us-prod-1, this is ~25 probe invocations/second fleet-wide. Under the race condition (2-second timeout + shutdownNow()), a slow RDS response could orphan a connection on every probe — adding a high-frequency source of connection leaks on top of the background thread issues already tracked in this epic.
Files
dotCMS/src/main/java/com/dotcms/health/checks/cdi/DatabaseHealthCheck.java — lines 67–94
Acceptance Criteria
Problem Statement
DatabaseHealthCheck.performCheck()contains two structural problems that are direct parallels of issues already fixed or tracked elsewhere in the connection management epic.Problem 1 — Same race condition as pre-#34490 telemetry (lines 67–94)
The health check uses the identical executor +
shutdownNow()pattern that #34490 fixed forMetricStatsCollector:try-with-resourcesis not interrupt-resilient. IfshutdownNow()interrupts the executor thread whileconn.close()is executing, the close is abandoned and the connection is orphaned — the same race condition thatwrapConnection()'sThread.interrupted()/ restore pattern inDbConnectionFactorywas specifically designed to prevent. The health check never received the equivalent fix.Problem 2 —
DbConnectionFactory.getConnection()anti-pattern (#34489)DbConnectionFactory.getConnection()is a ThreadLocal-first lookup. For an explicit connectivity test, this is wrong:try-with-resourcescloses the connection, the ThreadLocal on the executor thread still holds the reference. IfnewSingleThreadExecutor()reuses that thread on the next probe cycle,DbConnectionFactory.connectionExists()returnstrueon a stale reference.The correct source for a health check is
DbConnectionFactory.getDataSource().getConnection()— direct from the pool, bypassing ThreadLocal entirely.Problem 3 — PreparedStatement for
SELECT 1instead ofConnection.isValid()conn.prepareStatement("SELECT 1")allocates a PreparedStatement and parses a query.conn.isValid(timeoutSeconds)is the JDBC4 standard for connection health validation — pgjdbc implements it without a PreparedStatement, and it is semantically correct for this use case. The health check should useconn.isValid(2)rather than a manual SQL round-trip.Impact
Each k8s readiness probe fires every 7 seconds per pod. With 177 StatefulSets on
k8s-us-prod-1, this is ~25 probe invocations/second fleet-wide. Under the race condition (2-second timeout +shutdownNow()), a slow RDS response could orphan a connection on every probe — adding a high-frequency source of connection leaks on top of the background thread issues already tracked in this epic.Files
dotCMS/src/main/java/com/dotcms/health/checks/cdi/DatabaseHealthCheck.java— lines 67–94Acceptance Criteria
DbConnectionFactory.getConnection()replaced withDbConnectionFactory.getDataSource().getConnection()— bypasses ThreadLocal, always tests a real pool connectiontry-with-resourceson the connection replaced withwrapConnection()pattern (or explicit interrupt-resilient finally block matchingDbConnectionFactory.wrapConnection()) so thatfuture.cancel(true)/shutdownNow()cannot orphan the connectionconn.prepareStatement("SELECT 1")replaced withconn.isValid(2)— JDBC4 standard, no PreparedStatement allocationexecutor.shutdownNow()in finally replaced with gracefulexecutor.shutdown()+awaitTermination()+ fallbackshutdownNow(), consistent with the pattern inMetricStatsCollectorafter fix(telemetry): fixtelemetry DB connection leak in MetricStatsColl (#34480) #34490DatabaseHealthEventManageris in event-driven mode and its cached result is fresher than the probe interval, the health check returns the cached status directly without opening a connection