You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Original framing of this issue was based on a false premise.
This issue was filed (2026-03-02) after observing idle DB connections on k8s-us-prod-1 and assuming the fix in #34490 (shipped in 26.02.27-01) was already deployed. The implication was that connections were still leaking post-fix, and therefore an additional mechanism (query timeouts) was required.
What the data actually shows: A fleet-wide sweep of all 177 dotCMS StatefulSets on k8s-us-prod-1 found that every affected instance is running 26.02.17-01_2263c4e. The 26.02.17-01 image was built 10 days before the fix shipped in 26.02.27-01. These pods have never received the #34490 fix. The observed connections are pre-fix behavior, not a post-fix regression.
Yes — with high confidence. Code review of the merged PR confirms two independent fixes that together close the structural leak:
Structural fix (unconditional):DBMetricType.getValue() now wraps the query call in DbConnectionFactory.wrapConnection() on the executor thread. This makes connection lifecycle management explicit and independent of @CloseDBIfOpened firing on CDI proxies (which it silently does not). This fix is unconditionally correct — no timing dependency, no replication required to validate.
Race condition fix: The wrapConnection() finally block now calls Thread.interrupted() to clear the interrupted status before running closeSilently(), then restores the flag after. The executor timeout path now checks future.isDone() before calling cancel(true). The shutdown sequence uses executor.shutdown() + awaitTermination(5s) before falling back to shutdownNow(). Together these eliminate the narrow timing window where an interrupt could arrive mid-cleanup.
The inability to replicate in tests is not a confidence problem for the structural fix — wrapConnection() is unconditionally correct and can be reasoned about statically. The race condition window is also now extremely narrow given the isDone() guard.
All 207 observed orphaned connections have state = idle in pg_stat_activity.idle means the query completed — PostgreSQL finished executing and is waiting for the next command. This confirms the observed failure mode is Mode A (query completed but connection not returned to pool), which is exactly what #34490 targets. Mode B (query still blocking, socket hung, state = active) is not present in current production data.
Revised priority and framing
Item
Status
Fix for observed production issue
Deploy 26.02.27-01 to affected customers — this is the action, not this issue
Nice-to-have defense-in-depth; not required to fix current production behavior
DB_MAXWAIT/DB_MAX_WAIT naming bug
Still a real, independent bug — worth fixing regardless of the above
Query timeout / setQueryTimeout()
Valuable for genuinely slow/blocked queries; not the cause of currently observed idle connections
Recommended sequence:
Upgrade affected customers from 26.02.17-01 → 26.02.27-01
Run pg_stat_activity sweep post-upgrade to confirm accumulation stops
If confirmed resolved: treat this issue as a standalone defense-in-depth enhancement (particularly the DB_MAXWAIT naming fix, which is independently worth shipping)
If connections persist post-upgrade: then setQueryTimeout() becomes the primary investigation target
Description
Metric queries in `MetricsFactoryImpl` have no per-query timeout and the JDBC socket has no keepalive. An executor thread processing a metric can block indefinitely on JDBC I/O, never run connection cleanup, and permanently orphan a connection that persists in `pg_stat_activity` until the pod is restarted. This directly undermines the fix in #34490.
Investigation also identified a env var naming bug and HikariCP pool configuration issues that compound the risk and affect all connections. These are captured with risk-tiered recommendations.
Failure Modes
Mode A — Query still running (lock wait, slow query)
`future.cancel(true)` sets the interrupt flag on the executor thread
JDBC is blocking in `Statement.execute()` — pgjdbc does not respond to `Thread.interrupt()` during socket reads
`wrapConnection()`'s finally block cannot run until `delegate.execute()` returns
The executor thread stays alive holding a checked-out HikariCP connection indefinitely
Fix: `Statement.setQueryTimeout()` — fires a server-side cancel via PostgreSQL's cancel protocol (separate TCP socket). PostgreSQL cancels the query, pgjdbc unblocks and throws, the exception propagates through `wrapConnection()`, cleanup runs, connection returned to HikariCP.
Mode B — Query completed server-side, TCP-level hang
`pg_stat_activity` shows `state = idle` — PostgreSQL has processed the query and sent the result. But pgjdbc is still blocking in `SocketInputStream.read()` waiting to receive those bytes due to a network-level issue (dead TCP session, dropped packet after server sent result).
`Statement.setQueryTimeout()` does not help here — there is no running query to cancel. The original socket read remains blocked indefinitely.
Fix: `tcpKeepAlive=true` in JDBC URL — enables OS-level TCP keepalive probes. When the OS detects a dead connection it closes the socket, pgjdbc throws `SocketException`, the executor thread unblocks, and cleanup runs. This only fires on dead connections and has no effect on legitimate slow queries.
Why Not `socketTimeout`
`socketTimeout` in the JDBC URL is a global TCP read deadline applied to every query in the application — bulk operations, report queries, content search — all would be killed at the threshold. The `setQueryTimeout()` approach is scoped to metric queries only; `tcpKeepAlive` relies on the OS to distinguish dead connections from slow ones.
Fix 1 — Telemetry Query Timeout
Add `getQueryTimeoutSeconds()` to `TimeoutConfig`
The query timeout is derived from the existing `telemetry.metric.timeout.seconds` — no new configuration property needed. The metric timeout is the operator-visible contract (outer executor deadline); the query timeout is an implementation detail that fires 2 seconds earlier, allowing the resulting exception to propagate and the connection to be returned before the executor deadline fires.
/** * Gets the JDBC query timeout for metric DB queries. * Derived from the metric timeout with a 2-second buffer so the server-side * cancel fires before the executor's future.get() deadline. * * @return the JDBC query timeout in seconds (always at least 1) */publicintgetQueryTimeoutSeconds() {
returnMath.max(1, getMetricTimeoutSeconds() - 2);
}
Timeout
Source
Default
Role
`telemetry.metric.timeout.seconds`
existing config
10s
executor `future.get()` deadline — outer bound
`getQueryTimeoutSeconds()`
derived: `metricTimeout - 2`
8s
JDBC `setQueryTimeout()` — fires first
Add `setQueryTimeout(int seconds)` to `DotConnect`
Add a `queryTimeoutSeconds` field (default `0` = no timeout) with a fluent setter. In `executeQuery(Connection conn)`, call `statement.setQueryTimeout(queryTimeoutSeconds)` on both the `PreparedStatement` path (~line 693) and the `Statement` path (~line 719) before execution. When `queryTimeoutSeconds == 0` no timeout is set — existing callers are unaffected.
Enables `SO_KEEPALIVE` on the pgjdbc socket. OS TCP keepalive probes detect dead connections and close the socket, unblocking pgjdbc. The probe interval is governed by OS settings (`tcp_keepalive_time`, `tcp_keepalive_intvl`, `tcp_keepalive_probes`) — tuning these in the k8s pod spec reduces detection time.
Add `keepaliveTime` to `SystemEnvDataSourceStrategy`
The current `DB_IDLE_TIMEOUT = 300s` means connections above `minimumIdle` are retired at 5 minutes — just before the AWS NAT gateway's ~350s idle TCP timeout, so those connections are safe. However, the `minimumIdle` connections (recommended: 1) are never retired by `idleTimeout` and can be idle for hours, becoming silently dead on the NAT without either side knowing.
`keepaliveTime = 120s` sends a lightweight `SELECT 1` every 2 minutes on idle connections, keeping the NAT entry alive. This change has no impact on request latency.
`keepaliveTime` is not currently exposed as an env var. Both `SystemEnvDataSourceStrategy` and `setenv.sh` need to be updated:
Note on `DB_VALIDATION_QUERY`: Leave it empty. pgjdbc supports JDBC4 `Connection.isValid()`, which HikariCP uses automatically — free validation with no extra round-trip. Setting `connectionTestQuery` would add overhead to every connection checkout across the entire application.
Bug: `DB_MAXWAIT` / `DB_MAX_WAIT` Naming Mismatch
The `maxLifetime` setting configured in `setenv.sh` is never actually applied. There is a naming mismatch between the env var exported by `setenv.sh` and the name the code reads:
Location
Variable name
`setenv.sh`
`DB_MAX_WAIT` (underscore between MAX and WAIT)
`DataSourceStrategyProvider.java` line 22
`DB_MAXWAIT` (no underscore)
Because these names do not match, `SystemEnvDataSourceStrategy` never finds the `DB_MAX_WAIT` value and always falls back to its hardcoded default of `"60000"` (60 seconds). Every other pool constant follows the underscore convention (`DB_MAX_TOTAL`, `DB_MIN_IDLE`, `DB_IDLE_TIMEOUT`, etc.) — `DB_MAXWAIT` is the sole exception and is incorrect.
Effective `maxLifetime` at runtime: 60 seconds. HikariCP retires and replaces connections every ~60 seconds. With `maximumPoolSize = 200` under load each pod generates up to ~3 reconnects/second, each triggering a `SET application_name` on the new connection. This is the direct cause of the `SET application_name` churn visible in `pg_stat_activity`.
Fix: Rename the constant in `DataSourceStrategyProvider.java`:
Migration impact: Any k8s ConfigMap or deployment script that explicitly sets `DB_MAXWAIT` must be updated to `DB_MAX_WAIT`. Deployments relying on the `setenv.sh` default (which already uses `DB_MAX_WAIT`) will automatically benefit once the constant is corrected.
Once the naming is fixed, the `setenv.sh` default of 900s becomes effective. This should be raised to 1800s as part of the same change (see Tier 1 acceptance criteria).
Related: HikariCP Pool Configuration
The effective pool settings (what actually runs) after accounting for the naming bug:
Property
Env Var (code reads)
setenv.sh exports
Actual runtime value
`maximumPoolSize`
`DB_MAX_TOTAL`
`DB_MAX_TOTAL=200`
200
`minimumIdle`
`DB_MIN_IDLE`
`DB_MIN_IDLE=3`
3
`maxLifetime`
`DB_MAXWAIT`
`DB_MAX_WAIT=900000` ⚠️ mismatch
60s (hardcoded default — setenv.sh value ignored)
`idleTimeout`
`DB_IDLE_TIMEOUT`
`DB_IDLE_TIMEOUT=300000`
300s
`connectionTimeout`
`DB_CONNECTION_TIMEOUT`
`DB_CONNECTION_TIMEOUT=5000`
5s
`connectionTestQuery`
`DB_VALIDATION_QUERY`
`DB_VALIDATION_QUERY=""`
"" (fine — JDBC4 isValid automatic)
`keepaliveTime`
`DB_KEEPALIVE_TIME`
not set
not set
Pool self-regulation
The pool self-regulates around actual load: connections above `minimumIdle` are retired after `idleTimeout = 300s` of idleness. `maximumPoolSize = 200` is a burst ceiling, not a permanent floor — the pool sheds load within 5 minutes of traffic dropping.
`DB_MIN_IDLE` → 1 (Tier 1, safe)
`minimumIdle = 3` is hard to justify over 1. The only practical use case for keeping warm connections is having one ready for background threads (Quartz, push sync) during quiet periods. One connection covers this — the remaining 2 provide negligible benefit since HikariCP ramps additional connections in parallel rapidly on burst. Reducing to 1 minimises permanent RDS slot consumption and reduces the keepalive concern to a single connection per pod.
`DB_CONNECTION_TIMEOUT` → 30s (Tier 1, safe)
`connectionTimeout = 5s` means any thread requesting a pool connection during a brief burst — publish + telemetry + reindex overlap — has only 5 seconds before HikariCP throws a `SQLTransientConnectionException`. This produces noisy false failures during spikes that would self-resolve within seconds. 30 seconds (HikariCP's own recommended default) gives legitimate requests time to succeed without masking actual pool exhaustion. This change has no effect on leak behaviour.
`DB_MAX_TOTAL = 200` — observe before changing (Tier 2)
`DB_MAX_TOTAL = 200` was likely sized to match the application's peak thread demand. dotCMS runs many concurrent thread pools (Tomcat request handlers, content indexing, push publishing, Quartz, workflow processors, telemetry) that can each hold a database connection simultaneously. Reducing this without observability data risks introducing `connectionTimeout` failures on services that currently work.
Approach: Establish a baseline first by querying `pg_stat_activity` during a content publish, full reindex, and normal traffic window:
SELECT client_addr, application_name, state, count(*)
FROM pg_stat_activity
WHERE datname ='dotcms'GROUP BY client_addr, application_name, state
ORDER BYcount(*) DESC;
Note: Each pod runs one permanent LISTEN/NOTIFY connection (`JDBCPubSubImpl`) that holds a pool slot indefinitely, appearing as a connection running frequent `SELECT 1` queries. Filter this out when analysing load-driven connection counts — it is 1 fixed connection per pod, not a traffic signal.
Once peak data is available, reduce `DB_MAX_TOTAL` in small increments (200 → 150 → 100) monitoring HikariCP JMX metrics and `connectionTimeout` errors after each step. Tomcat max thread count and async indexing thread pool sizing should be reviewed as part of this work.
This setting kills sessions idle inside an open transaction and is the most impactful safety net for stuck transaction leaks. However, dotCMS performs Elasticsearch/OpenSearch indexing operations inside database transactions during content publish. These can hold a transaction open for an extended period depending on content volume and index response time.
Setting any value here before this pattern is remediated would cause catastrophic publish failures. This must be tracked as a separate technical debt item. Once index operations are moved outside transaction boundaries, a starting value of 5–10 minutes is appropriate.
`idle_session_timeout` — after keepalive is confirmed (Tier 3)
With `keepaliveTime = 120s` in place, pooled connections receive a ping every 2 minutes — they will never be idle long enough to trigger this setting. It only catches connections that have fully escaped pool management.
Set to 3600000ms (1 hour) per role once `keepaliveTime` is confirmed working in production:
ALTER ROLE dotcms_user SET idle_session_timeout ='3600000';
Apply per role, not globally, to avoid affecting superuser and monitoring connections. Must be greater than `maxLifetime` (30 min) to avoid killing connections HikariCP intends to keep alive.
Required timeout ordering once all settings are in place:
idle_session_timeout (server, 60 min)
> maxLifetime (HikariCP, 30 min) — pool retires connections before server kills them
> keepaliveTime (HikariCP, 2 min) — pings keep minimumIdle connections alive through NAT
> NAT idle timeout (AWS, ~350s)
Temporary Operational Workarounds
These changes can be applied to the StatefulSet and RDS without any code deployment to limit connection accumulation while the permanent fixes are in progress. All StatefulSet changes require a rolling pod restart.
StatefulSet environment variables
Apply all of the following in a single StatefulSet update:
env:
# Fix the DB_MAXWAIT naming bug immediately — the code reads DB_MAXWAIT,# not DB_MAX_WAIT. Setting this directly bypasses the setenv.sh mismatch# and changes maxLifetime from 60s to 30 minutes, stopping connection churn.
- name: DB_MAXWAITvalue: "1800000"# Enable TCP keepalive and add a 10-minute socket read deadline as a# temporary backstop for Mode B (TCP-level hang). socketTimeout kills any# socket read blocked longer than 10 minutes — conservative enough not to# affect legitimate operations. Remove socketTimeout once the code fix ships.
- name: DB_BASE_URLvalue: "jdbc:postgresql://host:5432/dotcms?tcpKeepAlive=true&socketTimeout=600"# Reduce permanent idle footprint from 3 connections to 1 per pod.
- name: DB_MIN_IDLEvalue: "1"
Priority order and rationale:
Change
Impact
Risk
`DB_MAXWAIT=1800000`
Stops connection churn immediately — single biggest impact
None
`tcpKeepAlive=true` in `DB_BASE_URL`
OS detects and closes dead TCP connections (Mode B)
None
`socketTimeout=600` in `DB_BASE_URL`
Backstop: unblocks any socket read stuck >10 min
Low — 10 min is conservative
`DB_MIN_IDLE=1`
Reduces permanent idle connections per pod
None
RDS — apply per role (immediate, no restart required)
These take effect for new sessions only. Existing connections are unaffected until they reconnect.
-- Backstop for any query running longer than 10 minutes.-- Covers Mode A (slow/locked query) at the database level while the-- setQueryTimeout() code fix is pending. Conservative value preserves-- legitimate long-running operations.
ALTER ROLE dotcms_user SET statement_timeout ='600000';
-- Backstop for connections stuck idle inside an open transaction for-- longer than 30 minutes. This is deliberately conservative given the-- known indexing-in-transaction anti-pattern — genuine publish transactions-- do not sit idle in BEGIN for 30 minutes. Provides a safety net for-- truly stuck connections without risking publish failures.
ALTER ROLE dotcms_user SET idle_in_transaction_session_timeout ='1800000';
-- Backstop for connections that have fully escaped pool management-- (leaked connections that neither code nor HikariCP will clean up).-- 2 hours ensures pooled connections cycling at maxLifetime=30min are-- never affected. Provides a ceiling on how long a leaked connection-- can persist before RDS terminates it.
ALTER ROLE dotcms_user SET idle_session_timeout ='7200000';
Verify settings applied:
SELECT rolname, rolconfig FROM pg_roles WHERE rolname ='dotcms_user';
What these workarounds do and do not address
Failure mode
Workaround coverage
Permanent fix
Connection churn from 60s maxLifetime
✅ Fixed by `DB_MAXWAIT=1800000`
`DB_MAXWAIT` rename in code
Mode A — slow/locked metric query blocking indefinitely
⚠️ Partially covered by `statement_timeout=10min` (database kills query eventually)
`setQueryTimeout()` in `DotConnect`
Mode B — TCP hang after query completes
✅ `tcpKeepAlive=true` handles dead connections; `socketTimeout=600` covers edge cases
`tcpKeepAlive=true` in code path
Leaked connections persisting indefinitely
✅ `idle_session_timeout=2h` and `idle_in_transaction_session_timeout=30min` provide a ceiling
`keepaliveTime` + code fixes
NAT silently killing minimumIdle connections
⚠️ Mitigated by `DB_MIN_IDLE=1` reducing exposure; not eliminated without `keepaliveTime`
`keepaliveTime` via `DB_KEEPALIVE_TIME` env var
Priority
High — without this, the `wrapConnection()` pattern added in #34490 cannot reliably prevent connection orphaning regardless of whether the blocking JDBC call is still in-flight or has already completed server-side. The `DB_MAXWAIT` naming bug means `maxLifetime` is running at 60s on all deployments regardless of `setenv.sh` configuration, causing significant unnecessary connection churn.
Acceptance Criteria
Tier 1 — Safe to apply now
`getQueryTimeoutSeconds()` added to `TimeoutConfig` — returns `Math.max(1, getMetricTimeoutSeconds() - 2)`, no new config property
`setQueryTimeout(int seconds)` added to `DotConnect` — default `0` (no timeout), fluent setter, applied to both `PreparedStatement` and `Statement` paths in `executeQuery()`
`TimeoutConfig` injected into `MetricsFactoryImpl`; `getValue()` and `getList()` chain `.setQueryTimeout(timeoutConfig.getQueryTimeoutSeconds())`
`getSchemaDBVersion()` does not use `setQueryTimeout`
`tcpKeepAlive=true` added to `DB_BASE_URL` in k8s ConfigMap; `socketTimeout=600` removed from `DB_BASE_URL` (workaround no longer needed)
`keepaliveTime` added to `SystemEnvDataSourceStrategy` via env var `DB_KEEPALIVE_TIME`; default `120000` in `setenv.sh`
`CONNECTION_DB_MAX_WAIT` constant in `DataSourceStrategyProvider.java` renamed from `"DB_MAXWAIT"` to `"DB_MAX_WAIT"` to match `setenv.sh` and naming convention
`DB_MAX_WAIT` default raised to `1800000` in `setenv.sh`; `DB_MAXWAIT` workaround entry removed from StatefulSet env vars
`DB_MIN_IDLE` default lowered to `1` in `setenv.sh`
K8s ConfigMaps and deployment scripts audited for `DB_MAXWAIT` references and updated to `DB_MAX_WAIT`
Verified: existing `DotConnect` callers without `setQueryTimeout()` are unaffected
Verified: a slow/locked metric query is cancelled at `metricTimeout - 2s` and the connection is returned to HikariCP
`DB_CONNECTION_TIMEOUT` default raised to `30000` in `setenv.sh`
Tier 2 — Requires observability data first
Peak connection count per pod baselined via `pg_stat_activity` across publish, reindex, and normal traffic scenarios (LISTEN/NOTIFY connection filtered out)
Tomcat max threads and async thread pool configuration reviewed alongside connection pool sizing
`DB_MAX_TOTAL` reduced incrementally (200 → 150 → 100) based on observed peak, with HikariCP JMX metrics monitored after each step
Tier 3 — Requires code remediation first
Indexing-in-transaction pattern tracked as technical debt — index operations to be moved outside transaction boundaries or executed post-commit
`idle_in_transaction_session_timeout` tightened from 30min workaround to 5–10 minutes only after Tier 3 code work is complete
`idle_session_timeout` tightened from 2h workaround to 1h only after `keepaliveTime` (Tier 1) is confirmed working in production
This issue — `setQueryTimeout()` ensures JDBC returns for slow/locked queries; `tcpKeepAlive` handles TCP-level hangs; `keepaliveTime` prevents NAT from silently killing idle connections; `DB_MAXWAIT` naming fix makes `maxLifetime` configuration effective for the first time; server-side timeouts (Tier 3) catch any connection that escapes pool management entirely
Tier 1 changes can be applied immediately. Tier 2 and Tier 3 require observability work and code remediation respectively. The temporary operational workarounds above can be applied to production now to reduce RDS connection pressure while the code work proceeds.
Context Update — 2026-03-03
Original framing of this issue was based on a false premise.
This issue was filed (2026-03-02) after observing idle DB connections on
k8s-us-prod-1and assuming the fix in #34490 (shipped in26.02.27-01) was already deployed. The implication was that connections were still leaking post-fix, and therefore an additional mechanism (query timeouts) was required.What the data actually shows: A fleet-wide sweep of all 177 dotCMS StatefulSets on
k8s-us-prod-1found that every affected instance is running26.02.17-01_2263c4e. The26.02.17-01image was built 10 days before the fix shipped in26.02.27-01. These pods have never received the #34490 fix. The observed connections are pre-fix behavior, not a post-fix regression.Does #34490 actually fix the observed problem?
Yes — with high confidence. Code review of the merged PR confirms two independent fixes that together close the structural leak:
Structural fix (unconditional):
DBMetricType.getValue()now wraps the query call inDbConnectionFactory.wrapConnection()on the executor thread. This makes connection lifecycle management explicit and independent of@CloseDBIfOpenedfiring on CDI proxies (which it silently does not). This fix is unconditionally correct — no timing dependency, no replication required to validate.Race condition fix: The
wrapConnection()finally block now callsThread.interrupted()to clear the interrupted status before runningcloseSilently(), then restores the flag after. The executor timeout path now checksfuture.isDone()before callingcancel(true). The shutdown sequence usesexecutor.shutdown()+awaitTermination(5s)before falling back toshutdownNow(). Together these eliminate the narrow timing window where an interrupt could arrive mid-cleanup.The inability to replicate in tests is not a confidence problem for the structural fix —
wrapConnection()is unconditionally correct and can be reasoned about statically. The race condition window is also now extremely narrow given theisDone()guard.All 207 observed orphaned connections have
state = idleinpg_stat_activity.idlemeans the query completed — PostgreSQL finished executing and is waiting for the next command. This confirms the observed failure mode is Mode A (query completed but connection not returned to pool), which is exactly what #34490 targets. Mode B (query still blocking, socket hung,state = active) is not present in current production data.Revised priority and framing
26.02.27-01to affected customers — this is the action, not this issueDB_MAXWAIT/DB_MAX_WAITnaming bugsetQueryTimeout()Recommended sequence:
26.02.17-01→26.02.27-01pg_stat_activitysweep post-upgrade to confirm accumulation stopsDB_MAXWAITnaming fix, which is independently worth shipping)setQueryTimeout()becomes the primary investigation targetDescription
Metric queries in `MetricsFactoryImpl` have no per-query timeout and the JDBC socket has no keepalive. An executor thread processing a metric can block indefinitely on JDBC I/O, never run connection cleanup, and permanently orphan a connection that persists in `pg_stat_activity` until the pod is restarted. This directly undermines the fix in #34490.
Investigation also identified a env var naming bug and HikariCP pool configuration issues that compound the risk and affect all connections. These are captured with risk-tiered recommendations.
Failure Modes
Mode A — Query still running (lock wait, slow query)
Fix: `Statement.setQueryTimeout()` — fires a server-side cancel via PostgreSQL's cancel protocol (separate TCP socket). PostgreSQL cancels the query, pgjdbc unblocks and throws, the exception propagates through `wrapConnection()`, cleanup runs, connection returned to HikariCP.
Mode B — Query completed server-side, TCP-level hang
`pg_stat_activity` shows `state = idle` — PostgreSQL has processed the query and sent the result. But pgjdbc is still blocking in `SocketInputStream.read()` waiting to receive those bytes due to a network-level issue (dead TCP session, dropped packet after server sent result).
`Statement.setQueryTimeout()` does not help here — there is no running query to cancel. The original socket read remains blocked indefinitely.
Fix: `tcpKeepAlive=true` in JDBC URL — enables OS-level TCP keepalive probes. When the OS detects a dead connection it closes the socket, pgjdbc throws `SocketException`, the executor thread unblocks, and cleanup runs. This only fires on dead connections and has no effect on legitimate slow queries.
Why Not `socketTimeout`
`socketTimeout` in the JDBC URL is a global TCP read deadline applied to every query in the application — bulk operations, report queries, content search — all would be killed at the threshold. The `setQueryTimeout()` approach is scoped to metric queries only; `tcpKeepAlive` relies on the OS to distinguish dead connections from slow ones.
Fix 1 — Telemetry Query Timeout
Add `getQueryTimeoutSeconds()` to `TimeoutConfig`
The query timeout is derived from the existing `telemetry.metric.timeout.seconds` — no new configuration property needed. The metric timeout is the operator-visible contract (outer executor deadline); the query timeout is an implementation detail that fires 2 seconds earlier, allowing the resulting exception to propagate and the connection to be returned before the executor deadline fires.
Add `setQueryTimeout(int seconds)` to `DotConnect`
Add a `queryTimeoutSeconds` field (default `0` = no timeout) with a fluent setter. In `executeQuery(Connection conn)`, call `statement.setQueryTimeout(queryTimeoutSeconds)` on both the `PreparedStatement` path (~line 693) and the `Statement` path (~line 719) before execution. When `queryTimeoutSeconds == 0` no timeout is set — existing callers are unaffected.
Inject `TimeoutConfig` into `MetricsFactoryImpl`
`getSchemaDBVersion()` does not use `setQueryTimeout` — fast schema lookup, not a leak risk.
Fix 2 — JDBC URL and HikariCP Keepalive
Add `tcpKeepAlive=true` to `DB_BASE_URL`
Enables `SO_KEEPALIVE` on the pgjdbc socket. OS TCP keepalive probes detect dead connections and close the socket, unblocking pgjdbc. The probe interval is governed by OS settings (`tcp_keepalive_time`, `tcp_keepalive_intvl`, `tcp_keepalive_probes`) — tuning these in the k8s pod spec reduces detection time.
Add `keepaliveTime` to `SystemEnvDataSourceStrategy`
The current `DB_IDLE_TIMEOUT = 300s` means connections above `minimumIdle` are retired at 5 minutes — just before the AWS NAT gateway's ~350s idle TCP timeout, so those connections are safe. However, the `minimumIdle` connections (recommended: 1) are never retired by `idleTimeout` and can be idle for hours, becoming silently dead on the NAT without either side knowing.
`keepaliveTime = 120s` sends a lightweight `SELECT 1` every 2 minutes on idle connections, keeping the NAT entry alive. This change has no impact on request latency.
`keepaliveTime` is not currently exposed as an env var. Both `SystemEnvDataSourceStrategy` and `setenv.sh` need to be updated:
Bug: `DB_MAXWAIT` / `DB_MAX_WAIT` Naming Mismatch
The `maxLifetime` setting configured in `setenv.sh` is never actually applied. There is a naming mismatch between the env var exported by `setenv.sh` and the name the code reads:
Because these names do not match, `SystemEnvDataSourceStrategy` never finds the `DB_MAX_WAIT` value and always falls back to its hardcoded default of `"60000"` (60 seconds). Every other pool constant follows the underscore convention (`DB_MAX_TOTAL`, `DB_MIN_IDLE`, `DB_IDLE_TIMEOUT`, etc.) — `DB_MAXWAIT` is the sole exception and is incorrect.
Effective `maxLifetime` at runtime: 60 seconds. HikariCP retires and replaces connections every ~60 seconds. With `maximumPoolSize = 200` under load each pod generates up to ~3 reconnects/second, each triggering a `SET application_name` on the new connection. This is the direct cause of the `SET application_name` churn visible in `pg_stat_activity`.
Fix: Rename the constant in `DataSourceStrategyProvider.java`:
Migration impact: Any k8s ConfigMap or deployment script that explicitly sets `DB_MAXWAIT` must be updated to `DB_MAX_WAIT`. Deployments relying on the `setenv.sh` default (which already uses `DB_MAX_WAIT`) will automatically benefit once the constant is corrected.
Once the naming is fixed, the `setenv.sh` default of 900s becomes effective. This should be raised to 1800s as part of the same change (see Tier 1 acceptance criteria).
Related: HikariCP Pool Configuration
The effective pool settings (what actually runs) after accounting for the naming bug:
Pool self-regulation
The pool self-regulates around actual load: connections above `minimumIdle` are retired after `idleTimeout = 300s` of idleness. `maximumPoolSize = 200` is a burst ceiling, not a permanent floor — the pool sheds load within 5 minutes of traffic dropping.
`DB_MIN_IDLE` → 1 (Tier 1, safe)
`minimumIdle = 3` is hard to justify over 1. The only practical use case for keeping warm connections is having one ready for background threads (Quartz, push sync) during quiet periods. One connection covers this — the remaining 2 provide negligible benefit since HikariCP ramps additional connections in parallel rapidly on burst. Reducing to 1 minimises permanent RDS slot consumption and reduces the keepalive concern to a single connection per pod.
`DB_CONNECTION_TIMEOUT` → 30s (Tier 1, safe)
`connectionTimeout = 5s` means any thread requesting a pool connection during a brief burst — publish + telemetry + reindex overlap — has only 5 seconds before HikariCP throws a `SQLTransientConnectionException`. This produces noisy false failures during spikes that would self-resolve within seconds. 30 seconds (HikariCP's own recommended default) gives legitimate requests time to succeed without masking actual pool exhaustion. This change has no effect on leak behaviour.
`DB_MAX_TOTAL = 200` — observe before changing (Tier 2)
`DB_MAX_TOTAL = 200` was likely sized to match the application's peak thread demand. dotCMS runs many concurrent thread pools (Tomcat request handlers, content indexing, push publishing, Quartz, workflow processors, telemetry) that can each hold a database connection simultaneously. Reducing this without observability data risks introducing `connectionTimeout` failures on services that currently work.
Approach: Establish a baseline first by querying `pg_stat_activity` during a content publish, full reindex, and normal traffic window:
Once peak data is available, reduce `DB_MAX_TOTAL` in small increments (200 → 150 → 100) monitoring HikariCP JMX metrics and `connectionTimeout` errors after each step. Tomcat max thread count and async indexing thread pool sizing should be reviewed as part of this work.
Related: PostgreSQL Server-Side Timeouts
`idle_in_transaction_session_timeout` — defer (Tier 3)
This setting kills sessions idle inside an open transaction and is the most impactful safety net for stuck transaction leaks. However, dotCMS performs Elasticsearch/OpenSearch indexing operations inside database transactions during content publish. These can hold a transaction open for an extended period depending on content volume and index response time.
Setting any value here before this pattern is remediated would cause catastrophic publish failures. This must be tracked as a separate technical debt item. Once index operations are moved outside transaction boundaries, a starting value of 5–10 minutes is appropriate.
`idle_session_timeout` — after keepalive is confirmed (Tier 3)
With `keepaliveTime = 120s` in place, pooled connections receive a ping every 2 minutes — they will never be idle long enough to trigger this setting. It only catches connections that have fully escaped pool management.
Set to 3600000ms (1 hour) per role once `keepaliveTime` is confirmed working in production:
Apply per role, not globally, to avoid affecting superuser and monitoring connections. Must be greater than `maxLifetime` (30 min) to avoid killing connections HikariCP intends to keep alive.
Required timeout ordering once all settings are in place:
Temporary Operational Workarounds
These changes can be applied to the StatefulSet and RDS without any code deployment to limit connection accumulation while the permanent fixes are in progress. All StatefulSet changes require a rolling pod restart.
StatefulSet environment variables
Apply all of the following in a single StatefulSet update:
Priority order and rationale:
RDS — apply per role (immediate, no restart required)
These take effect for new sessions only. Existing connections are unaffected until they reconnect.
Verify settings applied:
What these workarounds do and do not address
Priority
High — without this, the `wrapConnection()` pattern added in #34490 cannot reliably prevent connection orphaning regardless of whether the blocking JDBC call is still in-flight or has already completed server-side. The `DB_MAXWAIT` naming bug means `maxLifetime` is running at 60s on all deployments regardless of `setenv.sh` configuration, causing significant unnecessary connection churn.
Acceptance Criteria
Tier 1 — Safe to apply now
Tier 2 — Requires observability data first
Tier 3 — Requires code remediation first
Context
Companion fix to #34490:
Tier 1 changes can be applied immediately. Tier 2 and Tier 3 require observability work and code remediation respectively. The temporary operational workarounds above can be applied to production now to reduce RDS connection pressure while the code work proceeds.