Skip to content

enhancement(db): Add per-query Statement.setQueryTimeout(), tcpKeepAlive, and HikariCP keepaliveTime as defense-in-depth #34832

@spbolton

Description

@spbolton

Context Update — 2026-03-03

Original framing of this issue was based on a false premise.

This issue was filed (2026-03-02) after observing idle DB connections on k8s-us-prod-1 and assuming the fix in #34490 (shipped in 26.02.27-01) was already deployed. The implication was that connections were still leaking post-fix, and therefore an additional mechanism (query timeouts) was required.

What the data actually shows: A fleet-wide sweep of all 177 dotCMS StatefulSets on k8s-us-prod-1 found that every affected instance is running 26.02.17-01_2263c4e. The 26.02.17-01 image was built 10 days before the fix shipped in 26.02.27-01. These pods have never received the #34490 fix. The observed connections are pre-fix behavior, not a post-fix regression.

Does #34490 actually fix the observed problem?

Yes — with high confidence. Code review of the merged PR confirms two independent fixes that together close the structural leak:

  1. Structural fix (unconditional): DBMetricType.getValue() now wraps the query call in DbConnectionFactory.wrapConnection() on the executor thread. This makes connection lifecycle management explicit and independent of @CloseDBIfOpened firing on CDI proxies (which it silently does not). This fix is unconditionally correct — no timing dependency, no replication required to validate.

  2. Race condition fix: The wrapConnection() finally block now calls Thread.interrupted() to clear the interrupted status before running closeSilently(), then restores the flag after. The executor timeout path now checks future.isDone() before calling cancel(true). The shutdown sequence uses executor.shutdown() + awaitTermination(5s) before falling back to shutdownNow(). Together these eliminate the narrow timing window where an interrupt could arrive mid-cleanup.

The inability to replicate in tests is not a confidence problem for the structural fix — wrapConnection() is unconditionally correct and can be reasoned about statically. The race condition window is also now extremely narrow given the isDone() guard.

All 207 observed orphaned connections have state = idle in pg_stat_activity. idle means the query completed — PostgreSQL finished executing and is waiting for the next command. This confirms the observed failure mode is Mode A (query completed but connection not returned to pool), which is exactly what #34490 targets. Mode B (query still blocking, socket hung, state = active) is not present in current production data.

Revised priority and framing

Item Status
Fix for observed production issue Deploy 26.02.27-01 to affected customers — this is the action, not this issue
This issue (#34832) — Mode B / TCP hang Nice-to-have defense-in-depth; not required to fix current production behavior
DB_MAXWAIT/DB_MAX_WAIT naming bug Still a real, independent bug — worth fixing regardless of the above
Query timeout / setQueryTimeout() Valuable for genuinely slow/blocked queries; not the cause of currently observed idle connections

Recommended sequence:

  1. Upgrade affected customers from 26.02.17-0126.02.27-01
  2. Run pg_stat_activity sweep post-upgrade to confirm accumulation stops
  3. If confirmed resolved: treat this issue as a standalone defense-in-depth enhancement (particularly the DB_MAXWAIT naming fix, which is independently worth shipping)
  4. If connections persist post-upgrade: then setQueryTimeout() becomes the primary investigation target

Description

Metric queries in `MetricsFactoryImpl` have no per-query timeout and the JDBC socket has no keepalive. An executor thread processing a metric can block indefinitely on JDBC I/O, never run connection cleanup, and permanently orphan a connection that persists in `pg_stat_activity` until the pod is restarted. This directly undermines the fix in #34490.

Investigation also identified a env var naming bug and HikariCP pool configuration issues that compound the risk and affect all connections. These are captured with risk-tiered recommendations.


Failure Modes

Mode A — Query still running (lock wait, slow query)

  1. `future.cancel(true)` sets the interrupt flag on the executor thread
  2. JDBC is blocking in `Statement.execute()` — pgjdbc does not respond to `Thread.interrupt()` during socket reads
  3. `wrapConnection()`'s finally block cannot run until `delegate.execute()` returns
  4. The executor thread stays alive holding a checked-out HikariCP connection indefinitely

Fix: `Statement.setQueryTimeout()` — fires a server-side cancel via PostgreSQL's cancel protocol (separate TCP socket). PostgreSQL cancels the query, pgjdbc unblocks and throws, the exception propagates through `wrapConnection()`, cleanup runs, connection returned to HikariCP.

Mode B — Query completed server-side, TCP-level hang

`pg_stat_activity` shows `state = idle` — PostgreSQL has processed the query and sent the result. But pgjdbc is still blocking in `SocketInputStream.read()` waiting to receive those bytes due to a network-level issue (dead TCP session, dropped packet after server sent result).

`Statement.setQueryTimeout()` does not help here — there is no running query to cancel. The original socket read remains blocked indefinitely.

Fix: `tcpKeepAlive=true` in JDBC URL — enables OS-level TCP keepalive probes. When the OS detects a dead connection it closes the socket, pgjdbc throws `SocketException`, the executor thread unblocks, and cleanup runs. This only fires on dead connections and has no effect on legitimate slow queries.


Why Not `socketTimeout`

`socketTimeout` in the JDBC URL is a global TCP read deadline applied to every query in the application — bulk operations, report queries, content search — all would be killed at the threshold. The `setQueryTimeout()` approach is scoped to metric queries only; `tcpKeepAlive` relies on the OS to distinguish dead connections from slow ones.


Fix 1 — Telemetry Query Timeout

Add `getQueryTimeoutSeconds()` to `TimeoutConfig`

The query timeout is derived from the existing `telemetry.metric.timeout.seconds` — no new configuration property needed. The metric timeout is the operator-visible contract (outer executor deadline); the query timeout is an implementation detail that fires 2 seconds earlier, allowing the resulting exception to propagate and the connection to be returned before the executor deadline fires.

/**
 * Gets the JDBC query timeout for metric DB queries.
 * Derived from the metric timeout with a 2-second buffer so the server-side
 * cancel fires before the executor's future.get() deadline.
 *
 * @return the JDBC query timeout in seconds (always at least 1)
 */
public int getQueryTimeoutSeconds() {
    return Math.max(1, getMetricTimeoutSeconds() - 2);
}
Timeout Source Default Role
`telemetry.metric.timeout.seconds` existing config 10s executor `future.get()` deadline — outer bound
`getQueryTimeoutSeconds()` derived: `metricTimeout - 2` 8s JDBC `setQueryTimeout()` — fires first

Add `setQueryTimeout(int seconds)` to `DotConnect`

Add a `queryTimeoutSeconds` field (default `0` = no timeout) with a fluent setter. In `executeQuery(Connection conn)`, call `statement.setQueryTimeout(queryTimeoutSeconds)` on both the `PreparedStatement` path (~line 693) and the `Statement` path (~line 719) before execution. When `queryTimeoutSeconds == 0` no timeout is set — existing callers are unaffected.

Inject `TimeoutConfig` into `MetricsFactoryImpl`

@Inject
private TimeoutConfig timeoutConfig;

@Override
public Optional<Object> getValue(final String sqlQuery) throws DotDataException {
    final List<Map<String, Object>> results = new DotConnect()
            .setSQL(sqlQuery)
            .setQueryTimeout(timeoutConfig.getQueryTimeoutSeconds())
            .loadObjectResults();
    // ...
}

@Override
public Optional<List<String>> getList(final String sqlQuery) throws DotDataException {
    final List<Map<String, Object>> results = new DotConnect()
            .setSQL(sqlQuery)
            .setQueryTimeout(timeoutConfig.getQueryTimeoutSeconds())
            .loadObjectResults();
    // ...
}

`getSchemaDBVersion()` does not use `setQueryTimeout` — fast schema lookup, not a leak risk.


Fix 2 — JDBC URL and HikariCP Keepalive

Add `tcpKeepAlive=true` to `DB_BASE_URL`

jdbc:postgresql://host:5432/dotcms?tcpKeepAlive=true

Enables `SO_KEEPALIVE` on the pgjdbc socket. OS TCP keepalive probes detect dead connections and close the socket, unblocking pgjdbc. The probe interval is governed by OS settings (`tcp_keepalive_time`, `tcp_keepalive_intvl`, `tcp_keepalive_probes`) — tuning these in the k8s pod spec reduces detection time.

Add `keepaliveTime` to `SystemEnvDataSourceStrategy`

The current `DB_IDLE_TIMEOUT = 300s` means connections above `minimumIdle` are retired at 5 minutes — just before the AWS NAT gateway's ~350s idle TCP timeout, so those connections are safe. However, the `minimumIdle` connections (recommended: 1) are never retired by `idleTimeout` and can be idle for hours, becoming silently dead on the NAT without either side knowing.

`keepaliveTime = 120s` sends a lightweight `SELECT 1` every 2 minutes on idle connections, keeping the NAT entry alive. This change has no impact on request latency.

`keepaliveTime` is not currently exposed as an env var. Both `SystemEnvDataSourceStrategy` and `setenv.sh` need to be updated:

export DB_KEEPALIVE_TIME=${DB_KEEPALIVE_TIME:-"120000"}

Note on `DB_VALIDATION_QUERY`: Leave it empty. pgjdbc supports JDBC4 `Connection.isValid()`, which HikariCP uses automatically — free validation with no extra round-trip. Setting `connectionTestQuery` would add overhead to every connection checkout across the entire application.


Bug: `DB_MAXWAIT` / `DB_MAX_WAIT` Naming Mismatch

The `maxLifetime` setting configured in `setenv.sh` is never actually applied. There is a naming mismatch between the env var exported by `setenv.sh` and the name the code reads:

Location Variable name
`setenv.sh` `DB_MAX_WAIT` (underscore between MAX and WAIT)
`DataSourceStrategyProvider.java` line 22 `DB_MAXWAIT` (no underscore)

Because these names do not match, `SystemEnvDataSourceStrategy` never finds the `DB_MAX_WAIT` value and always falls back to its hardcoded default of `"60000"` (60 seconds). Every other pool constant follows the underscore convention (`DB_MAX_TOTAL`, `DB_MIN_IDLE`, `DB_IDLE_TIMEOUT`, etc.) — `DB_MAXWAIT` is the sole exception and is incorrect.

Effective `maxLifetime` at runtime: 60 seconds. HikariCP retires and replaces connections every ~60 seconds. With `maximumPoolSize = 200` under load each pod generates up to ~3 reconnects/second, each triggering a `SET application_name` on the new connection. This is the direct cause of the `SET application_name` churn visible in `pg_stat_activity`.

Fix: Rename the constant in `DataSourceStrategyProvider.java`:

// Before
static final String CONNECTION_DB_MAX_WAIT = "DB_MAXWAIT";

// After
static final String CONNECTION_DB_MAX_WAIT = "DB_MAX_WAIT";

Migration impact: Any k8s ConfigMap or deployment script that explicitly sets `DB_MAXWAIT` must be updated to `DB_MAX_WAIT`. Deployments relying on the `setenv.sh` default (which already uses `DB_MAX_WAIT`) will automatically benefit once the constant is corrected.

Once the naming is fixed, the `setenv.sh` default of 900s becomes effective. This should be raised to 1800s as part of the same change (see Tier 1 acceptance criteria).


Related: HikariCP Pool Configuration

The effective pool settings (what actually runs) after accounting for the naming bug:

Property Env Var (code reads) setenv.sh exports Actual runtime value
`maximumPoolSize` `DB_MAX_TOTAL` `DB_MAX_TOTAL=200` 200
`minimumIdle` `DB_MIN_IDLE` `DB_MIN_IDLE=3` 3
`maxLifetime` `DB_MAXWAIT` `DB_MAX_WAIT=900000` ⚠️ mismatch 60s (hardcoded default — setenv.sh value ignored)
`idleTimeout` `DB_IDLE_TIMEOUT` `DB_IDLE_TIMEOUT=300000` 300s
`connectionTimeout` `DB_CONNECTION_TIMEOUT` `DB_CONNECTION_TIMEOUT=5000` 5s
`connectionTestQuery` `DB_VALIDATION_QUERY` `DB_VALIDATION_QUERY=""` "" (fine — JDBC4 isValid automatic)
`keepaliveTime` `DB_KEEPALIVE_TIME` not set not set

Pool self-regulation

The pool self-regulates around actual load: connections above `minimumIdle` are retired after `idleTimeout = 300s` of idleness. `maximumPoolSize = 200` is a burst ceiling, not a permanent floor — the pool sheds load within 5 minutes of traffic dropping.

`DB_MIN_IDLE` → 1 (Tier 1, safe)

`minimumIdle = 3` is hard to justify over 1. The only practical use case for keeping warm connections is having one ready for background threads (Quartz, push sync) during quiet periods. One connection covers this — the remaining 2 provide negligible benefit since HikariCP ramps additional connections in parallel rapidly on burst. Reducing to 1 minimises permanent RDS slot consumption and reduces the keepalive concern to a single connection per pod.

`DB_CONNECTION_TIMEOUT` → 30s (Tier 1, safe)

`connectionTimeout = 5s` means any thread requesting a pool connection during a brief burst — publish + telemetry + reindex overlap — has only 5 seconds before HikariCP throws a `SQLTransientConnectionException`. This produces noisy false failures during spikes that would self-resolve within seconds. 30 seconds (HikariCP's own recommended default) gives legitimate requests time to succeed without masking actual pool exhaustion. This change has no effect on leak behaviour.

`DB_MAX_TOTAL = 200` — observe before changing (Tier 2)

`DB_MAX_TOTAL = 200` was likely sized to match the application's peak thread demand. dotCMS runs many concurrent thread pools (Tomcat request handlers, content indexing, push publishing, Quartz, workflow processors, telemetry) that can each hold a database connection simultaneously. Reducing this without observability data risks introducing `connectionTimeout` failures on services that currently work.

Approach: Establish a baseline first by querying `pg_stat_activity` during a content publish, full reindex, and normal traffic window:

SELECT client_addr, application_name, state, count(*)
FROM pg_stat_activity
WHERE datname = 'dotcms'
GROUP BY client_addr, application_name, state
ORDER BY count(*) DESC;

Note: Each pod runs one permanent LISTEN/NOTIFY connection (`JDBCPubSubImpl`) that holds a pool slot indefinitely, appearing as a connection running frequent `SELECT 1` queries. Filter this out when analysing load-driven connection counts — it is 1 fixed connection per pod, not a traffic signal.

Once peak data is available, reduce `DB_MAX_TOTAL` in small increments (200 → 150 → 100) monitoring HikariCP JMX metrics and `connectionTimeout` errors after each step. Tomcat max thread count and async indexing thread pool sizing should be reviewed as part of this work.


Related: PostgreSQL Server-Side Timeouts

`idle_in_transaction_session_timeout` — defer (Tier 3)

This setting kills sessions idle inside an open transaction and is the most impactful safety net for stuck transaction leaks. However, dotCMS performs Elasticsearch/OpenSearch indexing operations inside database transactions during content publish. These can hold a transaction open for an extended period depending on content volume and index response time.

Setting any value here before this pattern is remediated would cause catastrophic publish failures. This must be tracked as a separate technical debt item. Once index operations are moved outside transaction boundaries, a starting value of 5–10 minutes is appropriate.

`idle_session_timeout` — after keepalive is confirmed (Tier 3)

With `keepaliveTime = 120s` in place, pooled connections receive a ping every 2 minutes — they will never be idle long enough to trigger this setting. It only catches connections that have fully escaped pool management.

Set to 3600000ms (1 hour) per role once `keepaliveTime` is confirmed working in production:

ALTER ROLE dotcms_user SET idle_session_timeout = '3600000';

Apply per role, not globally, to avoid affecting superuser and monitoring connections. Must be greater than `maxLifetime` (30 min) to avoid killing connections HikariCP intends to keep alive.

Required timeout ordering once all settings are in place:

idle_session_timeout  (server, 60 min)
    > maxLifetime     (HikariCP, 30 min)  — pool retires connections before server kills them
    > keepaliveTime   (HikariCP, 2 min)   — pings keep minimumIdle connections alive through NAT
    > NAT idle timeout (AWS, ~350s)

Temporary Operational Workarounds

These changes can be applied to the StatefulSet and RDS without any code deployment to limit connection accumulation while the permanent fixes are in progress. All StatefulSet changes require a rolling pod restart.

StatefulSet environment variables

Apply all of the following in a single StatefulSet update:

env:
  # Fix the DB_MAXWAIT naming bug immediately — the code reads DB_MAXWAIT,
  # not DB_MAX_WAIT. Setting this directly bypasses the setenv.sh mismatch
  # and changes maxLifetime from 60s to 30 minutes, stopping connection churn.
  - name: DB_MAXWAIT
    value: "1800000"

  # Enable TCP keepalive and add a 10-minute socket read deadline as a
  # temporary backstop for Mode B (TCP-level hang). socketTimeout kills any
  # socket read blocked longer than 10 minutes — conservative enough not to
  # affect legitimate operations. Remove socketTimeout once the code fix ships.
  - name: DB_BASE_URL
    value: "jdbc:postgresql://host:5432/dotcms?tcpKeepAlive=true&socketTimeout=600"

  # Reduce permanent idle footprint from 3 connections to 1 per pod.
  - name: DB_MIN_IDLE
    value: "1"

Priority order and rationale:

Change Impact Risk
`DB_MAXWAIT=1800000` Stops connection churn immediately — single biggest impact None
`tcpKeepAlive=true` in `DB_BASE_URL` OS detects and closes dead TCP connections (Mode B) None
`socketTimeout=600` in `DB_BASE_URL` Backstop: unblocks any socket read stuck >10 min Low — 10 min is conservative
`DB_MIN_IDLE=1` Reduces permanent idle connections per pod None

RDS — apply per role (immediate, no restart required)

These take effect for new sessions only. Existing connections are unaffected until they reconnect.

-- Backstop for any query running longer than 10 minutes.
-- Covers Mode A (slow/locked query) at the database level while the
-- setQueryTimeout() code fix is pending. Conservative value preserves
-- legitimate long-running operations.
ALTER ROLE dotcms_user SET statement_timeout = '600000';

-- Backstop for connections stuck idle inside an open transaction for
-- longer than 30 minutes. This is deliberately conservative given the
-- known indexing-in-transaction anti-pattern — genuine publish transactions
-- do not sit idle in BEGIN for 30 minutes. Provides a safety net for
-- truly stuck connections without risking publish failures.
ALTER ROLE dotcms_user SET idle_in_transaction_session_timeout = '1800000';

-- Backstop for connections that have fully escaped pool management
-- (leaked connections that neither code nor HikariCP will clean up).
-- 2 hours ensures pooled connections cycling at maxLifetime=30min are
-- never affected. Provides a ceiling on how long a leaked connection
-- can persist before RDS terminates it.
ALTER ROLE dotcms_user SET idle_session_timeout = '7200000';

Verify settings applied:

SELECT rolname, rolconfig FROM pg_roles WHERE rolname = 'dotcms_user';

What these workarounds do and do not address

Failure mode Workaround coverage Permanent fix
Connection churn from 60s maxLifetime ✅ Fixed by `DB_MAXWAIT=1800000` `DB_MAXWAIT` rename in code
Mode A — slow/locked metric query blocking indefinitely ⚠️ Partially covered by `statement_timeout=10min` (database kills query eventually) `setQueryTimeout()` in `DotConnect`
Mode B — TCP hang after query completes ✅ `tcpKeepAlive=true` handles dead connections; `socketTimeout=600` covers edge cases `tcpKeepAlive=true` in code path
Leaked connections persisting indefinitely ✅ `idle_session_timeout=2h` and `idle_in_transaction_session_timeout=30min` provide a ceiling `keepaliveTime` + code fixes
NAT silently killing minimumIdle connections ⚠️ Mitigated by `DB_MIN_IDLE=1` reducing exposure; not eliminated without `keepaliveTime` `keepaliveTime` via `DB_KEEPALIVE_TIME` env var

Priority

High — without this, the `wrapConnection()` pattern added in #34490 cannot reliably prevent connection orphaning regardless of whether the blocking JDBC call is still in-flight or has already completed server-side. The `DB_MAXWAIT` naming bug means `maxLifetime` is running at 60s on all deployments regardless of `setenv.sh` configuration, causing significant unnecessary connection churn.


Acceptance Criteria

Tier 1 — Safe to apply now

  • `getQueryTimeoutSeconds()` added to `TimeoutConfig` — returns `Math.max(1, getMetricTimeoutSeconds() - 2)`, no new config property
  • `setQueryTimeout(int seconds)` added to `DotConnect` — default `0` (no timeout), fluent setter, applied to both `PreparedStatement` and `Statement` paths in `executeQuery()`
  • `TimeoutConfig` injected into `MetricsFactoryImpl`; `getValue()` and `getList()` chain `.setQueryTimeout(timeoutConfig.getQueryTimeoutSeconds())`
  • `getSchemaDBVersion()` does not use `setQueryTimeout`
  • `tcpKeepAlive=true` added to `DB_BASE_URL` in k8s ConfigMap; `socketTimeout=600` removed from `DB_BASE_URL` (workaround no longer needed)
  • `keepaliveTime` added to `SystemEnvDataSourceStrategy` via env var `DB_KEEPALIVE_TIME`; default `120000` in `setenv.sh`
  • `CONNECTION_DB_MAX_WAIT` constant in `DataSourceStrategyProvider.java` renamed from `"DB_MAXWAIT"` to `"DB_MAX_WAIT"` to match `setenv.sh` and naming convention
  • `DB_MAX_WAIT` default raised to `1800000` in `setenv.sh`; `DB_MAXWAIT` workaround entry removed from StatefulSet env vars
  • `DB_MIN_IDLE` default lowered to `1` in `setenv.sh`
  • K8s ConfigMaps and deployment scripts audited for `DB_MAXWAIT` references and updated to `DB_MAX_WAIT`
  • Verified: existing `DotConnect` callers without `setQueryTimeout()` are unaffected
  • Verified: a slow/locked metric query is cancelled at `metricTimeout - 2s` and the connection is returned to HikariCP
  • `DB_CONNECTION_TIMEOUT` default raised to `30000` in `setenv.sh`

Tier 2 — Requires observability data first

  • Peak connection count per pod baselined via `pg_stat_activity` across publish, reindex, and normal traffic scenarios (LISTEN/NOTIFY connection filtered out)
  • Tomcat max threads and async thread pool configuration reviewed alongside connection pool sizing
  • `DB_MAX_TOTAL` reduced incrementally (200 → 150 → 100) based on observed peak, with HikariCP JMX metrics monitored after each step

Tier 3 — Requires code remediation first

  • Indexing-in-transaction pattern tracked as technical debt — index operations to be moved outside transaction boundaries or executed post-commit
  • `idle_in_transaction_session_timeout` tightened from 30min workaround to 5–10 minutes only after Tier 3 code work is complete
  • `idle_session_timeout` tightened from 2h workaround to 1h only after `keepaliveTime` (Tier 1) is confirmed working in production

Context

Companion fix to #34490:

  • fix(telemetry): fixtelemetry DB connection leak in MetricStatsColl (#34480) #34490 — `wrapConnection()` ensures `closeSilently()` runs once JDBC returns
  • This issue — `setQueryTimeout()` ensures JDBC returns for slow/locked queries; `tcpKeepAlive` handles TCP-level hangs; `keepaliveTime` prevents NAT from silently killing idle connections; `DB_MAXWAIT` naming fix makes `maxLifetime` configuration effective for the first time; server-side timeouts (Tier 3) catch any connection that escapes pool management entirely

Tier 1 changes can be applied immediately. Tier 2 and Tier 3 require observability work and code remediation respectively. The temporary operational workarounds above can be applied to production now to reduce RDS connection pressure while the code work proceeds.

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    Status

    New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions