enhancement(db): Add per-query Statement.setQueryTimeout(), tcpKeepAlive, and HikariCP keepaliveTime as defense-in-depth

## Context Update — 2026-03-03

**Original framing of this issue was based on a false premise.**

This issue was filed (2026-03-02) after observing idle DB connections on `k8s-us-prod-1` and assuming the fix in #34490 (shipped in `26.02.27-01`) was already deployed. The implication was that connections were _still_ leaking post-fix, and therefore an additional mechanism (query timeouts) was required.

**What the data actually shows:** A fleet-wide sweep of all 177 dotCMS StatefulSets on `k8s-us-prod-1` found that every affected instance is running `26.02.17-01_2263c4e`. The `26.02.17-01` image was built **10 days before** the fix shipped in `26.02.27-01`. These pods have never received the #34490 fix. The observed connections are pre-fix behavior, not a post-fix regression.

### Does #34490 actually fix the observed problem?

**Yes — with high confidence.** Code review of the merged PR confirms two independent fixes that together close the structural leak:

1. **Structural fix (unconditional):** `DBMetricType.getValue()` now wraps the query call in `DbConnectionFactory.wrapConnection()` on the executor thread. This makes connection lifecycle management explicit and independent of `@CloseDBIfOpened` firing on CDI proxies (which it silently does not). This fix is unconditionally correct — no timing dependency, no replication required to validate.

2. **Race condition fix:** The `wrapConnection()` finally block now calls `Thread.interrupted()` to clear the interrupted status before running `closeSilently()`, then restores the flag after. The executor timeout path now checks `future.isDone()` before calling `cancel(true)`. The shutdown sequence uses `executor.shutdown()` + `awaitTermination(5s)` before falling back to `shutdownNow()`. Together these eliminate the narrow timing window where an interrupt could arrive mid-cleanup.

The inability to replicate in tests is not a confidence problem for the structural fix — `wrapConnection()` is unconditionally correct and can be reasoned about statically. The race condition window is also now extremely narrow given the `isDone()` guard.

**All 207 observed orphaned connections have `state = idle` in `pg_stat_activity`.** `idle` means the query completed — PostgreSQL finished executing and is waiting for the next command. This confirms the observed failure mode is Mode A (query completed but connection not returned to pool), which is exactly what #34490 targets. Mode B (query still blocking, socket hung, `state = active`) is **not present** in current production data.

### Revised priority and framing

| Item | Status |
|---|---|
| Fix for observed production issue | **Deploy `26.02.27-01` to affected customers** — this is the action, not this issue |
| This issue (#34832) — Mode B / TCP hang | Nice-to-have defense-in-depth; not required to fix current production behavior |
| `DB_MAXWAIT`/`DB_MAX_WAIT` naming bug | **Still a real, independent bug** — worth fixing regardless of the above |
| Query timeout / `setQueryTimeout()` | Valuable for genuinely slow/blocked queries; not the cause of currently observed idle connections |

**Recommended sequence:**
1. Upgrade affected customers from `26.02.17-01` → `26.02.27-01`
2. Run `pg_stat_activity` sweep post-upgrade to confirm accumulation stops
3. If confirmed resolved: treat this issue as a standalone defense-in-depth enhancement (particularly the `DB_MAXWAIT` naming fix, which is independently worth shipping)
4. If connections persist post-upgrade: then `setQueryTimeout()` becomes the primary investigation target

---

## Description

Metric queries in \`MetricsFactoryImpl\` have no per-query timeout and the JDBC socket has no keepalive. An executor thread processing a metric can block indefinitely on JDBC I/O, never run connection cleanup, and permanently orphan a connection that persists in \`pg_stat_activity\` until the pod is restarted. This directly undermines the fix in #34490.

Investigation also identified a env var naming bug and HikariCP pool configuration issues that compound the risk and affect all connections. These are captured with risk-tiered recommendations.

---

## Failure Modes

### Mode A — Query still running (lock wait, slow query)

1. \`future.cancel(true)\` sets the interrupt flag on the executor thread
2. JDBC is blocking in \`Statement.execute()\` — pgjdbc does not respond to \`Thread.interrupt()\` during socket reads
3. \`wrapConnection()\`'s finally block cannot run until \`delegate.execute()\` returns
4. The executor thread stays alive holding a checked-out HikariCP connection indefinitely

**Fix: \`Statement.setQueryTimeout()\`** — fires a server-side cancel via PostgreSQL's cancel protocol (separate TCP socket). PostgreSQL cancels the query, pgjdbc unblocks and throws, the exception propagates through \`wrapConnection()\`, cleanup runs, connection returned to HikariCP.

### Mode B — Query completed server-side, TCP-level hang

\`pg_stat_activity\` shows \`state = idle\` — PostgreSQL has processed the query and sent the result. But pgjdbc is still blocking in \`SocketInputStream.read()\` waiting to receive those bytes due to a network-level issue (dead TCP session, dropped packet after server sent result).

\`Statement.setQueryTimeout()\` does not help here — there is no running query to cancel. The original socket read remains blocked indefinitely.

**Fix: \`tcpKeepAlive=true\` in JDBC URL** — enables OS-level TCP keepalive probes. When the OS detects a dead connection it closes the socket, pgjdbc throws \`SocketException\`, the executor thread unblocks, and cleanup runs. This only fires on dead connections and has no effect on legitimate slow queries.

---

## Why Not \`socketTimeout\`

\`socketTimeout\` in the JDBC URL is a global TCP read deadline applied to every query in the application — bulk operations, report queries, content search — all would be killed at the threshold. The \`setQueryTimeout()\` approach is scoped to metric queries only; \`tcpKeepAlive\` relies on the OS to distinguish dead connections from slow ones.

---

## Fix 1 — Telemetry Query Timeout

### Add \`getQueryTimeoutSeconds()\` to \`TimeoutConfig\`

The query timeout is derived from the existing \`telemetry.metric.timeout.seconds\` — no new configuration property needed. The metric timeout is the operator-visible contract (outer executor deadline); the query timeout is an implementation detail that fires 2 seconds earlier, allowing the resulting exception to propagate and the connection to be returned before the executor deadline fires.

```java
/**
 * Gets the JDBC query timeout for metric DB queries.
 * Derived from the metric timeout with a 2-second buffer so the server-side
 * cancel fires before the executor's future.get() deadline.
 *
 * @return the JDBC query timeout in seconds (always at least 1)
 */
public int getQueryTimeoutSeconds() {
    return Math.max(1, getMetricTimeoutSeconds() - 2);
}
```

| Timeout | Source | Default | Role |
|---|---|---|---|
| \`telemetry.metric.timeout.seconds\` | existing config | 10s | executor \`future.get()\` deadline — outer bound |
| \`getQueryTimeoutSeconds()\` | derived: \`metricTimeout - 2\` | 8s | JDBC \`setQueryTimeout()\` — fires first |

### Add \`setQueryTimeout(int seconds)\` to \`DotConnect\`

Add a \`queryTimeoutSeconds\` field (default \`0\` = no timeout) with a fluent setter. In \`executeQuery(Connection conn)\`, call \`statement.setQueryTimeout(queryTimeoutSeconds)\` on both the \`PreparedStatement\` path (~line 693) and the \`Statement\` path (~line 719) before execution. When \`queryTimeoutSeconds == 0\` no timeout is set — existing callers are unaffected.

### Inject \`TimeoutConfig\` into \`MetricsFactoryImpl\`

```java
@Inject
private TimeoutConfig timeoutConfig;

@Override
public Optional<Object> getValue(final String sqlQuery) throws DotDataException {
    final List<Map<String, Object>> results = new DotConnect()
            .setSQL(sqlQuery)
            .setQueryTimeout(timeoutConfig.getQueryTimeoutSeconds())
            .loadObjectResults();
    // ...
}

@Override
public Optional<List<String>> getList(final String sqlQuery) throws DotDataException {
    final List<Map<String, Object>> results = new DotConnect()
            .setSQL(sqlQuery)
            .setQueryTimeout(timeoutConfig.getQueryTimeoutSeconds())
            .loadObjectResults();
    // ...
}
```

\`getSchemaDBVersion()\` does **not** use \`setQueryTimeout\` — fast schema lookup, not a leak risk.

---

## Fix 2 — JDBC URL and HikariCP Keepalive

### Add \`tcpKeepAlive=true\` to \`DB_BASE_URL\`

```
jdbc:postgresql://host:5432/dotcms?tcpKeepAlive=true
```

Enables \`SO_KEEPALIVE\` on the pgjdbc socket. OS TCP keepalive probes detect dead connections and close the socket, unblocking pgjdbc. The probe interval is governed by OS settings (\`tcp_keepalive_time\`, \`tcp_keepalive_intvl\`, \`tcp_keepalive_probes\`) — tuning these in the k8s pod spec reduces detection time.

### Add \`keepaliveTime\` to \`SystemEnvDataSourceStrategy\`

The current \`DB_IDLE_TIMEOUT = 300s\` means connections above \`minimumIdle\` are retired at 5 minutes — just before the AWS NAT gateway's ~350s idle TCP timeout, so those connections are safe. However, the \`minimumIdle\` connections (recommended: 1) are never retired by \`idleTimeout\` and can be idle for hours, becoming silently dead on the NAT without either side knowing.

\`keepaliveTime = 120s\` sends a lightweight \`SELECT 1\` every 2 minutes on idle connections, keeping the NAT entry alive. This change has no impact on request latency.

\`keepaliveTime\` is not currently exposed as an env var. Both \`SystemEnvDataSourceStrategy\` and \`setenv.sh\` need to be updated:

```bash
export DB_KEEPALIVE_TIME=${DB_KEEPALIVE_TIME:-"120000"}
```

> **Note on \`DB_VALIDATION_QUERY\`:** Leave it empty. pgjdbc supports JDBC4 \`Connection.isValid()\`, which HikariCP uses automatically — free validation with no extra round-trip. Setting \`connectionTestQuery\` would add overhead to every connection checkout across the entire application.

---

## Bug: \`DB_MAXWAIT\` / \`DB_MAX_WAIT\` Naming Mismatch

**The \`maxLifetime\` setting configured in \`setenv.sh\` is never actually applied.** There is a naming mismatch between the env var exported by \`setenv.sh\` and the name the code reads:

| Location | Variable name |
|---|---|
| \`setenv.sh\` | \`DB_MAX_WAIT\` (underscore between MAX and WAIT) |
| \`DataSourceStrategyProvider.java\` line 22 | \`DB_MAXWAIT\` (no underscore) |

Because these names do not match, \`SystemEnvDataSourceStrategy\` never finds the \`DB_MAX_WAIT\` value and always falls back to its hardcoded default of \`"60000"\` (60 seconds). Every other pool constant follows the underscore convention (\`DB_MAX_TOTAL\`, \`DB_MIN_IDLE\`, \`DB_IDLE_TIMEOUT\`, etc.) — \`DB_MAXWAIT\` is the sole exception and is incorrect.

**Effective \`maxLifetime\` at runtime: 60 seconds.** HikariCP retires and replaces connections every ~60 seconds. With \`maximumPoolSize = 200\` under load each pod generates up to ~3 reconnects/second, each triggering a \`SET application_name\` on the new connection. This is the direct cause of the \`SET application_name\` churn visible in \`pg_stat_activity\`.

**Fix:** Rename the constant in \`DataSourceStrategyProvider.java\`:

```java
// Before
static final String CONNECTION_DB_MAX_WAIT = "DB_MAXWAIT";

// After
static final String CONNECTION_DB_MAX_WAIT = "DB_MAX_WAIT";
```

**Migration impact:** Any k8s ConfigMap or deployment script that explicitly sets \`DB_MAXWAIT\` must be updated to \`DB_MAX_WAIT\`. Deployments relying on the \`setenv.sh\` default (which already uses \`DB_MAX_WAIT\`) will automatically benefit once the constant is corrected.

Once the naming is fixed, the \`setenv.sh\` default of 900s becomes effective. This should be raised to 1800s as part of the same change (see Tier 1 acceptance criteria).

---

## Related: HikariCP Pool Configuration

The effective pool settings (what actually runs) after accounting for the naming bug:

| Property | Env Var (code reads) | setenv.sh exports | **Actual runtime value** |
|---|---|---|---|
| \`maximumPoolSize\` | \`DB_MAX_TOTAL\` | \`DB_MAX_TOTAL=200\` | **200** |
| \`minimumIdle\` | \`DB_MIN_IDLE\` | \`DB_MIN_IDLE=3\` | **3** |
| \`maxLifetime\` | \`DB_MAXWAIT\` | \`DB_MAX_WAIT=900000\` ⚠️ mismatch | **60s (hardcoded default — setenv.sh value ignored)** |
| \`idleTimeout\` | \`DB_IDLE_TIMEOUT\` | \`DB_IDLE_TIMEOUT=300000\` | **300s** |
| \`connectionTimeout\` | \`DB_CONNECTION_TIMEOUT\` | \`DB_CONNECTION_TIMEOUT=5000\` | **5s** |
| \`connectionTestQuery\` | \`DB_VALIDATION_QUERY\` | \`DB_VALIDATION_QUERY=""\` | **""** (fine — JDBC4 isValid automatic) |
| \`keepaliveTime\` | \`DB_KEEPALIVE_TIME\` | not set | **not set** |

### Pool self-regulation

The pool self-regulates around actual load: connections above \`minimumIdle\` are retired after \`idleTimeout = 300s\` of idleness. \`maximumPoolSize = 200\` is a burst ceiling, not a permanent floor — the pool sheds load within 5 minutes of traffic dropping.

### \`DB_MIN_IDLE\` → 1 (Tier 1, safe)

\`minimumIdle = 3\` is hard to justify over 1. The only practical use case for keeping warm connections is having one ready for background threads (Quartz, push sync) during quiet periods. One connection covers this — the remaining 2 provide negligible benefit since HikariCP ramps additional connections in parallel rapidly on burst. Reducing to 1 minimises permanent RDS slot consumption and reduces the keepalive concern to a single connection per pod.

### \`DB_CONNECTION_TIMEOUT\` → 30s (Tier 1, safe)

\`connectionTimeout = 5s\` means any thread requesting a pool connection during a brief burst — publish + telemetry + reindex overlap — has only 5 seconds before HikariCP throws a \`SQLTransientConnectionException\`. This produces noisy false failures during spikes that would self-resolve within seconds. 30 seconds (HikariCP's own recommended default) gives legitimate requests time to succeed without masking actual pool exhaustion. This change has no effect on leak behaviour.

### \`DB_MAX_TOTAL = 200\` — observe before changing (Tier 2)

\`DB_MAX_TOTAL = 200\` was likely sized to match the application's peak thread demand. dotCMS runs many concurrent thread pools (Tomcat request handlers, content indexing, push publishing, Quartz, workflow processors, telemetry) that can each hold a database connection simultaneously. Reducing this without observability data risks introducing \`connectionTimeout\` failures on services that currently work.

**Approach:** Establish a baseline first by querying \`pg_stat_activity\` during a content publish, full reindex, and normal traffic window:

```sql
SELECT client_addr, application_name, state, count(*)
FROM pg_stat_activity
WHERE datname = 'dotcms'
GROUP BY client_addr, application_name, state
ORDER BY count(*) DESC;
```

> **Note:** Each pod runs one permanent LISTEN/NOTIFY connection (\`JDBCPubSubImpl\`) that holds a pool slot indefinitely, appearing as a connection running frequent \`SELECT 1\` queries. Filter this out when analysing load-driven connection counts — it is 1 fixed connection per pod, not a traffic signal.

Once peak data is available, reduce \`DB_MAX_TOTAL\` in small increments (200 → 150 → 100) monitoring HikariCP JMX metrics and \`connectionTimeout\` errors after each step. Tomcat max thread count and async indexing thread pool sizing should be reviewed as part of this work.

---

## Related: PostgreSQL Server-Side Timeouts

### \`idle_in_transaction_session_timeout\` — defer (Tier 3)

This setting kills sessions idle inside an open transaction and is the most impactful safety net for stuck transaction leaks. However, **dotCMS performs Elasticsearch/OpenSearch indexing operations inside database transactions** during content publish. These can hold a transaction open for an extended period depending on content volume and index response time.

Setting any value here before this pattern is remediated would cause catastrophic publish failures. This must be tracked as a separate technical debt item. Once index operations are moved outside transaction boundaries, a starting value of 5–10 minutes is appropriate.

### \`idle_session_timeout\` — after keepalive is confirmed (Tier 3)

With \`keepaliveTime = 120s\` in place, pooled connections receive a ping every 2 minutes — they will never be idle long enough to trigger this setting. It only catches connections that have fully escaped pool management.

Set to 3600000ms (1 hour) per role once \`keepaliveTime\` is confirmed working in production:

```sql
ALTER ROLE dotcms_user SET idle_session_timeout = '3600000';
```

Apply per role, not globally, to avoid affecting superuser and monitoring connections. Must be greater than \`maxLifetime\` (30 min) to avoid killing connections HikariCP intends to keep alive.

Required timeout ordering once all settings are in place:
```
idle_session_timeout  (server, 60 min)
    > maxLifetime     (HikariCP, 30 min)  — pool retires connections before server kills them
    > keepaliveTime   (HikariCP, 2 min)   — pings keep minimumIdle connections alive through NAT
    > NAT idle timeout (AWS, ~350s)
```

---

## Temporary Operational Workarounds

These changes can be applied to the StatefulSet and RDS **without any code deployment** to limit connection accumulation while the permanent fixes are in progress. All StatefulSet changes require a rolling pod restart.

### StatefulSet environment variables

Apply all of the following in a single StatefulSet update:

```yaml
env:
  # Fix the DB_MAXWAIT naming bug immediately — the code reads DB_MAXWAIT,
  # not DB_MAX_WAIT. Setting this directly bypasses the setenv.sh mismatch
  # and changes maxLifetime from 60s to 30 minutes, stopping connection churn.
  - name: DB_MAXWAIT
    value: "1800000"

  # Enable TCP keepalive and add a 10-minute socket read deadline as a
  # temporary backstop for Mode B (TCP-level hang). socketTimeout kills any
  # socket read blocked longer than 10 minutes — conservative enough not to
  # affect legitimate operations. Remove socketTimeout once the code fix ships.
  - name: DB_BASE_URL
    value: "jdbc:postgresql://host:5432/dotcms?tcpKeepAlive=true&socketTimeout=600"

  # Reduce permanent idle footprint from 3 connections to 1 per pod.
  - name: DB_MIN_IDLE
    value: "1"
```

**Priority order and rationale:**

| Change | Impact | Risk |
|---|---|---|
| \`DB_MAXWAIT=1800000\` | Stops connection churn immediately — single biggest impact | None |
| \`tcpKeepAlive=true\` in \`DB_BASE_URL\` | OS detects and closes dead TCP connections (Mode B) | None |
| \`socketTimeout=600\` in \`DB_BASE_URL\` | Backstop: unblocks any socket read stuck >10 min | Low — 10 min is conservative |
| \`DB_MIN_IDLE=1\` | Reduces permanent idle connections per pod | None |

### RDS — apply per role (immediate, no restart required)

These take effect for new sessions only. Existing connections are unaffected until they reconnect.

```sql
-- Backstop for any query running longer than 10 minutes.
-- Covers Mode A (slow/locked query) at the database level while the
-- setQueryTimeout() code fix is pending. Conservative value preserves
-- legitimate long-running operations.
ALTER ROLE dotcms_user SET statement_timeout = '600000';

-- Backstop for connections stuck idle inside an open transaction for
-- longer than 30 minutes. This is deliberately conservative given the
-- known indexing-in-transaction anti-pattern — genuine publish transactions
-- do not sit idle in BEGIN for 30 minutes. Provides a safety net for
-- truly stuck connections without risking publish failures.
ALTER ROLE dotcms_user SET idle_in_transaction_session_timeout = '1800000';

-- Backstop for connections that have fully escaped pool management
-- (leaked connections that neither code nor HikariCP will clean up).
-- 2 hours ensures pooled connections cycling at maxLifetime=30min are
-- never affected. Provides a ceiling on how long a leaked connection
-- can persist before RDS terminates it.
ALTER ROLE dotcms_user SET idle_session_timeout = '7200000';
```

Verify settings applied:
```sql
SELECT rolname, rolconfig FROM pg_roles WHERE rolname = 'dotcms_user';
```

### What these workarounds do and do not address

| Failure mode | Workaround coverage | Permanent fix |
|---|---|---|
| Connection churn from 60s maxLifetime | ✅ Fixed by \`DB_MAXWAIT=1800000\` | \`DB_MAXWAIT\` rename in code |
| Mode A — slow/locked metric query blocking indefinitely | ⚠️ Partially covered by \`statement_timeout=10min\` (database kills query eventually) | \`setQueryTimeout()\` in \`DotConnect\` |
| Mode B — TCP hang after query completes | ✅ \`tcpKeepAlive=true\` handles dead connections; \`socketTimeout=600\` covers edge cases | \`tcpKeepAlive=true\` in code path |
| Leaked connections persisting indefinitely | ✅ \`idle_session_timeout=2h\` and \`idle_in_transaction_session_timeout=30min\` provide a ceiling | \`keepaliveTime\` + code fixes |
| NAT silently killing minimumIdle connections | ⚠️ Mitigated by \`DB_MIN_IDLE=1\` reducing exposure; not eliminated without \`keepaliveTime\` | \`keepaliveTime\` via \`DB_KEEPALIVE_TIME\` env var |

---

## Priority

High — without this, the \`wrapConnection()\` pattern added in #34490 cannot reliably prevent connection orphaning regardless of whether the blocking JDBC call is still in-flight or has already completed server-side. The \`DB_MAXWAIT\` naming bug means \`maxLifetime\` is running at 60s on all deployments regardless of \`setenv.sh\` configuration, causing significant unnecessary connection churn.

---

## Acceptance Criteria

### Tier 1 — Safe to apply now

- [ ] \`getQueryTimeoutSeconds()\` added to \`TimeoutConfig\` — returns \`Math.max(1, getMetricTimeoutSeconds() - 2)\`, no new config property
- [ ] \`setQueryTimeout(int seconds)\` added to \`DotConnect\` — default \`0\` (no timeout), fluent setter, applied to both \`PreparedStatement\` and \`Statement\` paths in \`executeQuery()\`
- [ ] \`TimeoutConfig\` injected into \`MetricsFactoryImpl\`; \`getValue()\` and \`getList()\` chain \`.setQueryTimeout(timeoutConfig.getQueryTimeoutSeconds())\`
- [ ] \`getSchemaDBVersion()\` does **not** use \`setQueryTimeout\`
- [ ] \`tcpKeepAlive=true\` added to \`DB_BASE_URL\` in k8s ConfigMap; \`socketTimeout=600\` removed from \`DB_BASE_URL\` (workaround no longer needed)
- [ ] \`keepaliveTime\` added to \`SystemEnvDataSourceStrategy\` via env var \`DB_KEEPALIVE_TIME\`; default \`120000\` in \`setenv.sh\`
- [ ] \`CONNECTION_DB_MAX_WAIT\` constant in \`DataSourceStrategyProvider.java\` renamed from \`"DB_MAXWAIT"\` to \`"DB_MAX_WAIT"\` to match \`setenv.sh\` and naming convention
- [ ] \`DB_MAX_WAIT\` default raised to \`1800000\` in \`setenv.sh\`; \`DB_MAXWAIT\` workaround entry removed from StatefulSet env vars
- [ ] \`DB_MIN_IDLE\` default lowered to \`1\` in \`setenv.sh\`
- [ ] K8s ConfigMaps and deployment scripts audited for \`DB_MAXWAIT\` references and updated to \`DB_MAX_WAIT\`
- [ ] Verified: existing \`DotConnect\` callers without \`setQueryTimeout()\` are unaffected
- [ ] Verified: a slow/locked metric query is cancelled at \`metricTimeout - 2s\` and the connection is returned to HikariCP
- [ ] \`DB_CONNECTION_TIMEOUT\` default raised to \`30000\` in \`setenv.sh\`

### Tier 2 — Requires observability data first

- [ ] Peak connection count per pod baselined via \`pg_stat_activity\` across publish, reindex, and normal traffic scenarios (LISTEN/NOTIFY connection filtered out)
- [ ] Tomcat max threads and async thread pool configuration reviewed alongside connection pool sizing
- [ ] \`DB_MAX_TOTAL\` reduced incrementally (200 → 150 → 100) based on observed peak, with HikariCP JMX metrics monitored after each step

### Tier 3 — Requires code remediation first

- [ ] Indexing-in-transaction pattern tracked as technical debt — index operations to be moved outside transaction boundaries or executed post-commit
- [ ] \`idle_in_transaction_session_timeout\` tightened from 30min workaround to 5–10 minutes only after Tier 3 code work is complete
- [ ] \`idle_session_timeout\` tightened from 2h workaround to 1h only after \`keepaliveTime\` (Tier 1) is confirmed working in production

---

## Context

Companion fix to #34490:
- **#34490** — \`wrapConnection()\` ensures \`closeSilently()\` runs once JDBC returns
- **This issue** — \`setQueryTimeout()\` ensures JDBC returns for slow/locked queries; \`tcpKeepAlive\` handles TCP-level hangs; \`keepaliveTime\` prevents NAT from silently killing idle connections; \`DB_MAXWAIT\` naming fix makes \`maxLifetime\` configuration effective for the first time; server-side timeouts (Tier 3) catch any connection that escapes pool management entirely

Tier 1 changes can be applied immediately. Tier 2 and Tier 3 require observability work and code remediation respectively. The temporary operational workarounds above can be applied to production now to reduce RDS connection pressure while the code work proceeds.

Location	Variable name
`setenv.sh`	`DB_MAX_WAIT` (underscore between MAX and WAIT)
`DataSourceStrategyProvider.java` line 22	`DB_MAXWAIT` (no underscore)

Property	Env Var (code reads)	setenv.sh exports	Actual runtime value
`maximumPoolSize`	`DB_MAX_TOTAL`	`DB_MAX_TOTAL=200`	200
`minimumIdle`	`DB_MIN_IDLE`	`DB_MIN_IDLE=3`	3
`maxLifetime`	`DB_MAXWAIT`	`DB_MAX_WAIT=900000` ⚠️ mismatch	60s (hardcoded default — setenv.sh value ignored)
`idleTimeout`	`DB_IDLE_TIMEOUT`	`DB_IDLE_TIMEOUT=300000`	300s
`connectionTimeout`	`DB_CONNECTION_TIMEOUT`	`DB_CONNECTION_TIMEOUT=5000`	5s
`connectionTestQuery`	`DB_VALIDATION_QUERY`	`DB_VALIDATION_QUERY=""`	"" (fine — JDBC4 isValid automatic)
`keepaliveTime`	`DB_KEEPALIVE_TIME`	not set	not set

Failure mode	Workaround coverage	Permanent fix
Connection churn from 60s maxLifetime	✅ Fixed by `DB_MAXWAIT=1800000`	`DB_MAXWAIT` rename in code
Mode A — slow/locked metric query blocking indefinitely	⚠️ Partially covered by `statement_timeout=10min` (database kills query eventually)	`setQueryTimeout()` in `DotConnect`
Mode B — TCP hang after query completes	✅ `tcpKeepAlive=true` handles dead connections; `socketTimeout=600` covers edge cases	`tcpKeepAlive=true` in code path
Leaked connections persisting indefinitely	✅ `idle_session_timeout=2h` and `idle_in_transaction_session_timeout=30min` provide a ceiling	`keepaliveTime` + code fixes
NAT silently killing minimumIdle connections	⚠️ Mitigated by `DB_MIN_IDLE=1` reducing exposure; not eliminated without `keepaliveTime`	`keepaliveTime` via `DB_KEEPALIVE_TIME` env var

Item	Status
Fix for observed production issue	Deploy `26.02.27-01` to affected customers — this is the action, not this issue
This issue (#34832) — Mode B / TCP hang	Nice-to-have defense-in-depth; not required to fix current production behavior
`DB_MAXWAIT`/`DB_MAX_WAIT` naming bug	Still a real, independent bug — worth fixing regardless of the above
Query timeout / `setQueryTimeout()`	Valuable for genuinely slow/blocked queries; not the cause of currently observed idle connections

Timeout	Source	Default	Role
`telemetry.metric.timeout.seconds`	existing config	10s	executor `future.get()` deadline — outer bound
`getQueryTimeoutSeconds()`	derived: `metricTimeout - 2`	8s	JDBC `setQueryTimeout()` — fires first

Change	Impact	Risk
`DB_MAXWAIT=1800000`	Stops connection churn immediately — single biggest impact	None
`tcpKeepAlive=true` in `DB_BASE_URL`	OS detects and closes dead TCP connections (Mode B)	None
`socketTimeout=600` in `DB_BASE_URL`	Backstop: unblocks any socket read stuck >10 min	Low — 10 min is conservative
`DB_MIN_IDLE=1`	Reduces permanent idle connections per pod	None

enhancement(db): Add per-query Statement.setQueryTimeout(), tcpKeepAlive, and HikariCP keepaliveTime as defense-in-depth #34832

Description

Context Update — 2026-03-03

Does #34490 actually fix the observed problem?

Revised priority and framing

Description

Failure Modes

Mode A — Query still running (lock wait, slow query)

Mode B — Query completed server-side, TCP-level hang

Why Not `socketTimeout`

Fix 1 — Telemetry Query Timeout

Add `getQueryTimeoutSeconds()` to `TimeoutConfig`

Add `setQueryTimeout(int seconds)` to `DotConnect`

Inject `TimeoutConfig` into `MetricsFactoryImpl`

Fix 2 — JDBC URL and HikariCP Keepalive

Add `tcpKeepAlive=true` to `DB_BASE_URL`

Add `keepaliveTime` to `SystemEnvDataSourceStrategy`

Bug: `DB_MAXWAIT` / `DB_MAX_WAIT` Naming Mismatch

Related: HikariCP Pool Configuration

Pool self-regulation

`DB_MIN_IDLE` → 1 (Tier 1, safe)

`DB_CONNECTION_TIMEOUT` → 30s (Tier 1, safe)

`DB_MAX_TOTAL = 200` — observe before changing (Tier 2)

Related: PostgreSQL Server-Side Timeouts

`idle_in_transaction_session_timeout` — defer (Tier 3)

`idle_session_timeout` — after keepalive is confirmed (Tier 3)

Temporary Operational Workarounds

StatefulSet environment variables

RDS — apply per role (immediate, no restart required)

What these workarounds do and do not address

Priority

Acceptance Criteria

Tier 1 — Safe to apply now

Tier 2 — Requires observability data first

Tier 3 — Requires code remediation first

Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions