fix(duckdb): stop orphaned queries on timeout/cancel to prevent worker hangs#68
Conversation
…r hangs Query timeouts (and user cancellations) raced the native DuckDB operation via Promise.race, but the loser was never cancelled: prepared.run() kept executing on a libuv worker thread while the caller's finally immediately disconnected the connection. Closing a connection out from under a live native query can trip an extension abort() that kills the whole worker/API process, and the orphaned queries starve the (default 4-thread) libuv pool so unrelated jobs hang in "Thinking". withQueryTimeout now interrupts the query and waits for the native operation to actually settle (bounded by QUERY_INTERRUPT_GRACE_MS) before rejecting. The new safeDisconnect helper defers teardown until any still-unwinding query finishes, so a connection is never closed mid-query. Wired safeDisconnect into every query call site (agent tools, MCP tools, SQL AST validation, DuckDB console, connections + data-browser API routes) and raised UV_THREADPOOL_SIZE in the container entrypoint as defense-in-depth. Co-authored-by: Cursor <cursoragent@cursor.com>
|
🚅 Deployed to the archmax-pr-68 environment in archmax SemLayer
|
| // The query ignored the interrupt within the grace window. Hand the | ||
| // still-pending settle handle to `safeDisconnect` so it defers the | ||
| // teardown rather than disconnecting under a live native operation. | ||
| pendingNativeOps.set(connection, opSettled); |
There was a problem hiding this comment.
This pending-operation tracking is bypassed by several callers in this same file that still run db.disconnectSync() in finally after withQueryTimeout(): attachConnection after the ATTACH timeout path, attachIcebergCatalog after the Iceberg ATTACH, and materialiseModelViewsLocked after CREATE SCHEMA / CREATE OR REPLACE VIEW. If one of those operations times out and does not settle within the grace window, withQueryTimeout() records it here, but the caller immediately closes the connection anyway, reintroducing the crash/orphaned-worker behavior this PR is trying to prevent on the MCP execute_query setup/materialization path. Replace those finally blocks with safeDisconnect(db) (and add a regression covering attach/materialization timeout) so every connection that ran through withQueryTimeout() honors the deferred disconnect.
Docker image readydocker pull ghcr.io/archmaxai/archmax:pr-68 |
…t paths Bugbot flagged three in-file callers that ran through withQueryTimeout but still called db.disconnectSync() directly in finally: attachConnection, attachIcebergCatalog, and materialiseModelViewsLocked. If one of those ATTACH / CREATE SCHEMA / CREATE OR REPLACE VIEW operations timed out and did not settle within the interrupt grace, the connection was closed out from under a live native operation — reintroducing the crash/orphan behavior this PR prevents on the MCP execute_query setup/materialization path. Route them through safeDisconnect so the deferred disconnect is honored everywhere. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 7bf4692. Configure here.
…t waits for all pendingNativeOps tracked at most one settle handle per connection, but a single connection runs several withQueryTimeout calls in sequence (materialisation pass, data-browser exists/count/data). A second timeout overwrote the first's handle, so safeDisconnect could close the connection while the first orphaned query was still live. Merge the new handle with any prior one (Promise.all) so the deferred disconnect waits for every still-running query to unwind. Co-authored-by: Cursor <cursoragent@cursor.com>

Summary
Some query timeouts (and user cancellations) left the underlying native DuckDB query running after the tool call had already returned, which in turn destabilized both the semantic model builder and playground agents. This fixes the orphaned-execution cascade.
Root cause
withQueryTimeoutusedPromise.race([operation(), timeoutPromise]). When the timeout/cancel won, the loser was never cancelled —prepared.run()kept executing on a libuv worker thread — while the caller'sfinally { db.disconnectSync() }fired immediately. Consequences:abort()that kills the whole worker/API process (seeentrypoint.sh). This is why only some (federated/non-instantly-interruptible) timeouts caused it.UV_THREADPOOL_SIZEwas set (libuv default 4) whileWORKER_CONCURRENCY=5, so a few orphaned native queries starved all async I/O and new jobs hung.Fix
withQueryTimeoutnow callsinterrupt()and waits for the native op to actually settle (bounded by newQUERY_INTERRUPT_GRACE_MS, default 30s) before rejecting, so the connection is never torn down mid-query.safeDisconnect()defersdisconnectSyncuntil a still-unwinding query finishes; wired into every query call site (agent tools ×2, MCP tools, SQL AST validation, DuckDB console, and theconnections+data-browserAPI routes).UV_THREADPOOL_SIZE(default 16, overridable) in the container entrypoint as defense-in-depth.QUERY_INTERRUPT_GRACE_MSin.env.example.Behavioral note
Interruptible queries unwind in milliseconds (no change). Genuinely non-interruptible queries now wait up to
QUERY_INTERRUPT_GRACE_MSto unwind before the tool returns — intentional, since waiting (bounded, configurable) is preferable to disconnecting mid-query and aborting the process. Timeouts remain a recoverable tool result (agent continues); only user cancellation aborts the run, unchanged.Test plan
withQueryTimeoutunit tests: waits-for-unwind before rejecting, deferredsafeDisconnect, already-aborted guard,QUERY_INTERRUPT_GRACE_MSparsing.duckdb.test.ts(105) + service tests (mcp-tools,agent-tools) pass.connections-reinit,connections-firebird) updated mocks + pass.pnpm typecheckfor@archmax/coreand@archmax/api.Made with Cursor
Note
Medium Risk
Touches core DuckDB lifecycle for every federated query path; mis-tuned grace/timeouts could delay tool responses, but the change reduces native crash and pool-starvation risk.
Overview
Fixes orphaned DuckDB work after query timeouts/cancellations by changing how
withQueryTimeoutends a lost race: it nowinterrupt()s the connection and waits (up toQUERY_INTERRUPT_GRACE_MS) for the native operation to settle before rejecting, instead of returning whileprepared.run()is still on a libuv worker thread.Adds
safeDisconnect, which defersdisconnectSyncwhen a timed-out query is still unwinding (tracking multiple pending ops per connection so later timeouts don’t mask earlier live queries). All query teardown paths—agent/MCP tools, SQL AST validation, federation console, connection test/reinit, data browser, and ATTACH/materialise flows—switch fromdisconnectSynctosafeDisconnect.Also skips starting queries when the abort signal is already set, documents
QUERY_INTERRUPT_GRACE_MSin.env.example, and setsUV_THREADPOOL_SIZEdefault 16 in the container entrypoint so a few stuck federated queries are less likely to starve the whole process.Reviewed by Cursor Bugbot for commit 0d42b29. Bugbot is set up for automated code reviews on this repo. Configure here.