fix(duckdb): stop orphaned queries on timeout/cancel to prevent worker hangs by tobias-gp · Pull Request #68 · archmaxai/archmax

tobias-gp · 2026-06-06T05:25:59Z

Summary

Some query timeouts (and user cancellations) left the underlying native DuckDB query running after the tool call had already returned, which in turn destabilized both the semantic model builder and playground agents. This fixes the orphaned-execution cascade.

Root cause

withQueryTimeout used Promise.race([operation(), timeoutPromise]). When the timeout/cancel won, the loser was never cancelled — prepared.run() kept executing on a libuv worker thread — while the caller's finally { db.disconnectSync() } fired immediately. Consequences:

Orphaned execution: the query "kept executing although the tool call had finished".
Process crashes / "general instability": disconnecting a connection out from under a live native query can trip a scanner-extension abort() that kills the whole worker/API process (see entrypoint.sh). This is why only some (federated/non-instantly-interruptible) timeouts caused it.
Stuck "Thinking" / pool pollution: no UV_THREADPOOL_SIZE was set (libuv default 4) while WORKER_CONCURRENCY=5, so a few orphaned native queries starved all async I/O and new jobs hung.

Fix

withQueryTimeout now calls interrupt() and waits for the native op to actually settle (bounded by new QUERY_INTERRUPT_GRACE_MS, default 30s) before rejecting, so the connection is never torn down mid-query.
New safeDisconnect() defers disconnectSync until a still-unwinding query finishes; wired into every query call site (agent tools ×2, MCP tools, SQL AST validation, DuckDB console, and the connections + data-browser API routes).
Early-abort guard so an already-cancelled run never starts a query.
Raised UV_THREADPOOL_SIZE (default 16, overridable) in the container entrypoint as defense-in-depth.
Documented QUERY_INTERRUPT_GRACE_MS in .env.example.

Behavioral note

Interruptible queries unwind in milliseconds (no change). Genuinely non-interruptible queries now wait up to QUERY_INTERRUPT_GRACE_MS to unwind before the tool returns — intentional, since waiting (bounded, configurable) is preferable to disconnecting mid-query and aborting the process. Timeouts remain a recoverable tool result (agent continues); only user cancellation aborts the run, unchanged.

Test plan

withQueryTimeout unit tests: waits-for-unwind before rejecting, deferred safeDisconnect, already-aborted guard, QUERY_INTERRUPT_GRACE_MS parsing.
Full duckdb.test.ts (105) + service tests (mcp-tools, agent-tools) pass.
API integration tests (connections-reinit, connections-firebird) updated mocks + pass.
pnpm typecheck for @archmax/core and @archmax/api.
Manual: trigger a federated query timeout and confirm the worker stays responsive and no orphaned query / process crash.

Made with Cursor

Note

Medium Risk
Touches core DuckDB lifecycle for every federated query path; mis-tuned grace/timeouts could delay tool responses, but the change reduces native crash and pool-starvation risk.

Overview
Fixes orphaned DuckDB work after query timeouts/cancellations by changing how withQueryTimeout ends a lost race: it now interrupt()s the connection and waits (up to QUERY_INTERRUPT_GRACE_MS) for the native operation to settle before rejecting, instead of returning while prepared.run() is still on a libuv worker thread.

Adds safeDisconnect, which defers disconnectSync when a timed-out query is still unwinding (tracking multiple pending ops per connection so later timeouts don’t mask earlier live queries). All query teardown paths—agent/MCP tools, SQL AST validation, federation console, connection test/reinit, data browser, and ATTACH/materialise flows—switch from disconnectSync to safeDisconnect.

Also skips starting queries when the abort signal is already set, documents QUERY_INTERRUPT_GRACE_MS in .env.example, and sets UV_THREADPOOL_SIZE default 16 in the container entrypoint so a few stuck federated queries are less likely to starve the whole process.

^{Reviewed by Cursor Bugbot for commit 0d42b29. Bugbot is set up for automated code reviews on this repo. Configure here.}

…r hangs Query timeouts (and user cancellations) raced the native DuckDB operation via Promise.race, but the loser was never cancelled: prepared.run() kept executing on a libuv worker thread while the caller's finally immediately disconnected the connection. Closing a connection out from under a live native query can trip an extension abort() that kills the whole worker/API process, and the orphaned queries starve the (default 4-thread) libuv pool so unrelated jobs hang in "Thinking". withQueryTimeout now interrupts the query and waits for the native operation to actually settle (bounded by QUERY_INTERRUPT_GRACE_MS) before rejecting. The new safeDisconnect helper defers teardown until any still-unwinding query finishes, so a connection is never closed mid-query. Wired safeDisconnect into every query call site (agent tools, MCP tools, SQL AST validation, DuckDB console, connections + data-browser API routes) and raised UV_THREADPOOL_SIZE in the container entrypoint as defense-in-depth. Co-authored-by: Cursor <cursoragent@cursor.com>

railway-app · 2026-06-06T05:27:52Z

🚅 Deployed to the archmax-pr-68 environment in archmax SemLayer

Service	Status	Updated (UTC)
archmax_external_dbs	✅ Success (View Logs)	Jun 6, 2026 at 9:40 am
archmax_standalone	✅ Success (View Logs)	Jun 6, 2026 at 9:40 am
archmax_standalone_with_volume	✅ Success (View Logs)	Jun 6, 2026 at 9:40 am

cursor · 2026-06-06T05:28:35Z

+      // The query ignored the interrupt within the grace window. Hand the
+      // still-pending settle handle to `safeDisconnect` so it defers the
+      // teardown rather than disconnecting under a live native operation.
+      pendingNativeOps.set(connection, opSettled);


This pending-operation tracking is bypassed by several callers in this same file that still run db.disconnectSync() in finally after withQueryTimeout(): attachConnection after the ATTACH timeout path, attachIcebergCatalog after the Iceberg ATTACH, and materialiseModelViewsLocked after CREATE SCHEMA / CREATE OR REPLACE VIEW. If one of those operations times out and does not settle within the grace window, withQueryTimeout() records it here, but the caller immediately closes the connection anyway, reintroducing the crash/orphaned-worker behavior this PR is trying to prevent on the MCP execute_query setup/materialization path. Replace those finally blocks with safeDisconnect(db) (and add a regression covering attach/materialization timeout) so every connection that ran through withQueryTimeout() honors the deferred disconnect.

github-actions · 2026-06-06T05:36:52Z

Docker image ready

docker pull ghcr.io/archmaxai/archmax:pr-68

…t paths Bugbot flagged three in-file callers that ran through withQueryTimeout but still called db.disconnectSync() directly in finally: attachConnection, attachIcebergCatalog, and materialiseModelViewsLocked. If one of those ATTACH / CREATE SCHEMA / CREATE OR REPLACE VIEW operations timed out and did not settle within the interrupt grace, the connection was closed out from under a live native operation — reintroducing the crash/orphan behavior this PR prevents on the MCP execute_query setup/materialization path. Route them through safeDisconnect so the deferred disconnect is honored everywhere. Co-authored-by: Cursor <cursoragent@cursor.com>

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 7bf4692. Configure here.}

…t waits for all pendingNativeOps tracked at most one settle handle per connection, but a single connection runs several withQueryTimeout calls in sequence (materialisation pass, data-browser exists/count/data). A second timeout overwrote the first's handle, so safeDisconnect could close the connection while the first orphaned query was still live. Merge the new handle with any prior one (Promise.all) so the deferred disconnect waits for every still-running query to unwind. Co-authored-by: Cursor <cursoragent@cursor.com>

railway-app Bot temporarily deployed to archmax SemLayer / archmax-pr-68 June 6, 2026 05:27 Destroyed

cursor Bot reviewed Jun 6, 2026

View reviewed changes

railway-app Bot temporarily deployed to archmax SemLayer / archmax-pr-68 June 6, 2026 05:51 Destroyed

cursor Bot reviewed Jun 6, 2026

View reviewed changes

Comment thread packages/core/src/services/duckdb.ts Outdated

railway-app Bot temporarily deployed to archmax SemLayer / archmax-pr-68 June 6, 2026 09:39 Destroyed

tobias-gp added the release label Jun 6, 2026

tobias-gp merged commit 433006d into main Jun 6, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(duckdb): stop orphaned queries on timeout/cancel to prevent worker hangs#68

fix(duckdb): stop orphaned queries on timeout/cancel to prevent worker hangs#68
tobias-gp merged 3 commits into
mainfrom
fix/duckdb-timeout-orphaned-query

tobias-gp commented Jun 6, 2026 •

edited by cursor Bot

Loading

Uh oh!

railway-app Bot commented Jun 6, 2026 •

edited

Loading

Uh oh!

cursor Bot Jun 6, 2026

Uh oh!

github-actions Bot commented Jun 6, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tobias-gp commented Jun 6, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

Behavioral note

Test plan

Uh oh!

railway-app Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 6, 2026

Docker image ready

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tobias-gp commented Jun 6, 2026 •

edited by cursor Bot

Loading

railway-app Bot commented Jun 6, 2026 •

edited

Loading