Skip to content

fix(duckdb): stop orphaned queries on timeout/cancel to prevent worker hangs#68

Merged
tobias-gp merged 3 commits into
mainfrom
fix/duckdb-timeout-orphaned-query
Jun 6, 2026
Merged

fix(duckdb): stop orphaned queries on timeout/cancel to prevent worker hangs#68
tobias-gp merged 3 commits into
mainfrom
fix/duckdb-timeout-orphaned-query

Conversation

@tobias-gp
Copy link
Copy Markdown
Contributor

@tobias-gp tobias-gp commented Jun 6, 2026

Summary

Some query timeouts (and user cancellations) left the underlying native DuckDB query running after the tool call had already returned, which in turn destabilized both the semantic model builder and playground agents. This fixes the orphaned-execution cascade.

Root cause

withQueryTimeout used Promise.race([operation(), timeoutPromise]). When the timeout/cancel won, the loser was never cancelled — prepared.run() kept executing on a libuv worker thread — while the caller's finally { db.disconnectSync() } fired immediately. Consequences:

  • Orphaned execution: the query "kept executing although the tool call had finished".
  • Process crashes / "general instability": disconnecting a connection out from under a live native query can trip a scanner-extension abort() that kills the whole worker/API process (see entrypoint.sh). This is why only some (federated/non-instantly-interruptible) timeouts caused it.
  • Stuck "Thinking" / pool pollution: no UV_THREADPOOL_SIZE was set (libuv default 4) while WORKER_CONCURRENCY=5, so a few orphaned native queries starved all async I/O and new jobs hung.

Fix

  • withQueryTimeout now calls interrupt() and waits for the native op to actually settle (bounded by new QUERY_INTERRUPT_GRACE_MS, default 30s) before rejecting, so the connection is never torn down mid-query.
  • New safeDisconnect() defers disconnectSync until a still-unwinding query finishes; wired into every query call site (agent tools ×2, MCP tools, SQL AST validation, DuckDB console, and the connections + data-browser API routes).
  • Early-abort guard so an already-cancelled run never starts a query.
  • Raised UV_THREADPOOL_SIZE (default 16, overridable) in the container entrypoint as defense-in-depth.
  • Documented QUERY_INTERRUPT_GRACE_MS in .env.example.

Behavioral note

Interruptible queries unwind in milliseconds (no change). Genuinely non-interruptible queries now wait up to QUERY_INTERRUPT_GRACE_MS to unwind before the tool returns — intentional, since waiting (bounded, configurable) is preferable to disconnecting mid-query and aborting the process. Timeouts remain a recoverable tool result (agent continues); only user cancellation aborts the run, unchanged.

Test plan

  • withQueryTimeout unit tests: waits-for-unwind before rejecting, deferred safeDisconnect, already-aborted guard, QUERY_INTERRUPT_GRACE_MS parsing.
  • Full duckdb.test.ts (105) + service tests (mcp-tools, agent-tools) pass.
  • API integration tests (connections-reinit, connections-firebird) updated mocks + pass.
  • pnpm typecheck for @archmax/core and @archmax/api.
  • Manual: trigger a federated query timeout and confirm the worker stays responsive and no orphaned query / process crash.

Made with Cursor


Note

Medium Risk
Touches core DuckDB lifecycle for every federated query path; mis-tuned grace/timeouts could delay tool responses, but the change reduces native crash and pool-starvation risk.

Overview
Fixes orphaned DuckDB work after query timeouts/cancellations by changing how withQueryTimeout ends a lost race: it now interrupt()s the connection and waits (up to QUERY_INTERRUPT_GRACE_MS) for the native operation to settle before rejecting, instead of returning while prepared.run() is still on a libuv worker thread.

Adds safeDisconnect, which defers disconnectSync when a timed-out query is still unwinding (tracking multiple pending ops per connection so later timeouts don’t mask earlier live queries). All query teardown paths—agent/MCP tools, SQL AST validation, federation console, connection test/reinit, data browser, and ATTACH/materialise flows—switch from disconnectSync to safeDisconnect.

Also skips starting queries when the abort signal is already set, documents QUERY_INTERRUPT_GRACE_MS in .env.example, and sets UV_THREADPOOL_SIZE default 16 in the container entrypoint so a few stuck federated queries are less likely to starve the whole process.

Reviewed by Cursor Bugbot for commit 0d42b29. Bugbot is set up for automated code reviews on this repo. Configure here.

…r hangs

Query timeouts (and user cancellations) raced the native DuckDB operation via
Promise.race, but the loser was never cancelled: prepared.run() kept executing
on a libuv worker thread while the caller's finally immediately disconnected
the connection. Closing a connection out from under a live native query can
trip an extension abort() that kills the whole worker/API process, and the
orphaned queries starve the (default 4-thread) libuv pool so unrelated jobs
hang in "Thinking".

withQueryTimeout now interrupts the query and waits for the native operation to
actually settle (bounded by QUERY_INTERRUPT_GRACE_MS) before rejecting. The new
safeDisconnect helper defers teardown until any still-unwinding query finishes,
so a connection is never closed mid-query. Wired safeDisconnect into every
query call site (agent tools, MCP tools, SQL AST validation, DuckDB console,
connections + data-browser API routes) and raised UV_THREADPOOL_SIZE in the
container entrypoint as defense-in-depth.

Co-authored-by: Cursor <cursoragent@cursor.com>
@railway-app railway-app Bot temporarily deployed to archmax SemLayer / archmax-pr-68 June 6, 2026 05:27 Destroyed
@railway-app
Copy link
Copy Markdown

railway-app Bot commented Jun 6, 2026

🚅 Deployed to the archmax-pr-68 environment in archmax SemLayer

Service Status Web Updated (UTC)
archmax_external_dbs ✅ Success (View Logs) Jun 6, 2026 at 9:40 am
archmax_standalone ✅ Success (View Logs) Jun 6, 2026 at 9:40 am
archmax_standalone_with_volume ✅ Success (View Logs) Jun 6, 2026 at 9:40 am

Comment thread packages/core/src/services/duckdb.ts Outdated
// The query ignored the interrupt within the grace window. Hand the
// still-pending settle handle to `safeDisconnect` so it defers the
// teardown rather than disconnecting under a live native operation.
pendingNativeOps.set(connection, opSettled);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pending-operation tracking is bypassed by several callers in this same file that still run db.disconnectSync() in finally after withQueryTimeout(): attachConnection after the ATTACH timeout path, attachIcebergCatalog after the Iceberg ATTACH, and materialiseModelViewsLocked after CREATE SCHEMA / CREATE OR REPLACE VIEW. If one of those operations times out and does not settle within the grace window, withQueryTimeout() records it here, but the caller immediately closes the connection anyway, reintroducing the crash/orphaned-worker behavior this PR is trying to prevent on the MCP execute_query setup/materialization path. Replace those finally blocks with safeDisconnect(db) (and add a regression covering attach/materialization timeout) so every connection that ran through withQueryTimeout() honors the deferred disconnect.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 6, 2026

Docker image ready

docker pull ghcr.io/archmaxai/archmax:pr-68

…t paths

Bugbot flagged three in-file callers that ran through withQueryTimeout but
still called db.disconnectSync() directly in finally: attachConnection,
attachIcebergCatalog, and materialiseModelViewsLocked. If one of those
ATTACH / CREATE SCHEMA / CREATE OR REPLACE VIEW operations timed out and did
not settle within the interrupt grace, the connection was closed out from
under a live native operation — reintroducing the crash/orphan behavior this
PR prevents on the MCP execute_query setup/materialization path. Route them
through safeDisconnect so the deferred disconnect is honored everywhere.

Co-authored-by: Cursor <cursoragent@cursor.com>
@railway-app railway-app Bot temporarily deployed to archmax SemLayer / archmax-pr-68 June 6, 2026 05:51 Destroyed
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 7bf4692. Configure here.

Comment thread packages/core/src/services/duckdb.ts Outdated
…t waits for all

pendingNativeOps tracked at most one settle handle per connection, but a single
connection runs several withQueryTimeout calls in sequence (materialisation
pass, data-browser exists/count/data). A second timeout overwrote the first's
handle, so safeDisconnect could close the connection while the first orphaned
query was still live. Merge the new handle with any prior one (Promise.all) so
the deferred disconnect waits for every still-running query to unwind.

Co-authored-by: Cursor <cursoragent@cursor.com>
@railway-app railway-app Bot temporarily deployed to archmax SemLayer / archmax-pr-68 June 6, 2026 09:39 Destroyed
@tobias-gp tobias-gp merged commit 433006d into main Jun 6, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant