Skip to content

feat: support distributed query cancellation in query queue and cancellation promise for Athena#10953

Merged
paveltiunov merged 26 commits into
masterfrom
cursor/distributed-cancel-athena-5ec0
May 29, 2026
Merged

feat: support distributed query cancellation in query queue and cancellation promise for Athena#10953
paveltiunov merged 26 commits into
masterfrom
cursor/distributed-cancel-athena-5ec0

Conversation

@paveltiunov
Copy link
Copy Markdown
Member

@paveltiunov paveltiunov commented May 27, 2026

Description of Changes Made

Distributed Query Cancellation (QueryQueue)

In a distributed/multi-node Cube deployment with CubeStore, when a query is cancelled (manually, via orphan/stale detection, or timeout reconciliation) by a node different from the one executing the query, the queue entry is removed but the actual database query keeps running until it completes or times out. This is because the cancel handler (queue.handles[cancelHandler]) is process-local — it only exists on the node executing the query.

Changes:

  • Enhanced the heartbeat interval in processQuery to detect external cancellation
  • After updating the heartbeat, if a cancel handler has been registered, the executing node checks whether the queue item still exists via getQueryDef
  • If the queue item has been removed (cancelled externally) and the query is still executing, the local cancel handler is invoked via processCancel, triggering actual database query cancellation
  • Added queryExecutionFinished flag to prevent false positives (e.g., detecting "gone" after natural completion)
  • Added localCancelHandler tracking to capture the cancel handle registered by the query handler

This enables distributed query cancellation at heartbeat interval granularity (default 30s) without requiring inter-process messaging or CubeStore protocol changes.

Athena Cancellation Promise (AthenaDriver)

The Athena driver previously returned plain Promise objects from query methods, which meant the query orchestrator could not cancel in-flight Athena queries on timeout, manual cancel, or pre-aggregation cancellation.

Changes:

  • query(), memory(), stream(), downloadQueryResults(), and loadPreAggregationIntoTable() now return MaybeCancelablePromise with a .cancel() method
  • Calling .cancel() sets a cancellation flag and calls stopQueryExecution() on AWS Athena via the existing stopQuery() method
  • waitForSuccess() now accepts an optional isCancelled callback to detect cancellation during the poll loop, enabling fast exit from the polling cycle
  • The stream() release function now properly stops the Athena query instead of being a no-op
  • Cancellation is checked at multiple points: after startQuery(), during each poll iteration in waitForSuccess(), and during result row iteration

This integrates with the query orchestrator's existing cancel machinery — the cancelCombinator pattern in PreAggregationLoader and the cancel handler registration in QueryCache.createQueue will automatically detect and use the .cancel() method.

REST API for Query Cancellation by Request ID

Added a new public API endpoint for cancelling running queries by their request ID.

Endpoint: DELETE /cubejs-api/v1/running-query/:requestId

Flow:

  • ApiGateway registers the route with user auth middleware
  • OrchestratorApi.cancelQueryByRequestId() delegates to the orchestrator
  • QueryOrchestrator.cancelQueryByRequestId() searches across all SQL query and pre-aggregation queues for all data sources
  • QueryQueue.cancelQueryByRequestId() scans queued queries, matches by requestId, and cancels via cancelQuery()

Check List

  • Tests have been run in packages where changes have been made if available
  • Linter has been run for changed code
  • Tests for the changes have been added if not covered yet
  • Docs have been added / updated if required
Open in Web Open in Cursor 

@github-actions github-actions Bot added driver:athena Issues related to the AWS Athena driver javascript Pull requests that update Javascript code data source driver labels May 27, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 27, 2026

Claude finished @paveltiunov's task in 2m 17s —— View job


Code Review: Distributed Query Cancellation + Athena Cancel Support

  • Read and understand the changed files
  • Review QueryQueue distributed cancellation changes
  • Review AthenaDriver cancellation promise changes
  • Check for correctness, security, and performance issues
  • Post detailed review feedback

Overall: Solid design — piggybacking cancellation detection on the existing heartbeat is a pragmatic approach that avoids CubeStore protocol changes. The Athena cancellation support is a meaningful improvement over the previous no-op release and plain promises. I've left inline comments on the specific issues below.

Bugs Found

  1. Partial result on cancellation (AthenaDriver)memory() and query() use if (cancelled) break during row iteration, which exits the loop and returns truncated data to the caller as if it were a complete result. Should throw instead of break. See inline comments at line 287 and line 368.

  2. Synchronous throw in loadPreAggregationIntoTable — The S3 config validation at line 444 was previously wrapped by async and returned as a rejected promise. Now it's a synchronous throw since the method is no longer async. Callers using .catch() without a try-catch will miss this.

Race Condition (Low Severity)

The heartbeat-based cancellation detection in QueryQueue has a TOCTOU window between getQueryDef() returning null and queryExecutionFinished being checked — a normally-completed query could be mistakenly "cancelled" if the timing aligns. The double-check mitigates but doesn't eliminate this. For Athena specifically this is benign (stopQuery is best-effort). See inline comment.

Code Quality

  • The cancelable-promise boilerplate is duplicated 4 times with near-identical structure. A small helper would reduce surface area for the partial-result bug. See suggestion.
  • Nice fix of the rerutns typo → returns in the query() docstring.

Testing

No tests were added for either change. Key scenarios that would benefit from coverage:

  • Cancel during waitForSuccess polling — verify stopQueryExecution is called
  • Cancel during row iteration — verify error is thrown (not partial data)
  • QueryQueue heartbeat detecting external cancellation — verify cancel handler invoked
  • QueryQueue heartbeat when query completes normally — verify no false cancellation

Security & Performance

  • No security concerns — no new input surfaces, no credential handling changes.
  • Performance impact is minimal: one extra getQueryDef call per heartbeat interval (default 30s) when a cancel handler is registered. This is acceptable.

Comment thread packages/cubejs-athena-driver/src/AthenaDriver.ts
Comment thread packages/cubejs-athena-driver/src/AthenaDriver.ts
Comment thread packages/cubejs-athena-driver/src/AthenaDriver.ts Outdated
Comment thread packages/cubejs-query-orchestrator/src/orchestrator/QueryQueue.ts Outdated
Comment on lines +273 to +297
const promise: any = (async () => {
qid = await this.startQuery(query, values);
if (cancelled) {
await this.stopQuery(qid);
throw new Error('Query was cancelled');
}
await this.waitForSuccess(qid, () => cancelled);
const iter = this.lazyRowIterator(qid, query, true);
const types = <TableStructure><unknown>((await iter.next()).value);
const rows: Row[] = [];
for await (const row of iter) {
if (cancelled) break;
rows.push(<Row>row);
}
return { types, rows };
})();

promise.cancel = async () => {
cancelled = true;
if (qid) {
await this.stopQuery(qid);
}
};

return promise;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (code quality): The cancelable-promise wrapping pattern (declare qid/cancelled, create IIFE, attach .cancel()) is duplicated nearly identically across memory(), stream(), query(), and loadPreAggregationIntoTable() — four copies of the same boilerplate.

Consider extracting a small helper like:

private makeCancelable<T>(
  fn: (isCancelled: () => boolean) => Promise<T>
): MaybeCancelablePromise<T> {
  let qid: AthenaQueryId | null = null;
  let cancelled = false;
  const promise: any = fn(() => cancelled);
  promise.cancel = async () => {
    cancelled = true;
    if (qid) await this.stopQuery(qid);
  };
  return promise;
}

Not blocking, but would reduce the surface area for the partial-result bug found above from four sites to one.

}
} finally {
// catch block can throw an exception, it's why it's important to clearInterval here
queryExecutionFinished = true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good: The old comment (// catch block can throw an exception, it's why it's important to clearInterval here) was removed, and queryExecutionFinished = true is now set before clearInterval. This correctly signals to any in-flight heartbeat callback that execution is done.

One edge case: if a heartbeat callback is currently executing its getQueryDef check when queryExecutionFinished is set to true, the callback will exit early via the if (queryExecutionFinished) return; guard — which is correct behavior.

protected async waitForSuccess(qid: AthenaQueryId, isCancelled?: () => boolean): Promise<void> {
const startedTime = Date.now();
for (let i = 0; Date.now() - startedTime <= this.config.pollTimeout; i++) {
if (isCancelled?.()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cancellation check runs before checkStatus(), so if the query was cancelled between polls but the Athena query already succeeded, waitForSuccess will throw "Query was cancelled" even though the data is ready. This is fine in practice (the stopQuery call is best-effort and a no-op on a completed query), but worth noting that the ordering means cancellation always wins — even a microsecond before success. This seems intentional for consistency.

@paveltiunov paveltiunov marked this pull request as ready for review May 28, 2026 00:15
@paveltiunov paveltiunov requested a review from a team as a code owner May 28, 2026 00:15
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 28, 2026

Claude finished @paveltiunov's task in 4m 31s —— View job


Comprehensive Code Review: Distributed Query Cancellation + Athena Cancel Support

  • Read and analyze changed files in detail
  • Review QueryQueue distributed cancellation changes
  • Review AthenaDriver cancellation promise changes
  • Check for bugs, security, performance, and testing issues
  • Post comprehensive review with inline comments

Summary

The design is sound — piggybacking cancellation detection on the existing heartbeat avoids CubeStore protocol changes, and adding MaybeCancelablePromise to Athena methods integrates cleanly with the existing cancelCombinator machinery in PreAggregationLoader. I found 2 bugs that should be fixed before merge and a few lower-priority items.

Bugs Found

Severity Issue Location
Medium Partial result on cancellationmemory() and query() use if (cancelled) break during row iteration, which exits the loop and returns truncated data to the caller as a complete result. Should throw instead of break for consistency with the other cancel checkpoints. AthenaDriver.ts:284, AthenaDriver.ts:366
Low Unhandled promise rejection in heartbeat — The callback was changed from non-async to async, but updateHeartBeat is awaited outside any try/catch. A connection failure now becomes an unhandled rejection instead of a silently ignored promise. QueryQueue.ts:789
Low Synchronous throw in loadPreAggregationIntoTable — The S3 config validation is now a synchronous throw since the method is no longer async. Current callers are safe (verified both call sites in PreAggregationLoader.ts — both use await inside try/catch), but moving the check inside the async IIFE would be more defensive. AthenaDriver.ts:442

Race Condition Analysis (QueryQueue)

I traced through the possible interleavings of the heartbeat-based cancellation detection and found the double-check pattern is correct. The key invariant: queryExecutionFinished = true is set in the finally block (line 928) before setResultAndRemoveQuery removes the queue item (line 932). Since !currentDef && !queryExecutionFinished evaluates synchronously, there's no microtask boundary where the flag could change between the two reads. Details in inline comment.

Code Quality

  • The cancelable-promise boilerplate is repeated 4x with near-identical structure across memory(), stream(), query(), and loadPreAggregationIntoTable(). A small makeCancelable helper would reduce duplication and the surface area for the partial-result bug. Suggestion.
  • Nice fix of the rerutns typo in the query() docstring.
  • Good: the stream() release function now properly calls stopQuery instead of being a no-op.

Security & Performance

  • Security: No concerns — no new input surfaces, no credential handling changes.
  • Performance: One extra getQueryDef call per heartbeat interval (default 30s) when a cancel handler is registered. Negligible overhead.

Testing

No tests were added for either change. Key scenarios that would benefit from coverage:

  • Cancel during waitForSuccess polling — verify stopQueryExecution is called
  • Cancel during row iteration — verify an error is thrown (not partial data returned)
  • QueryQueue heartbeat detecting external cancellation — verify cancel handler is invoked
  • QueryQueue heartbeat after normal completion — verify no false cancellation

Recommendation

Fix the two breakthrow changes in memory() and query() (the partial-result bug), and wrap updateHeartBeat in a try/catch. The rest are suggestions.

Comment thread packages/cubejs-athena-driver/src/AthenaDriver.ts Outdated
Comment thread packages/cubejs-athena-driver/src/AthenaDriver.ts Outdated
Comment thread packages/cubejs-query-orchestrator/src/orchestrator/QueryQueue.ts Outdated
Comment thread packages/cubejs-athena-driver/src/AthenaDriver.ts
Comment on lines +791 to +810
if (!queryExecutionFinished && localCancelHandler !== null) {
try {
const currentDef = await queueConnection.getQueryDef(queryKeyHashed, queueId);
if (!currentDef && !queryExecutionFinished) {
this.logger('Cancelling query due to external cancellation', {
queueId,
queryKey: query.queryKey,
queuePrefix: this.redisQueuePrefix,
requestId: query.requestId,
metadata: query.query?.metadata,
preAggregationId: query.query?.preAggregation?.preAggregationId,
newVersionEntry: query.query?.newVersionEntry,
preAggregation: query.query?.preAggregation,
addedToQueueTime: query.addedToQueueTime,
});
const cancelQuery = { ...query, cancelHandler: localCancelHandler };
if (this.cancelHandlers[query.queryHandler]) {
await this.cancelHandlers[query.queryHandler](cancelQuery);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition analysis — actually well-handled. I traced through the possible interleavings:

When a query completes normally, the finally block (line 928) sets queryExecutionFinished = true before setResultAndRemoveQuery (line 932) removes the queue item. So if a heartbeat callback is mid-flight:

  1. It passes the first !queryExecutionFinished guard (line 791)
  2. getQueryDef resolves — even if the item was just removed by setResultAndRemoveQuery, queryExecutionFinished is already true
  3. The second check !queryExecutionFinished (line 794) catches it — no false cancellation

Since both reads in !currentDef && !queryExecutionFinished happen in the same synchronous expression (no microtask boundary between them), this double-check pattern is sound in single-threaded JS.

One suggestion: consider adding a comment documenting this invariant ("queryExecutionFinished must be set before the queue item is removed") since it's the load-bearing ordering guarantee.

Comment on lines +270 to +297
let qid: AthenaQueryId | null = null;
let cancelled = false;

const promise: any = (async () => {
qid = await this.startQuery(query, values);
if (cancelled) {
await this.stopQuery(qid);
throw new Error('Query was cancelled');
}
await this.waitForSuccess(qid, () => cancelled);
const iter = this.lazyRowIterator(qid, query, true);
const types = <TableStructure><unknown>((await iter.next()).value);
const rows: Row[] = [];
for await (const row of iter) {
if (cancelled) break;
rows.push(<Row>row);
}
return { types, rows };
})();

promise.cancel = async () => {
cancelled = true;
if (qid) {
await this.stopQuery(qid);
}
};

return promise;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (code quality): The cancelable-promise wrapping pattern (let qid/cancelled, create async IIFE, attach .cancel()) is repeated nearly identically across memory(), stream(), query(), and loadPreAggregationIntoTable(). A small helper would centralize the logic and reduce the surface area for bugs like the partial-result issue:

private makeCancelable<T>(
  fn: (isCancelled: () => boolean) => Promise<T>
): MaybeCancelablePromise<T> {
  let qid: AthenaQueryId | null = null;
  let cancelled = false;
  const promise: any = fn(() => cancelled);
  promise.cancel = async () => {
    cancelled = true;
    if (qid) await this.stopQuery(qid);
  };
  return promise;
}

Not blocking, but would cut four ~15-line boilerplate blocks down to one.

@cursor cursor Bot requested a review from a team as a code owner May 28, 2026 00:35
});
}));

app.delete(`${this.basePath}/v1/running-query/:requestId`, userMiddlewares, userAsyncHandler(async (req: any, res) => {
@codecov
Copy link
Copy Markdown

codecov Bot commented May 28, 2026

Codecov Report

❌ Patch coverage is 21.73913% with 36 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.84%. Comparing base (f39bcf0) to head (2f3928d).
⚠️ Report is 12 commits behind head on master.

Files with missing lines Patch % Lines
...-query-orchestrator/src/orchestrator/QueryQueue.ts 19.23% 21 Missing ⚠️
packages/cubejs-api-gateway/src/gateway.ts 12.50% 7 Missing ⚠️
...orchestrator/src/orchestrator/QueryOrchestrator.ts 0.00% 6 Missing ⚠️
...ubejs-query-orchestrator/src/orchestrator/utils.ts 66.66% 0 Missing and 1 partial ⚠️
...ges/cubejs-server-core/src/core/OrchestratorApi.ts 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master   #10953       +/-   ##
===========================================
+ Coverage   58.51%   78.84%   +20.33%     
===========================================
  Files         216      470      +254     
  Lines       17014    93380    +76366     
  Branches     3457     3462        +5     
===========================================
+ Hits         9956    73629    +63673     
- Misses       6555    19248    +12693     
  Partials      503      503               
Flag Coverage Δ
cube-backend 58.41% <21.73%> (-0.10%) ⬇️
cubesql 83.41% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@cursor cursor Bot requested a review from keydunov as a code owner May 28, 2026 01:17
@github-actions github-actions Bot added the rust Pull requests that update Rust code label May 28, 2026
cursoragent and others added 4 commits May 28, 2026 22:21
…llation promise for Athena

Distributed Query Cancellation (QueryQueue):
- During the heartbeat interval, the executing node now checks if the
  queue item was removed externally (by another node calling cancelQuery
  or reconcile removing orphaned/stalled queries)
- If the queue item is gone and a cancel handler is registered locally,
  the local cancel handler is invoked, triggering actual DB query
  cancellation
- This closes the gap where cancellation from one node would remove the
  queue entry but the actual DB query kept running on the executing node

Athena Cancellation Promise (AthenaDriver):
- query(), memory(), stream(), downloadQueryResults(), and
  loadPreAggregationIntoTable() now return MaybeCancelablePromise with
  a .cancel() method
- Calling cancel() sets a flag that aborts the poll loop in
  waitForSuccess() and calls stopQueryExecution() on AWS Athena
- waitForSuccess() now accepts an optional isCancelled callback to
  detect cancellation during the poll loop
- The stream() release function now properly stops the Athena query
- This enables the query orchestrator to cancel in-flight Athena queries
  on timeout, manual cancel, or pre-aggregation cancellation via
  cancelCombinator

Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
…llback

Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
Add DELETE /cubejs-api/v1/running-query/:requestId endpoint that cancels
in-flight queries matching the given request ID across all SQL and
pre-aggregation queues.

- QueryQueue.cancelQueryByRequestId: scans queued queries and cancels
  those matching the request ID
- QueryOrchestrator.cancelQueryByRequestId: searches across all SQL
  query and pre-aggregation queues for all data sources
- OrchestratorApi.cancelQueryByRequestId: delegates to orchestrator
- ApiGateway: registers DELETE route with user auth middleware

Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
cursoragent and others added 11 commits May 28, 2026 22:21
…ding

- Remove timeout argument from test() call — jest.setTimeout at the
  describe level already applies, and the types don't support the
  3-argument overload (TS2769)
- Remove blank line before closing brace (padded-blocks lint error)

Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
…tion

Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
The SQL API appends -span-N suffixes to request IDs on each
continue-wait retry, so the same logical query may have different
span suffixes in the queue. Strip the -span-N suffix before
comparing to match all queries from the same request.

Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
The /cubesql endpoint's native transport has an internal retry loop
that re-submits cancelled queries with new span IDs, so cancellation
doesn't surface as an error to the client. Use /load instead where
the continue-wait loop is client-controlled.

After cancellation, /load returns { error: 'Continue wait' } since
the queue item is gone. Assert result.error is defined.

Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
When a client closes the connection to /v1/cubesql, the native
transport's continue-wait retry loop would keep polling indefinitely.

Fix: listen for res 'close' event in the gateway and mark the request
ID as closed. On each retry iteration, the sqlApiLoad callback checks
if the request was closed and throws 'Client disconnected' — a
non-continue-wait error that breaks the transport retry loop.

Closed request IDs are auto-cleaned after 5 minutes.

Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
When a client closes the HTTP connection to /v1/cubesql, the native
transport's continue-wait retry loop would keep polling indefinitely.

Fix: register a 'close' event listener on the Node.js response stream
from the Rust side (OnCloseHandler, modeled after OnDrainHandler).
When the stream closes, a oneshot channel fires. In handle_sql_query,
tokio::select! races the execute() future against the close signal.
If the client disconnects, the execute future is dropped, which
cancels the transport retry loop and all downstream operations.

Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
@cursor cursor Bot force-pushed the cursor/distributed-cancel-athena-5ec0 branch from 76def12 to 1cd4b72 Compare May 28, 2026 22:22
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 28, 2026

Review: latest synchronize (Rust client-disconnect + UUID-prefix cancel matching)

Verdict: No blocking issues. 0 high, 0 medium, 2 low (both inline). Prior partial-result and unhandled-rejection bugs are fixed (verified in 9f38db4); resolved 8 stale review threads.

Details

Tasks

  • Diff vs prior review (9f38db4..HEAD)
  • Re-verify previous bug fixes in AthenaDriver.ts and QueryQueue.ts
  • Review new Rust client-disconnect handling (stream.rs, node_export.rs)
  • Review UUID-prefix cancelQueryByRequestId change
  • Resolve stale claude-authored threads where concerns were addressed
  • Post inline comments for new observations

Previously-flagged bugs (verified fixed in 9f38db4)

  • memory() and query() now throw on mid-iteration cancel instead of break → no more silent partial results. ✅
  • S3 config validation moved inside the async IIFE in loadPreAggregationIntoTable → rejected-promise semantics preserved. ✅
  • updateHeartBeat wrapped in try/catch → no unhandled rejection on connection failure. ✅

Resolved 8 stale claude review threads covering these.

New observations (inline)

  1. Rust client-disconnect (node_export.rs:444-449)tokio::select! cancels the consumer side; whether the underlying database query is actually aborted depends on the upstream driver's drop behavior. Also noting: 'close' fires on normal end too (safe due to take()), and the JS listener outlives close_rx (consistent with OnDrainHandler).
  2. UUID-prefix cancel matching (QueryQueue.ts:520-531) — implicit semantic widening from exact match to prefix match; lastIndexOf('-span-') accepts any input, so a regex anchored to /-span-\d+$/ would be more defensive if requestId is ever user-supplied.

Still applicable from earlier reviews (left open)

  • Code-quality nit: makeCancelable helper to dedupe the 4 cancelable-promise blocks in AthenaDriver.
  • No new tests for the heartbeat cancellation path itself (a JS smoke test was added for the REST cancel endpoint via SQL API).
  • Security scan flagged missing rate-limiting on DELETE /v1/running-query/:requestId (github-advanced-security) — unchanged.

Security & performance

  • No new concerns beyond the existing CodeQL note.
  • One extra getQueryDef per heartbeat when a cancel handler is registered — negligible.

View job run · branch cursor/distributed-cancel-athena-5ec0

Comment on lines +444 to +449
let result = tokio::select! {
_ = close_rx => {
Err(CubeError::internal("Client disconnected".to_string()))
}
res = execute() => res
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question — does cancellation actually propagate to the database? When the disconnect arm wins, tokio::select! drops the execute() future, which cancels the in-flight stream.next().await on the next yield. That stops consumption of batches, but whether the underlying database query is actually aborted depends on how get_df_batches / the upstream driver handles drop. If it spawns work on a separate task (or if a downstream driver like Athena holds an HTTP connection that completes regardless of consumer), the query keeps running until natural completion. Worth confirming this is "stop polling/streaming on disconnect" (analogous to the SQL API REST flow) rather than "cancel the underlying query."

Stream 'close' also fires on normal end. Node's Writable emits 'close' after the response naturally finishes. In the success path, execute() resolves first and tokio::select! returns Ok; close_rx is dropped and the JS listener firing later is harmless (the Arc<Mutex<Option<Sender>>>::take() makes subsequent on_close calls no-ops). Just noting it's deliberate that the close listener outlives close_rx.

Listener never detached. Like the existing OnDrainHandler, the 'close' listener stays attached for the lifetime of stream_methods.stream. Consistent with the existing pattern, but worth noting if the same JS stream is ever reused across queries — a stale handler from a prior query would fire.

Comment on lines +520 to +531
public async cancelQueryByRequestId(requestId: string): Promise<QueryDef[]> {
const extractUUID = (id: string) => {
const idx = id.lastIndexOf('-span-');
return idx !== -1 ? id.substring(0, idx) : id;
};

const targetUUID = extractUUID(requestId);
const queries: any[] = await this.getQueries();
const cancelled: QueryDef[] = [];

for (const query of queries) {
if (query.requestId && extractUUID(query.requestId) === targetUUID) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UUID-prefix match is broader than the prior exact match. Two observations:

  1. Cross-span cancellation is now possible. If two logical requests share the same UUID prefix but differ only in -span-N, cancelling one cancels the other. This appears intentional (the goal is to cancel all sub-spans of a parent request), but it's an implicit semantic widening from the previous exact-match behavior — worth a one-line comment documenting the intent.

  2. lastIndexOf('-span-') accepts any input. If requestId is ever user-supplied (e.g., via x-request-id header), a string like evil-span-1 would be silently truncated to evil and match any query whose UUID happens to be evil. In practice the prefix is a generated UUID so collisions are negligible, but if the goal is strictly "strip a server-generated -span-N suffix," consider anchoring with a regex like /-span-\d+$/ so unrelated strings aren't quietly normalized.

Neither blocks merge.

- Start slow query via /v1/cubesql (REST SQL API)
- Poll pg_stat_activity on the underlying Postgres to confirm
  pg_sleep is running
- Abort the HTTP connection (triggers Rust-side close detection)
- Cancel the query in the queue via DELETE /running-query/:requestId
- Poll pg_stat_activity again to confirm pg_sleep has stopped,
  verifying the query is no longer executing in Postgres

Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 28, 2026

Claude finished @cursor[bot]'s task in 1m 16s —— View job


Review of latest synchronize (test rewrite)

Verdict: No blocking issues. 0 high, 0 medium, 0 low new. Diff since prior review (9f38db4..HEAD) is small: docs, the Rust client-disconnect path (already reviewed), the UUID-prefix cancel match (already reviewed), and a test rewrite landed in 454cebf.

Details

Tasks

  • Diff vs prior review (9f38db4..HEAD)
  • Re-check prior bug fixes
  • Review new test rewrite (smoke-cubesql.test.ts)
  • Re-confirm Rust + UUID-prefix observations
  • Post findings

New: test rewrite (smoke-cubesql.test.ts:1097-1173)

Nice improvement — the new test verifies the end-to-end cancellation path by checking pg_stat_activity directly, rather than just inspecting the HTTP response. This is meaningful integration coverage for the heartbeat-driven distributed cancel because pg_sleep(30) disappearing within the 20s polling window can only happen if cancel actually propagated to the backing Postgres connection (not natural completion).

Two minor observations (non-blocking):

  1. The test exercises both the Rust-side OnCloseHandler (abort) and the queue cancel (DELETE /running-query/...) in the same case. If both paths work, you can't distinguish which one stopped the query — fine for a smoke test, but consider a second case that omits abortController.abort() to isolate the queue-cancel path.
  2. The pg_stat_activity query uses LIKE '%pg_sleep%' and excludes pg_backend_pid(), which is correct, but if any other test runs concurrently against the same Postgres container it could see a stale match. Not an issue today (sequential test ordering), worth noting if parallelization changes.

Re-verified

  • Partial-result, sync-throw, and heartbeat unhandled-rejection fixes from 9f38db4 still in place.
  • Rust OnCloseHandler and UUID-prefix cancelQueryByRequestId observations from the 22:22Z review still apply (existing unresolved threads).

Still applicable from earlier reviews (left open)

  • makeCancelable helper to dedupe the 4 Athena cancelable-promise blocks.
  • Rust disconnect: confirm whether dropping execute() actually cancels the upstream DB query (vs just stopping consumption).
  • UUID-prefix match: implicit semantic widening from exact match; lastIndexOf('-span-') accepts any input.
  • github-advanced-security: missing rate-limiting on DELETE /v1/running-query/:requestId.

Security & performance

  • No new concerns.
· [`cursor/distributed-cancel-athena-5ec0`](https://github.com/cube-js/cube/tree/cursor/distributed-cancel-athena-5ec0)

- Start slow query via /v1/cubesql REST SQL API
- Confirm pg_sleep is running via pg_stat_activity on the underlying
  Postgres
- Abort HTTP connection (triggers Rust-side close detection)
- Cancel query in queue via DELETE /running-query/:requestId
- Verify second cancel returns empty (query is gone from queue)
- Clean up lingering pg_sleep via pg_terminate_backend in finally

Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 28, 2026

Claude finished @cursor[bot]'s task in 34s —— View job


Review of latest synchronize (test refinement in 78ad4cd)

Verdict: No new blocking issues. The only change since my prior review (454cebf) is a test refinement.

Details

Tasks

  • Diff vs prior review (454cebf..HEAD)
  • Review test changes in 78ad4cd
  • Re-check prior fix invariants
  • Post findings

Diff since last review

Only packages/cubejs-testing/test/smoke-cubesql.test.ts changed. The test was tightened in a meaningful way:

  • Removed the post-cancel "second cancel returns empty" assertion — that only proved the queue entry was gone, not that the backend query actually stopped.
  • Added a polling loop on pg_stat_activity that waits up to 20s (40 × 500ms) for the pg_sleep query to disappear, then asserts sleepStopped === true.
  • Removed the pg_terminate_backend cleanup in finally — the test now relies on the cancel path actually working, rather than masking failures by force-killing the leftover query.

This is the right direction: the test now fails loudly if cancellation doesn't propagate to Postgres, instead of silently passing while leaving a 30s pg_sleep to be reaped by the cleanup.

Minor follow-ups (non-blocking, carried from prior reviews)

The cleanup removal means a real cancellation regression won't be hidden — but if sleepStopped ever flips to false, the test leaves an active pg_sleep in the container for up to its natural ~30s duration. Probably fine for CI; worth knowing if tests get parallelized against a shared Postgres.

The two earlier observations still apply (existing unresolved threads):

  • The test exercises both Rust-side OnCloseHandler and the queue cancel — can't isolate which path stopped the query if both work.
  • pg_stat_activity LIKE '%pg_sleep%' match would collide if other tests run concurrently against the same container.

Re-affirmed earlier threads (still applicable, not re-posting):

  • makeCancelable helper to dedupe the 4 Athena cancelable-promise blocks.
  • Rust disconnect: confirm execute() drop actually aborts upstream DB query vs just halting consumption.
  • UUID-prefix cancelQueryByRequestId match — implicit semantic widening; defensive regex anchor would harden against user-supplied input.
  • github-advanced-security: missing rate-limiting on DELETE /v1/running-query/:requestId.

Security & performance

No new concerns.

· [`cursor/distributed-cancel-athena-5ec0`](https://github.com/cube-js/cube/tree/cursor/distributed-cancel-athena-5ec0)

Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 28, 2026

Claude finished @cursor[bot]'s task in 23s —— View job


Review of latest synchronize (58061ee)

Verdict: No new findings. 0 high, 0 medium, 0 low new.

Details

The only diff since my prior review (78ad4cd..HEAD) is a one-line bump in SlowQuery.js: pg_sleep(30)pg_sleep(90). Commit message says "for cancel test reliability."

This widens the window in which the heartbeat-driven distributed cancel can observably stop the query before the upstream Postgres pg_sleep would have completed naturally — without this bump, a slow CI run could finish the sleep within the 20s poll window the test grants, which would look like a successful cancel even if cancellation never propagated. Sensible reliability fix.

No re-review of code paths needed — observations from prior reviews (makeCancelable helper, Rust OnCloseHandler upstream-cancel semantics, UUID-prefix match widening, missing rate-limiting on the DELETE endpoint) still apply on existing unresolved threads.

Security & performance

  • No new concerns.
· [`cursor/distributed-cancel-athena-5ec0`](https://github.com/cube-js/cube/tree/cursor/distributed-cancel-athena-5ec0)

Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 28, 2026

Claude finished @cursor[bot]'s task in 24s —— View job


Review of latest synchronize (06ea600)

Verdict: No new findings. 0 high, 0 medium, 0 low new.

Details

Only change since prior review (58061ee..HEAD) is a single-commit cargo fmt pass in packages/cubejs-backend-native/src/node_export.rs: import reordering (OnCloseHandler/OnDrainHandler alpha-sort, tokio::sync::oneshot moved into the std/third-party block) and one OnCloseHandler::new(...) call collapsed to a single line. Pure formatting — no behavior change.

Re-affirmed earlier threads still applicable (not re-posting):

  • makeCancelable helper to dedupe the 4 Athena cancelable-promise blocks.
  • Rust disconnect: confirm execute() drop actually aborts the upstream DB query vs just halting consumption.
  • UUID-prefix cancelQueryByRequestId match — implicit semantic widening; defensive regex anchor would harden against user-supplied input.
  • github-advanced-security: missing rate-limiting on DELETE /v1/running-query/:requestId.

Security & performance

  • No new concerns.

· cursor/distributed-cancel-athena-5ec0 ·

Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 29, 2026

Claude finished @cursor[bot]'s task in 22s —— View job


Review of latest synchronize (95aaf6a)

Verdict: No new findings. 0 high, 0 medium, 0 low new.

Details

Only diff since prior review (06ea600..HEAD) is a single 2s setTimeout inserted in smoke-cubesql.test.ts between the first and second DELETE /running-query/:requestId calls, to let the queue cleanup settle before asserting the second cancel returns empty. Pure test reliability tweak — no production code touched.

Minor observation (non-blocking): a fixed sleep is a small flake risk if CI is slow; a poll loop on the second cancel's empty response would be more robust, but 2s is generous for an in-process queue cleanup and matches the project's existing test patterns.

Re-affirmed earlier threads still applicable (not re-posting):

  • makeCancelable helper to dedupe the 4 Athena cancelable-promise blocks.
  • Rust disconnect: confirm execute() drop actually aborts the upstream DB query vs just halting consumption.
  • UUID-prefix cancelQueryByRequestId match — implicit semantic widening; defensive regex anchor would harden against user-supplied input.
  • github-advanced-security: missing rate-limiting on DELETE /v1/running-query/:requestId.

Security & performance

  • No new concerns.
· [`cursor/distributed-cancel-athena-5ec0`](https://github.com/cube-js/cube/tree/cursor/distributed-cancel-athena-5ec0)

…e and QueryCache

Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 29, 2026

Claude finished @cursor[bot]'s task in 26s —— View job


Review of latest synchronize (2f3928d)

Verdict: No new findings. 0 high, 0 medium, 0 low new.

Details

Only diff since prior review (95aaf6a..HEAD) is a pure refactor: extractRequestUUID extracted into packages/cubejs-query-orchestrator/src/orchestrator/utils.ts and reused from both QueryQueue.cancelQueryByRequestId and QueryCache.extractRequestUUID (which now delegates). Behavior is preserved exactly — same lastIndexOf('-span-') logic, same fallback. The static method on QueryCache is kept as a thin wrapper, so external callers aren't broken.

Nice DRY cleanup. The earlier observation about the implicit semantic widening of prefix-match (and the lastIndexOf accepting any input) now applies to a single source of truth, which makes a future regex-anchor hardening even cheaper if you decide to do it.

Re-affirmed earlier threads still applicable (not re-posting):

  • makeCancelable helper to dedupe the 4 Athena cancelable-promise blocks.
  • Rust disconnect: confirm execute() drop actually aborts the upstream DB query vs just halting consumption.
  • UUID-prefix cancelQueryByRequestId match — implicit semantic widening; defensive regex anchor would harden against user-supplied input (now centralized in utils.ts:extractRequestUUID).
  • github-advanced-security: missing rate-limiting on DELETE /v1/running-query/:requestId.

Security & performance

  • No new concerns.

· cursor/distributed-cancel-athena-5ec0 ·

@paveltiunov paveltiunov merged commit 0c08a78 into master May 29, 2026
159 of 162 checks passed
@paveltiunov paveltiunov deleted the cursor/distributed-cancel-athena-5ec0 branch May 29, 2026 04:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data source driver driver:athena Issues related to the AWS Athena driver javascript Pull requests that update Javascript code rust Pull requests that update Rust code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants