Skip to content

aitools: parallelize discover-schema across tables and probes#5097

Merged
simonfaltum merged 14 commits intomainfrom
simonfaltum/aitools-pr4-discover-parallel
Apr 28, 2026
Merged

aitools: parallelize discover-schema across tables and probes#5097
simonfaltum merged 14 commits intomainfrom
simonfaltum/aitools-pr4-discover-parallel

Conversation

@simonfaltum
Copy link
Copy Markdown
Member

@simonfaltum simonfaltum commented Apr 27, 2026

Stack

This PR is part of a 4-PR stack making aitools data exploration faster for ai-dev-kit. Each PR is independently reviewable; merge in order.

  1. aitools: extract pollStatement helper and pin OnWaitTimeout #5092 — aitools: extract pollStatement helper and pin OnWaitTimeout (base: main)
  2. aitools: run multiple SQL queries in parallel from one invocation #5093 — aitools: run multiple SQL queries in parallel from one query invocation (base: aitools: extract pollStatement helper and pin OnWaitTimeout #5092)
  3. aitools: add 'tools statement' lifecycle commands #5095 — aitools: add 'tools statement' lifecycle commands (base: aitools: run multiple SQL queries in parallel from one invocation #5093)
  4. aitools: parallelize discover-schema across tables and probes #5097 — aitools: parallelize discover-schema across tables and probes (base: aitools: add 'tools statement' lifecycle commands #5095)this PR

Use git diff <base>...HEAD or set the comparison base in the GitHub UI to see only this PR's changes; the default "Files changed" diff against main includes ancestor PRs.


Why

discover-schema walked tables sequentially and ran each table's three probes (DESCRIBE, sample SELECT, null counts) one after the other. For ai-dev-kit's data-exploration phase that meant warehouse-bound work was idle most of the time. Same root cause as the multi-query exploration latency that #5093 (batch query) fixed; same fix.

This is a pure latency win. No new user-facing API surface, no output-shape change.

Changes

Two layers of parallelism plus a shared statement budget:

  1. Across tables. The for-loop in RunE becomes an errgroup.Group. A failure on one table never aborts the others; it's rendered inline as "Error discovering ..." exactly as before.
  2. Within a table. discoverTable still runs DESCRIBE first because the column list feeds the null-counts query. After DESCRIBE returns, the sample SELECT and null-counts probes run concurrently. Output text is assembled once both probes finish, preserving the existing COLUMNS / SAMPLE DATA / NULL COUNTS order.
  3. Single warehouse-statement budget. A new sqlGate (chan struct{} of capacity N + statement_id tracking) wraps every executeSQL call. --concurrency (default 8) caps total in-flight statements globally, regardless of how many tables you pass. So --concurrency 1 actually serializes statement load, not just table fan-out.

Switch executeSQL to use pollStatement (the helper extracted in #5092) instead of the SDK's ExecuteAndWait. Pins OnWaitTimeout: CONTINUE. Failed states flow through checkFailedState, yielding more specific error messages (e.g. "query failed: SYNTAX_ERROR near 'oops'") than the previous hand-rolled branch. The user-visible "SAMPLE DATA: Error - %v" / "NULL COUNTS: Error - %v" wrapping is unchanged. Future polling-helper improvements land here for free.

Cancellation discipline mirroring batch.go (#5093): signal handler cancels a derived pollCtx; sqlGate records each statement_id post-submission; on cancellation the recorded IDs are swept via CancelExecution before returning root.ErrAlreadyPrinted. Without this, parallelism would orphan up to N×2 statements server-side on Ctrl+C.

--concurrency validation mirrors cmd/fs/cp.go and #5093: PreRunE rejects values <= 0 with errInvalidBatchConcurrency. Table-name validation also runs in PreRunE so malformed identifiers are rejected before MustWorkspaceClient runs (no unnecessary auth roundtrip on bad input).

Output unchanged for any input that previously succeeded. Same dividers, same header/probe ordering, same per-probe error wrapping.

Test plan

  • go test ./experimental/aitools/... passes.

  • make checks clean.

  • make fmt no drift.

  • make lint 0 issues.

  • New unit tests in discover_schema_test.go:

    • quoteTableName table-driven (valid, missing parts, too many parts, injection attempts, empty parts, leading-digit identifiers, backtick in name)
    • parseDescribeResult skips metadata rows (#-prefixed and empty)
    • sqlGate.run pins OnWaitTimeout: CONTINUE, propagates FAILED state, wraps transport errors, records IDs, respects cancelled context
    • cancelDiscoverInFlight calls API per ID; empty list is a no-op
    • discoverTable: sample and null-count probes run concurrently after DESCRIBE (deterministic atomic-counter + sync.OnceFunc + channel-close barrier; sequential execution surfaces a timeout error)
    • discoverTable: a sample-probe failure does not abort null counts
    • --concurrency 0 and -1 rejected at PreRunE
    • Invalid table name (not CATALOG.SCHEMA.TABLE) and injection attempts rejected at PreRunE before any API call
  • Manual smoke against a real warehouse:

    databricks experimental aitools tools discover-schema \
      samples.nyctaxi.trips samples.tpch.orders samples.tpch.customer

@arsenyinfo
Copy link
Copy Markdown
Contributor

Column names containing backticks break the null-counts SQL

  • Priority: P3
  • Location: experimental/aitools/cmd/discover_schema.go:256-260 (column source at lines 306-317)
  • Scenario: parseDescribeResult returns raw DESCRIBE-derived column names without escaping; discoverTable interpolates each directly into SUM(CASE WHEN `%s` IS NULL THEN 1 ELSE 0 END) AS `%s_nulls`. A column name containing a backtick (valid in Databricks/Delta DDL via doubled-backtick escaping) terminates the quoted identifier mid-string, producing invalid SQL. The sample probe uses SELECT * and succeeds, so the user sees only a silent NULL COUNTS: Error - query failed: PARSE_SYNTAX_ERROR ... degradation that is easy to misattribute. Existing tests cover backticks in table names (discover_schema_test.go:33) but not DESCRIBE-returned column names.
  • Potential solution: Before interpolation, escape embedded backticks in both the identifier and the alias positions with strings.ReplaceAll(col, "", "``"); alternatively, filter out columns that do not match sqlIdentifierRe` and degrade gracefully for those columns.

🔍 Reviewed by nitpicker

Copy link
Copy Markdown
Contributor

@arsenyinfo arsenyinfo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, but I think this change should be also propagated to https://github.com/databricks/databricks-agent-skills/tree/main/skills/databricks-core

@simonfaltum simonfaltum force-pushed the simonfaltum/aitools-pr3-statement branch from ff58192 to a34f39e Compare April 28, 2026 08:14
@simonfaltum simonfaltum force-pushed the simonfaltum/aitools-pr4-discover-parallel branch from b5fc38d to 0d9fecf Compare April 28, 2026 08:19
Refactor `executeAndPoll` in `experimental/aitools/cmd/query.go` to extract
a pure `pollStatement(ctx, api, resp)` helper. The helper polls until the
statement reaches a terminal state and returns the response without any
signal handling, spinner, or server-side cancellation; those concerns stay
in `executeAndPoll` where they belong.

Also pin `OnWaitTimeout: CONTINUE` explicitly on the `ExecuteStatement`
call. The SDK default happens to be CONTINUE today, but relying on it is
a hidden coupling: a server-side default flip would silently break the
poll loop by killing the statement before our first GET.

Behavior is unchanged for the existing `query` command. Follow-up PRs
(parallel batch queries, statement lifecycle command tree) will reuse the
helper.

Co-authored-by: Isaac
Allow `databricks experimental aitools tools query` to accept several SQLs
in a single invocation and run them in parallel against the warehouse.
Pass multiple positional arguments and/or repeat `--file` to fan out:

  databricks experimental aitools tools query \
    --warehouse <wh> --output json \
    "SELECT count(*) FROM t" \
    "SELECT min(ts), max(ts) FROM t" \
    "SELECT col, count(*) FROM t GROUP BY 1"

Multi-query output is always a JSON array of one object per input,
preserving input order. The shape is `{sql, statement_id, state,
elapsed_ms, columns, rows, error}`. Individual statement failures don't
abort siblings; each is encoded in the per-result `error` field, and the
exit code is non-zero when any statement failed.

A new `--concurrency` flag (default 8) caps in-flight statements. On
Ctrl+C the still-running statements are cancelled server-side via
CancelExecution before exit.

Single-query behavior is unchanged. The previous restriction that
forbade mixing `--file` and a positional SQL is lifted, since both now
contribute to the batch.

Co-authored-by: Isaac
Address two findings from a cursor PR review:

1. --concurrency was passed straight into errgroup.SetLimit. A value of
   0 deadlocks (errgroup refuses to add goroutines), and a negative
   value silently removes the cap. Add a PreRunE check that rejects
   anything <= 0 with errInvalidBatchConcurrency, matching the shape
   used by cmd/fs/cp.go for the same flag.

2. The Long help previously said multi-query results come back "in
   input order", which was ambiguous when --file and positional SQLs
   are mixed. The actual behavior (already covered by
   TestResolveSQLsMixedFileAndPositional) is: --file inputs first in
   flag order, then positional SQLs in arg order. Tighten the help
   text to state that contract precisely.

Adds two unit tests that verify --concurrency 0 and -1 are rejected
before any API call.

Co-authored-by: Isaac
… cases

Two pairs of cobra-level tests were each testing one rejection code
path with two flag values. Fold them into table-driven subtests so the
shared assertion lives in one place:

- TestQueryCommandBatchTextOutputRejected + ...CsvOutputRejected →
  TestQueryCommandBatchOutputRejection (text, csv subtests)
- TestQueryCommandConcurrencyZeroRejected + ...NegativeRejected →
  TestQueryCommandConcurrencyRejection (0, -1 subtests)

Same coverage, half the test functions.

Co-authored-by: Isaac
Address Arseni's P2 finding on the batch PR. cancelInFlight (batch.go)
and cancelStatement (query.go) used to derive the cancel-RPC ctx via
context.WithTimeout(ctx, cancelTimeout). On the actual hot path (Ctrl+C
or parent ctx cancelled), the inbound ctx is already cancelled by the
time we reach the cancel sweep. The SDK then short-circuits on
ctx.Err() and the cancel RPC never reaches the warehouse, leaving
in-flight statements running server-side.

Wrap with context.WithoutCancel(ctx) (Go 1.21+) so the timeout context
keeps the caller's values but drops the cancellation signal. The cancel
RPC now actually fires.

Also tighten the existing tests:
- TestExecuteBatchContextCancellationCancelsInFlight
- TestExecuteAndPollCancelledContextCallsCancelExecution

Both previously matched mock.Anything for the ctx argument, so they
passed regardless of whether the bug was present. They now use
mock.MatchedBy(c.Err() == nil) to assert the cancel-RPC ctx is alive.
This is a regression guard; reverting the production fix makes the
tests fail with "unexpected call" because the matcher no longer matches.

Co-authored-by: Isaac
Adds a low-level command tree for asynchronous SQL statement
management, complementing the synchronous 'tools query':

  databricks experimental aitools tools statement submit  "SELECT ..."
  databricks experimental aitools tools statement get     <statement_id>
  databricks experimental aitools tools statement status  <statement_id>
  databricks experimental aitools tools statement cancel  <statement_id>

submit fires an ExecuteStatement with WaitTimeout=0s and
OnWaitTimeout=CONTINUE, returning the statement_id immediately. get
polls (via pollStatement from #5092) until terminal and emits rows on
success or an error object on failure. status performs a single GET
without polling. cancel sends CancelExecution.

All four subcommands emit a uniform JSON shape {statement_id, state,
warehouse_id, columns, rows, error} with omitempty so the payload only
includes fields that subcommand has.

Important UX nuance: 'statement get' Ctrl+C stops polling but does NOT
cancel the server-side statement. Users that want server-side
termination call 'statement cancel' explicitly. (This differs from
'tools query', which cancels server-side on Ctrl+C because the user
invoked the synchronous path.) The pollStatement helper from #5092 is
already designed to propagate ctx errors without touching the server,
so 'get' inherits this behavior for free.

Co-authored-by: Isaac
Address a cursor PR review finding: 'statement get' and 'statement
status' previously only set info.Error when pollResp.Status.Error was
non-nil. The Statements API can return a non-success terminal state
(FAILED, CANCELED, CLOSED) with no Error payload, so the JSON contract
"emits rows on success or an error object on failure" wasn't actually
guaranteed. Skill consumers couldn't branch on `error == null` alone:
they had to also inspect `state`. Especially bad for 'get', which
exits non-zero on non-success terminal states without giving the
caller structured failure detail.

Add a shared helper, statementErrorFromStatus, that returns a
batchResultError for any terminal non-success state, populated from
the SDK's ServiceError when present and synthesizing
"statement reached terminal state X" when the backend doesn't supply
one. Mirrors the pattern already used by runOneBatchQuery in batch.go,
so the contract is uniform across batch and single-statement paths.

Both 'get' and 'status' now use the helper. PENDING and RUNNING still
emit no error (legitimately mid-flight).

New tests:
- table-driven coverage of statementErrorFromStatus across nil,
  succeeded, running, pending, failed-with-error, failed-no-error,
  canceled-no-error, closed-no-error
- getStatementResult with CLOSED state and no Error
- getStatementResult with FAILED state and no Error
- getStatementStatus with FAILED state and no Error
- getStatementStatus with RUNNING state confirms no error is set

Co-authored-by: Isaac
…tests

Self-review pass on the test suite found ~8 functions worth trimming
without losing coverage:

Drop (cobra built-ins, not our contract):
- TestStatementSubmitArgsBound: tests cobra's MaximumNArgs(1)
- TestStatementGetRequiresStatementID: tests cobra's ExactArgs(1)
- TestStatementCancelRequiresStatementID: tests cobra's ExactArgs(1)

Drop (already covered by TestStatementErrorFromStatus, the table-driven
helper test added with the cursor-fix commit):
- TestGetStatementResultClosedTerminalSynthesizesError
- TestGetStatementResultFailedWithoutBackendErrorSynthesizesError
- TestGetStatementStatusFailedWithoutBackendErrorSynthesizesError
- TestGetStatementStatusRunningHasNoError

Fold:
- TestRenderStatementInfo + TestRenderStatementInfoOmitsEmptyFields →
  one table-driven TestRenderStatementInfo with the full and minimal
  cases as subtests.

Kept the validation we actually wrote (TestStatementSubmitRejectsMultipleSQLs)
and the wiring tests that pin distinct contracts
(TestGetStatementResultPolls, TestGetStatementResultFailedStateReportsError,
TestGetStatementResultDoesNotCancelServerSideOnContextCancel,
TestGetStatementStatusSinglePoll, TestGetStatementStatusReportsError,
the cancel pair, and submit pair).

Co-authored-by: Isaac
…put before auth

Address two findings from Arseni's review.

P2 (statement_get.go):
getStatementResult used to return (info, err) when fetchAllRows failed
after a SUCCEEDED state. RunE then discarded the populated info and
surfaced only the raw Go error, so the user got an unstructured
"fetch result chunk N: ..." string with no statement_id and no
machine-readable error field. That contradicts the contract in the
failed-terminal path two cases above, which renders JSON and returns
root.ErrAlreadyPrinted.

Now: on chunk-fetch failure, populate info.Error with the chunk-fetch
message and return (info, nil). RunE renders the partial info as JSON
and signals exit-non-zero based on info.Error != nil. The caller still
gets statement_id and columns; the error field carries the failure
detail. New test
TestGetStatementResultChunkFetchFailureRendersPartialInfo locks this
in.

P3 (statement_submit.go):
The PR description claims submit validates input before accessing
WorkspaceClient. The code didn't actually deliver that: PreRunE was
root.MustWorkspaceClient (auth/profile setup), then RunE did the
resolveSQLs / "exactly one" checks. So a malformed invocation hit auth
errors before ever surfacing the input error.

Move resolveSQLs and the length check into a custom PreRunE that runs
before root.MustWorkspaceClient, mirroring the pattern in
query.go:113-118. The result is stashed in a closure variable
(sqlStatement) for RunE to consume. Existing test
TestStatementSubmitRejectsMultipleSQLs is renamed to
...BeforeWorkspaceClient and no longer needs to stub out PreRunE: the
new ordering means a bad invocation gets the validation error without
ever attempting workspace-client setup.

Co-authored-by: Isaac
@simonfaltum simonfaltum force-pushed the simonfaltum/aitools-pr3-statement branch from dba7285 to 9b52b65 Compare April 28, 2026 09:12
discover-schema previously walked tables sequentially and ran each
table's three probes (DESCRIBE, sample SELECT, null counts) one after
the other. For ai-dev-kit's data-exploration phase that meant
warehouse-bound work was idle most of the time. Same root cause as the
multi-query exploration latency that PR 2 fixed; same fix.

Two layers of parallelism:

1. Tables fan out via errgroup with --concurrency (default 8). A
   failure on one table never aborts the others; it gets rendered
   inline as "Error discovering ...".
2. Within a table, DESCRIBE still runs first because the column list
   feeds the null-counts query. After DESCRIBE returns, the sample
   SELECT and null-counts probes run concurrently. The output text is
   assembled once both finish, preserving the existing column order
   (COLUMNS, SAMPLE DATA, NULL COUNTS).

Switch executeSQL from the SDK's ExecuteAndWait helper to
ExecuteStatement + pollStatement (the helper extracted in #5092). This
brings discover-schema in line with query.go and statement.go: explicit
OnWaitTimeout=CONTINUE on every call, and any future polling-helper
improvement (e.g. signal handling) lands here for free. Failed states
now flow through checkFailedState, which yields more specific error
messages (e.g. "query failed: SYNTAX_ERROR ...") than the previous
hand-rolled branch. The user-visible "SAMPLE DATA: Error - %v" / "NULL
COUNTS: Error - %v" wrapping is unchanged.

Add --concurrency validation matching the cmd/fs/cp.go and
experimental/aitools/cmd/query.go pattern: PreRunE rejects values <= 0
with errInvalidBatchConcurrency.

Tests added in discover_schema_test.go:
- quoteTableName (table-driven across valid identifiers, missing
  parts, injection attempts, empty parts, leading-digit identifiers)
- parseDescribeResult skipping metadata rows
- executeSQL pins OnWaitTimeout=CONTINUE
- executeSQL propagates server-reported FAILED state
- executeSQL wraps transport errors
- discoverTable: sample and null-count probes run concurrently after
  DESCRIBE (atomic peak-counter assertion)
- discoverTable: a sample failure does not abort null counts
- --concurrency 0 and -1 rejected at PreRunE time
- invalid table name (not CATALOG.SCHEMA.TABLE) rejected at RunE
  validation before any API call

Co-authored-by: Isaac
…l on Ctrl+C

Address two findings from a cursor PR review.

1. --concurrency previously capped table-level fan-out via
   errgroup.SetLimit, but each table issued up to two probes after
   DESCRIBE, so peak warehouse load was 2*concurrency rather than the
   advertised "max in-flight statements." A user setting
   --concurrency 1 to stay under a warehouse cap still saw two
   statements concurrently. Replace the table-level limit with a
   shared sqlGate (chan struct{} of capacity N + statement_id
   tracking) that wraps every executeSQL call. Now --concurrency
   really means "max statements in flight at any moment, across all
   tables and probes." Update the help text to match.

2. After switching from ExecuteAndWait to ExecuteStatement +
   pollStatement (PR4 first commit), Ctrl+C left up to 2*concurrency
   statements running server-side because nothing called
   CancelExecution. Add the same cancellation discipline used in
   batch.go: signal handler cancels a derived pollCtx, gate records
   each statement_id post-submission, and on cancellation we sweep
   the recorded IDs via CancelExecution before returning
   root.ErrAlreadyPrinted.

Also addressed:

- Move table-name validation into PreRunE so a malformed identifier
  is rejected before MustWorkspaceClient runs (real CLI-lifecycle
  improvement, not just a test trick).
- Replace the timing-based parallelism test with a deterministic
  barrier (atomic counter + sync.OnceFunc + channel close): both
  probes must arrive before either is allowed to leave; if they ran
  sequentially the first probe times out and surfaces an error.

Tests reorganized:
- sqlGate.run: pins OnWaitTimeout, propagates FAILED, wraps transport
  errors, records ids, respects cancelled context
- cancelDiscoverInFlight: per-id calls, empty list is a no-op
- discoverTable: deterministic concurrent-probes assertion;
  per-probe failure does not abort siblings
- cobra-level: invalid table name and injection attempts rejected
  before any workspace client setup

Co-authored-by: Isaac
…tion

Self-review pass on the test suite found ~3 functions worth trimming:

Drop:
- TestDiscoverSchemaInjectionAttemptRejected: TestQuoteTableName already
  has an "injection in catalog" case; the cobra-level wiring is already
  tested by TestDiscoverSchemaInvalidTableNameRejectedBeforeWorkspaceClient
  with a different bad input.
- TestCancelDiscoverInFlightHandlesEmptyList: just verifies "no API calls
  when list is empty"; the mock would fail loudly on any unexpected call,
  making this a tautology.

Fold:
- TestDiscoverSchemaConcurrencyZeroRejected + ...NegativeRejected →
  TestDiscoverSchemaConcurrencyRejection (0, -1 subtests).

Co-authored-by: Isaac
sqlGate.run used to enter a select with two ready cases when the caller
passed an already-cancelled context: the gate had free slots, so
`g.sem <- struct{}{}` was ready, and `<-ctx.Done()` was also ready. Go
picks pseudo-randomly between simultaneously-ready cases, so on roughly
half of those calls we proceeded to submit a statement under a
cancelled context.

Added an early `ctx.Err()` check before the select. The flaky test
TestSQLGateRunRespectsCancelledContext is deterministic now (verified
with -count=20).

Surfaced by rebasing PR 4 on top of the trimmed PR 3, which changed
test execution conditions enough to flip the coin.

Co-authored-by: Isaac
Address Arseni's P3 finding on the discover-schema PR.
parseDescribeResult returned column names verbatim and discoverTable
interpolated them into the null-counts SQL inside backtick-quoted
identifier positions, e.g.

  SUM(CASE WHEN `<col>` IS NULL THEN 1 ELSE 0 END) AS `<col>_nulls`

Databricks/Delta DDL allows column names containing backticks via
doubled-backtick escaping (`weird``col`). Without escaping in the SQL
we generate, an embedded backtick in the column name terminates the
quoted identifier mid-string and produces a PARSE_SYNTAX_ERROR.
Sample-data uses SELECT * so it succeeds, and the user sees only a
confusing "NULL COUNTS: Error - ..." line that's easy to misattribute
to the warehouse.

Escape via strings.ReplaceAll(col, "`", "``") in both the identifier
and the alias positions before interpolation. New test
TestDiscoverTableEscapesBackticksInColumnNames pins the doubled form
in both spots and asserts the no-error code path.

Co-authored-by: Isaac
@simonfaltum simonfaltum force-pushed the simonfaltum/aitools-pr4-discover-parallel branch from b2396af to 82d2052 Compare April 28, 2026 09:14
Base automatically changed from simonfaltum/aitools-pr3-statement to main April 28, 2026 12:28
@simonfaltum simonfaltum added this pull request to the merge queue Apr 28, 2026
Merged via the queue into main with commit a3f0765 Apr 28, 2026
18 checks passed
@simonfaltum simonfaltum deleted the simonfaltum/aitools-pr4-discover-parallel branch April 28, 2026 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants