[SPARK-56147][SQL] spark-sql cli correctly handles SQL Scripting compound blocks#54946
[SPARK-56147][SQL] spark-sql cli correctly handles SQL Scripting compound blocks#54946pan3793 wants to merge 1 commit intoapache:masterfrom
spark-sql cli correctly handles SQL Scripting compound blocks#54946Conversation
|
cc @srielau @cloud-fan, could you please take a look? |
spark-sql cli correctly handles SQL Scripting compound blocksspark-sql cli correctly handles SQL Scripting compound blocks
|
cc @davidm-db @MaxGekk could you please take a look? |
cloud-fan
left a comment
There was a problem hiding this comment.
Summary
Prior state and problem: The spark-sql CLI uses semicolons as statement boundaries — both for splitting multi-statement input in batch mode (splitSemiColon) and for deciding when to execute in interactive mode (line ends with ;). SQL Scripting compound blocks (e.g., BEGIN SELECT 1; SELECT 2; END;) use semicolons internally, so the CLI would prematurely split or execute incomplete blocks.
Design approach: A lightweight keyword-based scanner (SqlScriptBlockTracker) tracks block nesting depth by recognizing SQL scripting keywords (BEGIN, END, CASE, IF, DO, LOOP, REPEAT) while respecting quotes and comments. The same tracker class is used in two places:
- Interactive mode:
sqlScriptingBlockDepthchecks the accumulated input. If depth > 0, the CLI continues accumulating instead of executing. - Batch mode:
splitSemiColonskips semicolons whentracker.depth > 0.
Key design decisions:
BEGINunconditionally increments depth. All other block-opening keywords (CASE,IF,DO,LOOP,REPEAT) only increment when already inside a block (depth > 0), preventing false positives from SQL expressions likeCASE...ENDand theIF()function.IFfollowed by(is treated as the Spark SQLIF()function, not a scriptingIF.- Decorative suffixes after
END(e.g.,END IF,END CASE) are recognized and don't change depth.
Implementation: Two places consume the tracker: a standalone sqlScriptingBlockDepth method (companion object) for the interactive loop, and inline integration into splitSemiColon (instance method) for batch splitting. Both implement their own character-by-character quote/comment scanning. The return type of splitSemiColon is also cleaned up from JList[String] to Array[String].
| * Computes the SQL scripting block depth of the given SQL text. | ||
| * Returns 0 when the text is not inside any scripting block, > 0 when still open. | ||
| */ | ||
| private[hive] def sqlScriptingBlockDepth(text: String): Int = { |
There was a problem hiding this comment.
The quote/comment/escape scanning in this method (~lines 460-512) is nearly identical to the scanning in splitSemiColon (~lines 762-866) — both track insideSingleQuote, insideDoubleQuote, insideSimpleComment, bracketedCommentLevel, escape, leavingBracketedComment with the same toggle logic, and both feed characters to a SqlScriptBlockTracker.
Consider extracting the common scanning loop so both callers can share it. For example, a helper method that takes a text and a callback/visitor could iterate character-by-character with the quote/comment state machine, letting sqlScriptingBlockDepth just read the tracker's depth, and splitSemiColon handle semicolons and substring extraction. This would eliminate the duplication and ensure both paths stay in sync when future changes arise (e.g., adding backtick-quoted identifier support).
|
|
||
| if (!insideComment && !insideAnyQuote) { | ||
| tracker.processChar(c) | ||
| } else if (tracker.depth >= 0) { |
There was a problem hiding this comment.
tracker.depth >= 0 is always true since depth is only decremented when > 0. Simplify to else:
| } else if (tracker.depth >= 0) { | |
| } else { |
| } | ||
| prevWord = upper | ||
| upper match { | ||
| case "BEGIN" => depth += 1 |
There was a problem hiding this comment.
BEGIN is in Spark's nonReserved grammar list, meaning it can appear as a regular identifier — e.g., SELECT begin FROM t; is valid Spark SQL. Since the tracker unconditionally increments depth on BEGIN, this would cause the interactive CLI to enter continuation mode (waiting for END;) instead of executing the statement.
In practice this is rare since begin is an unusual identifier choice, but it could confuse users. One mitigation: only treat BEGIN as a block opener when it appears at a statement boundary (start of input, or right after a semicolon-split point). Would this be worth addressing?
There was a problem hiding this comment.
this is a good catch! and the suggested approach also sgtm
There was a problem hiding this comment.
Second thought, this might be a more complex problem. For example, suppose Spark SQL will be extended to support CREATE PROCEDURE SQL syntax like Databricks SQL. In this case, BEGIN is in the middle of the statement, so it's hard to determine the statement boundary ...
https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-ddl-create-procedure
CREATE PROCEDURE procedure_name AS
BEGIN
...
END
|
Can we add some tests that include |
What changes were proposed in this pull request?
The
spark-sqlcli now correctly handles SQL Scripting compound blocks (e.g.,BEGIN...END,IF...END IF,WHILE...DO...END WHILE,CASE...END CASE) by tracking block nesting depth during input processing.Changes:
SqlScriptBlockTrackerclass in the companion object that tracks SQL Scripting block depth by scanning keyword tokens (BEGIN,END,CASE,IF,DO,LOOP,REPEAT) while correctly handling decorative suffixes afterEND(e.g.,END IF,END CASE).splitSemiColonto useSqlScriptBlockTrackerso semicolons inside compound blocks are not treated as statement boundaries.sqlScriptingBlockDepthand continue accumulating input when the user is still inside an open scripting block.Why are the changes needed?
The
spark-sqlCLI uses semicolons to determine statement boundaries, both for splitting multi-statement input (splitSemiColon) and for deciding when to execute in interactive mode (line ends with;). SQL Scripting compound blocks use semicolons as internal statement terminators (e.g.,BEGIN SELECT 1; SELECT 2; END;), so the CLI incorrectly splits or prematurely executes incomplete blocks.For example, in interactive mode:
After this change, the CLI waits until the block is fully closed (
END;) before executing.Does this PR introduce any user-facing change?
Yes. The
spark-sqlCLI now correctly accepts multi-line SQL Scripting blocks in both interactive mode and file/-emode without prematurely splitting or executing them.How was this patch tested?
New UTs are added in
CliSuite.scalaWith additional playing with
spark-sql:Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Claude Sonnet 4.6), OpenCode (MiMo V2 Pro)