Skip to content

[SPARK-56147][SQL] spark-sql cli correctly handles SQL Scripting compound blocks#54946

Open
pan3793 wants to merge 1 commit intoapache:masterfrom
pan3793:SPARK-48338
Open

[SPARK-56147][SQL] spark-sql cli correctly handles SQL Scripting compound blocks#54946
pan3793 wants to merge 1 commit intoapache:masterfrom
pan3793:SPARK-48338

Conversation

@pan3793
Copy link
Copy Markdown
Member

@pan3793 pan3793 commented Mar 23, 2026

What changes were proposed in this pull request?

The spark-sql cli now correctly handles SQL Scripting compound blocks (e.g., BEGIN...END, IF...END IF, WHILE...DO...END WHILE, CASE...END CASE) by tracking block nesting depth during input processing.

Changes:

  • Added SqlScriptBlockTracker class in the companion object that tracks SQL Scripting block depth by scanning keyword tokens (BEGIN, END, CASE, IF, DO, LOOP, REPEAT) while correctly handling decorative suffixes after END (e.g., END IF, END CASE).
  • Updated splitSemiColon to use SqlScriptBlockTracker so semicolons inside compound blocks are not treated as statement boundaries.
  • Updated the interactive input loop to call sqlScriptingBlockDepth and continue accumulating input when the user is still inside an open scripting block.

Why are the changes needed?

The spark-sql CLI uses semicolons to determine statement boundaries, both for splitting multi-statement input (splitSemiColon) and for deciding when to execute in interactive mode (line ends with ;). SQL Scripting compound blocks use semicolons as internal statement terminators (e.g., BEGIN SELECT 1; SELECT 2; END;), so the CLI incorrectly splits or prematurely executes incomplete blocks.

For example, in interactive mode:

spark-sql> BEGIN
         >   SELECT 1;   <-- CLI fires here with incomplete "BEGIN\n  SELECT 1;"

After this change, the CLI waits until the block is fully closed (END;) before executing.

Does this PR introduce any user-facing change?

Yes. The spark-sql CLI now correctly accepts multi-line SQL Scripting blocks in both interactive mode and file/-e mode without prematurely splitting or executing them.

How was this patch tested?

New UTs are added in CliSuite.scala

With additional playing with spark-sql:

$ build/sbt -Phive,hive-thriftserver clean package
$ SPARK_PREPEND_CLASSES=true bin/spark-sql 
spark-sql (default)> SELECT current_timestamp();
2026-03-23 13:46:37.797724
Time taken: 0.033 seconds, Fetched 1 row(s)
spark-sql (default)> BEGIN
                   >   DECLARE counter INT DEFAULT 1;
                   >   DECLARE total INT DEFAULT 0;
                   > 
                   >   WHILE counter <= 5 DO
                   >     SET total = total + counter;
                   >     SET counter = counter + 1;
                   >   END WHILE;
                   > 
                   >   SELECT total AS sum_of_first_five;
                   > END;
15
Time taken: 0.363 seconds, Fetched 1 row(s)
spark-sql (default)> SELECT current_timestamp();
2026-03-23 13:46:38.303671
Time taken: 0.028 seconds, Fetched 1 row(s)
spark-sql (default)> 

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Sonnet 4.6), OpenCode (MiMo V2 Pro)

@pan3793
Copy link
Copy Markdown
Member Author

pan3793 commented Mar 23, 2026

cc @srielau @cloud-fan, could you please take a look?

@pan3793 pan3793 changed the title [SPARK-48338][SQL] spark-sql cli correctly handles SQL Scripting compound blocks [SPARK-56147][SQL] spark-sql cli correctly handles SQL Scripting compound blocks Mar 23, 2026
@pan3793
Copy link
Copy Markdown
Member Author

pan3793 commented Mar 25, 2026

cc @davidm-db @MaxGekk could you please take a look?

Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Prior state and problem: The spark-sql CLI uses semicolons as statement boundaries — both for splitting multi-statement input in batch mode (splitSemiColon) and for deciding when to execute in interactive mode (line ends with ;). SQL Scripting compound blocks (e.g., BEGIN SELECT 1; SELECT 2; END;) use semicolons internally, so the CLI would prematurely split or execute incomplete blocks.

Design approach: A lightweight keyword-based scanner (SqlScriptBlockTracker) tracks block nesting depth by recognizing SQL scripting keywords (BEGIN, END, CASE, IF, DO, LOOP, REPEAT) while respecting quotes and comments. The same tracker class is used in two places:

  1. Interactive mode: sqlScriptingBlockDepth checks the accumulated input. If depth > 0, the CLI continues accumulating instead of executing.
  2. Batch mode: splitSemiColon skips semicolons when tracker.depth > 0.

Key design decisions:

  • BEGIN unconditionally increments depth. All other block-opening keywords (CASE, IF, DO, LOOP, REPEAT) only increment when already inside a block (depth > 0), preventing false positives from SQL expressions like CASE...END and the IF() function.
  • IF followed by ( is treated as the Spark SQL IF() function, not a scripting IF.
  • Decorative suffixes after END (e.g., END IF, END CASE) are recognized and don't change depth.

Implementation: Two places consume the tracker: a standalone sqlScriptingBlockDepth method (companion object) for the interactive loop, and inline integration into splitSemiColon (instance method) for batch splitting. Both implement their own character-by-character quote/comment scanning. The return type of splitSemiColon is also cleaned up from JList[String] to Array[String].

* Computes the SQL scripting block depth of the given SQL text.
* Returns 0 when the text is not inside any scripting block, > 0 when still open.
*/
private[hive] def sqlScriptingBlockDepth(text: String): Int = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quote/comment/escape scanning in this method (~lines 460-512) is nearly identical to the scanning in splitSemiColon (~lines 762-866) — both track insideSingleQuote, insideDoubleQuote, insideSimpleComment, bracketedCommentLevel, escape, leavingBracketedComment with the same toggle logic, and both feed characters to a SqlScriptBlockTracker.

Consider extracting the common scanning loop so both callers can share it. For example, a helper method that takes a text and a callback/visitor could iterate character-by-character with the quote/comment state machine, letting sqlScriptingBlockDepth just read the tracker's depth, and splitSemiColon handle semicolons and substring extraction. This would eliminate the duplication and ensure both paths stay in sync when future changes arise (e.g., adding backtick-quoted identifier support).


if (!insideComment && !insideAnyQuote) {
tracker.processChar(c)
} else if (tracker.depth >= 0) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tracker.depth >= 0 is always true since depth is only decremented when > 0. Simplify to else:

Suggested change
} else if (tracker.depth >= 0) {
} else {

}
prevWord = upper
upper match {
case "BEGIN" => depth += 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BEGIN is in Spark's nonReserved grammar list, meaning it can appear as a regular identifier — e.g., SELECT begin FROM t; is valid Spark SQL. Since the tracker unconditionally increments depth on BEGIN, this would cause the interactive CLI to enter continuation mode (waiting for END;) instead of executing the statement.

In practice this is rare since begin is an unusual identifier choice, but it could confuse users. One mitigation: only treat BEGIN as a block opener when it appears at a statement boundary (start of input, or right after a semicolon-split point). Would this be worth addressing?

Copy link
Copy Markdown
Member Author

@pan3793 pan3793 Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good catch! and the suggested approach also sgtm

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second thought, this might be a more complex problem. For example, suppose Spark SQL will be extended to support CREATE PROCEDURE SQL syntax like Databricks SQL. In this case, BEGIN is in the middle of the statement, so it's hard to determine the statement boundary ...

https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-ddl-create-procedure

CREATE PROCEDURE procedure_name AS
BEGIN
  ...
END

@miland-db
Copy link
Copy Markdown
Contributor

Can we add some tests that include SQL Exception Handlers that are available inside SQL scripts?

BEGIN
  DECLARE EXIT HANDLER FOR SQLEXCEPTION
  BEGIN
    VALUES('Hello from the exception handler.');
  END;

  SELECT * FROM does_not_exist;
END;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants