[SPARK-56147][SQL] `spark-sql` cli correctly handles SQL Scripting compound blocks by pan3793 · Pull Request #54946 · apache/spark

pan3793 · 2026-03-23T05:47:00Z

What changes were proposed in this pull request?

The spark-sql cli now correctly handles SQL Scripting compound blocks (e.g., BEGIN...END, IF...END IF, WHILE...DO...END WHILE, CASE...END CASE) by tracking block nesting depth during input processing.

Changes:

Added SqlScriptBlockTracker class in the companion object that tracks SQL Scripting block depth by scanning keyword tokens (BEGIN, END, CASE, IF, DO, LOOP, REPEAT) while correctly handling decorative suffixes after END (e.g., END IF, END CASE).
Updated splitSemiColon to use SqlScriptBlockTracker so semicolons inside compound blocks are not treated as statement boundaries.
Updated the interactive input loop to call sqlScriptingBlockDepth and continue accumulating input when the user is still inside an open scripting block.

Why are the changes needed?

The spark-sql CLI uses semicolons to determine statement boundaries, both for splitting multi-statement input (splitSemiColon) and for deciding when to execute in interactive mode (line ends with ;). SQL Scripting compound blocks use semicolons as internal statement terminators (e.g., BEGIN SELECT 1; SELECT 2; END;), so the CLI incorrectly splits or prematurely executes incomplete blocks.

For example, in interactive mode:

spark-sql> BEGIN
         >   SELECT 1;   <-- CLI fires here with incomplete "BEGIN\n  SELECT 1;"

After this change, the CLI waits until the block is fully closed (END;) before executing.

Does this PR introduce any user-facing change?

Yes. The spark-sql CLI now correctly accepts multi-line SQL Scripting blocks in both interactive mode and file/-e mode without prematurely splitting or executing them.

How was this patch tested?

New UTs are added in CliSuite.scala

With additional playing with spark-sql:

$ build/sbt -Phive,hive-thriftserver clean package
$ SPARK_PREPEND_CLASSES=true bin/spark-sql 
spark-sql (default)> SELECT current_timestamp();
2026-03-23 13:46:37.797724
Time taken: 0.033 seconds, Fetched 1 row(s)
spark-sql (default)> BEGIN
                   >   DECLARE counter INT DEFAULT 1;
                   >   DECLARE total INT DEFAULT 0;
                   > 
                   >   WHILE counter <= 5 DO
                   >     SET total = total + counter;
                   >     SET counter = counter + 1;
                   >   END WHILE;
                   > 
                   >   SELECT total AS sum_of_first_five;
                   > END;
15
Time taken: 0.363 seconds, Fetched 1 row(s)
spark-sql (default)> SELECT current_timestamp();
2026-03-23 13:46:38.303671
Time taken: 0.028 seconds, Fetched 1 row(s)
spark-sql (default)>

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Sonnet 4.6), OpenCode (MiMo V2 Pro)

pan3793 · 2026-03-23T08:38:51Z

cc @srielau @cloud-fan, could you please take a look?

pan3793 · 2026-03-25T09:10:42Z

cc @davidm-db @MaxGekk could you please take a look?

…mpound blocks

cloud-fan

Summary

Prior state and problem: The spark-sql CLI uses semicolons as statement boundaries — both for splitting multi-statement input in batch mode (splitSemiColon) and for deciding when to execute in interactive mode (line ends with ;). SQL Scripting compound blocks (e.g., BEGIN SELECT 1; SELECT 2; END;) use semicolons internally, so the CLI would prematurely split or execute incomplete blocks.

Design approach: A lightweight keyword-based scanner (SqlScriptBlockTracker) tracks block nesting depth by recognizing SQL scripting keywords (BEGIN, END, CASE, IF, DO, LOOP, REPEAT) while respecting quotes and comments. The same tracker class is used in two places:

Interactive mode: sqlScriptingBlockDepth checks the accumulated input. If depth > 0, the CLI continues accumulating instead of executing.
Batch mode: splitSemiColon skips semicolons when tracker.depth > 0.

Key design decisions:

BEGIN unconditionally increments depth. All other block-opening keywords (CASE, IF, DO, LOOP, REPEAT) only increment when already inside a block (depth > 0), preventing false positives from SQL expressions like CASE...END and the IF() function.
IF followed by ( is treated as the Spark SQL IF() function, not a scripting IF.
Decorative suffixes after END (e.g., END IF, END CASE) are recognized and don't change depth.

Implementation: Two places consume the tracker: a standalone sqlScriptingBlockDepth method (companion object) for the interactive loop, and inline integration into splitSemiColon (instance method) for batch splitting. Both implement their own character-by-character quote/comment scanning. The return type of splitSemiColon is also cleaned up from JList[String] to Array[String].

cloud-fan · 2026-04-08T09:40:12Z

+   * Computes the SQL scripting block depth of the given SQL text.
+   * Returns 0 when the text is not inside any scripting block, > 0 when still open.
+   */
+  private[hive] def sqlScriptingBlockDepth(text: String): Int = {


The quote/comment/escape scanning in this method (~lines 460-512) is nearly identical to the scanning in splitSemiColon (~lines 762-866) — both track insideSingleQuote, insideDoubleQuote, insideSimpleComment, bracketedCommentLevel, escape, leavingBracketedComment with the same toggle logic, and both feed characters to a SqlScriptBlockTracker.

Consider extracting the common scanning loop so both callers can share it. For example, a helper method that takes a text and a callback/visitor could iterate character-by-character with the quote/comment state machine, letting sqlScriptingBlockDepth just read the tracker's depth, and splitSemiColon handle semicolons and substring extraction. This would eliminate the duplication and ensure both paths stay in sync when future changes arise (e.g., adding backtick-quoted identifier support).

cloud-fan · 2026-04-08T09:40:12Z

+
+      if (!insideComment && !insideAnyQuote) {
+        tracker.processChar(c)
+      } else if (tracker.depth >= 0) {


tracker.depth >= 0 is always true since depth is only decremented when > 0. Simplify to else:

Suggested change

} else if (tracker.depth >= 0) {

} else {

cloud-fan · 2026-04-08T09:40:12Z

+      }
+      prevWord = upper
+      upper match {
+        case "BEGIN" => depth += 1


BEGIN is in Spark's nonReserved grammar list, meaning it can appear as a regular identifier — e.g., SELECT begin FROM t; is valid Spark SQL. Since the tracker unconditionally increments depth on BEGIN, this would cause the interactive CLI to enter continuation mode (waiting for END;) instead of executing the statement.

In practice this is rare since begin is an unusual identifier choice, but it could confuse users. One mitigation: only treat BEGIN as a block opener when it appears at a statement boundary (start of input, or right after a semicolon-split point). Would this be worth addressing?

this is a good catch! ~~and the suggested approach also sgtm~~

Second thought, this might be a more complex problem. For example, suppose Spark SQL will be extended to support CREATE PROCEDURE SQL syntax like Databricks SQL. In this case, BEGIN is in the middle of the statement, so it's hard to determine the statement boundary ...

https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-ddl-create-procedure

CREATE PROCEDURE procedure_name AS BEGIN ... END

miland-db · 2026-04-08T12:14:57Z

Can we add some tests that include SQL Exception Handlers that are available inside SQL scripts?

BEGIN
  DECLARE EXIT HANDLER FOR SQLEXCEPTION
  BEGIN
    VALUES('Hello from the exception handler.');
  END;

  SELECT * FROM does_not_exist;
END;

pan3793 changed the title ~~[SPARK-48338][SQL] spark-sql cli correctly handles SQL Scripting compound blocks~~ [SPARK-56147][SQL] spark-sql cli correctly handles SQL Scripting compound blocks Mar 23, 2026

pan3793 force-pushed the SPARK-48338 branch from 29b42e5 to d81317f Compare March 23, 2026 09:57

[SPARK-48338][SQL] spark-sql cli correctly handles SQL Scripting co…

cd13a3e

…mpound blocks

pan3793 force-pushed the SPARK-48338 branch from 133b2bb to cd13a3e Compare March 29, 2026 15:38

cloud-fan reviewed Apr 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56147][SQL] `spark-sql` cli correctly handles SQL Scripting compound blocks#54946

[SPARK-56147][SQL] `spark-sql` cli correctly handles SQL Scripting compound blocks#54946
pan3793 wants to merge 1 commit intoapache:masterfrom
pan3793:SPARK-48338

pan3793 commented Mar 23, 2026

Uh oh!

pan3793 commented Mar 23, 2026

Uh oh!

pan3793 commented Mar 25, 2026

Uh oh!

cloud-fan left a comment

Uh oh!

cloud-fan Apr 8, 2026

Uh oh!

cloud-fan Apr 8, 2026

Uh oh!

cloud-fan Apr 8, 2026

Uh oh!

pan3793 Apr 8, 2026 •

edited

Loading

Uh oh!

pan3793 Apr 8, 2026

Uh oh!

miland-db commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pan3793 commented Mar 23, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

pan3793 commented Mar 23, 2026

Uh oh!

pan3793 commented Mar 25, 2026

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Summary

Uh oh!

cloud-fan Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

cloud-fan Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

cloud-fan Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

pan3793 Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

miland-db commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pan3793 Apr 8, 2026 •

edited

Loading