[SPARK-48343][SQL] Introduction of SQL Scripting interpreter #47026

davidm-db · 2024-06-19T09:03:39Z

What changes were proposed in this pull request?

Previous PR introduced parser changes for SQL Scripting. This PR is a follow-up to introduce the interpreter for SQL Scripting language and proposes the following changes:

SqlScriptingExecutionNode - introduces execution nodes for SQL scripting, used during interpretation phase:
- SingleStatementExec - executable node for SingleStatement logical node; wraps logical plan of the single statement.
- CompoundNestedStatementIteratorExec - implements base recursive iterator logic for all nesting statements.
- CompoundBodyExec - concrete implementation of CompoundNestedStatementIteratorExec for CompoundBody logical node.
SqlScriptingInterpreter - introduces the interpreter for SQL scripts. Product of interpretation is the iterator over the statements that should be executed.

Follow-up PRs will introduce further statements, support for exceptions thrown from parser/interpreter, exception handling in SQL, etc.
More details can be found in Jira item for this task and its parent (where the design doc is uploaded as well).

Why are the changes needed?

The intent is to add support for SQL scripting (and stored procedures down the line). It gives users the ability to develop complex logic and ETL entirely in SQL.

Until now, users had to write verbose SQL statements or combine SQL + Python to efficiently write the logic. This is an effort to breach that gap and enable complex logic to be written entirely in SQL.

Does this PR introduce any user-facing change?

No.
This PR is second in series of PRs that will introduce changes to sql() API to add support for SQL scripting, but for now, the API remains unchanged.
In the future, the API will remain the same as well, but it will have new possibility to execute SQL scripts.

How was this patch tested?

There are tests for newly introduced parser changes:

SqlScriptingExecutionNodeSuite - unit tests for execution nodes.
SqlScriptingInterpreterSuite - tests for interpreter (with parser integration).

Was this patch authored or co-authored using generative AI tooling?

No.

dbatomic · 2024-06-19T09:10:54Z

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala

+
+  /** Get the SQL query text corresponding to this statement. */
+  def getText(sqlScriptText: String): String = {
+    if (origin.startIndex.isEmpty || origin.stopIndex.isEmpty) {


What are the cases when origin is empty?

Especially, having one of them being empty and other not seems like an error against which we should assert?

If SingleStatement is created with withOrigin in AST builder (which should be done since it implements WithOrigin trait), these fields should never be empty. Asserting just seemed too much here because getText is something I see us using when creating exception messages and stuff like that, so didn't seem on critical path.
But if you think so, I have nothing against asserting here :) Should change the same method in the SingleStatement then as well probably, but that's fine.

I figured out we can use Origin completely for this, because it holds SQL text together with start and end text indices. I've simplified the getText so it does not have input param. Since Origin is already existing mechanism that should never fail, I've added assert now like you suggested - did the same in SingleStatement logical operator.

sql/core/src/test/scala/org/apache/spark/sql/scripting/SqlScriptingIntegrationSuite.scala

sql/core/src/test/scala/org/apache/spark/sql/scripting/SqlScriptingInterpreterSuite.scala

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingInterpreter.scala

dbatomic · 2024-06-19T09:28:20Z

sql/core/src/test/scala/org/apache/spark/sql/scripting/SqlScriptingIntegrationSuite.scala

+    val sqlScript =
+      """
+        |BEGIN
+        |DECLARE var = 1;


Can you add a test that asserts that DECLARE can be used only in top of BEGIN/END block?

We removed that logic from the parser PR to simplify it, have a follow-up item to add this check - https://issues.apache.org/jira/browse/SPARK-48345.

dbatomic · 2024-06-19T09:30:46Z

sql/core/src/test/scala/org/apache/spark/sql/scripting/SqlScriptingIntegrationSuite.scala

+  test("session vars - set and read scoped") {
+    val sqlScript =
+      """
+        |BEGIN


TODO (in next PR) is a test that:

creates a var

does some operation that throws an error (e.g. insert against constraint/create table that already exists...)

assert that variable is no longer in session after error is thrown.

This currently is not going to work since we manually insert DROPs.

Agreed, we already have an item for local variables support that includes this topic: https://issues.apache.org/jira/browse/SPARK-48530.

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala

dbatomic · 2024-06-19T09:35:24Z

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingInterpreter.scala

+          .map(new SingleStatementExec(_, Origin(), isInternal = true))
+          .reverse
+        new CompoundBodyExec(
+          body.collection.map(st => transformTreeIntoExecutable(st)) ++ dropVariables)


Instead of just concatenating DROPs alternative could be to:

Keep "cleanup/internal" statements in side structure.

Execute them as part of finalize or reset calls.

My fear is that by just keeping them at the end it might be hard to support following scenarios:

premature return (e.g. during break)

error handling (exception thrown in the middle of the block).

Agreement is to do this in PR to come.

allisonwang-db · 2024-06-20T16:57:49Z

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala

+  /** Get the SQL query text corresponding to this statement. */
+  def getText(sqlScriptText: String): String = {
+    if (origin.startIndex.isEmpty || origin.stopIndex.isEmpty) {
+      return null


What does it mean when the SQL query text is null? Should it be a user-facing or an internal error?

Do you think of sqlScriptText parameter here? This method is intended for internal use only - in our POC we used it for the exception messages, tests and similar. If there is any error here it is internal, but it really depends on what is passed as a parameter (and this should be correct) - so I don't think we should do any errors here.

One another point is the fact that we were not aware of Origin when working on POC - now it seems like it is sufficient to return text info on its own, so we might end up with dropping getText function completely, but we'd like to figure out that in the upcoming PRs.

allisonwang-db · 2024-06-20T16:58:26Z

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala

+      case Some(body: NonLeafStatementExec) => body.hasNext
+      case Some(_: LeafStatementExec) => true
+      case None => false
+      case _ => throw new IllegalStateException("Unknown statement type")


nit: SparkException.internalError

@allisonwang-db for my education, any specific reason for this change? We want to throw only internal errors if not caused by user? Should we create a specific error class/condition for this?

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala

allisonwang-db · 2024-06-20T17:06:02Z

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala

+sealed trait CompoundStatementExec extends Logging {
+  /**
+   * Whether the statement originates from the SQL script or is created during the interpretation.
+   * Example: DropVariable statements are automatically created at the end of each compound.


Anything other than drop var are we planning to add in the future? drop temp table/view/functions?

Definitely, we will have other temporary stuff to drop in the future. For now, we made a really simple change like this, so our scripts are functional for testing, etc.

Follow-up PRs will first introduce labels for compound bodies (both implicit and explicit) first, which is a requirement for further work/improvements in the variables area - like introducing variable shadowing and similar. Once the labels are introduced, we will work on variable improvements - this will probably include some type of a context (which might later be used for a ton of other stuff - call stack tracking once we introduce procedures, tracking info for some other statements like BREAK/CONTINUE, etc). Then, we will exactly know how we want to handle these droppings - it probably won't be as explicit as in this PR, but rather include some interaction with the context. Then, we will figure out exactly whether we need this isInternal property down the road or not.

It is questionable whether temp table/view/function should be tied to a compound scope (probably not) like variables. They will continue to live on session level.

sql/core/src/test/scala/org/apache/spark/sql/scripting/SqlScriptingInterpreterSuite.scala

allisonwang-db · 2024-06-20T17:09:58Z

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala

+
+  /**
+   * Whether this statement had to be executed during the interpretation phase.
+   * Example: Statements in conditions of If/Else, While, etc.


Do we have test cases that include If/Else and While statements? Are they already supported, or will they be supported in the future?

These statements will follow-up in the upcoming PRs, we already have them ready/tried out in our POC. I can share more details offline if you are interested.

For now, we are introducing just the basic structure in order to have simpler PRs for reviewing.

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingInterpreter.scala

cloud-fan · 2024-06-25T09:25:58Z

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingInterpreter.scala

+/**
+ * Concrete implementation of the interpreter for SQL scripting.
+ */
+case class SqlScriptingInterpreter() extends ProceduralLanguageInterpreter {


where do we call reset? or it's preparing for loop in the future?

You are right, it is a preparation for loops in the future.

cloud-fan · 2024-06-25T09:27:09Z

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingInterpreter.scala

+   */
+  private def getDeclareVarNameFromPlan(plan: LogicalPlan): Option[UnresolvedIdentifier] =
+    plan match {
+      case CreateVariable(name: UnresolvedIdentifier, _, _) => Some(name)


to confirm: we do creat session variables when executing the CompoundBody, but we remove it at the end to pretend it's local variables?

Correct, this is not a target behavior and I explained a plan for variable related statements in this comment before, but we did it like this for now to simplify the PR and have a functioning scrips as well for testing purposes.

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala

davidm-db · 2024-06-26T22:18:28Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/parser/SqlScriptingLogicalOperators.scala

+  /** Get the SQL query text corresponding to this statement. */
+  def getText: String = {
+    assert(origin.sqlText.isDefined && origin.startIndex.isDefined && origin.stopIndex.isDefined)
+    origin.sqlText.get.substring(origin.startIndex.get, origin.stopIndex.get + 1)


Check comment for this change.

davidm-db · 2024-06-27T06:12:16Z

.../src/test/scala/org/apache/spark/sql/connect/client/CheckConnectJvmClientCompatibility.scala

@@ -164,6 +164,7 @@ object CheckConnectJvmClientCompatibility {
      ProblemFilters.exclude[Problem]("org.apache.spark.sql.streaming.ui.*"),
      ProblemFilters.exclude[Problem]("org.apache.spark.sql.test.*"),
      ProblemFilters.exclude[Problem]("org.apache.spark.sql.util.*"),
+      ProblemFilters.exclude[Problem]("org.apache.spark.sql.scripting.*"),


Any concerns with this? I think this is all right since we don't need any of the interpreting logic on the client side for Connect. In parallel to this PR, we are testing scripting with Connect e2e and it works without any changes on the client side, especially considering interpreting logic.

we should mark it as a private package in https://github.com/apache/spark/blob/master/project/SparkBuild.scala#L328-L336

Thanks! Wasn't aware of this...

cloud-fan · 2024-07-03T12:26:42Z

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala

+/**
+ * Non-leaf node in the execution tree. It is an iterator over executable child nodes.
+ */
+trait NonLeafStatementExec extends CompoundStatementExec with Iterator[CompoundStatementExec]


This is a bit different from catalyst TreeNode.

Shall we use explicit tree structure? The CompoundStatementExec interface should have def children: Seq[CompoundStatementExec]

It is different and here is the explanation - we didn't inherit TreeNode since interpreter does not go through regular execution path. Intended use of the interpreter is like:

Call interpreter's buildExecutionPlan with the parser output plan.

It will return iterator over executable statements.

Use something as simple as foreach, flatMap, etc. and leverage the interpreter interface that has already been implemented and that will iterate through the statements.

Alternately, for debugging feature, iterator is especially useful because we can operate on it directly based on the debugging commands.

If we were to introduce exec nodes that inherit TreeNode, this would introduce another layer, because we would need to implement interpreter logic outside of these classes.

Another reason why having children of executable nodes might be a bit ambiguous approach is in cases when we have loops for example - we cannot know in advance what would be the number of statements to be executed.

Does this make sense to you?

cloud-fan · 2024-07-03T12:27:10Z

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala

+  extends LeafStatementExec
+  with WithOrigin {


Suggested change

extends LeafStatementExec

with WithOrigin {

extends LeafStatementExec with WithOrigin {

cloud-fan · 2024-07-03T12:28:03Z

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala

+  with WithOrigin {
+
+  /**
+   * Whether this statement had to be executed during the interpretation phase.


The doc string reads very confusing, as the variable name is consumed.

do you mean Whether this statement has been executed?

Yes, that's what is meant here. Statements in conditions, for example, need to be executed during the interpretation so we know the condition value. After the interpretation is done, this flag would indicate if the statement has been executed during the interpretation phase.

I can rename this variable name to isExecuted, I think it would resolve the confusion here.

…scala

cloud-fan · 2024-07-04T09:04:42Z

sql/core/src/test/scala/org/apache/spark/sql/scripting/SqlScriptingInterpreterSuite.scala

+    val compoundBody = spark.sessionState.sqlParser.parseScript(sqlText)
+    val executionPlan = interpreter.buildExecutionPlan(compoundBody)
+    val result = executionPlan.flatMap {
+      case statement: SingleStatementExec =>


I'm thinking about the proper abstraction we should use. What we get at the beginning is the AST CompoundPlanStatement, what we need at the end is an iterator of SingleStatementExec. In the middle layer we use CompoundStatementExec.

I don't have a good sense now as loop/if-else is a very important part of the abstraction. Can we implement if-else first so that we can have a better sense of the abstraction?

Shared the POC pull request in Slack with If-Else/While implementations. We can continue discussion there and write the summary here afterwards.

cloud-fan · 2024-07-08T13:13:58Z

thanks, merging to master!

### What changes were proposed in this pull request? Previous [PR](apache#46665) introduced parser changes for SQL Scripting. This PR is a follow-up to introduce the interpreter for SQL Scripting language and proposes the following changes: - `SqlScriptingExecutionNode` - introduces execution nodes for SQL scripting, used during interpretation phase: - `SingleStatementExec` - executable node for `SingleStatement` logical node; wraps logical plan of the single statement. - `CompoundNestedStatementIteratorExec` - implements base recursive iterator logic for all nesting statements. - `CompoundBodyExec` - concrete implementation of `CompoundNestedStatementIteratorExec` for `CompoundBody` logical node. - `SqlScriptingInterpreter` - introduces the interpreter for SQL scripts. Product of interpretation is the iterator over the statements that should be executed. Follow-up PRs will introduce further statements, support for exceptions thrown from parser/interpreter, exception handling in SQL, etc. More details can be found in [Jira item](https://issues.apache.org/jira/browse/SPARK-48343) for this task and its parent (where the design doc is uploaded as well). ### Why are the changes needed? The intent is to add support for SQL scripting (and stored procedures down the line). It gives users the ability to develop complex logic and ETL entirely in SQL. Until now, users had to write verbose SQL statements or combine SQL + Python to efficiently write the logic. This is an effort to breach that gap and enable complex logic to be written entirely in SQL. ### Does this PR introduce _any_ user-facing change? No. This PR is second in series of PRs that will introduce changes to sql() API to add support for SQL scripting, but for now, the API remains unchanged. In the future, the API will remain the same as well, but it will have new possibility to execute SQL scripts. ### How was this patch tested? There are tests for newly introduced parser changes: - `SqlScriptingExecutionNodeSuite` - unit tests for execution nodes. - `SqlScriptingInterpreterSuite` - tests for interpreter (with parser integration). ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47026 from davidm-db/sql_scripting_interpreter. Authored-by: David Milicevic <david.milicevic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

github-actions bot added the SQL label Jun 19, 2024

dbatomic reviewed Jun 19, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/scripting/SqlScriptingIntegrationSuite.scala Outdated Show resolved Hide resolved

dbatomic reviewed Jun 19, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/scripting/SqlScriptingIntegrationSuite.scala Outdated Show resolved Hide resolved

dbatomic reviewed Jun 19, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/scripting/SqlScriptingInterpreterSuite.scala Outdated Show resolved Hide resolved

dbatomic reviewed Jun 19, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingInterpreter.scala Show resolved Hide resolved

dbatomic reviewed Jun 19, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala Outdated Show resolved Hide resolved

dbatomic reviewed Jun 19, 2024

View reviewed changes

davidm-db force-pushed the sql_scripting_interpreter branch from 2720c0e to 3013eb6 Compare June 19, 2024 15:55

dbatomic approved these changes Jun 19, 2024

View reviewed changes

allisonwang-db reviewed Jun 20, 2024

View reviewed changes

cloud-fan reviewed Jun 25, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingInterpreter.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jun 25, 2024

View reviewed changes

miland-db approved these changes Jun 26, 2024

View reviewed changes

nebojsa-db reviewed Jun 26, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala Outdated Show resolved Hide resolved

davidm-db commented Jun 26, 2024

View reviewed changes

github-actions bot added the CONNECT label Jun 27, 2024

davidm-db commented Jun 27, 2024

View reviewed changes

davidm-db added 9 commits June 27, 2024 08:26

Initial changes for SQL scripting interpreter

0dea6c6

Add comments

d7b1950

Whitespace changes

0872769

Addressing comments vol1

dd20a7e

Addressing comments vol2

0117228

Addressing comments vol3

bd2a907

Improve getText in SingleStatement logical operator

746e04b

Build break fix + minor styling

3126baf

Add minor missing comments

df563bd

Exclude Scripting from Connect client compatibility check

6d402fc

davidm-db force-pushed the sql_scripting_interpreter branch from f0babf7 to 6d402fc Compare June 27, 2024 06:26

davidm-db added 2 commits June 27, 2024 08:47

Minor doc comment changes

5a2055f

Remove inheritdoc comment

fcbc98d

cloud-fan reviewed Jul 3, 2024

View reviewed changes

davidm-db added 2 commits July 3, 2024 17:43

Addressing comments vol4

32dd5a8

Mark org.apache.spark.sql.scripting as private package in SparkBuild.…

2d010dd

…scala

github-actions bot added the BUILD label Jul 4, 2024

cloud-fan reviewed Jul 4, 2024

View reviewed changes

Execution node returns iterator instead of being an iterator

9cc6bae

cloud-fan approved these changes Jul 8, 2024

View reviewed changes

cloud-fan closed this in 2c54aa5 Jul 8, 2024

	extends LeafStatementExec
	with WithOrigin {
	extends LeafStatementExec with WithOrigin {

[SPARK-48343][SQL] Introduction of SQL Scripting interpreter #47026

[SPARK-48343][SQL] Introduction of SQL Scripting interpreter #47026

Conversation

davidm-db commented Jun 19, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Choose a reason for hiding this comment

davidm-db Jun 19, 2024 • edited Loading

Choose a reason for hiding this comment

davidm-db Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidm-db Jun 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidm-db Jun 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jul 8, 2024

davidm-db commented Jun 19, 2024 •

edited

Loading

davidm-db Jun 19, 2024 •

edited

Loading

davidm-db Jun 26, 2024 •

edited

Loading

davidm-db Jun 19, 2024 •

edited

Loading

davidm-db Jun 21, 2024 •

edited

Loading