[SPARK-48344][SQL] SQL API change to support execution of compound statements by miland-db · Pull Request #47403 · apache/spark

miland-db · 2024-07-18T12:51:13Z

What changes were proposed in this pull request?

Previous PRs introduced basic changes for SQL Scripting.
This PR is a follow-up to introduce changes to SQL API in SparkSession to support execution of compound statements.

SQL Config SQL_SCRIPTING_ENABLED is added to enable usage of SQL Scripting.

CompoundNestedStatementIteratorExec is removed since it is an unnecessary abstraction layer.

Statement flags

isInternal flag is still existing to indicate if statement is added by us (generated during tree transformation for example to drop variables). We will see in the future if we still need this flag.

shouldCollectResult is indicating whether we should collect and potentially return result or to throw is away.

Why are the changes needed?

This change is needed to open the possibility for users to execute and get results from sql scripts.

Does this PR introduce any user-facing change?

Users will now see be able to execute sql scripts using spark.sql() API.

How was this patch tested?

There are tests for newly introduced execution changes:

SqlScriptingExecutionNodeSuite - updated unit tests for execution nodes.

Was this patch authored or co-authored using generative AI tooling?

No.

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala

sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala

sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala

davidm-db · 2024-07-19T09:27:45Z

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingInterpreter.scala

          isInternal = false)
    }
+
+  def execute(executionPlan: Iterator[CompoundStatementExec]): Iterator[Array[Row]] = {


I would maybe make buildExecutionPlan private, and than call it from execute - seems like we will never need to call buildExecutionPlan from outside?
for example, if we add separate execute* function in the future, it can also call buildExecutionPlan internally.

We are using buildExecutionPlan in SqlScriptingInterpreterSuite. Should we change the testing logic as well?

We can call buildExecutionPlan from execute but can we also leave it to be public?

This probably means that we should change the tests a bit, we are simulating execute in the tests as well, because before we didn't have it in the interpreter. Now that we have it, we should probably change tests.

Let's do what you said last for now, until other folks review the PR (leave this comment unresolved) and then we can figure if we want to change tests as well.

davidm-db · 2024-07-19T18:27:16Z

Just as a reminder, we should probably include SQL config with this PR as well! Should be fairly simple, but let's talk about it on Monday.

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala

cloud-fan · 2024-07-23T09:33:49Z

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

+          // execute the plan directly if it is not a single statement
+          val lastRow = executeScript(plan).foldLeft(Array.empty[Row])((_, next) => next)
+          val attributes = DataTypeUtils.toAttributes(lastRow.head.schema)
+          Dataset.ofRows(self, LocalRelation.fromExternalRows(attributes, lastRow.toIndexedSeq))


shall we return the last DataFrame from the script instead of collecting the result at the driver side?

this is a bit of a hard topic at the moment... we decided on this approach for preview for multiple reasons:

this what we will do for initial version of the changes for Spark Connect execution

discussions around multiple topics are still open:

multiple results API - decisions around this will affect single results API as well

multiple results API - from correctness perspective, all statements (including SELECTs) need to be executed eagerly. It makes sense then to have the same behavior with single results API as well - in this case all statements are executed eagerly, but results for all of them except the last one are dropped.

there is still an open discussion whether to include return statement

a ton of questions about stored procedures are still an open topic

what will probably happen down the line is:

sql() API remains unchanged and only last DataFrame is returned (as you suggested). Requires still a lot of work to support Connect execution, current approach works with Connect already.

[optional] new API to do what we are doing at the moment.

new API for multiple results, stored procedures, execute immediate, etc.

since the last part is still an open question, we figured out that we will do a simplest thing that works e2e in all cases and then, after we gather initial feedback from preview, and understand better what we want to do for stored procedures/multiple results, we should actually commit to implement all of the API changes.

please let us know your thoughts on this.

all statements (including SELECTs) need to be executed eagerly

We need to materialize the query result but we don't need to collect the result at the driver side (except for the last statement). E.g. we can write the query result to a noop sink.

What is noop sink doing? Is it df.write.format("noop").mode("overwrite").save()? Is it the same as doing df.collect() but it just throws away the result?

We have a hard time determining which statement is the last statement. That is the reason why we are doing it this way (we have to save the result of the last dataframe).

Yea it's the same as doing df.collect() but it just throws away the result

cloud-fan · 2024-07-23T13:59:46Z

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingInterpreter.scala

+          shouldCollectResult = true)
    }
+
+  def execute(compoundBody: CompoundBody): Iterator[Array[Row]] = {


can we decouple the two parts: 1) produce DataFrames from the script. 2) execute the DataFrames one by one w.r.t. the order.

We decided to do it this way because each statement has to be executed as soon as we encounter it because it can throw an exception. Very soon we are introducing handlers/conditions to deal with this exceptions.

We did it the way you suggested before, but as soon as we started introducing exception handling, we figured out it's hardly extensible in the future. When SQL statements throws an exception, exception handlers need to be executed immediately, and we need interpreter for that - hence, we need to do execution within the interpreter as well.

I'm OK with a simple implementation (collect all the result rows) as a beginning, but we should build the basic framework with proper abstraction, otherwise I don't know how to review it. The idea of handlers makes sense.

For now, because we don't know which one is the last statement to do only one collect() and use noop sink for the rest, we decided to do collect for all statements.

Either way, it is important to execute each statement as soon as we encounter it to be able to handle errors properly. PR introducing handlers is currently work in progress and will probably explain why we did things the way we did in this PR.

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/errors/SqlScriptingErrors.scala # sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/SqlScriptingParserSuite.scala

davidm-db · 2024-07-26T06:57:23Z

@cloud-fan any unresolved topics here? Milan won't be available next week - I can finish the PR next week if needed, but let's try to figure out if there are any concerns so Milan can try to sort them out today.

davidm-db · 2024-07-31T14:19:12Z

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

+          Dataset.ofRows(self, singleStmtPlan.parsedPlan, tracker)
+        case _ =>
+          // execute the plan directly if it is not a single statement
+          val lastRow = executeScript(plan).foldLeft(Array.empty[Row])((_, next) => next)


let's think if we want to do this exactly this way, because:

executeScript is basically a simple one-liner and alias for interpreter's execute function

when we introduce multiple results in the future, it seems best to:

have executeMultipleResults in the interpreter

each function (execute and executeMultipleResults and maybe something new?) should collect data based on the type of data it needs to return

I propose that execute family of methods in the interpreter should be responsible to handle the logic of which data is returned, instead of fetching last row here in SparkSession.

I didn't write a ton of details here, I'm writing this comment as a reminder and we can discuss more offline.

davidm-db · 2024-08-01T07:53:12Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/SqlScriptingParserSuite.scala

      .trim
  }
+
+  test("SQL Scripting not enabled") {


nit: move this above the // Helper methods

davidm-db · 2024-08-01T12:55:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+        "feature flag.")
+      .version("4.0.0")
+      .booleanConf
+      .createWithDefault(Utils.isTesting)


Did we talk about changing this to false by default? Why did we decide not to do it?

davidm-db · 2024-08-01T14:09:02Z

sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala

+  /**
+   * Script interpreter that produces execution plan and executes SQL scripts.
+   */
+  protected lazy val scriptingInterpreter: SqlScriptingInterpreter = {


nit: sqlScriptingInterpreter maybe?

davidm-db · 2024-08-01T15:15:32Z

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingInterpreter.scala

+          shouldCollectResult = true)
    }
+
+  def execute(compoundBody: CompoundBody): Iterator[Array[Row]] = {


nit: let's keep public stuff at top, and private below

miland-db added 3 commits July 18, 2024 14:10

Add execution functions

2c1ab0d

Fix one sql() method

a1028bb

Fix other sql() method

e3f3638

github-actions bot added the SQL label Jul 18, 2024

HyukjinKwon reviewed Jul 19, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala Show resolved Hide resolved