Skip to content

[SPARK-48344][SQL] SQL API change to support execution of compound statements#47403

Closed
miland-db wants to merge 23 commits intoapache:masterfrom
miland-db:sql_batch_execution
Closed

[SPARK-48344][SQL] SQL API change to support execution of compound statements#47403
miland-db wants to merge 23 commits intoapache:masterfrom
miland-db:sql_batch_execution

Conversation

@miland-db
Copy link
Contributor

@miland-db miland-db commented Jul 18, 2024

What changes were proposed in this pull request?

Previous PRs introduced basic changes for SQL Scripting.
This PR is a follow-up to introduce changes to SQL API in SparkSession to support execution of compound statements.

SQL Config SQL_SCRIPTING_ENABLED is added to enable usage of SQL Scripting.

CompoundNestedStatementIteratorExec is removed since it is an unnecessary abstraction layer.

Statement flags

isInternal flag is still existing to indicate if statement is added by us (generated during tree transformation for example to drop variables). We will see in the future if we still need this flag.

shouldCollectResult is indicating whether we should collect and potentially return result or to throw is away.

Why are the changes needed?

This change is needed to open the possibility for users to execute and get results from sql scripts.

Does this PR introduce any user-facing change?

Users will now see be able to execute sql scripts using spark.sql() API.

How was this patch tested?

There are tests for newly introduced execution changes:

SqlScriptingExecutionNodeSuite - updated unit tests for execution nodes.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Jul 18, 2024
isInternal = false)
}

def execute(executionPlan: Iterator[CompoundStatementExec]): Iterator[Array[Row]] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would maybe make buildExecutionPlan private, and than call it from execute - seems like we will never need to call buildExecutionPlan from outside?
for example, if we add separate execute* function in the future, it can also call buildExecutionPlan internally.

Copy link
Contributor Author

@miland-db miland-db Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using buildExecutionPlan in SqlScriptingInterpreterSuite. Should we change the testing logic as well?

We can call buildExecutionPlan from execute but can we also leave it to be public?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably means that we should change the tests a bit, we are simulating execute in the tests as well, because before we didn't have it in the interpreter. Now that we have it, we should probably change tests.

Let's do what you said last for now, until other folks review the PR (leave this comment unresolved) and then we can figure if we want to change tests as well.

@davidm-db
Copy link
Contributor

Just as a reminder, we should probably include SQL config with this PR as well! Should be fairly simple, but let's talk about it on Monday.

@miland-db miland-db changed the title [WIP][SPARK-][SQL] SQL API change to support execution of compound statements [WIP][SPARK-48344][SQL] SQL API change to support execution of compound statements Jul 22, 2024
@miland-db miland-db changed the title [WIP][SPARK-48344][SQL] SQL API change to support execution of compound statements [SPARK-48344][SQL] SQL API change to support execution of compound statements Jul 22, 2024
@miland-db miland-db requested a review from davidm-db July 22, 2024 15:14
// execute the plan directly if it is not a single statement
val lastRow = executeScript(plan).foldLeft(Array.empty[Row])((_, next) => next)
val attributes = DataTypeUtils.toAttributes(lastRow.head.schema)
Dataset.ofRows(self, LocalRelation.fromExternalRows(attributes, lastRow.toIndexedSeq))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we return the last DataFrame from the script instead of collecting the result at the driver side?

Copy link
Contributor

@davidm-db davidm-db Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit of a hard topic at the moment... we decided on this approach for preview for multiple reasons:

  • this what we will do for initial version of the changes for Spark Connect execution
  • discussions around multiple topics are still open:
    • multiple results API - decisions around this will affect single results API as well
    • multiple results API - from correctness perspective, all statements (including SELECTs) need to be executed eagerly. It makes sense then to have the same behavior with single results API as well - in this case all statements are executed eagerly, but results for all of them except the last one are dropped.
    • there is still an open discussion whether to include return statement
    • a ton of questions about stored procedures are still an open topic

what will probably happen down the line is:

  • sql() API remains unchanged and only last DataFrame is returned (as you suggested). Requires still a lot of work to support Connect execution, current approach works with Connect already.
  • [optional] new API to do what we are doing at the moment.
  • new API for multiple results, stored procedures, execute immediate, etc.

since the last part is still an open question, we figured out that we will do a simplest thing that works e2e in all cases and then, after we gather initial feedback from preview, and understand better what we want to do for stored procedures/multiple results, we should actually commit to implement all of the API changes.

please let us know your thoughts on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all statements (including SELECTs) need to be executed eagerly

We need to materialize the query result but we don't need to collect the result at the driver side (except for the last statement). E.g. we can write the query result to a noop sink.

Copy link
Contributor Author

@miland-db miland-db Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is noop sink doing? Is it df.write.format("noop").mode("overwrite").save()? Is it the same as doing df.collect() but it just throws away the result?

We have a hard time determining which statement is the last statement. That is the reason why we are doing it this way (we have to save the result of the last dataframe).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea it's the same as doing df.collect() but it just throws away the result

shouldCollectResult = true)
}

def execute(compoundBody: CompoundBody): Iterator[Array[Row]] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we decouple the two parts: 1) produce DataFrames from the script. 2) execute the DataFrames one by one w.r.t. the order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided to do it this way because each statement has to be executed as soon as we encounter it because it can throw an exception. Very soon we are introducing handlers/conditions to deal with this exceptions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did it the way you suggested before, but as soon as we started introducing exception handling, we figured out it's hardly extensible in the future. When SQL statements throws an exception, exception handlers need to be executed immediately, and we need interpreter for that - hence, we need to do execution within the interpreter as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with a simple implementation (collect all the result rows) as a beginning, but we should build the basic framework with proper abstraction, otherwise I don't know how to review it. The idea of handlers makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, because we don't know which one is the last statement to do only one collect() and use noop sink for the rest, we decided to do collect for all statements.

Either way, it is important to execute each statement as soon as we encounter it to be able to handle errors properly. PR introducing handlers is currently work in progress and will probably explain why we did things the way we did in this PR.

@miland-db miland-db requested a review from cloud-fan July 25, 2024 09:37
# Conflicts:
#	sql/catalyst/src/main/scala/org/apache/spark/sql/errors/SqlScriptingErrors.scala
#	sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/SqlScriptingParserSuite.scala
@davidm-db
Copy link
Contributor

@cloud-fan any unresolved topics here? Milan won't be available next week - I can finish the PR next week if needed, but let's try to figure out if there are any concerns so Milan can try to sort them out today.

Dataset.ofRows(self, singleStmtPlan.parsedPlan, tracker)
case _ =>
// execute the plan directly if it is not a single statement
val lastRow = executeScript(plan).foldLeft(Array.empty[Row])((_, next) => next)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's think if we want to do this exactly this way, because:

  • executeScript is basically a simple one-liner and alias for interpreter's execute function
  • when we introduce multiple results in the future, it seems best to:
    • have executeMultipleResults in the interpreter
    • each function (execute and executeMultipleResults and maybe something new?) should collect data based on the type of data it needs to return

I propose that execute family of methods in the interpreter should be responsible to handle the logic of which data is returned, instead of fetching last row here in SparkSession.

I didn't write a ton of details here, I'm writing this comment as a reminder and we can discuss more offline.

.trim
}

test("SQL Scripting not enabled") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move this above the // Helper methods

"feature flag.")
.version("4.0.0")
.booleanConf
.createWithDefault(Utils.isTesting)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we talk about changing this to false by default? Why did we decide not to do it?

/**
* Script interpreter that produces execution plan and executes SQL scripts.
*/
protected lazy val scriptingInterpreter: SqlScriptingInterpreter = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: sqlScriptingInterpreter maybe?

shouldCollectResult = true)
}

def execute(compoundBody: CompoundBody): Iterator[Array[Row]] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's keep public stuff at top, and private below

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants