Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48343][SQL] Introduction of SQL Scripting interpreter #47026

Closed
wants to merge 15 commits into from

Conversation

davidm-db
Copy link
Contributor

@davidm-db davidm-db commented Jun 19, 2024

What changes were proposed in this pull request?

Previous PR introduced parser changes for SQL Scripting. This PR is a follow-up to introduce the interpreter for SQL Scripting language and proposes the following changes:

  • SqlScriptingExecutionNode - introduces execution nodes for SQL scripting, used during interpretation phase:
    • SingleStatementExec - executable node for SingleStatement logical node; wraps logical plan of the single statement.
    • CompoundNestedStatementIteratorExec - implements base recursive iterator logic for all nesting statements.
    • CompoundBodyExec - concrete implementation of CompoundNestedStatementIteratorExec for CompoundBody logical node.
  • SqlScriptingInterpreter - introduces the interpreter for SQL scripts. Product of interpretation is the iterator over the statements that should be executed.

Follow-up PRs will introduce further statements, support for exceptions thrown from parser/interpreter, exception handling in SQL, etc.
More details can be found in Jira item for this task and its parent (where the design doc is uploaded as well).

Why are the changes needed?

The intent is to add support for SQL scripting (and stored procedures down the line). It gives users the ability to develop complex logic and ETL entirely in SQL.

Until now, users had to write verbose SQL statements or combine SQL + Python to efficiently write the logic. This is an effort to breach that gap and enable complex logic to be written entirely in SQL.

Does this PR introduce any user-facing change?

No.
This PR is second in series of PRs that will introduce changes to sql() API to add support for SQL scripting, but for now, the API remains unchanged.
In the future, the API will remain the same as well, but it will have new possibility to execute SQL scripts.

How was this patch tested?

There are tests for newly introduced parser changes:

  • SqlScriptingExecutionNodeSuite - unit tests for execution nodes.
  • SqlScriptingInterpreterSuite - tests for interpreter (with parser integration).

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Jun 19, 2024

/** Get the SQL query text corresponding to this statement. */
def getText(sqlScriptText: String): String = {
if (origin.startIndex.isEmpty || origin.stopIndex.isEmpty) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the cases when origin is empty?

Especially, having one of them being empty and other not seems like an error against which we should assert?

Copy link
Contributor Author

@davidm-db davidm-db Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If SingleStatement is created with withOrigin in AST builder (which should be done since it implements WithOrigin trait), these fields should never be empty. Asserting just seemed too much here because getText is something I see us using when creating exception messages and stuff like that, so didn't seem on critical path.
But if you think so, I have nothing against asserting here :) Should change the same method in the SingleStatement then as well probably, but that's fine.

Copy link
Contributor Author

@davidm-db davidm-db Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured out we can use Origin completely for this, because it holds SQL text together with start and end text indices. I've simplified the getText so it does not have input param. Since Origin is already existing mechanism that should never fail, I've added assert now like you suggested - did the same in SingleStatement logical operator.

val sqlScript =
"""
|BEGIN
|DECLARE var = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test that asserts that DECLARE can be used only in top of BEGIN/END block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We removed that logic from the parser PR to simplify it, have a follow-up item to add this check - https://issues.apache.org/jira/browse/SPARK-48345.

test("session vars - set and read scoped") {
val sqlScript =
"""
|BEGIN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO (in next PR) is a test that:

  1. creates a var
  2. does some operation that throws an error (e.g. insert against constraint/create table that already exists...)
  3. assert that variable is no longer in session after error is thrown.

This currently is not going to work since we manually insert DROPs.

Copy link
Contributor Author

@davidm-db davidm-db Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, we already have an item for local variables support that includes this topic: https://issues.apache.org/jira/browse/SPARK-48530.

.map(new SingleStatementExec(_, Origin(), isInternal = true))
.reverse
new CompoundBodyExec(
body.collection.map(st => transformTreeIntoExecutable(st)) ++ dropVariables)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of just concatenating DROPs alternative could be to:

  1. Keep "cleanup/internal" statements in side structure.
  2. Execute them as part of finalize or reset calls.

My fear is that by just keeping them at the end it might be hard to support following scenarios:

  1. premature return (e.g. during break)
  2. error handling (exception thrown in the middle of the block).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreement is to do this in PR to come.

/** Get the SQL query text corresponding to this statement. */
def getText(sqlScriptText: String): String = {
if (origin.startIndex.isEmpty || origin.stopIndex.isEmpty) {
return null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean when the SQL query text is null? Should it be a user-facing or an internal error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think of sqlScriptText parameter here? This method is intended for internal use only - in our POC we used it for the exception messages, tests and similar. If there is any error here it is internal, but it really depends on what is passed as a parameter (and this should be correct) - so I don't think we should do any errors here.

One another point is the fact that we were not aware of Origin when working on POC - now it seems like it is sufficient to return text info on its own, so we might end up with dropping getText function completely, but we'd like to figure out that in the upcoming PRs.

case Some(body: NonLeafStatementExec) => body.hasNext
case Some(_: LeafStatementExec) => true
case None => false
case _ => throw new IllegalStateException("Unknown statement type")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: SparkException.internalError

Copy link
Contributor Author

@davidm-db davidm-db Jun 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@allisonwang-db for my education, any specific reason for this change? We want to throw only internal errors if not caused by user? Should we create a specific error class/condition for this?

sealed trait CompoundStatementExec extends Logging {
/**
* Whether the statement originates from the SQL script or is created during the interpretation.
* Example: DropVariable statements are automatically created at the end of each compound.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything other than drop var are we planning to add in the future? drop temp table/view/functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, we will have other temporary stuff to drop in the future. For now, we made a really simple change like this, so our scripts are functional for testing, etc.

Follow-up PRs will first introduce labels for compound bodies (both implicit and explicit) first, which is a requirement for further work/improvements in the variables area - like introducing variable shadowing and similar. Once the labels are introduced, we will work on variable improvements - this will probably include some type of a context (which might later be used for a ton of other stuff - call stack tracking once we introduce procedures, tracking info for some other statements like BREAK/CONTINUE, etc). Then, we will exactly know how we want to handle these droppings - it probably won't be as explicit as in this PR, but rather include some interaction with the context. Then, we will figure out exactly whether we need this isInternal property down the road or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is questionable whether temp table/view/function should be tied to a compound scope (probably not) like variables. They will continue to live on session level.


/**
* Whether this statement had to be executed during the interpretation phase.
* Example: Statements in conditions of If/Else, While, etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have test cases that include If/Else and While statements? Are they already supported, or will they be supported in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These statements will follow-up in the upcoming PRs, we already have them ready/tried out in our POC. I can share more details offline if you are interested.

For now, we are introducing just the basic structure in order to have simpler PRs for reviewing.

/**
* Concrete implementation of the interpreter for SQL scripting.
*/
case class SqlScriptingInterpreter() extends ProceduralLanguageInterpreter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do we call reset? or it's preparing for loop in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, it is a preparation for loops in the future.

*/
private def getDeclareVarNameFromPlan(plan: LogicalPlan): Option[UnresolvedIdentifier] =
plan match {
case CreateVariable(name: UnresolvedIdentifier, _, _) => Some(name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to confirm: we do creat session variables when executing the CompoundBody, but we remove it at the end to pretend it's local variables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, this is not a target behavior and I explained a plan for variable related statements in this comment before, but we did it like this for now to simplify the PR and have a functioning scrips as well for testing purposes.

Comment on lines 40 to 43
/** Get the SQL query text corresponding to this statement. */
def getText: String = {
assert(origin.sqlText.isDefined && origin.startIndex.isDefined && origin.stopIndex.isDefined)
origin.sqlText.get.substring(origin.startIndex.get, origin.stopIndex.get + 1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check comment for this change.

@@ -164,6 +164,7 @@ object CheckConnectJvmClientCompatibility {
ProblemFilters.exclude[Problem]("org.apache.spark.sql.streaming.ui.*"),
ProblemFilters.exclude[Problem]("org.apache.spark.sql.test.*"),
ProblemFilters.exclude[Problem]("org.apache.spark.sql.util.*"),
ProblemFilters.exclude[Problem]("org.apache.spark.sql.scripting.*"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any concerns with this? I think this is all right since we don't need any of the interpreting logic on the client side for Connect. In parallel to this PR, we are testing scripting with Connect e2e and it works without any changes on the client side, especially considering interpreting logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Wasn't aware of this...

/**
* Non-leaf node in the execution tree. It is an iterator over executable child nodes.
*/
trait NonLeafStatementExec extends CompoundStatementExec with Iterator[CompoundStatementExec]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit different from catalyst TreeNode.

Shall we use explicit tree structure? The CompoundStatementExec interface should have def children: Seq[CompoundStatementExec]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is different and here is the explanation - we didn't inherit TreeNode since interpreter does not go through regular execution path. Intended use of the interpreter is like:

  • Call interpreter's buildExecutionPlan with the parser output plan.
  • It will return iterator over executable statements.
  • Use something as simple as foreach, flatMap, etc. and leverage the interpreter interface that has already been implemented and that will iterate through the statements.
  • Alternately, for debugging feature, iterator is especially useful because we can operate on it directly based on the debugging commands.

If we were to introduce exec nodes that inherit TreeNode, this would introduce another layer, because we would need to implement interpreter logic outside of these classes.

Another reason why having children of executable nodes might be a bit ambiguous approach is in cases when we have loops for example - we cannot know in advance what would be the number of statements to be executed.

Does this make sense to you?

Comment on lines 67 to 68
extends LeafStatementExec
with WithOrigin {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
extends LeafStatementExec
with WithOrigin {
extends LeafStatementExec with WithOrigin {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

with WithOrigin {

/**
* Whether this statement had to be executed during the interpretation phase.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc string reads very confusing, as the variable name is consumed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean Whether this statement has been executed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what is meant here. Statements in conditions, for example, need to be executed during the interpretation so we know the condition value. After the interpretation is done, this flag would indicate if the statement has been executed during the interpretation phase.

I can rename this variable name to isExecuted, I think it would resolve the confusion here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@github-actions github-actions bot added the BUILD label Jul 4, 2024
val compoundBody = spark.sessionState.sqlParser.parseScript(sqlText)
val executionPlan = interpreter.buildExecutionPlan(compoundBody)
val result = executionPlan.flatMap {
case statement: SingleStatementExec =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about the proper abstraction we should use. What we get at the beginning is the AST CompoundPlanStatement, what we need at the end is an iterator of SingleStatementExec. In the middle layer we use CompoundStatementExec.

I don't have a good sense now as loop/if-else is a very important part of the abstraction. Can we implement if-else first so that we can have a better sense of the abstraction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shared the POC pull request in Slack with If-Else/While implementations. We can continue discussion there and write the summary here afterwards.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 2c54aa5 Jul 8, 2024
ericm-db pushed a commit to ericm-db/spark that referenced this pull request Jul 10, 2024
### What changes were proposed in this pull request?
Previous [PR](apache#46665) introduced parser changes for SQL Scripting. This PR is a follow-up to introduce the interpreter for SQL Scripting language and proposes the following changes:
- `SqlScriptingExecutionNode` - introduces execution nodes for SQL scripting, used during interpretation phase:
  - `SingleStatementExec` - executable node for `SingleStatement` logical node; wraps logical plan of the single statement.
  - `CompoundNestedStatementIteratorExec` - implements base recursive iterator logic for all nesting statements.
  - `CompoundBodyExec` - concrete implementation of `CompoundNestedStatementIteratorExec` for `CompoundBody` logical node.
- `SqlScriptingInterpreter` - introduces the interpreter for SQL scripts. Product of interpretation is the iterator over the statements that should be executed.

Follow-up PRs will introduce further statements, support for exceptions thrown from parser/interpreter, exception handling in SQL, etc.
More details can be found in [Jira item](https://issues.apache.org/jira/browse/SPARK-48343) for this task and its parent (where the design doc is uploaded as well).

### Why are the changes needed?
The intent is to add support for SQL scripting (and stored procedures down the line). It gives users the ability to develop complex logic and ETL entirely in SQL.

Until now, users had to write verbose SQL statements or combine SQL + Python to efficiently write the logic. This is an effort to breach that gap and enable complex logic to be written entirely in SQL.

### Does this PR introduce _any_ user-facing change?
No.
This PR is second in series of PRs that will introduce changes to sql() API to add support for SQL scripting, but for now, the API remains unchanged.
In the future, the API will remain the same as well, but it will have new possibility to execute SQL scripts.

### How was this patch tested?
There are tests for newly introduced parser changes:
- `SqlScriptingExecutionNodeSuite` - unit tests for execution nodes.
- `SqlScriptingInterpreterSuite` - tests for interpreter (with parser integration).

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47026 from davidm-db/sql_scripting_interpreter.

Authored-by: David Milicevic <david.milicevic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
jingz-db pushed a commit to jingz-db/spark that referenced this pull request Jul 22, 2024
### What changes were proposed in this pull request?
Previous [PR](apache#46665) introduced parser changes for SQL Scripting. This PR is a follow-up to introduce the interpreter for SQL Scripting language and proposes the following changes:
- `SqlScriptingExecutionNode` - introduces execution nodes for SQL scripting, used during interpretation phase:
  - `SingleStatementExec` - executable node for `SingleStatement` logical node; wraps logical plan of the single statement.
  - `CompoundNestedStatementIteratorExec` - implements base recursive iterator logic for all nesting statements.
  - `CompoundBodyExec` - concrete implementation of `CompoundNestedStatementIteratorExec` for `CompoundBody` logical node.
- `SqlScriptingInterpreter` - introduces the interpreter for SQL scripts. Product of interpretation is the iterator over the statements that should be executed.

Follow-up PRs will introduce further statements, support for exceptions thrown from parser/interpreter, exception handling in SQL, etc.
More details can be found in [Jira item](https://issues.apache.org/jira/browse/SPARK-48343) for this task and its parent (where the design doc is uploaded as well).

### Why are the changes needed?
The intent is to add support for SQL scripting (and stored procedures down the line). It gives users the ability to develop complex logic and ETL entirely in SQL.

Until now, users had to write verbose SQL statements or combine SQL + Python to efficiently write the logic. This is an effort to breach that gap and enable complex logic to be written entirely in SQL.

### Does this PR introduce _any_ user-facing change?
No.
This PR is second in series of PRs that will introduce changes to sql() API to add support for SQL scripting, but for now, the API remains unchanged.
In the future, the API will remain the same as well, but it will have new possibility to execute SQL scripts.

### How was this patch tested?
There are tests for newly introduced parser changes:
- `SqlScriptingExecutionNodeSuite` - unit tests for execution nodes.
- `SqlScriptingInterpreterSuite` - tests for interpreter (with parser integration).

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47026 from davidm-db/sql_scripting_interpreter.

Authored-by: David Milicevic <david.milicevic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
### What changes were proposed in this pull request?
Previous [PR](apache#46665) introduced parser changes for SQL Scripting. This PR is a follow-up to introduce the interpreter for SQL Scripting language and proposes the following changes:
- `SqlScriptingExecutionNode` - introduces execution nodes for SQL scripting, used during interpretation phase:
  - `SingleStatementExec` - executable node for `SingleStatement` logical node; wraps logical plan of the single statement.
  - `CompoundNestedStatementIteratorExec` - implements base recursive iterator logic for all nesting statements.
  - `CompoundBodyExec` - concrete implementation of `CompoundNestedStatementIteratorExec` for `CompoundBody` logical node.
- `SqlScriptingInterpreter` - introduces the interpreter for SQL scripts. Product of interpretation is the iterator over the statements that should be executed.

Follow-up PRs will introduce further statements, support for exceptions thrown from parser/interpreter, exception handling in SQL, etc.
More details can be found in [Jira item](https://issues.apache.org/jira/browse/SPARK-48343) for this task and its parent (where the design doc is uploaded as well).

### Why are the changes needed?
The intent is to add support for SQL scripting (and stored procedures down the line). It gives users the ability to develop complex logic and ETL entirely in SQL.

Until now, users had to write verbose SQL statements or combine SQL + Python to efficiently write the logic. This is an effort to breach that gap and enable complex logic to be written entirely in SQL.

### Does this PR introduce _any_ user-facing change?
No.
This PR is second in series of PRs that will introduce changes to sql() API to add support for SQL scripting, but for now, the API remains unchanged.
In the future, the API will remain the same as well, but it will have new possibility to execute SQL scripts.

### How was this patch tested?
There are tests for newly introduced parser changes:
- `SqlScriptingExecutionNodeSuite` - unit tests for execution nodes.
- `SqlScriptingInterpreterSuite` - tests for interpreter (with parser integration).

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47026 from davidm-db/sql_scripting_interpreter.

Authored-by: David Milicevic <david.milicevic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants