-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48343][SQL] Introduction of SQL Scripting interpreter #47026
Conversation
|
||
/** Get the SQL query text corresponding to this statement. */ | ||
def getText(sqlScriptText: String): String = { | ||
if (origin.startIndex.isEmpty || origin.stopIndex.isEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the cases when origin is empty?
Especially, having one of them being empty and other not seems like an error against which we should assert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If SingleStatement
is created with withOrigin
in AST builder (which should be done since it implements WithOrigin
trait), these fields should never be empty. Asserting just seemed too much here because getText
is something I see us using when creating exception messages and stuff like that, so didn't seem on critical path.
But if you think so, I have nothing against asserting here :) Should change the same method in the SingleStatement
then as well probably, but that's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured out we can use Origin
completely for this, because it holds SQL text together with start and end text indices. I've simplified the getText
so it does not have input param. Since Origin
is already existing mechanism that should never fail, I've added assert now like you suggested - did the same in SingleStatement
logical operator.
sql/core/src/test/scala/org/apache/spark/sql/scripting/SqlScriptingIntegrationSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/scripting/SqlScriptingIntegrationSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/scripting/SqlScriptingInterpreterSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingInterpreter.scala
Show resolved
Hide resolved
val sqlScript = | ||
""" | ||
|BEGIN | ||
|DECLARE var = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test that asserts that DECLARE can be used only in top of BEGIN/END block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We removed that logic from the parser PR to simplify it, have a follow-up item to add this check - https://issues.apache.org/jira/browse/SPARK-48345.
test("session vars - set and read scoped") { | ||
val sqlScript = | ||
""" | ||
|BEGIN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO (in next PR) is a test that:
- creates a var
- does some operation that throws an error (e.g. insert against constraint/create table that already exists...)
- assert that variable is no longer in session after error is thrown.
This currently is not going to work since we manually insert DROPs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, we already have an item for local variables support that includes this topic: https://issues.apache.org/jira/browse/SPARK-48530.
sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala
Outdated
Show resolved
Hide resolved
.map(new SingleStatementExec(_, Origin(), isInternal = true)) | ||
.reverse | ||
new CompoundBodyExec( | ||
body.collection.map(st => transformTreeIntoExecutable(st)) ++ dropVariables) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of just concatenating DROPs alternative could be to:
- Keep "cleanup/internal" statements in side structure.
- Execute them as part of
finalize
orreset
calls.
My fear is that by just keeping them at the end it might be hard to support following scenarios:
- premature return (e.g. during break)
- error handling (exception thrown in the middle of the block).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreement is to do this in PR to come.
2720c0e
to
3013eb6
Compare
/** Get the SQL query text corresponding to this statement. */ | ||
def getText(sqlScriptText: String): String = { | ||
if (origin.startIndex.isEmpty || origin.stopIndex.isEmpty) { | ||
return null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean when the SQL query text is null? Should it be a user-facing or an internal error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think of sqlScriptText
parameter here? This method is intended for internal use only - in our POC we used it for the exception messages, tests and similar. If there is any error here it is internal, but it really depends on what is passed as a parameter (and this should be correct) - so I don't think we should do any errors here.
One another point is the fact that we were not aware of Origin
when working on POC - now it seems like it is sufficient to return text info on its own, so we might end up with dropping getText
function completely, but we'd like to figure out that in the upcoming PRs.
case Some(body: NonLeafStatementExec) => body.hasNext | ||
case Some(_: LeafStatementExec) => true | ||
case None => false | ||
case _ => throw new IllegalStateException("Unknown statement type") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: SparkException.internalError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@allisonwang-db for my education, any specific reason for this change? We want to throw only internal errors if not caused by user? Should we create a specific error class/condition for this?
sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala
Outdated
Show resolved
Hide resolved
sealed trait CompoundStatementExec extends Logging { | ||
/** | ||
* Whether the statement originates from the SQL script or is created during the interpretation. | ||
* Example: DropVariable statements are automatically created at the end of each compound. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything other than drop var are we planning to add in the future? drop temp table/view/functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely, we will have other temporary stuff to drop in the future. For now, we made a really simple change like this, so our scripts are functional for testing, etc.
Follow-up PRs will first introduce labels for compound bodies (both implicit and explicit) first, which is a requirement for further work/improvements in the variables area - like introducing variable shadowing and similar. Once the labels are introduced, we will work on variable improvements - this will probably include some type of a context (which might later be used for a ton of other stuff - call stack tracking once we introduce procedures, tracking info for some other statements like BREAK/CONTINUE, etc). Then, we will exactly know how we want to handle these droppings - it probably won't be as explicit as in this PR, but rather include some interaction with the context. Then, we will figure out exactly whether we need this isInternal
property down the road or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is questionable whether temp table/view/function should be tied to a compound scope (probably not) like variables. They will continue to live on session level.
sql/core/src/test/scala/org/apache/spark/sql/scripting/SqlScriptingInterpreterSuite.scala
Show resolved
Hide resolved
|
||
/** | ||
* Whether this statement had to be executed during the interpretation phase. | ||
* Example: Statements in conditions of If/Else, While, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have test cases that include If/Else and While statements? Are they already supported, or will they be supported in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These statements will follow-up in the upcoming PRs, we already have them ready/tried out in our POC. I can share more details offline if you are interested.
For now, we are introducing just the basic structure in order to have simpler PRs for reviewing.
sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingInterpreter.scala
Outdated
Show resolved
Hide resolved
/** | ||
* Concrete implementation of the interpreter for SQL scripting. | ||
*/ | ||
case class SqlScriptingInterpreter() extends ProceduralLanguageInterpreter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where do we call reset
? or it's preparing for loop in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, it is a preparation for loops in the future.
*/ | ||
private def getDeclareVarNameFromPlan(plan: LogicalPlan): Option[UnresolvedIdentifier] = | ||
plan match { | ||
case CreateVariable(name: UnresolvedIdentifier, _, _) => Some(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to confirm: we do creat session variables when executing the CompoundBody
, but we remove it at the end to pretend it's local variables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, this is not a target behavior and I explained a plan for variable related statements in this comment before, but we did it like this for now to simplify the PR and have a functioning scrips as well for testing purposes.
sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala
Outdated
Show resolved
Hide resolved
/** Get the SQL query text corresponding to this statement. */ | ||
def getText: String = { | ||
assert(origin.sqlText.isDefined && origin.startIndex.isDefined && origin.stopIndex.isDefined) | ||
origin.sqlText.get.substring(origin.startIndex.get, origin.stopIndex.get + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check comment for this change.
@@ -164,6 +164,7 @@ object CheckConnectJvmClientCompatibility { | |||
ProblemFilters.exclude[Problem]("org.apache.spark.sql.streaming.ui.*"), | |||
ProblemFilters.exclude[Problem]("org.apache.spark.sql.test.*"), | |||
ProblemFilters.exclude[Problem]("org.apache.spark.sql.util.*"), | |||
ProblemFilters.exclude[Problem]("org.apache.spark.sql.scripting.*"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any concerns with this? I think this is all right since we don't need any of the interpreting logic on the client side for Connect. In parallel to this PR, we are testing scripting with Connect e2e and it works without any changes on the client side, especially considering interpreting logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should mark it as a private package in https://github.com/apache/spark/blob/master/project/SparkBuild.scala#L328-L336
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Wasn't aware of this...
f0babf7
to
6d402fc
Compare
/** | ||
* Non-leaf node in the execution tree. It is an iterator over executable child nodes. | ||
*/ | ||
trait NonLeafStatementExec extends CompoundStatementExec with Iterator[CompoundStatementExec] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit different from catalyst TreeNode
.
Shall we use explicit tree structure? The CompoundStatementExec
interface should have def children: Seq[CompoundStatementExec]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is different and here is the explanation - we didn't inherit TreeNode
since interpreter does not go through regular execution path. Intended use of the interpreter is like:
- Call interpreter's
buildExecutionPlan
with the parser output plan. - It will return iterator over executable statements.
- Use something as simple as
foreach
,flatMap
, etc. and leverage the interpreter interface that has already been implemented and that will iterate through the statements. - Alternately, for debugging feature, iterator is especially useful because we can operate on it directly based on the debugging commands.
If we were to introduce exec nodes that inherit TreeNode
, this would introduce another layer, because we would need to implement interpreter logic outside of these classes.
Another reason why having children
of executable nodes might be a bit ambiguous approach is in cases when we have loops for example - we cannot know in advance what would be the number of statements to be executed.
Does this make sense to you?
extends LeafStatementExec | ||
with WithOrigin { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extends LeafStatementExec | |
with WithOrigin { | |
extends LeafStatementExec with WithOrigin { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
with WithOrigin { | ||
|
||
/** | ||
* Whether this statement had to be executed during the interpretation phase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc string reads very confusing, as the variable name is consumed
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean Whether this statement has been executed
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's what is meant here. Statements in conditions, for example, need to be executed during the interpretation so we know the condition value. After the interpretation is done, this flag would indicate if the statement has been executed during the interpretation phase.
I can rename this variable name to isExecuted
, I think it would resolve the confusion here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
val compoundBody = spark.sessionState.sqlParser.parseScript(sqlText) | ||
val executionPlan = interpreter.buildExecutionPlan(compoundBody) | ||
val result = executionPlan.flatMap { | ||
case statement: SingleStatementExec => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking about the proper abstraction we should use. What we get at the beginning is the AST CompoundPlanStatement
, what we need at the end is an iterator of SingleStatementExec
. In the middle layer we use CompoundStatementExec
.
I don't have a good sense now as loop/if-else is a very important part of the abstraction. Can we implement if-else first so that we can have a better sense of the abstraction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shared the POC pull request in Slack with If-Else/While implementations. We can continue discussion there and write the summary here afterwards.
thanks, merging to master! |
### What changes were proposed in this pull request? Previous [PR](apache#46665) introduced parser changes for SQL Scripting. This PR is a follow-up to introduce the interpreter for SQL Scripting language and proposes the following changes: - `SqlScriptingExecutionNode` - introduces execution nodes for SQL scripting, used during interpretation phase: - `SingleStatementExec` - executable node for `SingleStatement` logical node; wraps logical plan of the single statement. - `CompoundNestedStatementIteratorExec` - implements base recursive iterator logic for all nesting statements. - `CompoundBodyExec` - concrete implementation of `CompoundNestedStatementIteratorExec` for `CompoundBody` logical node. - `SqlScriptingInterpreter` - introduces the interpreter for SQL scripts. Product of interpretation is the iterator over the statements that should be executed. Follow-up PRs will introduce further statements, support for exceptions thrown from parser/interpreter, exception handling in SQL, etc. More details can be found in [Jira item](https://issues.apache.org/jira/browse/SPARK-48343) for this task and its parent (where the design doc is uploaded as well). ### Why are the changes needed? The intent is to add support for SQL scripting (and stored procedures down the line). It gives users the ability to develop complex logic and ETL entirely in SQL. Until now, users had to write verbose SQL statements or combine SQL + Python to efficiently write the logic. This is an effort to breach that gap and enable complex logic to be written entirely in SQL. ### Does this PR introduce _any_ user-facing change? No. This PR is second in series of PRs that will introduce changes to sql() API to add support for SQL scripting, but for now, the API remains unchanged. In the future, the API will remain the same as well, but it will have new possibility to execute SQL scripts. ### How was this patch tested? There are tests for newly introduced parser changes: - `SqlScriptingExecutionNodeSuite` - unit tests for execution nodes. - `SqlScriptingInterpreterSuite` - tests for interpreter (with parser integration). ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47026 from davidm-db/sql_scripting_interpreter. Authored-by: David Milicevic <david.milicevic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? Previous [PR](apache#46665) introduced parser changes for SQL Scripting. This PR is a follow-up to introduce the interpreter for SQL Scripting language and proposes the following changes: - `SqlScriptingExecutionNode` - introduces execution nodes for SQL scripting, used during interpretation phase: - `SingleStatementExec` - executable node for `SingleStatement` logical node; wraps logical plan of the single statement. - `CompoundNestedStatementIteratorExec` - implements base recursive iterator logic for all nesting statements. - `CompoundBodyExec` - concrete implementation of `CompoundNestedStatementIteratorExec` for `CompoundBody` logical node. - `SqlScriptingInterpreter` - introduces the interpreter for SQL scripts. Product of interpretation is the iterator over the statements that should be executed. Follow-up PRs will introduce further statements, support for exceptions thrown from parser/interpreter, exception handling in SQL, etc. More details can be found in [Jira item](https://issues.apache.org/jira/browse/SPARK-48343) for this task and its parent (where the design doc is uploaded as well). ### Why are the changes needed? The intent is to add support for SQL scripting (and stored procedures down the line). It gives users the ability to develop complex logic and ETL entirely in SQL. Until now, users had to write verbose SQL statements or combine SQL + Python to efficiently write the logic. This is an effort to breach that gap and enable complex logic to be written entirely in SQL. ### Does this PR introduce _any_ user-facing change? No. This PR is second in series of PRs that will introduce changes to sql() API to add support for SQL scripting, but for now, the API remains unchanged. In the future, the API will remain the same as well, but it will have new possibility to execute SQL scripts. ### How was this patch tested? There are tests for newly introduced parser changes: - `SqlScriptingExecutionNodeSuite` - unit tests for execution nodes. - `SqlScriptingInterpreterSuite` - tests for interpreter (with parser integration). ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47026 from davidm-db/sql_scripting_interpreter. Authored-by: David Milicevic <david.milicevic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? Previous [PR](apache#46665) introduced parser changes for SQL Scripting. This PR is a follow-up to introduce the interpreter for SQL Scripting language and proposes the following changes: - `SqlScriptingExecutionNode` - introduces execution nodes for SQL scripting, used during interpretation phase: - `SingleStatementExec` - executable node for `SingleStatement` logical node; wraps logical plan of the single statement. - `CompoundNestedStatementIteratorExec` - implements base recursive iterator logic for all nesting statements. - `CompoundBodyExec` - concrete implementation of `CompoundNestedStatementIteratorExec` for `CompoundBody` logical node. - `SqlScriptingInterpreter` - introduces the interpreter for SQL scripts. Product of interpretation is the iterator over the statements that should be executed. Follow-up PRs will introduce further statements, support for exceptions thrown from parser/interpreter, exception handling in SQL, etc. More details can be found in [Jira item](https://issues.apache.org/jira/browse/SPARK-48343) for this task and its parent (where the design doc is uploaded as well). ### Why are the changes needed? The intent is to add support for SQL scripting (and stored procedures down the line). It gives users the ability to develop complex logic and ETL entirely in SQL. Until now, users had to write verbose SQL statements or combine SQL + Python to efficiently write the logic. This is an effort to breach that gap and enable complex logic to be written entirely in SQL. ### Does this PR introduce _any_ user-facing change? No. This PR is second in series of PRs that will introduce changes to sql() API to add support for SQL scripting, but for now, the API remains unchanged. In the future, the API will remain the same as well, but it will have new possibility to execute SQL scripts. ### How was this patch tested? There are tests for newly introduced parser changes: - `SqlScriptingExecutionNodeSuite` - unit tests for execution nodes. - `SqlScriptingInterpreterSuite` - tests for interpreter (with parser integration). ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47026 from davidm-db/sql_scripting_interpreter. Authored-by: David Milicevic <david.milicevic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Previous PR introduced parser changes for SQL Scripting. This PR is a follow-up to introduce the interpreter for SQL Scripting language and proposes the following changes:
SqlScriptingExecutionNode
- introduces execution nodes for SQL scripting, used during interpretation phase:SingleStatementExec
- executable node forSingleStatement
logical node; wraps logical plan of the single statement.CompoundNestedStatementIteratorExec
- implements base recursive iterator logic for all nesting statements.CompoundBodyExec
- concrete implementation ofCompoundNestedStatementIteratorExec
forCompoundBody
logical node.SqlScriptingInterpreter
- introduces the interpreter for SQL scripts. Product of interpretation is the iterator over the statements that should be executed.Follow-up PRs will introduce further statements, support for exceptions thrown from parser/interpreter, exception handling in SQL, etc.
More details can be found in Jira item for this task and its parent (where the design doc is uploaded as well).
Why are the changes needed?
The intent is to add support for SQL scripting (and stored procedures down the line). It gives users the ability to develop complex logic and ETL entirely in SQL.
Until now, users had to write verbose SQL statements or combine SQL + Python to efficiently write the logic. This is an effort to breach that gap and enable complex logic to be written entirely in SQL.
Does this PR introduce any user-facing change?
No.
This PR is second in series of PRs that will introduce changes to sql() API to add support for SQL scripting, but for now, the API remains unchanged.
In the future, the API will remain the same as well, but it will have new possibility to execute SQL scripts.
How was this patch tested?
There are tests for newly introduced parser changes:
SqlScriptingExecutionNodeSuite
- unit tests for execution nodes.SqlScriptingInterpreterSuite
- tests for interpreter (with parser integration).Was this patch authored or co-authored using generative AI tooling?
No.