-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-49912] Refactor simple CASE statement to evaluate the case variable only once #50027
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-49912] Refactor simple CASE statement to evaluate the case variable only once #50027
Conversation
| * @param elseBody Body that is executed if none of the conditions are met, i.e. ELSE branch. | ||
| */ | ||
| case class CaseStatement( | ||
| case class SearchedCaseStatement( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: update the description comment, to be equivalent of "SIMPLE variant" as below. Also, we might include a small examples to illustrate which is which, I always get confused...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| override def getTreeIterator: Iterator[CompoundStatementExec] = treeIterator | ||
|
|
||
| override def reset(): Unit = { | ||
| state = CaseState.Condition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
caseVariableExec.reset() should be added as well? I guess it's not important, but it will reset the isExecuted flag if it's ever important for anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, added it.
| Origin(sqlText = Some(conditionText), | ||
| startIndex = Some(0), | ||
| stopIndex = Some(conditionText.length - 1)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of this, but won't it screw-up the line number in the error message for example? Shall we add some test cases to the SqlScriptingExecutionSuite to cover exceptions and similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have tests for these exceptions, they are in SqlScriptingInterpreterSuite.
You're right it will screw up line numbers, perhaps it would be better to copy the Origin from the original caseVariable expression. This will keep the proper line number, however would just point to one part of the equality expression.
For example, if we have this script:
BEGIN
CASE 1
WHEN NULL THEN
SELECT 41;
ELSE
SELECT 43;
END CASE;
END
The error we would have if we kept the Origin from case variable expression:
{LINE:3} [BOOLEAN_STATEMENT_WITH_EMPTY_ROW] Boolean statement 1 is invalid. Expected single row with a value of the BOOLEAN type, but got an empty row. SQLSTATE: 21000
The error with the hacked origin:
[BOOLEAN_STATEMENT_WITH_EMPTY_ROW] Boolean statement (1 = NULL) is invalid. Expected single row with a value of the BOOLEAN type, but got an empty row. SQLSTATE: 21000
Not sure which is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could expand the hack to copy the just line number from caseVariableExec? That way we would have both the line number and the proper equality expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's an easy fix (and it looks like it is) I would definitely go with this approach (copying the line number to the hacked origin). If not, I think it's better to keep the line number and then improve the messaging later on as a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the line propagation - d06ab73
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if CI passes
davidm-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, minor comment left regarding the exception messaging.
|
thanks, merging to master/4.0! |
…iable only once ### What changes were proposed in this pull request? In this PR, CASE statement is refactored. Existing `CaseStatement` is split into two - `SimpleCaseStatement` and `SearchedCaseStatement`. `SearchedCaseStatement` retains the old behavior, while for `SimpleCaseStatement` a new logical and execution node are added - `SimpleCaseStatement` and `SimpleCaseStatementExec`. Previously, a simple case statement would evaluate the case variable again for every WHEN clause in the CASE. This is both inefficient, and could produce unexpected behavior if the evaluation has a side effect. `SimpleCaseStatementExec` now caches the result of the evaluation, and compares the WHEN conditions to the cached `Literal`. ### Why are the changes needed? Previous iteration of simple CASE evaluated the expression multiple times. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests were added to `SqlScriptingParserSuite`, `SqlScriptingInterpreterSuite` and `SqlScriptingExecutionNodeSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50027 from dusantism-db/scripting-case-improvements-v2. Authored-by: Dušan Tišma <dusan.tisma@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 48fc0fb) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…iable only once ### What changes were proposed in this pull request? In this PR, CASE statement is refactored. Existing `CaseStatement` is split into two - `SimpleCaseStatement` and `SearchedCaseStatement`. `SearchedCaseStatement` retains the old behavior, while for `SimpleCaseStatement` a new logical and execution node are added - `SimpleCaseStatement` and `SimpleCaseStatementExec`. Previously, a simple case statement would evaluate the case variable again for every WHEN clause in the CASE. This is both inefficient, and could produce unexpected behavior if the evaluation has a side effect. `SimpleCaseStatementExec` now caches the result of the evaluation, and compares the WHEN conditions to the cached `Literal`. ### Why are the changes needed? Previous iteration of simple CASE evaluated the expression multiple times. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests were added to `SqlScriptingParserSuite`, `SqlScriptingInterpreterSuite` and `SqlScriptingExecutionNodeSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#50027 from dusantism-db/scripting-case-improvements-v2. Authored-by: Dušan Tišma <dusan.tisma@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 8c70b3f) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
In this PR, CASE statement is refactored. Existing
CaseStatementis split into two -SimpleCaseStatementandSearchedCaseStatement.SearchedCaseStatementretains the old behavior, while forSimpleCaseStatementa new logical and execution node are added -SimpleCaseStatementandSimpleCaseStatementExec.Previously, a simple case statement would evaluate the case variable again for every WHEN clause in the CASE. This is both inefficient, and could produce unexpected behavior if the evaluation has a side effect.
SimpleCaseStatementExecnow caches the result of the evaluation, and compares the WHEN conditions to the cachedLiteral.Why are the changes needed?
Previous iteration of simple CASE evaluated the expression multiple times.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Tests were added to
SqlScriptingParserSuite,SqlScriptingInterpreterSuiteandSqlScriptingExecutionNodeSuite.Was this patch authored or co-authored using generative AI tooling?
No.