-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-46625] CTE with Identifier clause as reference #47180
Conversation
I'm not a big fan of this approach, as this duplicates the handling of IDENTIFIER clauses in IMO, the root cause is we special-case CTE resolution and run My idea: let's split CTE resolution into two steps:
|
@cloud-fan Please take a look at the pushed changes now, I've created a rough draft changes which should work with your approach (if I understood correctly). I don't have deep understanding of all possible uses of CTEs and if changing the order of these few rules could cause some major issues? |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala
Outdated
Show resolved
Hide resolved
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveIdentifierClause.scala
Outdated
Show resolved
Hide resolved
Yes this is the approach I was talking about, LGTM! |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Outdated
Show resolved
Hide resolved
SubqueryAlias(table, CTERelationRef(d.id, d.resolved, d.output, d.isStreaming)) | ||
// Add unresolved with CTE relations to the plan and we | ||
// will do CTE resolution later in analyzer based on this node. | ||
UnresolvedWithCTERelations(u, cteRelations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more about it, I think we can still return SubqueryAlias(...)
as before here, which is like a shortcut as we know we should look up CTE relations first when resolving UnresolvedRelation
. There is no need to delay it.
} | ||
}.getOrElse(u) | ||
|
||
case p: PlanWithUnresolvedIdentifier => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the rule name unchanged and only make this change here. We should also add comments to explain it
// We must look up CTE relations first when resolving `UnresolvedRelation`s, but we can't do it here
// as `PlanWithUnresolvedIdentifier` is a leaf node and may produce `UnresolvedRelation` later. Here
// we wrap it with `UnresolvedWithCTERelations` so that we can delay the CTE relations lookup after
// `PlanWithUnresolvedIdentifier` is resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing is that we should guarantee the UnresolvedRelation
should be resolved by CTE relations lookup before the normal table lookup path. My proposal: UnresolvedWithCTERelations
should be a leaf node so that the normal table lookup rule can't transform the UnresolvedRelation
inside it. ResolveIdentifierClause
should handle UnresolvedWithCTERelations
specially and resolve PlanWithUnresolvedIdentifier
inside it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made changes suggested in these comments, hopefully I understood correctly what you suggested, please review :)
@@ -1796,6 +1796,9 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor | |||
case s: Sort if !s.resolved || s.missingInput.nonEmpty => | |||
resolveReferencesInSort(s) | |||
|
|||
case u: UnresolvedWithCTERelations => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it more, now we need to special-case UnresolvedWithCTERelations
twice: once in ResolveReferences
to resolve session variables and once in ResolveIdentifierClause
to resolve identifier and look up CTE relations.
How about we make UnresolvedWithCTERelations
an unary code, and only special case it once in ResolveRelations
that we should look up from CTE relations for UnresolvedRelations
insideUnresolvedWithCTERelations
? Sorry for the back and forth!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worry!
Hm, issue with that approach is that ResolveRelations
is traversing the tree in bottom up manner so we will first do table lookup instead of CTE relations lookup since it will first encounter UnresolvedRelation
instead of UnresolvedWithCTERelations
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, it's better to keep the bottom-up resolotion.
thanks, merging to master! |
### What changes were proposed in this pull request? DECLARE agg = 'max'; DECLARE col = 'c1'; DECLARE tab = 'T'; WITH S(c1, c2) AS (VALUES(1, 2), (2, 3)), T(c1, c2) AS (VALUES ('a', 'b'), ('c', 'd')) SELECT IDENTIFIER(agg)(IDENTIFIER(col)) FROM IDENTIFIER(tab); -- OR WITH S(c1, c2) AS (VALUES(1, 2), (2, 3)), T(c1, c2) AS (VALUES ('a', 'b'), ('c', 'd')) SELECT IDENTIFIER('max')(IDENTIFIER('c1')) FROM IDENTIFIER('T'); Currently we don't support Identifier clause as part of CTE reference. ### Why are the changes needed? Adding support for Identifier clause as part of CTE reference for both constant string expressions and session variables. ### Does this PR introduce _any_ user-facing change? It contains user facing changes in sense that identifier clause as cte reference will now be supported. ### How was this patch tested? Added tests as part of this PR. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47180 from nebojsa-db/SPARK-46625. Authored-by: Nebojsa Savic <nebojsa.savic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? DECLARE agg = 'max'; DECLARE col = 'c1'; DECLARE tab = 'T'; WITH S(c1, c2) AS (VALUES(1, 2), (2, 3)), T(c1, c2) AS (VALUES ('a', 'b'), ('c', 'd')) SELECT IDENTIFIER(agg)(IDENTIFIER(col)) FROM IDENTIFIER(tab); -- OR WITH S(c1, c2) AS (VALUES(1, 2), (2, 3)), T(c1, c2) AS (VALUES ('a', 'b'), ('c', 'd')) SELECT IDENTIFIER('max')(IDENTIFIER('c1')) FROM IDENTIFIER('T'); Currently we don't support Identifier clause as part of CTE reference. ### Why are the changes needed? Adding support for Identifier clause as part of CTE reference for both constant string expressions and session variables. ### Does this PR introduce _any_ user-facing change? It contains user facing changes in sense that identifier clause as cte reference will now be supported. ### How was this patch tested? Added tests as part of this PR. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47180 from nebojsa-db/SPARK-46625. Authored-by: Nebojsa Savic <nebojsa.savic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? DECLARE agg = 'max'; DECLARE col = 'c1'; DECLARE tab = 'T'; WITH S(c1, c2) AS (VALUES(1, 2), (2, 3)), T(c1, c2) AS (VALUES ('a', 'b'), ('c', 'd')) SELECT IDENTIFIER(agg)(IDENTIFIER(col)) FROM IDENTIFIER(tab); -- OR WITH S(c1, c2) AS (VALUES(1, 2), (2, 3)), T(c1, c2) AS (VALUES ('a', 'b'), ('c', 'd')) SELECT IDENTIFIER('max')(IDENTIFIER('c1')) FROM IDENTIFIER('T'); Currently we don't support Identifier clause as part of CTE reference. ### Why are the changes needed? Adding support for Identifier clause as part of CTE reference for both constant string expressions and session variables. ### Does this PR introduce _any_ user-facing change? It contains user facing changes in sense that identifier clause as cte reference will now be supported. ### How was this patch tested? Added tests as part of this PR. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47180 from nebojsa-db/SPARK-46625. Authored-by: Nebojsa Savic <nebojsa.savic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? DECLARE agg = 'max'; DECLARE col = 'c1'; DECLARE tab = 'T'; WITH S(c1, c2) AS (VALUES(1, 2), (2, 3)), T(c1, c2) AS (VALUES ('a', 'b'), ('c', 'd')) SELECT IDENTIFIER(agg)(IDENTIFIER(col)) FROM IDENTIFIER(tab); -- OR WITH S(c1, c2) AS (VALUES(1, 2), (2, 3)), T(c1, c2) AS (VALUES ('a', 'b'), ('c', 'd')) SELECT IDENTIFIER('max')(IDENTIFIER('c1')) FROM IDENTIFIER('T'); Currently we don't support Identifier clause as part of CTE reference. ### Why are the changes needed? Adding support for Identifier clause as part of CTE reference for both constant string expressions and session variables. ### Does this PR introduce _any_ user-facing change? It contains user facing changes in sense that identifier clause as cte reference will now be supported. ### How was this patch tested? Added tests as part of this PR. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47180 from nebojsa-db/SPARK-46625. Authored-by: Nebojsa Savic <nebojsa.savic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
DECLARE agg = 'max';
DECLARE col = 'c1';
DECLARE tab = 'T';
WITH S(c1, c2) AS (VALUES(1, 2), (2, 3)),
T(c1, c2) AS (VALUES ('a', 'b'), ('c', 'd'))
SELECT IDENTIFIER(agg)(IDENTIFIER(col)) FROM IDENTIFIER(tab);
-- OR
WITH S(c1, c2) AS (VALUES(1, 2), (2, 3)),
T(c1, c2) AS (VALUES ('a', 'b'), ('c', 'd'))
SELECT IDENTIFIER('max')(IDENTIFIER('c1')) FROM IDENTIFIER('T');
Currently we don't support Identifier clause as part of CTE reference.
Why are the changes needed?
Adding support for Identifier clause as part of CTE reference for both constant string expressions and session variables.
Does this PR introduce any user-facing change?
It contains user facing changes in sense that identifier clause as cte reference will now be supported.
How was this patch tested?
Added tests as part of this PR.
Was this patch authored or co-authored using generative AI tooling?
No.