[SPARK-46625] CTE with Identifier clause as reference #47180

nebojsa-db · 2024-07-02T13:47:44Z

What changes were proposed in this pull request?

DECLARE agg = 'max';
DECLARE col = 'c1';
DECLARE tab = 'T';

WITH S(c1, c2) AS (VALUES(1, 2), (2, 3)),
T(c1, c2) AS (VALUES ('a', 'b'), ('c', 'd'))
SELECT IDENTIFIER(agg)(IDENTIFIER(col)) FROM IDENTIFIER(tab);

-- OR

WITH S(c1, c2) AS (VALUES(1, 2), (2, 3)),
T(c1, c2) AS (VALUES ('a', 'b'), ('c', 'd'))
SELECT IDENTIFIER('max')(IDENTIFIER('c1')) FROM IDENTIFIER('T');

Currently we don't support Identifier clause as part of CTE reference.

Why are the changes needed?

Adding support for Identifier clause as part of CTE reference for both constant string expressions and session variables.

Does this PR introduce any user-facing change?

It contains user facing changes in sense that identifier clause as cte reference will now be supported.

How was this patch tested?

Added tests as part of this PR.

Was this patch authored or co-authored using generative AI tooling?

No.

cloud-fan · 2024-07-03T12:42:58Z

I'm not a big fan of this approach, as this duplicates the handling of IDENTIFIER clauses in CTESubstitution.

IMO, the root cause is we special-case CTE resolution and run CTESubstitution as an individual batch at the very beginning. The ideal solution is to look up CTE relations together with the normal table lookup.

My idea: let's split CTE resolution into two steps:

identify the available CTE relations for each UnresolvedRelation. Given the position of UnresolvedRelation, the available CTE relations can be very different (e.g. in the main query, in the CTE relations, in nested CTE, etc.). Then we wrap UnresolvedRelation with a new node WithCTERelations to hold available CTE relations.
In the analyzer main batch, we wait for the IDENTIFIER clause to be handled, then unwrap WithCTERelations by looking up CTE relations and resoving UnresolvedRelation. If the lookup fails, restore to UnresolvedRelation so that normal table lookup rule can handle it later.

…esolution separately.

nebojsa-db · 2024-07-03T16:01:12Z

I'm not a big fan of this approach, as this duplicates the handling of IDENTIFIER clauses in CTESubstitution.

IMO, the root cause is we special-case CTE resolution and run CTESubstitution as an individual batch at the very beginning. The ideal solution is to look up CTE relations together with the normal table lookup.

My idea: let's split CTE resolution into two steps:

identify the available CTE relations for each UnresolvedRelation. Given the position of UnresolvedRelation, the available CTE relations can be very different (e.g. in the main query, in the CTE relations, in nested CTE, etc.). Then we wrap UnresolvedRelation with a new node WithCTERelations to hold available CTE relations.

In the analyzer main batch, we wait for the IDENTIFIER clause to be handled, then unwrap WithCTERelations by looking up CTE relations and resoving UnresolvedRelation. If the lookup fails, restore to UnresolvedRelation so that normal table lookup rule can handle it later.

@cloud-fan Please take a look at the pushed changes now, I've created a rough draft changes which should work with your approach (if I understood correctly). I don't have deep understanding of all possible uses of CTEs and if changing the order of these few rules could cause some major issues?

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveIdentifierClause.scala

cloud-fan · 2024-07-04T08:27:23Z

Yes this is the approach I was talking about, LGTM!

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

cloud-fan · 2024-07-05T02:55:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollectCTEDefinitions.scala

-            SubqueryAlias(table, CTERelationRef(d.id, d.resolved, d.output, d.isStreaming))
+            // Add unresolved with CTE relations to the plan and we
+            // will do CTE resolution later in analyzer based on this node.
+            UnresolvedWithCTERelations(u, cteRelations)


Thinking more about it, I think we can still return SubqueryAlias(...) as before here, which is like a shortcut as we know we should look up CTE relations first when resolving UnresolvedRelation. There is no need to delay it.

cloud-fan · 2024-07-05T02:59:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollectCTEDefinitions.scala

          }
        }.getOrElse(u)

+      case p: PlanWithUnresolvedIdentifier =>


Let's keep the rule name unchanged and only make this change here. We should also add comments to explain it

// We must look up CTE relations first when resolving `UnresolvedRelation`s, but we can't do it here // as `PlanWithUnresolvedIdentifier` is a leaf node and may produce `UnresolvedRelation` later. Here // we wrap it with `UnresolvedWithCTERelations` so that we can delay the CTE relations lookup after // `PlanWithUnresolvedIdentifier` is resolved.

Another thing is that we should guarantee the UnresolvedRelation should be resolved by CTE relations lookup before the normal table lookup path. My proposal: UnresolvedWithCTERelations should be a leaf node so that the normal table lookup rule can't transform the UnresolvedRelation inside it. ResolveIdentifierClause should handle UnresolvedWithCTERelations specially and resolve PlanWithUnresolvedIdentifier inside it.

I have made changes suggested in these comments, hopefully I understood correctly what you suggested, please review :)

cloud-fan · 2024-07-08T09:20:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -1796,6 +1796,9 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
      case s: Sort if !s.resolved || s.missingInput.nonEmpty =>
        resolveReferencesInSort(s)

+      case u: UnresolvedWithCTERelations =>


Thinking about it more, now we need to special-case UnresolvedWithCTERelations twice: once in ResolveReferences to resolve session variables and once in ResolveIdentifierClause to resolve identifier and look up CTE relations.

How about we make UnresolvedWithCTERelations an unary code, and only special case it once in ResolveRelations that we should look up from CTE relations for UnresolvedRelations insideUnresolvedWithCTERelations? Sorry for the back and forth!

No worry!
Hm, issue with that approach is that ResolveRelations is traversing the tree in bottom up manner so we will first do table lookup instead of CTE relations lookup since it will first encounter UnresolvedRelation instead of UnresolvedWithCTERelations?

you are right, it's better to keep the bottom-up resolotion.

cloud-fan · 2024-07-09T14:30:35Z

thanks, merging to master!

### What changes were proposed in this pull request? DECLARE agg = 'max'; DECLARE col = 'c1'; DECLARE tab = 'T'; WITH S(c1, c2) AS (VALUES(1, 2), (2, 3)), T(c1, c2) AS (VALUES ('a', 'b'), ('c', 'd')) SELECT IDENTIFIER(agg)(IDENTIFIER(col)) FROM IDENTIFIER(tab); -- OR WITH S(c1, c2) AS (VALUES(1, 2), (2, 3)), T(c1, c2) AS (VALUES ('a', 'b'), ('c', 'd')) SELECT IDENTIFIER('max')(IDENTIFIER('c1')) FROM IDENTIFIER('T'); Currently we don't support Identifier clause as part of CTE reference. ### Why are the changes needed? Adding support for Identifier clause as part of CTE reference for both constant string expressions and session variables. ### Does this PR introduce _any_ user-facing change? It contains user facing changes in sense that identifier clause as cte reference will now be supported. ### How was this patch tested? Added tests as part of this PR. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47180 from nebojsa-db/SPARK-46625. Authored-by: Nebojsa Savic <nebojsa.savic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Working CTE with session variables and Identifier clause in general.

3a01831

github-actions bot added the SQL label Jul 2, 2024

srielau approved these changes Jul 2, 2024

View reviewed changes

Added test for string pipe inside identifier clause.

0a7d045

nikolamand-db approved these changes Jul 3, 2024

View reviewed changes

nebojsa-db added 2 commits July 3, 2024 17:58

Split CTE substitution in to first identify all CTE defs and do CTE r…

186f1f4

…esolution separately.

Added new analyzer rule.

d352a53

Fixed hints and nesting.

518103d

cloud-fan reviewed Jul 4, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jul 4, 2024

View reviewed changes

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveIdentifierClause.scala Outdated Show resolved Hide resolved

Refactored code to better suit split collect/resolve approach.

4afd30b

cloud-fan reviewed Jul 5, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jul 5, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jul 5, 2024

View reviewed changes

nebojsa-db added 3 commits July 6, 2024 19:33

Fixed nit comments.

7df6378

Changed approach for resolving identifier clause with cte.

3f573ad

Scala style fix.

0601caf

cloud-fan reviewed Jul 8, 2024

View reviewed changes

cloud-fan approved these changes Jul 9, 2024

View reviewed changes

cloud-fan closed this in d824e9e Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-46625] CTE with Identifier clause as reference #47180

[SPARK-46625] CTE with Identifier clause as reference #47180

nebojsa-db commented Jul 2, 2024

cloud-fan commented Jul 3, 2024 •

edited

Loading

nebojsa-db commented Jul 3, 2024

cloud-fan commented Jul 4, 2024

cloud-fan Jul 5, 2024

cloud-fan Jul 5, 2024

cloud-fan Jul 5, 2024

nebojsa-db Jul 6, 2024

cloud-fan Jul 8, 2024

nebojsa-db Jul 8, 2024 •

edited

Loading

cloud-fan Jul 9, 2024

cloud-fan commented Jul 9, 2024

[SPARK-46625] CTE with Identifier clause as reference #47180

[SPARK-46625] CTE with Identifier clause as reference #47180

Conversation

nebojsa-db commented Jul 2, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

cloud-fan commented Jul 3, 2024 • edited Loading

nebojsa-db commented Jul 3, 2024

cloud-fan commented Jul 4, 2024

cloud-fan Jul 5, 2024

Choose a reason for hiding this comment

cloud-fan Jul 5, 2024

Choose a reason for hiding this comment

cloud-fan Jul 5, 2024

Choose a reason for hiding this comment

nebojsa-db Jul 6, 2024

Choose a reason for hiding this comment

cloud-fan Jul 8, 2024

Choose a reason for hiding this comment

nebojsa-db Jul 8, 2024 • edited Loading

Choose a reason for hiding this comment

cloud-fan Jul 9, 2024

Choose a reason for hiding this comment

cloud-fan commented Jul 9, 2024

cloud-fan commented Jul 3, 2024 •

edited

Loading

nebojsa-db Jul 8, 2024 •

edited

Loading