Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-46625] CTE with Identifier clause as reference #47180

Closed
wants to merge 9 commits into from

Conversation

nebojsa-db
Copy link
Contributor

What changes were proposed in this pull request?

DECLARE agg = 'max';
DECLARE col = 'c1';
DECLARE tab = 'T';

WITH S(c1, c2) AS (VALUES(1, 2), (2, 3)),
T(c1, c2) AS (VALUES ('a', 'b'), ('c', 'd'))
SELECT IDENTIFIER(agg)(IDENTIFIER(col)) FROM IDENTIFIER(tab);

-- OR

WITH S(c1, c2) AS (VALUES(1, 2), (2, 3)),
T(c1, c2) AS (VALUES ('a', 'b'), ('c', 'd'))
SELECT IDENTIFIER('max')(IDENTIFIER('c1')) FROM IDENTIFIER('T');

Currently we don't support Identifier clause as part of CTE reference.

Why are the changes needed?

Adding support for Identifier clause as part of CTE reference for both constant string expressions and session variables.

Does this PR introduce any user-facing change?

It contains user facing changes in sense that identifier clause as cte reference will now be supported.

How was this patch tested?

Added tests as part of this PR.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Jul 2, 2024
@cloud-fan
Copy link
Contributor

cloud-fan commented Jul 3, 2024

I'm not a big fan of this approach, as this duplicates the handling of IDENTIFIER clauses in CTESubstitution.

IMO, the root cause is we special-case CTE resolution and run CTESubstitution as an individual batch at the very beginning. The ideal solution is to look up CTE relations together with the normal table lookup.

My idea: let's split CTE resolution into two steps:

  1. identify the available CTE relations for each UnresolvedRelation. Given the position of UnresolvedRelation, the available CTE relations can be very different (e.g. in the main query, in the CTE relations, in nested CTE, etc.). Then we wrap UnresolvedRelation with a new node WithCTERelations to hold available CTE relations.
  2. In the analyzer main batch, we wait for the IDENTIFIER clause to be handled, then unwrap WithCTERelations by looking up CTE relations and resoving UnresolvedRelation. If the lookup fails, restore to UnresolvedRelation so that normal table lookup rule can handle it later.

@nebojsa-db
Copy link
Contributor Author

I'm not a big fan of this approach, as this duplicates the handling of IDENTIFIER clauses in CTESubstitution.

IMO, the root cause is we special-case CTE resolution and run CTESubstitution as an individual batch at the very beginning. The ideal solution is to look up CTE relations together with the normal table lookup.

My idea: let's split CTE resolution into two steps:

  1. identify the available CTE relations for each UnresolvedRelation. Given the position of UnresolvedRelation, the available CTE relations can be very different (e.g. in the main query, in the CTE relations, in nested CTE, etc.). Then we wrap UnresolvedRelation with a new node WithCTERelations to hold available CTE relations.
  2. In the analyzer main batch, we wait for the IDENTIFIER clause to be handled, then unwrap WithCTERelations by looking up CTE relations and resoving UnresolvedRelation. If the lookup fails, restore to UnresolvedRelation so that normal table lookup rule can handle it later.

@cloud-fan Please take a look at the pushed changes now, I've created a rough draft changes which should work with your approach (if I understood correctly). I don't have deep understanding of all possible uses of CTEs and if changing the order of these few rules could cause some major issues?

@cloud-fan
Copy link
Contributor

Yes this is the approach I was talking about, LGTM!

SubqueryAlias(table, CTERelationRef(d.id, d.resolved, d.output, d.isStreaming))
// Add unresolved with CTE relations to the plan and we
// will do CTE resolution later in analyzer based on this node.
UnresolvedWithCTERelations(u, cteRelations)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about it, I think we can still return SubqueryAlias(...) as before here, which is like a shortcut as we know we should look up CTE relations first when resolving UnresolvedRelation. There is no need to delay it.

}
}.getOrElse(u)

case p: PlanWithUnresolvedIdentifier =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the rule name unchanged and only make this change here. We should also add comments to explain it

// We must look up CTE relations first when resolving `UnresolvedRelation`s, but we can't do it here
// as `PlanWithUnresolvedIdentifier` is a leaf node and may produce `UnresolvedRelation` later. Here
// we wrap it with `UnresolvedWithCTERelations` so that we can delay the CTE relations lookup after
// `PlanWithUnresolvedIdentifier` is resolved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing is that we should guarantee the UnresolvedRelation should be resolved by CTE relations lookup before the normal table lookup path. My proposal: UnresolvedWithCTERelations should be a leaf node so that the normal table lookup rule can't transform the UnresolvedRelation inside it. ResolveIdentifierClause should handle UnresolvedWithCTERelations specially and resolve PlanWithUnresolvedIdentifier inside it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made changes suggested in these comments, hopefully I understood correctly what you suggested, please review :)

@@ -1796,6 +1796,9 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
case s: Sort if !s.resolved || s.missingInput.nonEmpty =>
resolveReferencesInSort(s)

case u: UnresolvedWithCTERelations =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it more, now we need to special-case UnresolvedWithCTERelations twice: once in ResolveReferences to resolve session variables and once in ResolveIdentifierClause to resolve identifier and look up CTE relations.

How about we make UnresolvedWithCTERelations an unary code, and only special case it once in ResolveRelations that we should look up from CTE relations for UnresolvedRelations insideUnresolvedWithCTERelations? Sorry for the back and forth!

Copy link
Contributor Author

@nebojsa-db nebojsa-db Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worry!
Hm, issue with that approach is that ResolveRelations is traversing the tree in bottom up manner so we will first do table lookup instead of CTE relations lookup since it will first encounter UnresolvedRelation instead of UnresolvedWithCTERelations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, it's better to keep the bottom-up resolotion.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in d824e9e Jul 9, 2024
ericm-db pushed a commit to ericm-db/spark that referenced this pull request Jul 10, 2024
### What changes were proposed in this pull request?
DECLARE agg = 'max';
DECLARE col = 'c1';
DECLARE tab = 'T';

WITH S(c1, c2) AS (VALUES(1, 2), (2, 3)),
      T(c1, c2) AS (VALUES ('a', 'b'), ('c', 'd'))
SELECT IDENTIFIER(agg)(IDENTIFIER(col)) FROM IDENTIFIER(tab);

-- OR

WITH S(c1, c2) AS (VALUES(1, 2), (2, 3)),
      T(c1, c2) AS (VALUES ('a', 'b'), ('c', 'd'))
SELECT IDENTIFIER('max')(IDENTIFIER('c1')) FROM IDENTIFIER('T');

Currently we don't support Identifier clause as part of CTE reference.

### Why are the changes needed?
Adding support for Identifier clause as part of CTE reference for both constant string expressions and session variables.

### Does this PR introduce _any_ user-facing change?
It contains user facing changes in sense that identifier clause as cte reference will now be supported.

### How was this patch tested?
Added tests as part of this PR.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47180 from nebojsa-db/SPARK-46625.

Authored-by: Nebojsa Savic <nebojsa.savic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
biruktesf-db pushed a commit to biruktesf-db/spark that referenced this pull request Jul 11, 2024
### What changes were proposed in this pull request?
DECLARE agg = 'max';
DECLARE col = 'c1';
DECLARE tab = 'T';

WITH S(c1, c2) AS (VALUES(1, 2), (2, 3)),
      T(c1, c2) AS (VALUES ('a', 'b'), ('c', 'd'))
SELECT IDENTIFIER(agg)(IDENTIFIER(col)) FROM IDENTIFIER(tab);

-- OR

WITH S(c1, c2) AS (VALUES(1, 2), (2, 3)),
      T(c1, c2) AS (VALUES ('a', 'b'), ('c', 'd'))
SELECT IDENTIFIER('max')(IDENTIFIER('c1')) FROM IDENTIFIER('T');

Currently we don't support Identifier clause as part of CTE reference.

### Why are the changes needed?
Adding support for Identifier clause as part of CTE reference for both constant string expressions and session variables.

### Does this PR introduce _any_ user-facing change?
It contains user facing changes in sense that identifier clause as cte reference will now be supported.

### How was this patch tested?
Added tests as part of this PR.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47180 from nebojsa-db/SPARK-46625.

Authored-by: Nebojsa Savic <nebojsa.savic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
jingz-db pushed a commit to jingz-db/spark that referenced this pull request Jul 22, 2024
### What changes were proposed in this pull request?
DECLARE agg = 'max';
DECLARE col = 'c1';
DECLARE tab = 'T';

WITH S(c1, c2) AS (VALUES(1, 2), (2, 3)),
      T(c1, c2) AS (VALUES ('a', 'b'), ('c', 'd'))
SELECT IDENTIFIER(agg)(IDENTIFIER(col)) FROM IDENTIFIER(tab);

-- OR

WITH S(c1, c2) AS (VALUES(1, 2), (2, 3)),
      T(c1, c2) AS (VALUES ('a', 'b'), ('c', 'd'))
SELECT IDENTIFIER('max')(IDENTIFIER('c1')) FROM IDENTIFIER('T');

Currently we don't support Identifier clause as part of CTE reference.

### Why are the changes needed?
Adding support for Identifier clause as part of CTE reference for both constant string expressions and session variables.

### Does this PR introduce _any_ user-facing change?
It contains user facing changes in sense that identifier clause as cte reference will now be supported.

### How was this patch tested?
Added tests as part of this PR.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47180 from nebojsa-db/SPARK-46625.

Authored-by: Nebojsa Savic <nebojsa.savic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
### What changes were proposed in this pull request?
DECLARE agg = 'max';
DECLARE col = 'c1';
DECLARE tab = 'T';

WITH S(c1, c2) AS (VALUES(1, 2), (2, 3)),
      T(c1, c2) AS (VALUES ('a', 'b'), ('c', 'd'))
SELECT IDENTIFIER(agg)(IDENTIFIER(col)) FROM IDENTIFIER(tab);

-- OR

WITH S(c1, c2) AS (VALUES(1, 2), (2, 3)),
      T(c1, c2) AS (VALUES ('a', 'b'), ('c', 'd'))
SELECT IDENTIFIER('max')(IDENTIFIER('c1')) FROM IDENTIFIER('T');

Currently we don't support Identifier clause as part of CTE reference.

### Why are the changes needed?
Adding support for Identifier clause as part of CTE reference for both constant string expressions and session variables.

### Does this PR introduce _any_ user-facing change?
It contains user facing changes in sense that identifier clause as cte reference will now be supported.

### How was this patch tested?
Added tests as part of this PR.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47180 from nebojsa-db/SPARK-46625.

Authored-by: Nebojsa Savic <nebojsa.savic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants