[SPARK-51109][SQL] CTE in subquery expression as grouping column #49829

cloud-fan · 2025-02-06T09:44:30Z

What changes were proposed in this pull request?

This is a long-standing problem. With the GROUP BY ordinal feature, it's quite easy for users to write a complicated expression as the GROUP BY expression and also put it in the SELECT list. It's usually OK as the complicated expressions in the GROUP BY expression and SELECT list remain the same, but problems may occur with subquery expressions, duplicated relations, and CTE inline. Let's look at this example:

CREATE VIEW v AS
WITH r AS (SELECT c1 + c2 AS c FROM t)
SELECT * FROM r;

SELECT (SELECT max(c) FROM v WHERE c > id) FROM range(1) GROUP BY 1;

A scalar subquery appears in both the GROUP BY expression and SELECT list. The scalar subquery scans table t, and because this scalar subquery appears twice, DeduplicateRelations will trigger. This makes the output attributes of CTE def and ref out of sync in the second scalar subquery and InlineCTE will add a cosmetic Project to adjust the output attr ids. CheckAnalysis will inline CTE in the beginning, and this extra cosmetic Project in the second scalar subquery makes Spark think it's not semantically equal to the first scalar subquery and fails the query.

The proposal here is to remove cosmetic Projects during plan canonicalization.

Why are the changes needed?

bug fix

Does this PR introduce any user-facing change?

Yes, some queries fail before and work now.

How was this patch tested?

new test

Was this patch authored or co-authored using generative AI tooling?

no

cloud-fan · 2025-02-06T09:44:43Z

cc @peter-toth @gengliangwang

peter-toth

LGTM, pending tests.

cloud-fan · 2025-02-10T03:00:09Z

thanks for the review! merging to master/4.0!

### What changes were proposed in this pull request? This is a long-standing problem. With the GROUP BY ordinal feature, it's quite easy for users to write a complicated expression as the GROUP BY expression and also put it in the SELECT list. It's usually OK as the complicated expressions in the GROUP BY expression and SELECT list remain the same, but problems may occur with subquery expressions, duplicated relations, and CTE inline. Let's look at this example: ``` CREATE VIEW v AS WITH r AS (SELECT c1 + c2 AS c FROM t) SELECT * FROM r; SELECT (SELECT max(c) FROM v WHERE c > id) FROM range(1) GROUP BY 1; ``` A scalar subquery appears in both the GROUP BY expression and SELECT list. The scalar subquery scans table `t`, and because this scalar subquery appears twice, `DeduplicateRelations` will trigger. This makes the output attributes of CTE def and ref out of sync in the second scalar subquery and `InlineCTE` will add a cosmetic `Project` to adjust the output attr ids. `CheckAnalysis` will inline CTE in the beginning, and this extra cosmetic `Project` in the second scalar subquery makes Spark think it's not semantically equal to the first scalar subquery and fails the query. The proposal here is to remove cosmetic Projects during plan canonicalization. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? Yes, some queries fail before and work now. ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? no Closes #49829 from cloud-fan/cte. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit c7edcae) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This is a long-standing problem. With the GROUP BY ordinal feature, it's quite easy for users to write a complicated expression as the GROUP BY expression and also put it in the SELECT list. It's usually OK as the complicated expressions in the GROUP BY expression and SELECT list remain the same, but problems may occur with subquery expressions, duplicated relations, and CTE inline. Let's look at this example: ``` CREATE VIEW v AS WITH r AS (SELECT c1 + c2 AS c FROM t) SELECT * FROM r; SELECT (SELECT max(c) FROM v WHERE c > id) FROM range(1) GROUP BY 1; ``` A scalar subquery appears in both the GROUP BY expression and SELECT list. The scalar subquery scans table `t`, and because this scalar subquery appears twice, `DeduplicateRelations` will trigger. This makes the output attributes of CTE def and ref out of sync in the second scalar subquery and `InlineCTE` will add a cosmetic `Project` to adjust the output attr ids. `CheckAnalysis` will inline CTE in the beginning, and this extra cosmetic `Project` in the second scalar subquery makes Spark think it's not semantically equal to the first scalar subquery and fails the query. The proposal here is to remove cosmetic Projects during plan canonicalization. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? Yes, some queries fail before and work now. ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#49829 from cloud-fan/cte. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit b3be934) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

CTE in subquery expression as grouping column

c7013e5

github-actions bot added the SQL label Feb 7, 2025

update tests

3bc481b

cloud-fan force-pushed the cte branch from 723de80 to 3bc481b Compare February 7, 2025 11:09

peter-toth approved these changes Feb 7, 2025

View reviewed changes

cloud-fan closed this in c7edcae Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-51109][SQL] CTE in subquery expression as grouping column #49829

[SPARK-51109][SQL] CTE in subquery expression as grouping column #49829

Uh oh!

cloud-fan commented Feb 6, 2025

Uh oh!

cloud-fan commented Feb 6, 2025

Uh oh!

peter-toth left a comment

Uh oh!

cloud-fan commented Feb 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-51109][SQL] CTE in subquery expression as grouping column #49829

[SPARK-51109][SQL] CTE in subquery expression as grouping column #49829

Uh oh!

Conversation

cloud-fan commented Feb 6, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

cloud-fan commented Feb 6, 2025

Uh oh!

peter-toth left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Feb 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants