Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

This is a long-standing problem. With the GROUP BY ordinal feature, it's quite easy for users to write a complicated expression as the GROUP BY expression and also put it in the SELECT list. It's usually OK as the complicated expressions in the GROUP BY expression and SELECT list remain the same, but problems may occur with subquery expressions, duplicated relations, and CTE inline. Let's look at this example:

CREATE VIEW v AS
WITH r AS (SELECT c1 + c2 AS c FROM t)
SELECT * FROM r;

SELECT (SELECT max(c) FROM v WHERE c > id) FROM range(1) GROUP BY 1;

A scalar subquery appears in both the GROUP BY expression and SELECT list. The scalar subquery scans table t, and because this scalar subquery appears twice, DeduplicateRelations will trigger. This makes the output attributes of CTE def and ref out of sync in the second scalar subquery and InlineCTE will add a cosmetic Project to adjust the output attr ids. CheckAnalysis will inline CTE in the beginning, and this extra cosmetic Project in the second scalar subquery makes Spark think it's not semantically equal to the first scalar subquery and fails the query.

The proposal here is to remove cosmetic Projects during plan canonicalization.

Why are the changes needed?

bug fix

Does this PR introduce any user-facing change?

Yes, some queries fail before and work now.

How was this patch tested?

new test

Was this patch authored or co-authored using generative AI tooling?

no

@cloud-fan
Copy link
Contributor Author

cc @peter-toth @gengliangwang

@github-actions github-actions bot added the SQL label Feb 7, 2025
Copy link
Contributor

@peter-toth peter-toth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending tests.

@cloud-fan
Copy link
Contributor Author

thanks for the review! merging to master/4.0!

@cloud-fan cloud-fan closed this in c7edcae Feb 10, 2025
cloud-fan added a commit that referenced this pull request Feb 10, 2025
### What changes were proposed in this pull request?

This is a long-standing problem. With the GROUP BY ordinal feature, it's quite easy for users to write a complicated expression as the GROUP BY expression and also put it in the SELECT list. It's usually OK as the complicated expressions in the GROUP BY expression and SELECT list remain the same, but problems may occur with subquery expressions, duplicated relations, and CTE inline. Let's look at this example:
```
CREATE VIEW v AS
WITH r AS (SELECT c1 + c2 AS c FROM t)
SELECT * FROM r;

SELECT (SELECT max(c) FROM v WHERE c > id) FROM range(1) GROUP BY 1;
```

A scalar subquery appears in both the GROUP BY expression and SELECT list. The scalar subquery scans table `t`, and because this scalar subquery appears twice, `DeduplicateRelations` will trigger. This makes the output attributes of CTE def and ref out of sync in the second scalar subquery and `InlineCTE` will add a cosmetic `Project` to adjust the output attr ids. `CheckAnalysis` will inline CTE in the beginning, and this extra cosmetic `Project` in the second scalar subquery makes Spark think it's not semantically equal to the first scalar subquery and fails the query.

The proposal here is to remove cosmetic Projects during plan canonicalization.

### Why are the changes needed?

bug fix

### Does this PR introduce _any_ user-facing change?

Yes, some queries fail before and work now.

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #49829 from cloud-fan/cte.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit c7edcae)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 14, 2025
### What changes were proposed in this pull request?

This is a long-standing problem. With the GROUP BY ordinal feature, it's quite easy for users to write a complicated expression as the GROUP BY expression and also put it in the SELECT list. It's usually OK as the complicated expressions in the GROUP BY expression and SELECT list remain the same, but problems may occur with subquery expressions, duplicated relations, and CTE inline. Let's look at this example:
```
CREATE VIEW v AS
WITH r AS (SELECT c1 + c2 AS c FROM t)
SELECT * FROM r;

SELECT (SELECT max(c) FROM v WHERE c > id) FROM range(1) GROUP BY 1;
```

A scalar subquery appears in both the GROUP BY expression and SELECT list. The scalar subquery scans table `t`, and because this scalar subquery appears twice, `DeduplicateRelations` will trigger. This makes the output attributes of CTE def and ref out of sync in the second scalar subquery and `InlineCTE` will add a cosmetic `Project` to adjust the output attr ids. `CheckAnalysis` will inline CTE in the beginning, and this extra cosmetic `Project` in the second scalar subquery makes Spark think it's not semantically equal to the first scalar subquery and fails the query.

The proposal here is to remove cosmetic Projects during plan canonicalization.

### Why are the changes needed?

bug fix

### Does this PR introduce _any_ user-facing change?

Yes, some queries fail before and work now.

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#49829 from cloud-fan/cte.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit b3be934)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants