Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-40149][SQL][FOLLOWUP] Avoid adding extra Project in AddMetadataColumns #39895

Closed

Conversation

allisonwang-db
Copy link
Contributor

@allisonwang-db allisonwang-db commented Feb 6, 2023

What changes were proposed in this pull request?

This PR is a follow-up for #37758. It updates the rule AddMetadataColumns to avoid introducing extra Project.

Why are the changes needed?

To fix an issue introduced by #37758.

-- t1: [key, value] t2: [key, value]
select t1.key, t2.key from t1 full outer join t2 using (key)

Before this PR, the rule AddMetadataColumns will add a new Project between the using join and the select list:

Project [key, key]
+- Project [key, key, key, key] <--- extra project
   +- Project [coalesce(key, key) AS key, value, value, key, key]
      +- Join FullOuter, (key = key)
         :- LocalRelation <empty>, [key#0, value#0]
         +- LocalRelation <empty>, [key#0, value#0]

After this PR, this extra Project will be removed.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add a new UT.

@github-actions github-actions bot added the SQL label Feb 6, 2023
@cloud-fan
Copy link
Contributor

The pyspark failure is unrelated, merging to master/3.4!

@cloud-fan cloud-fan closed this in 286d336 Feb 7, 2023
cloud-fan pushed a commit that referenced this pull request Feb 7, 2023
…aColumns

This PR is a follow-up for #37758. It updates the rule `AddMetadataColumns` to avoid introducing extra `Project`.

To fix an issue introduced by #37758.
```sql
-- t1: [key, value] t2: [key, value]
select t1.key, t2.key from t1 full outer join t2 using (key)
```
Before this PR, the rule `AddMetadataColumns` will add a new Project between the using join and the select list:
```
Project [key, key]
+- Project [key, key, key, key] <--- extra project
   +- Project [coalesce(key, key) AS key, value, value, key, key]
      +- Join FullOuter, (key = key)
         :- LocalRelation <empty>, [key#0, value#0]
         +- LocalRelation <empty>, [key#0, value#0]
```
After this PR, this extra Project will be removed.

No

Add a new UT.

Closes #39895 from allisonwang-db/spark-40149-follow-up.

Authored-by: allisonwang-db <allison.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 286d336)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
…aColumns

This PR is a follow-up for apache#37758. It updates the rule `AddMetadataColumns` to avoid introducing extra `Project`.

To fix an issue introduced by apache#37758.
```sql
-- t1: [key, value] t2: [key, value]
select t1.key, t2.key from t1 full outer join t2 using (key)
```
Before this PR, the rule `AddMetadataColumns` will add a new Project between the using join and the select list:
```
Project [key, key]
+- Project [key, key, key, key] <--- extra project
   +- Project [coalesce(key, key) AS key, value, value, key, key]
      +- Join FullOuter, (key = key)
         :- LocalRelation <empty>, [key#0, value#0]
         +- LocalRelation <empty>, [key#0, value#0]
```
After this PR, this extra Project will be removed.

No

Add a new UT.

Closes apache#39895 from allisonwang-db/spark-40149-follow-up.

Authored-by: allisonwang-db <allison.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 286d336)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants