[SPARK-40149][SQL][3.2] Propagate metadata columns through Project #37818

cloud-fan · 2022-09-07T11:25:53Z

backport #37758 to 3.2

What changes were proposed in this pull request?

This PR fixes a regression caused by #32017 .

In #32017 , we tried to be more conservative and decided to not propagate metadata columns in certain operators, including Project. However, the decision was made only considering SQL API, not DataFrame API. In fact, it's very common to chain Project operators in DataFrame, e.g. df.withColumn(...).withColumn(...)..., and it's very inconvenient if metadata columns are not propagated through Project.

This PR makes 2 changes:

Project should propagate metadata columns
SubqueryAlias should only propagate metadata columns if the child is a leaf node or also a SubqueryAlias

The second change is needed to still forbid weird queries like SELECT m from (SELECT a from t), which is the main motivation of #32017 .

After propagating metadata columns, a problem from #31666 is exposed: the natural join metadata columns may confuse the analyzer and lead to wrong analyzed plan. For example, SELECT t1.value FROM t1 LEFT JOIN t2 USING (key) ORDER BY key, how shall we resolve ORDER BY key? It should be resolved to t1.key via the rule ResolveMissingReferences, which is in the output of the left join. However, if Project can propagate metadata columns, ORDER BY key will be resolved to t2.key.

To solve this problem, this PR only allows qualified access for metadata columns of natural join. This has no breaking change, as people can only do qualified access for natural join metadata columns before, in the Project right after Join. This actually enables more use cases, as people can now access natural join metadata columns in ORDER BY. I've added a test for it.

Why are the changes needed?

fix a regression

Does this PR introduce any user-facing change?

For SQL API, there is no change, as a SubqueryAlias always comes with a Project or Aggregate, so we still don't propagate metadata columns through a SELECT group.

For DataFrame API, the behavior becomes more lenient. The only breaking case is an operator that can propagate metadata columns then follows a SubqueryAlias, e.g. df.filter(...).as("t").select("t.metadata_col"). But this is a weird use case and I don't think we should support it at the first place.

How was this patch tested?

new tests

This PR fixes a regression caused by apache#32017 . In apache#32017 , we tried to be more conservative and decided to not propagate metadata columns in certain operators, including `Project`. However, the decision was made only considering SQL API, not DataFrame API. In fact, it's very common to chain `Project` operators in DataFrame, e.g. `df.withColumn(...).withColumn(...)...`, and it's very inconvenient if metadata columns are not propagated through `Project`. This PR makes 2 changes: 1. Project should propagate metadata columns 2. SubqueryAlias should only propagate metadata columns if the child is a leaf node or also a SubqueryAlias The second change is needed to still forbid weird queries like `SELECT m from (SELECT a from t)`, which is the main motivation of apache#32017 . After propagating metadata columns, a problem from apache#31666 is exposed: the natural join metadata columns may confuse the analyzer and lead to wrong analyzed plan. For example, `SELECT t1.value FROM t1 LEFT JOIN t2 USING (key) ORDER BY key`, how shall we resolve `ORDER BY key`? It should be resolved to `t1.key` via the rule `ResolveMissingReferences`, which is in the output of the left join. However, if `Project` can propagate metadata columns, `ORDER BY key` will be resolved to `t2.key`. To solve this problem, this PR only allows qualified access for metadata columns of natural join. This has no breaking change, as people can only do qualified access for natural join metadata columns before, in the `Project` right after `Join`. This actually enables more use cases, as people can now access natural join metadata columns in ORDER BY. I've added a test for it. fix a regression For SQL API, there is no change, as a `SubqueryAlias` always comes with a `Project` or `Aggregate`, so we still don't propagate metadata columns through a SELECT group. For DataFrame API, the behavior becomes more lenient. The only breaking case is an operator that can propagate metadata columns then follows a `SubqueryAlias`, e.g. `df.filter(...).as("t").select("t.metadata_col")`. But this is a weird use case and I don't think we should support it at the first place. new tests Closes apache#37758 from cloud-fan/metadata. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 99ae1d9) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan · 2022-09-07T11:26:48Z

cc @viirya @huaxingao

cloud-fan · 2022-09-07T15:44:17Z

failed python tests are unrelated: https://github.com/cloud-fan/spark/actions/runs/3007233458

cloud-fan · 2022-09-07T15:44:49Z

merging to 3.2

backport #37758 to 3.2 ### What changes were proposed in this pull request? This PR fixes a regression caused by #32017 . In #32017 , we tried to be more conservative and decided to not propagate metadata columns in certain operators, including `Project`. However, the decision was made only considering SQL API, not DataFrame API. In fact, it's very common to chain `Project` operators in DataFrame, e.g. `df.withColumn(...).withColumn(...)...`, and it's very inconvenient if metadata columns are not propagated through `Project`. This PR makes 2 changes: 1. Project should propagate metadata columns 2. SubqueryAlias should only propagate metadata columns if the child is a leaf node or also a SubqueryAlias The second change is needed to still forbid weird queries like `SELECT m from (SELECT a from t)`, which is the main motivation of #32017 . After propagating metadata columns, a problem from #31666 is exposed: the natural join metadata columns may confuse the analyzer and lead to wrong analyzed plan. For example, `SELECT t1.value FROM t1 LEFT JOIN t2 USING (key) ORDER BY key`, how shall we resolve `ORDER BY key`? It should be resolved to `t1.key` via the rule `ResolveMissingReferences`, which is in the output of the left join. However, if `Project` can propagate metadata columns, `ORDER BY key` will be resolved to `t2.key`. To solve this problem, this PR only allows qualified access for metadata columns of natural join. This has no breaking change, as people can only do qualified access for natural join metadata columns before, in the `Project` right after `Join`. This actually enables more use cases, as people can now access natural join metadata columns in ORDER BY. I've added a test for it. ### Why are the changes needed? fix a regression ### Does this PR introduce _any_ user-facing change? For SQL API, there is no change, as a `SubqueryAlias` always comes with a `Project` or `Aggregate`, so we still don't propagate metadata columns through a SELECT group. For DataFrame API, the behavior becomes more lenient. The only breaking case is an operator that can propagate metadata columns then follows a `SubqueryAlias`, e.g. `df.filter(...).as("t").select("t.metadata_col")`. But this is a weird use case and I don't think we should support it at the first place. ### How was this patch tested? new tests Closes #37818 from cloud-fan/backport. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

viirya · 2022-09-07T16:40:01Z

lgtm

backport apache#37758 to 3.2 This PR fixes a regression caused by apache#32017 . In apache#32017 , we tried to be more conservative and decided to not propagate metadata columns in certain operators, including `Project`. However, the decision was made only considering SQL API, not DataFrame API. In fact, it's very common to chain `Project` operators in DataFrame, e.g. `df.withColumn(...).withColumn(...)...`, and it's very inconvenient if metadata columns are not propagated through `Project`. This PR makes 2 changes: 1. Project should propagate metadata columns 2. SubqueryAlias should only propagate metadata columns if the child is a leaf node or also a SubqueryAlias The second change is needed to still forbid weird queries like `SELECT m from (SELECT a from t)`, which is the main motivation of apache#32017 . After propagating metadata columns, a problem from apache#31666 is exposed: the natural join metadata columns may confuse the analyzer and lead to wrong analyzed plan. For example, `SELECT t1.value FROM t1 LEFT JOIN t2 USING (key) ORDER BY key`, how shall we resolve `ORDER BY key`? It should be resolved to `t1.key` via the rule `ResolveMissingReferences`, which is in the output of the left join. However, if `Project` can propagate metadata columns, `ORDER BY key` will be resolved to `t2.key`. To solve this problem, this PR only allows qualified access for metadata columns of natural join. This has no breaking change, as people can only do qualified access for natural join metadata columns before, in the `Project` right after `Join`. This actually enables more use cases, as people can now access natural join metadata columns in ORDER BY. I've added a test for it. fix a regression For SQL API, there is no change, as a `SubqueryAlias` always comes with a `Project` or `Aggregate`, so we still don't propagate metadata columns through a SELECT group. For DataFrame API, the behavior becomes more lenient. The only breaking case is an operator that can propagate metadata columns then follows a `SubqueryAlias`, e.g. `df.filter(...).as("t").select("t.metadata_col")`. But this is a weird use case and I don't think we should support it at the first place. new tests Closes apache#37818 from cloud-fan/backport. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d566017)

github-actions bot added the SQL label Sep 7, 2022

huaxingao approved these changes Sep 7, 2022

View reviewed changes

cloud-fan closed this Sep 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-40149][SQL][3.2] Propagate metadata columns through Project #37818

[SPARK-40149][SQL][3.2] Propagate metadata columns through Project #37818

cloud-fan commented Sep 7, 2022

cloud-fan commented Sep 7, 2022

cloud-fan commented Sep 7, 2022

cloud-fan commented Sep 7, 2022

viirya commented Sep 7, 2022

[SPARK-40149][SQL][3.2] Propagate metadata columns through Project #37818

[SPARK-40149][SQL][3.2] Propagate metadata columns through Project #37818

Conversation

cloud-fan commented Sep 7, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

cloud-fan commented Sep 7, 2022

cloud-fan commented Sep 7, 2022

cloud-fan commented Sep 7, 2022

viirya commented Sep 7, 2022