[SPARK-40149][SQL] Propagate metadata columns through Project #37758

cloud-fan · 2022-09-01T12:25:35Z

What changes were proposed in this pull request?

This PR fixes a regression caused by #32017 .

In #32017 , we tried to be more conservative and decided to not propagate metadata columns in certain operators, including Project. However, the decision was made only considering SQL API, not DataFrame API. In fact, it's very common to chain Project operators in DataFrame, e.g. df.withColumn(...).withColumn(...)..., and it's very inconvenient if metadata columns are not propagated through Project.

This PR makes 2 changes:

Project should propagate metadata columns
SubqueryAlias should only propagate metadata columns if the child is a leaf node or also a SubqueryAlias

The second change is needed to still forbid weird queries like SELECT m from (SELECT a from t), which is the main motivation of #32017 .

After propagating metadata columns, a problem from #31666 is exposed: the natural join metadata columns may confuse the analyzer and lead to wrong analyzed plan. For example, SELECT t1.value FROM t1 LEFT JOIN t2 USING (key) ORDER BY key, how shall we resolve ORDER BY key? It should be resolved to t1.key via the rule ResolveMissingReferences, which is in the output of the left join. However, if Project can propagate metadata columns, ORDER BY key will be resolved to t2.key.

To solve this problem, this PR only allows qualified access for metadata columns of natural join. This has no breaking change, as people can only do qualified access for natural join metadata columns before, in the Project right after Join. This actually enables more use cases, as people can now access natural join metadata columns in ORDER BY. I've added a test for it.

Why are the changes needed?

fix a regression

Does this PR introduce any user-facing change?

For SQL API, there is no change, as a SubqueryAlias always comes with a Project or Aggregate, so we still don't propagate metadata columns through a SELECT group.

For DataFrame API, the behavior becomes more lenient. The only breaking case is an operator that can propagate metadata columns then follows a SubqueryAlias, e.g. df.filter(...).as("t").select("t.metadata_col"). But this is a weird use case and I don't think we should support it at the first place.

How was this patch tested?

new tests

cloud-fan · 2022-09-01T12:26:14Z

cc @karenfeng @viirya @huaxingao

viirya

Looks good. But there are a few SQL test failures. Are they related?

cloud-fan · 2022-09-02T15:29:23Z

@viirya I've updated the PR description for the new issue discovered.

viirya · 2022-09-02T18:26:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/package.scala

@@ -191,20 +191,20 @@ package object util extends Logging {
    /**
     * If set, this metadata column is a candidate during qualified star expansions.


This comment is out-of-dated.

viirya · 2022-09-02T18:54:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -3523,8 +3525,8 @@ class Analyzer(override val catalogManager: CatalogManager)
    val project = Project(projectList, Join(left, right, joinType, newCondition, hint))
    project.setTagValue(
      Project.hiddenOutputTag,
-      hiddenList.map(_.markAsSupportsQualifiedStar()) ++
-        project.child.metadataOutput.filter(_.supportsQualifiedStar))
+      hiddenList.map(_.markAsQualifiedAccess()) ++


The new semantics read a bit weird. Previously it is understandable that markAsSupportsQualifiedStar means hiddenList can be accessed by qualified star.

But how to interpret markAsQualifiedAccess here? I think in hiddenList it is duplicated join keys. What does it mean to mark them as qualified access.

how about markAsQualifiedAccessOnly?

Hmm, sounds okay

viirya

All test passed. So I think the new markAsQualifiedAccess works in practice, at least for nature join case. Although I have a question about how to interpret it in its semantics.

huaxingao · 2022-09-05T01:16:20Z

sql/core/src/test/scala/org/apache/spark/sql/connector/MetadataColumnSuite.scala

+    }
+  }
+
+  test("SPARK-34923: propagate metadata columns through Project") {


nit: should this be test("SPARK-40149: propagate metadata columns through Project for DataFrame API")?

Shall we keep the

assertThrows[AnalysisException] { sql(s"SELECT index, _partition from (SELECT id, data FROM $t1)") }

It is kept, in https://github.com/apache/spark/pull/37758/files#diff-0d2dd492482083c6ff24c2572eacf9eeef3f01a180f97fec4dc4c5e08e9fbfb4R203

cloud-fan · 2022-09-07T10:43:58Z

thanks for review, merging to master/3.3!

This PR fixes a regression caused by #32017 . In #32017 , we tried to be more conservative and decided to not propagate metadata columns in certain operators, including `Project`. However, the decision was made only considering SQL API, not DataFrame API. In fact, it's very common to chain `Project` operators in DataFrame, e.g. `df.withColumn(...).withColumn(...)...`, and it's very inconvenient if metadata columns are not propagated through `Project`. This PR makes 2 changes: 1. Project should propagate metadata columns 2. SubqueryAlias should only propagate metadata columns if the child is a leaf node or also a SubqueryAlias The second change is needed to still forbid weird queries like `SELECT m from (SELECT a from t)`, which is the main motivation of #32017 . After propagating metadata columns, a problem from #31666 is exposed: the natural join metadata columns may confuse the analyzer and lead to wrong analyzed plan. For example, `SELECT t1.value FROM t1 LEFT JOIN t2 USING (key) ORDER BY key`, how shall we resolve `ORDER BY key`? It should be resolved to `t1.key` via the rule `ResolveMissingReferences`, which is in the output of the left join. However, if `Project` can propagate metadata columns, `ORDER BY key` will be resolved to `t2.key`. To solve this problem, this PR only allows qualified access for metadata columns of natural join. This has no breaking change, as people can only do qualified access for natural join metadata columns before, in the `Project` right after `Join`. This actually enables more use cases, as people can now access natural join metadata columns in ORDER BY. I've added a test for it. fix a regression For SQL API, there is no change, as a `SubqueryAlias` always comes with a `Project` or `Aggregate`, so we still don't propagate metadata columns through a SELECT group. For DataFrame API, the behavior becomes more lenient. The only breaking case is an operator that can propagate metadata columns then follows a `SubqueryAlias`, e.g. `df.filter(...).as("t").select("t.metadata_col")`. But this is a weird use case and I don't think we should support it at the first place. new tests Closes #37758 from cloud-fan/metadata. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 99ae1d9) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

This PR fixes a regression caused by apache#32017 . In apache#32017 , we tried to be more conservative and decided to not propagate metadata columns in certain operators, including `Project`. However, the decision was made only considering SQL API, not DataFrame API. In fact, it's very common to chain `Project` operators in DataFrame, e.g. `df.withColumn(...).withColumn(...)...`, and it's very inconvenient if metadata columns are not propagated through `Project`. This PR makes 2 changes: 1. Project should propagate metadata columns 2. SubqueryAlias should only propagate metadata columns if the child is a leaf node or also a SubqueryAlias The second change is needed to still forbid weird queries like `SELECT m from (SELECT a from t)`, which is the main motivation of apache#32017 . After propagating metadata columns, a problem from apache#31666 is exposed: the natural join metadata columns may confuse the analyzer and lead to wrong analyzed plan. For example, `SELECT t1.value FROM t1 LEFT JOIN t2 USING (key) ORDER BY key`, how shall we resolve `ORDER BY key`? It should be resolved to `t1.key` via the rule `ResolveMissingReferences`, which is in the output of the left join. However, if `Project` can propagate metadata columns, `ORDER BY key` will be resolved to `t2.key`. To solve this problem, this PR only allows qualified access for metadata columns of natural join. This has no breaking change, as people can only do qualified access for natural join metadata columns before, in the `Project` right after `Join`. This actually enables more use cases, as people can now access natural join metadata columns in ORDER BY. I've added a test for it. fix a regression For SQL API, there is no change, as a `SubqueryAlias` always comes with a `Project` or `Aggregate`, so we still don't propagate metadata columns through a SELECT group. For DataFrame API, the behavior becomes more lenient. The only breaking case is an operator that can propagate metadata columns then follows a `SubqueryAlias`, e.g. `df.filter(...).as("t").select("t.metadata_col")`. But this is a weird use case and I don't think we should support it at the first place. new tests Closes apache#37758 from cloud-fan/metadata. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 99ae1d9) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

backport #37758 to 3.2 ### What changes were proposed in this pull request? This PR fixes a regression caused by #32017 . In #32017 , we tried to be more conservative and decided to not propagate metadata columns in certain operators, including `Project`. However, the decision was made only considering SQL API, not DataFrame API. In fact, it's very common to chain `Project` operators in DataFrame, e.g. `df.withColumn(...).withColumn(...)...`, and it's very inconvenient if metadata columns are not propagated through `Project`. This PR makes 2 changes: 1. Project should propagate metadata columns 2. SubqueryAlias should only propagate metadata columns if the child is a leaf node or also a SubqueryAlias The second change is needed to still forbid weird queries like `SELECT m from (SELECT a from t)`, which is the main motivation of #32017 . After propagating metadata columns, a problem from #31666 is exposed: the natural join metadata columns may confuse the analyzer and lead to wrong analyzed plan. For example, `SELECT t1.value FROM t1 LEFT JOIN t2 USING (key) ORDER BY key`, how shall we resolve `ORDER BY key`? It should be resolved to `t1.key` via the rule `ResolveMissingReferences`, which is in the output of the left join. However, if `Project` can propagate metadata columns, `ORDER BY key` will be resolved to `t2.key`. To solve this problem, this PR only allows qualified access for metadata columns of natural join. This has no breaking change, as people can only do qualified access for natural join metadata columns before, in the `Project` right after `Join`. This actually enables more use cases, as people can now access natural join metadata columns in ORDER BY. I've added a test for it. ### Why are the changes needed? fix a regression ### Does this PR introduce _any_ user-facing change? For SQL API, there is no change, as a `SubqueryAlias` always comes with a `Project` or `Aggregate`, so we still don't propagate metadata columns through a SELECT group. For DataFrame API, the behavior becomes more lenient. The only breaking case is an operator that can propagate metadata columns then follows a `SubqueryAlias`, e.g. `df.filter(...).as("t").select("t.metadata_col")`. But this is a weird use case and I don't think we should support it at the first place. ### How was this patch tested? new tests Closes #37818 from cloud-fan/backport. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Skip `UnresolvedHint` in rule `AddMetadataColumns` to avoid call exprId on `UnresolvedAttribute`. ### Why are the changes needed? ``` CREATE TABLE t1(c1 bigint) USING PARQUET; CREATE TABLE t2(c2 bigint) USING PARQUET; SELECT /*+ hash(t2) */ * FROM t1 join t2 on c1 = c2; ``` failed with msg: ``` org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to exprId on unresolved object at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.exprId(unresolved.scala:147) at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$hasMetadataCol$4(Analyzer.scala:1005) at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$hasMetadataCol$4$adapted(Analyzer.scala:1005) at scala.collection.Iterator.exists(Iterator.scala:969) at scala.collection.Iterator.exists$(Iterator.scala:967) at scala.collection.AbstractIterator.exists(Iterator.scala:1431) at scala.collection.IterableLike.exists(IterableLike.scala:79) at scala.collection.IterableLike.exists$(IterableLike.scala:78) at scala.collection.AbstractIterable.exists(Iterable.scala:56) at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$hasMetadataCol$3(Analyzer.scala:1005) at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$hasMetadataCol$3$adapted(Analyzer.scala:1005) ``` But before just a warning: `WARN HintErrorLogger: Unrecognized hint: hash(t2)` ### Does this PR introduce _any_ user-facing change? yes, fix regression from 3.3.1. Note, the root reason is we mark `UnresolvedHint` is resolved if child is resolved since #32841, then #37758 trigger this bug. ### How was this patch tested? add test Closes #38662 from ulysses-you/hint. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Skip `UnresolvedHint` in rule `AddMetadataColumns` to avoid call exprId on `UnresolvedAttribute`. ``` CREATE TABLE t1(c1 bigint) USING PARQUET; CREATE TABLE t2(c2 bigint) USING PARQUET; SELECT /*+ hash(t2) */ * FROM t1 join t2 on c1 = c2; ``` failed with msg: ``` org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to exprId on unresolved object at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.exprId(unresolved.scala:147) at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$hasMetadataCol$4(Analyzer.scala:1005) at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$hasMetadataCol$4$adapted(Analyzer.scala:1005) at scala.collection.Iterator.exists(Iterator.scala:969) at scala.collection.Iterator.exists$(Iterator.scala:967) at scala.collection.AbstractIterator.exists(Iterator.scala:1431) at scala.collection.IterableLike.exists(IterableLike.scala:79) at scala.collection.IterableLike.exists$(IterableLike.scala:78) at scala.collection.AbstractIterable.exists(Iterable.scala:56) at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$hasMetadataCol$3(Analyzer.scala:1005) at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$hasMetadataCol$3$adapted(Analyzer.scala:1005) ``` But before just a warning: `WARN HintErrorLogger: Unrecognized hint: hash(t2)` yes, fix regression from 3.3.1. Note, the root reason is we mark `UnresolvedHint` is resolved if child is resolved since #32841, then #37758 trigger this bug. add test Closes #38662 from ulysses-you/hint. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit a9bf5d2) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ter subquery alias  ### What changes were proposed in this pull request?  This fixes a regression caused by #37758 . In #37758 , we decided to only allow qualified name access for using/natural join hidden columns, to fix other problems around hidden columns. We thought that is not a breaking change, as you can only access the join hidden columns by qualified names to disambiguate. However, one case is missed: when we wrap the join with a subquery alias, the ambiguity is gone and we should allow simple name access. This PR fixes this bug by removing the qualified access only restriction in `SubqueryAlias.output`. ### Why are the changes needed?  fix a regression. ### Does this PR introduce _any_ user-facing change?  Yes, certain querys that failed with `UNRESOLVED_COLUMN` before this PR can work now. ### How was this patch tested?  new tests Closes #38862 from cloud-fan/join. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Skip `UnresolvedHint` in rule `AddMetadataColumns` to avoid call exprId on `UnresolvedAttribute`. ### Why are the changes needed? ``` CREATE TABLE t1(c1 bigint) USING PARQUET; CREATE TABLE t2(c2 bigint) USING PARQUET; SELECT /*+ hash(t2) */ * FROM t1 join t2 on c1 = c2; ``` failed with msg: ``` org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to exprId on unresolved object at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.exprId(unresolved.scala:147) at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$hasMetadataCol$4(Analyzer.scala:1005) at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$hasMetadataCol$4$adapted(Analyzer.scala:1005) at scala.collection.Iterator.exists(Iterator.scala:969) at scala.collection.Iterator.exists$(Iterator.scala:967) at scala.collection.AbstractIterator.exists(Iterator.scala:1431) at scala.collection.IterableLike.exists(IterableLike.scala:79) at scala.collection.IterableLike.exists$(IterableLike.scala:78) at scala.collection.AbstractIterable.exists(Iterable.scala:56) at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$hasMetadataCol$3(Analyzer.scala:1005) at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$hasMetadataCol$3$adapted(Analyzer.scala:1005) ``` But before just a warning: `WARN HintErrorLogger: Unrecognized hint: hash(t2)` ### Does this PR introduce _any_ user-facing change? yes, fix regression from 3.3.1. Note, the root reason is we mark `UnresolvedHint` is resolved if child is resolved since apache#32841, then apache#37758 trigger this bug. ### How was this patch tested? add test Closes apache#38662 from ulysses-you/hint. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ter subquery alias  ### What changes were proposed in this pull request?  This fixes a regression caused by apache#37758 . In apache#37758 , we decided to only allow qualified name access for using/natural join hidden columns, to fix other problems around hidden columns. We thought that is not a breaking change, as you can only access the join hidden columns by qualified names to disambiguate. However, one case is missed: when we wrap the join with a subquery alias, the ambiguity is gone and we should allow simple name access. This PR fixes this bug by removing the qualified access only restriction in `SubqueryAlias.output`. ### Why are the changes needed?  fix a regression. ### Does this PR introduce _any_ user-facing change?  Yes, certain querys that failed with `UNRESOLVED_COLUMN` before this PR can work now. ### How was this patch tested?  new tests Closes apache#38862 from cloud-fan/join. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…tadataColumnSuite ### What changes were proposed in this pull request? Move the new test case for Metadata column in #39081 to `MetadataColumnSuite` ### Why are the changes needed? All metadata column related test cases should go into `MetadataColumnSuite`. For example: - #37758 - #39152 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA tests Closes #39425 from gengliangwang/moveTest. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>

…aColumns ### What changes were proposed in this pull request? This PR is a follow-up for #37758. It updates the rule `AddMetadataColumns` to avoid introducing extra `Project`. ### Why are the changes needed? To fix an issue introduced by #37758. ```sql -- t1: [key, value] t2: [key, value] select t1.key, t2.key from t1 full outer join t2 using (key) ``` Before this PR, the rule `AddMetadataColumns` will add a new Project between the using join and the select list: ``` Project [key, key] +- Project [key, key, key, key] <--- extra project +- Project [coalesce(key, key) AS key, value, value, key, key] +- Join FullOuter, (key = key) :- LocalRelation <empty>, [key#0, value#0] +- LocalRelation <empty>, [key#0, value#0] ``` After this PR, this extra Project will be removed. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add a new UT. Closes #39895 from allisonwang-db/spark-40149-follow-up. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…aColumns This PR is a follow-up for #37758. It updates the rule `AddMetadataColumns` to avoid introducing extra `Project`. To fix an issue introduced by #37758. ```sql -- t1: [key, value] t2: [key, value] select t1.key, t2.key from t1 full outer join t2 using (key) ``` Before this PR, the rule `AddMetadataColumns` will add a new Project between the using join and the select list: ``` Project [key, key] +- Project [key, key, key, key] <--- extra project +- Project [coalesce(key, key) AS key, value, value, key, key] +- Join FullOuter, (key = key) :- LocalRelation <empty>, [key#0, value#0] +- LocalRelation <empty>, [key#0, value#0] ``` After this PR, this extra Project will be removed. No Add a new UT. Closes #39895 from allisonwang-db/spark-40149-follow-up. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 286d336) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

backport apache#37758 to 3.2 This PR fixes a regression caused by apache#32017 . In apache#32017 , we tried to be more conservative and decided to not propagate metadata columns in certain operators, including `Project`. However, the decision was made only considering SQL API, not DataFrame API. In fact, it's very common to chain `Project` operators in DataFrame, e.g. `df.withColumn(...).withColumn(...)...`, and it's very inconvenient if metadata columns are not propagated through `Project`. This PR makes 2 changes: 1. Project should propagate metadata columns 2. SubqueryAlias should only propagate metadata columns if the child is a leaf node or also a SubqueryAlias The second change is needed to still forbid weird queries like `SELECT m from (SELECT a from t)`, which is the main motivation of apache#32017 . After propagating metadata columns, a problem from apache#31666 is exposed: the natural join metadata columns may confuse the analyzer and lead to wrong analyzed plan. For example, `SELECT t1.value FROM t1 LEFT JOIN t2 USING (key) ORDER BY key`, how shall we resolve `ORDER BY key`? It should be resolved to `t1.key` via the rule `ResolveMissingReferences`, which is in the output of the left join. However, if `Project` can propagate metadata columns, `ORDER BY key` will be resolved to `t2.key`. To solve this problem, this PR only allows qualified access for metadata columns of natural join. This has no breaking change, as people can only do qualified access for natural join metadata columns before, in the `Project` right after `Join`. This actually enables more use cases, as people can now access natural join metadata columns in ORDER BY. I've added a test for it. fix a regression For SQL API, there is no change, as a `SubqueryAlias` always comes with a `Project` or `Aggregate`, so we still don't propagate metadata columns through a SELECT group. For DataFrame API, the behavior becomes more lenient. The only breaking case is an operator that can propagate metadata columns then follows a `SubqueryAlias`, e.g. `df.filter(...).as("t").select("t.metadata_col")`. But this is a weird use case and I don't think we should support it at the first place. new tests Closes apache#37818 from cloud-fan/backport. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d566017)

…aColumns This PR is a follow-up for apache#37758. It updates the rule `AddMetadataColumns` to avoid introducing extra `Project`. To fix an issue introduced by apache#37758. ```sql -- t1: [key, value] t2: [key, value] select t1.key, t2.key from t1 full outer join t2 using (key) ``` Before this PR, the rule `AddMetadataColumns` will add a new Project between the using join and the select list: ``` Project [key, key] +- Project [key, key, key, key] <--- extra project +- Project [coalesce(key, key) AS key, value, value, key, key] +- Join FullOuter, (key = key) :- LocalRelation <empty>, [key#0, value#0] +- LocalRelation <empty>, [key#0, value#0] ``` After this PR, this extra Project will be removed. No Add a new UT. Closes apache#39895 from allisonwang-db/spark-40149-follow-up. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 286d336) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

github-actions bot added the SQL label Sep 1, 2022

viirya reviewed Sep 2, 2022

View reviewed changes

cloud-fan added 2 commits September 2, 2022 16:05

propagate metadata columns through Project

7a863b1

fix

d659812

cloud-fan force-pushed the metadata branch from 355f773 to d659812 Compare September 2, 2022 14:59

add test

a3effbe

viirya reviewed Sep 2, 2022

View reviewed changes

huaxingao reviewed Sep 5, 2022

View reviewed changes

address comments

fced88e

cloud-fan force-pushed the metadata branch from 64a7512 to fced88e Compare September 6, 2022 06:59

viirya approved these changes Sep 6, 2022

View reviewed changes

huaxingao approved these changes Sep 6, 2022

View reviewed changes

cloud-fan closed this in 99ae1d9 Sep 7, 2022

cloud-fan mentioned this pull request Sep 7, 2022

[SPARK-40149][SQL][3.2] Propagate metadata columns through Project #37818

Closed

ulysses-you mentioned this pull request Nov 15, 2022

[SPARK-41144][SQL] Unresolved hint should not cause query failure #38662

Closed

cloud-fan mentioned this pull request Dec 1, 2022

[SPARK-41350][SQL] Allow simple name access of join hidden columns after subquery alias #38862

Closed

gengliangwang mentioned this pull request Jan 6, 2023

[SPARK-41538][FollowUp][TESTS] Move a metadata column test case to MetadataColumnSuite #39425

Closed

allisonwang-db mentioned this pull request Feb 6, 2023

[SPARK-40149][SQL][FOLLOWUP] Avoid adding extra Project in AddMetadataColumns #39895

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-40149][SQL] Propagate metadata columns through Project #37758

[SPARK-40149][SQL] Propagate metadata columns through Project #37758

cloud-fan commented Sep 1, 2022 •

edited

cloud-fan commented Sep 1, 2022

viirya left a comment

cloud-fan commented Sep 2, 2022

viirya Sep 2, 2022

viirya Sep 2, 2022

cloud-fan Sep 6, 2022

viirya Sep 6, 2022

viirya left a comment

huaxingao Sep 5, 2022

cloud-fan Sep 6, 2022

cloud-fan commented Sep 7, 2022

		@@ -191,20 +191,20 @@ package object util extends Logging {
		/**
		* If set, this metadata column is a candidate during qualified star expansions.

[SPARK-40149][SQL] Propagate metadata columns through Project #37758

[SPARK-40149][SQL] Propagate metadata columns through Project #37758

Conversation

cloud-fan commented Sep 1, 2022 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

cloud-fan commented Sep 1, 2022

viirya left a comment

Choose a reason for hiding this comment

cloud-fan commented Sep 2, 2022

viirya Sep 2, 2022

Choose a reason for hiding this comment

viirya Sep 2, 2022

Choose a reason for hiding this comment

cloud-fan Sep 6, 2022

Choose a reason for hiding this comment

viirya Sep 6, 2022

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

huaxingao Sep 5, 2022

Choose a reason for hiding this comment

cloud-fan Sep 6, 2022

Choose a reason for hiding this comment

cloud-fan commented Sep 7, 2022

cloud-fan commented Sep 1, 2022 •

edited