[SPARK-34719][SQL][3.0] Correctly resolve the view query with duplicated column names #31894

cloud-fan · 2021-03-19T06:30:30Z

backport #31811 to 3.0

What changes were proposed in this pull request?

For permanent views (and the new SQL temp view in Spark 3.1), we store the view SQL text and re-parse/analyze the view SQL text when reading the view. In the case of SELECT * FROM ..., we want to avoid view schema change (e.g. the referenced table changes its schema) and will record the view query output column names when creating the view, so that when reading the view we can add a SELECT recorded_column_names FROM ... to retain the original view query schema.

In Spark 3.1 and before, the final SELECT is added after the analysis phase: https://github.com/apache/spark/blob/branch-3.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala#L67

If the view query has duplicated output column names, we always pick the first column when reading a view. A simple repro:

scala> sql("create view c(x, y) as select 1 a, 2 a")
res0: org.apache.spark.sql.DataFrame = []

scala> sql("select * from c").show
+---+---+
|  x|  y|
+---+---+
|  1|  1|
+---+---+

In the master branch, we will fail at the view reading time due to b891862 , which adds the final SELECT during analysis, so that the query fails with Reference 'a' is ambiguous

This PR proposes to resolve the view query output column names from the matching attributes by ordinal.

For example, create view c(x, y) as select 1 a, 2 a, the view query output column names are [a, a]. When we reading the view, there are 2 matching attributes (e.g.[a#1, a#2]) and we can simply match them by ordinal.

A negative example is

create table t(a int)
create view v as select *, 1 as col from t
replace table t(a int, col int)

When reading the view, the view query output column names are [a, col], and there are two matching attributes of col, and we should fail the query. See the tests for details.

Why are the changes needed?

bug fix

Does this PR introduce any user-facing change?

yes

How was this patch tested?

new test

…ted column names For permanent views (and the new SQL temp view in Spark 3.1), we store the view SQL text and re-parse/analyze the view SQL text when reading the view. In the case of `SELECT * FROM ...`, we want to avoid view schema change (e.g. the referenced table changes its schema) and will record the view query output column names when creating the view, so that when reading the view we can add a `SELECT recorded_column_names FROM ...` to retain the original view query schema. In Spark 3.1 and before, the final SELECT is added after the analysis phase: https://github.com/apache/spark/blob/branch-3.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala#L67 If the view query has duplicated output column names, we always pick the first column when reading a view. A simple repro: ``` scala> sql("create view c(x, y) as select 1 a, 2 a") res0: org.apache.spark.sql.DataFrame = [] scala> sql("select * from c").show +---+---+ | x| y| +---+---+ | 1| 1| +---+---+ ``` In the master branch, we will fail at the view reading time due to apache@b891862 , which adds the final SELECT during analysis, so that the query fails with `Reference 'a' is ambiguous` This PR proposes to resolve the view query output column names from the matching attributes by ordinal. For example, `create view c(x, y) as select 1 a, 2 a`, the view query output column names are `[a, a]`. When we reading the view, there are 2 matching attributes (e.g.`[a#1, a#2]`) and we can simply match them by ordinal. A negative example is ``` create table t(a int) create view v as select *, 1 as col from t replace table t(a int, col int) ``` When reading the view, the view query output column names are `[a, col]`, and there are two matching attributes of `col`, and we should fail the query. See the tests for details. This PR targets branch 3.1 because it's not a correctness bug in master, and there are code conflicts. I'll fix master later. bug fix yes new test Closes apache#31811 from cloud-fan/view. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

cloud-fan · 2021-03-19T06:30:52Z

cc @maropu

AmplabJenkins · 2021-03-19T07:08:57Z

Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40825/

AmplabJenkins · 2021-03-19T07:11:02Z

Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136243/

maropu · 2021-03-19T07:30:38Z

LGTM if the tests pass. also cc: @viirya

viirya · 2021-03-19T07:35:55Z

This PR targets branch 3.1

branch 3.1 -> branch 3.0?

viirya

lgtm

…ted column names backport #31811 to 3.0 ### What changes were proposed in this pull request? For permanent views (and the new SQL temp view in Spark 3.1), we store the view SQL text and re-parse/analyze the view SQL text when reading the view. In the case of `SELECT * FROM ...`, we want to avoid view schema change (e.g. the referenced table changes its schema) and will record the view query output column names when creating the view, so that when reading the view we can add a `SELECT recorded_column_names FROM ...` to retain the original view query schema. In Spark 3.1 and before, the final SELECT is added after the analysis phase: https://github.com/apache/spark/blob/branch-3.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala#L67 If the view query has duplicated output column names, we always pick the first column when reading a view. A simple repro: ``` scala> sql("create view c(x, y) as select 1 a, 2 a") res0: org.apache.spark.sql.DataFrame = [] scala> sql("select * from c").show +---+---+ | x| y| +---+---+ | 1| 1| +---+---+ ``` In the master branch, we will fail at the view reading time due to b891862 , which adds the final SELECT during analysis, so that the query fails with `Reference 'a' is ambiguous` This PR proposes to resolve the view query output column names from the matching attributes by ordinal. For example, `create view c(x, y) as select 1 a, 2 a`, the view query output column names are `[a, a]`. When we reading the view, there are 2 matching attributes (e.g.`[a#1, a#2]`) and we can simply match them by ordinal. A negative example is ``` create table t(a int) create view v as select *, 1 as col from t replace table t(a int, col int) ``` When reading the view, the view query output column names are `[a, col]`, and there are two matching attributes of `col`, and we should fail the query. See the tests for details. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? new test Closes #31894 from cloud-fan/backport. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

maropu · 2021-03-20T02:10:18Z

okay, thanks! Merged to branch-3.0.

…ted column names backport #31811 to 2.4 For permanent views (and the new SQL temp view in Spark 3.1), we store the view SQL text and re-parse/analyze the view SQL text when reading the view. In the case of `SELECT * FROM ...`, we want to avoid view schema change (e.g. the referenced table changes its schema) and will record the view query output column names when creating the view, so that when reading the view we can add a `SELECT recorded_column_names FROM ...` to retain the original view query schema. In Spark 3.1 and before, the final SELECT is added after the analysis phase: https://github.com/apache/spark/blob/branch-3.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala#L67 If the view query has duplicated output column names, we always pick the first column when reading a view. A simple repro: ``` scala> sql("create view c(x, y) as select 1 a, 2 a") res0: org.apache.spark.sql.DataFrame = [] scala> sql("select * from c").show +---+---+ | x| y| +---+---+ | 1| 1| +---+---+ ``` In the master branch, we will fail at the view reading time due to b891862 , which adds the final SELECT during analysis, so that the query fails with `Reference 'a' is ambiguous` This PR proposes to resolve the view query output column names from the matching attributes by ordinal. For example, `create view c(x, y) as select 1 a, 2 a`, the view query output column names are `[a, a]`. When we reading the view, there are 2 matching attributes (e.g.`[a#1, a#2]`) and we can simply match them by ordinal. A negative example is ``` create table t(a int) create view v as select *, 1 as col from t replace table t(a int, col int) ``` When reading the view, the view query output column names are `[a, col]`, and there are two matching attributes of `col`, and we should fail the query. See the tests for details. bug fix yes new test Closes #31894 from cloud-fan/backport. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

cloud-fan · 2021-03-22T15:53:31Z

I also backported this to 2.4. The conflicts are trivial and I ran the test locally.

viirya · 2021-03-22T16:26:02Z

@cloud-fan Thanks.

viirya approved these changes Mar 19, 2021

View reviewed changes

maropu closed this Mar 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-34719][SQL][3.0] Correctly resolve the view query with duplicated column names #31894

[SPARK-34719][SQL][3.0] Correctly resolve the view query with duplicated column names #31894

Uh oh!

cloud-fan commented Mar 19, 2021 •

edited

Loading

Uh oh!

cloud-fan commented Mar 19, 2021

Uh oh!

AmplabJenkins commented Mar 19, 2021

Uh oh!

AmplabJenkins commented Mar 19, 2021

Uh oh!

maropu commented Mar 19, 2021

Uh oh!

viirya commented Mar 19, 2021

Uh oh!

viirya left a comment

Uh oh!

maropu commented Mar 20, 2021

Uh oh!

cloud-fan commented Mar 22, 2021

Uh oh!

viirya commented Mar 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-34719][SQL][3.0] Correctly resolve the view query with duplicated column names #31894

[SPARK-34719][SQL][3.0] Correctly resolve the view query with duplicated column names #31894

Uh oh!

Conversation

cloud-fan commented Mar 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

cloud-fan commented Mar 19, 2021

Uh oh!

AmplabJenkins commented Mar 19, 2021

Uh oh!

AmplabJenkins commented Mar 19, 2021

Uh oh!

maropu commented Mar 19, 2021

Uh oh!

viirya commented Mar 19, 2021

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

maropu commented Mar 20, 2021

Uh oh!

cloud-fan commented Mar 22, 2021

Uh oh!

viirya commented Mar 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cloud-fan commented Mar 19, 2021 •

edited

Loading