[SPARK-36449][SQL] v2 ALTER TABLE REPLACE COLUMNS should check duplicates for the user specified columns #33676

imback82 · 2021-08-08T05:54:12Z

What changes were proposed in this pull request?

Currently, v2 ALTER TABLE REPLACE COLUMNS does not check duplicates for the user specified columns. For example,

spark.sql(s"CREATE TABLE $t (id int) USING $v2Format")
spark.sql(s"ALTER TABLE $t REPLACE COLUMNS (data string, data string)")

doesn't fail the analysis, and it's up to the catalog implementation to handle it.

Why are the changes needed?

To check the duplicate columns during analysis.

Does this PR introduce any user-facing change?

Yes, now the above will command will print out the following:

org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the user specified columns: `data`

How was this patch tested?

Added new unit tests

imback82 · 2021-08-08T05:55:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

-      if (struct.findNestedField(fieldNames, includeCollections = true, r).isDefined) {
+    def checkColumnNotExists(op: String, fieldNames: Seq[String], struct: StructType): Unit = {
+      if (struct.findNestedField(
+          fieldNames, includeCollections = true, alter.conf.resolver).isDefined) {


capturing resolver directly from alter variable to simplify.

imback82 · 2021-08-08T05:56:07Z

sql/core/src/test/scala/org/apache/spark/sql/connector/V2CommandsCaseSensitivitySuite.scala


+  test("SPARK-36449: Replacing columns with duplicate name should not be allowed") {
+    alterTableTest(
+      () => ReplaceColumns(


need to create a new ReplaceColumns. Otherwise, analyzed will be set to true after the first iteration.

SparkQA · 2021-08-08T06:49:32Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46704/

SparkQA · 2021-08-08T07:45:11Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46704/

SparkQA · 2021-08-08T10:59:13Z

Test build #142192 has finished for PR 33676 at commit e91d7f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

imback82 · 2021-08-08T19:03:01Z

cc @cloud-fan

cloud-fan · 2021-08-09T08:31:08Z

sql/core/src/test/scala/org/apache/spark/sql/connector/V2CommandsCaseSensitivitySuite.scala

+  }
+
  private def alterTableTest(
      alter: AlterTableCommand,


how about simply changing this to by-name parameter? alter: => AlterTableCommand

thanks, updated.

SparkQA · 2021-08-09T16:09:42Z

Test build #142231 has started for PR 33676 at commit 0b94f76.

SparkQA · 2021-08-09T17:07:42Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46743/

SparkQA · 2021-08-09T18:07:07Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46743/

AmplabJenkins · 2021-08-09T18:07:14Z

Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/46743/

cloud-fan · 2021-08-10T05:20:27Z

thanks, merging to master/3.2!

…ates for the user specified columns ### What changes were proposed in this pull request? Currently, v2 ALTER TABLE REPLACE COLUMNS does not check duplicates for the user specified columns. For example, ``` spark.sql(s"CREATE TABLE $t (id int) USING $v2Format") spark.sql(s"ALTER TABLE $t REPLACE COLUMNS (data string, data string)") ``` doesn't fail the analysis, and it's up to the catalog implementation to handle it. ### Why are the changes needed? To check the duplicate columns during analysis. ### Does this PR introduce _any_ user-facing change? Yes, now the above will command will print out the following: ``` org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the user specified columns: `data` ``` ### How was this patch tested? Added new unit tests Closes #33676 from imback82/replace_cols_duplicates. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit e1a5d94) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

initial commit

e91d7f9

github-actions bot added the SQL label Aug 8, 2021

imback82 commented Aug 8, 2021

View reviewed changes

cloud-fan reviewed Aug 9, 2021

View reviewed changes

cloud-fan approved these changes Aug 9, 2021

View reviewed changes

address PR comment

0b94f76

cloud-fan closed this in e1a5d94 Aug 10, 2021

[SPARK-36449][SQL] v2 ALTER TABLE REPLACE COLUMNS should check duplicates for the user specified columns #33676

[SPARK-36449][SQL] v2 ALTER TABLE REPLACE COLUMNS should check duplicates for the user specified columns #33676

Uh oh!

Conversation

imback82 commented Aug 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

imback82 Aug 8, 2021

Choose a reason for hiding this comment

Uh oh!

imback82 Aug 8, 2021

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 8, 2021

Uh oh!

SparkQA commented Aug 8, 2021

Uh oh!

SparkQA commented Aug 8, 2021

Uh oh!

imback82 commented Aug 8, 2021

Uh oh!

cloud-fan Aug 9, 2021

Choose a reason for hiding this comment

Uh oh!

imback82 Aug 9, 2021

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 9, 2021

Uh oh!

SparkQA commented Aug 9, 2021

Uh oh!

SparkQA commented Aug 9, 2021

Uh oh!

AmplabJenkins commented Aug 9, 2021

Uh oh!

cloud-fan commented Aug 10, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

imback82 commented Aug 8, 2021 •

edited

Loading