Skip to content

Conversation

@imback82
Copy link
Contributor

@imback82 imback82 commented Aug 8, 2021

What changes were proposed in this pull request?

Currently, v2 ALTER TABLE REPLACE COLUMNS does not check duplicates for the user specified columns. For example,

spark.sql(s"CREATE TABLE $t (id int) USING $v2Format")
spark.sql(s"ALTER TABLE $t REPLACE COLUMNS (data string, data string)")

doesn't fail the analysis, and it's up to the catalog implementation to handle it.

Why are the changes needed?

To check the duplicate columns during analysis.

Does this PR introduce any user-facing change?

Yes, now the above will command will print out the following:

org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the user specified columns: `data`

How was this patch tested?

Added new unit tests

@github-actions github-actions bot added the SQL label Aug 8, 2021
if (struct.findNestedField(fieldNames, includeCollections = true, r).isDefined) {
def checkColumnNotExists(op: String, fieldNames: Seq[String], struct: StructType): Unit = {
if (struct.findNestedField(
fieldNames, includeCollections = true, alter.conf.resolver).isDefined) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capturing resolver directly from alter variable to simplify.


test("SPARK-36449: Replacing columns with duplicate name should not be allowed") {
alterTableTest(
() => ReplaceColumns(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to create a new ReplaceColumns. Otherwise, analyzed will be set to true after the first iteration.

@SparkQA
Copy link

SparkQA commented Aug 8, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46704/

@SparkQA
Copy link

SparkQA commented Aug 8, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46704/

@SparkQA
Copy link

SparkQA commented Aug 8, 2021

Test build #142192 has finished for PR 33676 at commit e91d7f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@imback82
Copy link
Contributor Author

imback82 commented Aug 8, 2021

cc @cloud-fan

}

private def alterTableTest(
alter: AlterTableCommand,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about simply changing this to by-name parameter? alter: => AlterTableCommand

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, updated.

@SparkQA
Copy link

SparkQA commented Aug 9, 2021

Test build #142231 has started for PR 33676 at commit 0b94f76.

@SparkQA
Copy link

SparkQA commented Aug 9, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46743/

@SparkQA
Copy link

SparkQA commented Aug 9, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46743/

@AmplabJenkins
Copy link

Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/46743/

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.2!

@cloud-fan cloud-fan closed this in e1a5d94 Aug 10, 2021
cloud-fan pushed a commit that referenced this pull request Aug 10, 2021
…ates for the user specified columns

### What changes were proposed in this pull request?

Currently, v2 ALTER TABLE REPLACE COLUMNS does not check duplicates for the user specified columns. For example,
```
spark.sql(s"CREATE TABLE $t (id int) USING $v2Format")
spark.sql(s"ALTER TABLE $t REPLACE COLUMNS (data string, data string)")
```
doesn't fail the analysis, and it's up to the catalog implementation to handle it.

### Why are the changes needed?

To check the duplicate columns during analysis.

### Does this PR introduce _any_ user-facing change?

Yes, now the above will command will print out the following:
```
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the user specified columns: `data`
```

### How was this patch tested?

Added new unit tests

Closes #33676 from imback82/replace_cols_duplicates.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit e1a5d94)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants