Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-36352][SQL][3.0] Spark should check result plan's output schema name #33764

Closed
wants to merge 3 commits into from

Conversation

AngersZhuuuu
Copy link
Contributor

What changes were proposed in this pull request?

Spark should check result plan's output schema name

Why are the changes needed?

In current code, some optimizer rule may change plan's output schema, since in the code we always use semantic equal to check output, but it may change the plan's output schema.
For example, for SchemaPruning, if we have a plan

Project[a, B]
|--Scan[A, b, c]

the origin output schema is a, B, after SchemaPruning. it become

Project[A, b]
|--Scan[A, b]

It change the plan's schema. when we use CTAS, the schema is same as query plan's output.
Then since we change the schema, it not consistent with origin SQL. So we need to check final result plan's schema with origin plan's schema

Does this PR introduce any user-facing change?

No

How was this patch tested?

existed UT

@AngersZhuuuu
Copy link
Contributor Author

FYi @cloud-fan

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47063/

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47063/

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Test build #142560 has finished for PR 33764 at commit be57420.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu AngersZhuuuu changed the title [SPARK-36352][SQL][3.0] Spark should check result plan's output schema name [WIP][SPARK-36352][SQL][3.0] Spark should check result plan's output schema name Aug 19, 2021
@AngersZhuuuu
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 24, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47237/

@SparkQA
Copy link

SparkQA commented Aug 24, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47237/

@SparkQA
Copy link

SparkQA commented Aug 24, 2021

Test build #142737 has finished for PR 33764 at commit be57420.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu AngersZhuuuu changed the title [WIP][SPARK-36352][SQL][3.0] Spark should check result plan's output schema name [SPARK-36352][SQL][3.0] Spark should check result plan's output schema name Aug 26, 2021
@SparkQA
Copy link

SparkQA commented Aug 26, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47289/

@SparkQA
Copy link

SparkQA commented Aug 26, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47289/

@SparkQA
Copy link

SparkQA commented Aug 26, 2021

Test build #142792 has finished for PR 33764 at commit cf782d7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu
Copy link
Contributor Author

ping @cloud-fan

@cloud-fan
Copy link
Contributor

thanks, merging to 3.0!

cloud-fan pushed a commit that referenced this pull request Aug 26, 2021
…a name

### What changes were proposed in this pull request?
Spark should check result plan's output schema name

### Why are the changes needed?
In current code, some optimizer rule may change plan's output schema, since in the code we always use semantic equal to check output, but it may change the plan's output schema.
For example, for SchemaPruning, if we have a plan
```
Project[a, B]
|--Scan[A, b, c]
```
the origin output schema is `a, B`, after SchemaPruning. it become
```
Project[A, b]
|--Scan[A, b]
```
It change the plan's schema. when we use CTAS, the schema is same as query plan's output.
Then since we change the schema, it not consistent with origin SQL. So we need to check final result plan's schema with origin plan's schema

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existed UT

Closes #33764 from AngersZhuuuu/SPARK-36352-3.0.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan cloud-fan closed this Aug 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants