[SPARK-36352][SQL][3.0] Spark should check result plan's output schema name #33764

AngersZhuuuu · 2021-08-17T15:01:54Z

What changes were proposed in this pull request?

Spark should check result plan's output schema name

Why are the changes needed?

In current code, some optimizer rule may change plan's output schema, since in the code we always use semantic equal to check output, but it may change the plan's output schema.
For example, for SchemaPruning, if we have a plan

Project[a, B]
|--Scan[A, b, c]

the origin output schema is a, B, after SchemaPruning. it become

Project[A, b]
|--Scan[A, b]

It change the plan's schema. when we use CTAS, the schema is same as query plan's output.
Then since we change the schema, it not consistent with origin SQL. So we need to check final result plan's schema with origin plan's schema

Does this PR introduce any user-facing change?

No

How was this patch tested?

existed UT

…a name

AngersZhuuuu · 2021-08-17T15:02:44Z

FYi @cloud-fan

SparkQA · 2021-08-17T15:54:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47063/

SparkQA · 2021-08-17T16:19:00Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47063/

SparkQA · 2021-08-17T17:34:19Z

Test build #142560 has finished for PR 33764 at commit be57420.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AngersZhuuuu · 2021-08-24T15:59:27Z

retest this please

SparkQA · 2021-08-24T16:53:00Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47237/

SparkQA · 2021-08-24T17:20:13Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47237/

SparkQA · 2021-08-24T19:04:57Z

Test build #142737 has finished for PR 33764 at commit be57420.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-26T07:03:23Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47289/

SparkQA · 2021-08-26T07:05:48Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47289/

SparkQA · 2021-08-26T10:50:54Z

Test build #142792 has finished for PR 33764 at commit cf782d7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AngersZhuuuu · 2021-08-26T11:34:34Z

ping @cloud-fan

cloud-fan · 2021-08-26T14:09:07Z

thanks, merging to 3.0!

…a name ### What changes were proposed in this pull request? Spark should check result plan's output schema name ### Why are the changes needed? In current code, some optimizer rule may change plan's output schema, since in the code we always use semantic equal to check output, but it may change the plan's output schema. For example, for SchemaPruning, if we have a plan ``` Project[a, B] |--Scan[A, b, c] ``` the origin output schema is `a, B`, after SchemaPruning. it become ``` Project[A, b] |--Scan[A, b] ``` It change the plan's schema. when we use CTAS, the schema is same as query plan's output. Then since we change the schema, it not consistent with origin SQL. So we need to check final result plan's schema with origin plan's schema ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existed UT Closes #33764 from AngersZhuuuu/SPARK-36352-3.0. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

[SPARK-36352][SQL][3.0] Spark should check result plan's output schem…

be57420

…a name

AngersZhuuuu changed the title ~~[SPARK-36352][SQL][3.0] Spark should check result plan's output schema name~~ [WIP][SPARK-36352][SQL][3.0] Spark should check result plan's output schema name Aug 19, 2021

AngersZhuuuu added 2 commits August 25, 2021 21:38

Merge branch 'branch-3.0' into SPARK-36352-3.0

bb3d854

Update StreamSuite.scala

cf782d7

AngersZhuuuu changed the title ~~[WIP][SPARK-36352][SQL][3.0] Spark should check result plan's output schema name~~ [SPARK-36352][SQL][3.0] Spark should check result plan's output schema name Aug 26, 2021

cloud-fan closed this Aug 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-36352][SQL][3.0] Spark should check result plan's output schema name #33764

[SPARK-36352][SQL][3.0] Spark should check result plan's output schema name #33764

AngersZhuuuu commented Aug 17, 2021

AngersZhuuuu commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

AngersZhuuuu commented Aug 24, 2021

SparkQA commented Aug 24, 2021

SparkQA commented Aug 24, 2021

SparkQA commented Aug 24, 2021

SparkQA commented Aug 26, 2021

SparkQA commented Aug 26, 2021

SparkQA commented Aug 26, 2021

AngersZhuuuu commented Aug 26, 2021

cloud-fan commented Aug 26, 2021

[SPARK-36352][SQL][3.0] Spark should check result plan's output schema name #33764

[SPARK-36352][SQL][3.0] Spark should check result plan's output schema name #33764

Conversation

AngersZhuuuu commented Aug 17, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

AngersZhuuuu commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

AngersZhuuuu commented Aug 24, 2021

SparkQA commented Aug 24, 2021

SparkQA commented Aug 24, 2021

SparkQA commented Aug 24, 2021

SparkQA commented Aug 26, 2021

SparkQA commented Aug 26, 2021

SparkQA commented Aug 26, 2021

AngersZhuuuu commented Aug 26, 2021

cloud-fan commented Aug 26, 2021