[SPARK-23523] [SQL] Fix the incorrect result caused by the rule OptimizeMetadataOnlyQuery #20684

gatorsmile · 2018-02-27T05:24:40Z

What changes were proposed in this pull request?

val tablePath = new File(s"${path.getCanonicalPath}/cOl3=c/cOl1=a/cOl5=e")
 Seq(("a", "b", "c", "d", "e")).toDF("cOl1", "cOl2", "cOl3", "cOl4", "cOl5")
 .write.json(tablePath.getCanonicalPath)
 val df = spark.read.json(path.getCanonicalPath).select("CoL1", "CoL5", "CoL3").distinct()
 df.show()

It generates a wrong result.

[c,e,a]

We have a bug in the rule OptimizeMetadataOnlyQuery . We should respect the attribute order in the original leaf node. This PR is to fix it.

How was this patch tested?

Added a test case

gatorsmile · 2018-02-27T05:26:45Z

sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala

-    relation.output.filter(a => partColumns.contains(a.name.toLowerCase))
+    val attrMap = relation.output.map(_.name).zip(relation.output).toMap
+    partitionColumnNames.map { colName =>
+      attrMap.getOrElse(colName,


Do we need to consider the case sensitivity when comparing the names? cc @cloud-fan

cloud-fan · 2018-02-27T07:34:31Z

good catch! LGTM

SparkQA · 2018-02-27T08:05:01Z

Test build #87707 has finished for PR 20684 at commit 1bfaef8.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-27T08:05:02Z

Test build #87700 has finished for PR 20684 at commit ce702c7.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-02-27T08:07:27Z

retest this please.

jiangxb1987

LGTM

SparkQA · 2018-02-27T10:25:05Z

Test build #87708 has finished for PR 20684 at commit 1bfaef8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-27T10:33:40Z

Test build #87709 has finished for PR 20684 at commit 1bfaef8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-02-27T11:47:24Z

retest this please

SparkQA · 2018-02-27T15:19:45Z

Test build #87716 has finished for PR 20684 at commit 1bfaef8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-02-28T05:43:50Z

Hi, @gatorsmile and @cloud-fan .
Since 2.3 vote passed, can we have this in branch-2.3 for Apache Spark 2.3.1?

gatorsmile · 2018-02-28T08:10:32Z

We are still waiting for the official announcement of Spark 2.3 release. This will be merged to 2.3.1 for sure.

dongjoon-hyun · 2018-02-28T16:24:27Z

I see. Thank you for confirmation, @gatorsmile !

dongjoon-hyun · 2018-03-01T18:21:58Z

Gentle ping, @gatorsmile since 2.3 is announced officially yesterday.

…zeMetadataOnlyQuery ## What changes were proposed in this pull request? ```Scala val tablePath = new File(s"${path.getCanonicalPath}/cOl3=c/cOl1=a/cOl5=e") Seq(("a", "b", "c", "d", "e")).toDF("cOl1", "cOl2", "cOl3", "cOl4", "cOl5") .write.json(tablePath.getCanonicalPath) val df = spark.read.json(path.getCanonicalPath).select("CoL1", "CoL5", "CoL3").distinct() df.show() ``` It generates a wrong result. ``` [c,e,a] ``` We have a bug in the rule `OptimizeMetadataOnlyQuery `. We should respect the attribute order in the original leaf node. This PR is to fix it. ## How was this patch tested? Added a test case Author: gatorsmile <gatorsmile@gmail.com> Closes apache#20684 from gatorsmile/optimizeMetadataOnly.

…he rule OptimizeMetadataOnlyQuery This PR is to backport #20684 and #20693 to Spark 2.3 branch --- ## What changes were proposed in this pull request? ```Scala val tablePath = new File(s"${path.getCanonicalPath}/cOl3=c/cOl1=a/cOl5=e") Seq(("a", "b", "c", "d", "e")).toDF("cOl1", "cOl2", "cOl3", "cOl4", "cOl5") .write.json(tablePath.getCanonicalPath) val df = spark.read.json(path.getCanonicalPath).select("CoL1", "CoL5", "CoL3").distinct() df.show() ``` It generates a wrong result. ``` [c,e,a] ``` We have a bug in the rule `OptimizeMetadataOnlyQuery `. We should respect the attribute order in the original leaf node. This PR is to fix it. ## How was this patch tested? Added a test case Author: Xingbo Jiang <xingbo.jiang@databricks.com> Author: gatorsmile <gatorsmile@gmail.com> Closes #20763 from gatorsmile/backport23523.

…zeMetadataOnlyQuery ## What changes were proposed in this pull request? ```Scala val tablePath = new File(s"${path.getCanonicalPath}/cOl3=c/cOl1=a/cOl5=e") Seq(("a", "b", "c", "d", "e")).toDF("cOl1", "cOl2", "cOl3", "cOl4", "cOl5") .write.json(tablePath.getCanonicalPath) val df = spark.read.json(path.getCanonicalPath).select("CoL1", "CoL5", "CoL3").distinct() df.show() ``` It generates a wrong result. ``` [c,e,a] ``` We have a bug in the rule `OptimizeMetadataOnlyQuery `. We should respect the attribute order in the original leaf node. This PR is to fix it. ## How was this patch tested? Added a test case Author: gatorsmile <gatorsmile@gmail.com> Closes apache#20684 from gatorsmile/optimizeMetadataOnly.

gatorsmile added 2 commits February 26, 2018 21:18

fix.

292e87f

move

ce702c7

gatorsmile commented Feb 27, 2018

View reviewed changes

fix

1bfaef8

jiangxb1987 approved these changes Feb 27, 2018

View reviewed changes

asfgit closed this in 414ee86 Feb 27, 2018

gatorsmile mentioned this pull request Mar 7, 2018

[SPARK-23523] [SQL] [BACKPORT-2.3] Fix the incorrect result caused by the rule OptimizeMetadataOnlyQuery #20763

Closed

mkressirer mentioned this pull request Mar 13, 2018

[SPARK-23523][SQL][BACKPORT-2.3] Fix the incorrect result caused by t… toasttab/spark#13

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-23523] [SQL] Fix the incorrect result caused by the rule OptimizeMetadataOnlyQuery #20684

[SPARK-23523] [SQL] Fix the incorrect result caused by the rule OptimizeMetadataOnlyQuery #20684

gatorsmile commented Feb 27, 2018

gatorsmile Feb 27, 2018

cloud-fan commented Feb 27, 2018

SparkQA commented Feb 27, 2018

SparkQA commented Feb 27, 2018

viirya commented Feb 27, 2018

jiangxb1987 left a comment

SparkQA commented Feb 27, 2018

SparkQA commented Feb 27, 2018

cloud-fan commented Feb 27, 2018

SparkQA commented Feb 27, 2018

dongjoon-hyun commented Feb 28, 2018 •

edited

gatorsmile commented Feb 28, 2018

dongjoon-hyun commented Feb 28, 2018

dongjoon-hyun commented Mar 1, 2018

[SPARK-23523] [SQL] Fix the incorrect result caused by the rule OptimizeMetadataOnlyQuery #20684

[SPARK-23523] [SQL] Fix the incorrect result caused by the rule OptimizeMetadataOnlyQuery #20684

Conversation

gatorsmile commented Feb 27, 2018

What changes were proposed in this pull request?

How was this patch tested?

gatorsmile Feb 27, 2018

Choose a reason for hiding this comment

cloud-fan commented Feb 27, 2018

SparkQA commented Feb 27, 2018

SparkQA commented Feb 27, 2018

viirya commented Feb 27, 2018

jiangxb1987 left a comment

Choose a reason for hiding this comment

SparkQA commented Feb 27, 2018

SparkQA commented Feb 27, 2018

cloud-fan commented Feb 27, 2018

SparkQA commented Feb 27, 2018

dongjoon-hyun commented Feb 28, 2018 • edited

gatorsmile commented Feb 28, 2018

dongjoon-hyun commented Feb 28, 2018

dongjoon-hyun commented Mar 1, 2018

dongjoon-hyun commented Feb 28, 2018 •

edited