[SPARK-23614][SQL] Fix incorrect reuse exchange when caching is used #20831

viirya · 2018-03-15T06:42:44Z

What changes were proposed in this pull request?

We should provide customized canonicalize plan for InMemoryRelation and InMemoryTableScanExec. Otherwise, we can wrongly treat two different cached plans as same result. It causes wrongly reused exchange then.

For a test query like this:

val cached = spark.createDataset(Seq(TestDataUnion(1, 2, 3), TestDataUnion(4, 5, 6))).cache()
val group1 = cached.groupBy("x").agg(min(col("y")) as "value")
val group2 = cached.groupBy("x").agg(min(col("z")) as "value")
group1.union(group2)

Canonicalized plans before:

First exchange:

Exchange hashpartitioning(none#0, 5)
+- *(1) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4])
   +- *(1) InMemoryTableScan [none#0, none#1]
         +- InMemoryRelation [x#4253, y#4254, z#4255], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
               +- LocalTableScan [x#4253, y#4254, z#4255]

Second exchange:

Exchange hashpartitioning(none#0, 5)
+- *(3) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4])
   +- *(3) InMemoryTableScan [none#0, none#1]
         +- InMemoryRelation [x#4253, y#4254, z#4255], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
               +- LocalTableScan [x#4253, y#4254, z#4255]

You can find that they have the canonicalized plans are the same, although we use different columns in two InMemoryTableScans.

Canonicalized plan after:

First exchange:

Exchange hashpartitioning(none#0, 5)
+- *(1) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4])
   +- *(1) InMemoryTableScan [none#0, none#1]
         +- InMemoryRelation [none#0, none#1, none#2], true, 10000, StorageLevel(memory, 1 replicas)
               +- LocalTableScan [none#0, none#1, none#2]

Second exchange:

Exchange hashpartitioning(none#0, 5)
+- *(3) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4])
   +- *(3) InMemoryTableScan [none#0, none#2]
         +- InMemoryRelation [none#0, none#1, none#2], true, 10000, StorageLevel(memory, 1 replicas)
               +- LocalTableScan [none#0, none#1, none#2]

How was this patch tested?

Added unit test.

SparkQA · 2018-03-15T07:05:01Z

Test build #88256 has finished for PR 20831 at commit e1f28e2.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-03-15T07:36:32Z

retest this please.

maropu · 2018-03-15T08:34:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala

@@ -68,6 +69,15 @@ case class InMemoryRelation(

  override protected def innerChildren: Seq[SparkPlan] = Seq(child)

+  override def doCanonicalize(): logical.LogicalPlan =
+    copy(output = output.map(QueryPlan.normalizeExprId(_, child.output)),
+      storageLevel = new StorageLevel(),


StorageLevel.NONE?

SparkQA · 2018-03-15T09:31:34Z

Test build #88258 has finished for PR 20831 at commit e1f28e2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-15T10:56:32Z

Test build #88262 has finished for PR 20831 at commit a3f86b2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-03-15T12:27:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala

+      tableName = None)(
+      _cachedColumnBuffers,
+      sizeInBytesStats,
+      statsOfPlanToCache)


We need to copy these cached data?

_cachedColumnBuffers can't be null. If it is null, copy will trigger buildBuffers.

statsOfPlanToCache and sizeInBytesStats, too? For instance, ResolveHint drops hints in canonicalization:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/hints.scala

Line 44 in 3675af7

override def doCanonicalize(): LogicalPlan = child.canonicalized

cachedColumnBuffers, sizeInBytesStats, statsOfPlanToCache won't be considered when comparing two InMemoryRelation. So instead of create empty instances of statistics, I just use the original values.

aha, ok. Thanks!

SparkQA · 2018-03-15T17:24:09Z

Test build #88266 has finished for PR 20831 at commit fb0f949.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-03-15T22:27:43Z

cc @cloud-fan

cloud-fan · 2018-03-20T23:15:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala

@@ -68,6 +69,15 @@ case class InMemoryRelation(

  override protected def innerChildren: Seq[SparkPlan] = Seq(child)

+  override def doCanonicalize(): logical.LogicalPlan =
+    copy(output = output.map(QueryPlan.normalizeExprId(_, child.output)),
+      storageLevel = StorageLevel.NONE,


can we follow the parameter order in the constructor?

It is followed. I just ignored useCompression, batchSize as they are just primitives and don't need to be canonicalized here.

cloud-fan · 2018-03-20T23:17:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala

@@ -169,7 +174,10 @@ case class InMemoryTableScanExec(
  override def outputOrdering: Seq[SortOrder] =
    relation.child.outputOrdering.map(updateAttribute(_).asInstanceOf[SortOrder])

-  private def statsFor(a: Attribute) = relation.partitionStatistics.forAttribute(a)
+  // When we make canonicalized plan, we can't find a normalized attribute in this map.
+  // We return a `ColumnStatisticsSchema` for normalized attribute in this case.


It looks weird to call statsFor on a canonicalized InMemoryTableScanExec, can we just make buildFilter a lazy val?

I've tried that at beginning. However, partitionFilters uses buildFilter. Making partitionFilters a lazy doesn't work because when do copy, the initialization of InMemoryTableScanExec will try to materialize partitionFilters for coping it value.

Making partitionFilters, buildFilter as methods is not enough too, we also need to remove @transient from relation and InMemoryRelation.partitionStatistics. So I think it isn't worth and leave it as is.

I don't get it. Regardless how copy is implemented in scala, ideally we can just mark buildFilter and partitionFilters as lazy, and in doCanonicalize, create a new InMemoryTableScanExec, which won't materialize partitionFilters in either the current InMemoryTableScanExec or the new InMemoryTableScanExec.

One problem I can think of is to serialize a canonicalized InMemoryTableScanExec, but it should never happen.

Ah, sorry, I get it wrongly. The reason why it doesn't work is because relation is @transient. partitionFilters needs to be non-lazy, otherwise when we need to access relation in executor, we will get a NullPointerException.

And I think it isn't worth removing @transient from relation and InMemoryRelation.partitionStatistics just for this. So I leave it as is.

This can be solved if we add a val stats = relation.partitionStatistics, isn't it?

Yes. I think so. Updated.

cloud-fan · 2018-03-20T23:18:29Z

good catch! LGTM except a few comments

cloud-fan · 2018-03-21T23:57:38Z

LGTM, pending jenkins

viirya · 2018-03-21T23:59:54Z

Thanks!

SparkQA · 2018-03-22T02:51:48Z

Test build #88493 has finished for PR 20831 at commit a592882.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-03-22T02:55:22Z

retest this please.

SparkQA · 2018-03-22T06:17:52Z

Test build #88497 has finished for PR 20831 at commit a592882.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? We should provide customized canonicalize plan for `InMemoryRelation` and `InMemoryTableScanExec`. Otherwise, we can wrongly treat two different cached plans as same result. It causes wrongly reused exchange then. For a test query like this: ```scala val cached = spark.createDataset(Seq(TestDataUnion(1, 2, 3), TestDataUnion(4, 5, 6))).cache() val group1 = cached.groupBy("x").agg(min(col("y")) as "value") val group2 = cached.groupBy("x").agg(min(col("z")) as "value") group1.union(group2) ``` Canonicalized plans before: First exchange: ``` Exchange hashpartitioning(none#0, 5) +- *(1) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4]) +- *(1) InMemoryTableScan [none#0, none#1] +- InMemoryRelation [x#4253, y#4254, z#4255], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) +- LocalTableScan [x#4253, y#4254, z#4255] ``` Second exchange: ``` Exchange hashpartitioning(none#0, 5) +- *(3) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4]) +- *(3) InMemoryTableScan [none#0, none#1] +- InMemoryRelation [x#4253, y#4254, z#4255], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) +- LocalTableScan [x#4253, y#4254, z#4255] ``` You can find that they have the canonicalized plans are the same, although we use different columns in two `InMemoryTableScan`s. Canonicalized plan after: First exchange: ``` Exchange hashpartitioning(none#0, 5) +- *(1) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4]) +- *(1) InMemoryTableScan [none#0, none#1] +- InMemoryRelation [none#0, none#1, none#2], true, 10000, StorageLevel(memory, 1 replicas) +- LocalTableScan [none#0, none#1, none#2] ``` Second exchange: ``` Exchange hashpartitioning(none#0, 5) +- *(3) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4]) +- *(3) InMemoryTableScan [none#0, none#2] +- InMemoryRelation [none#0, none#1, none#2], true, 10000, StorageLevel(memory, 1 replicas) +- LocalTableScan [none#0, none#1, none#2] ``` ## How was this patch tested? Added unit test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #20831 from viirya/SPARK-23614. (cherry picked from commit b2edc30) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan · 2018-03-23T04:24:25Z

thanks, merging to master/2.3!

viirya · 2018-03-23T04:30:56Z

Thanks! @cloud-fan

## What changes were proposed in this pull request? We should provide customized canonicalize plan for `InMemoryRelation` and `InMemoryTableScanExec`. Otherwise, we can wrongly treat two different cached plans as same result. It causes wrongly reused exchange then. For a test query like this: ```scala val cached = spark.createDataset(Seq(TestDataUnion(1, 2, 3), TestDataUnion(4, 5, 6))).cache() val group1 = cached.groupBy("x").agg(min(col("y")) as "value") val group2 = cached.groupBy("x").agg(min(col("z")) as "value") group1.union(group2) ``` Canonicalized plans before: First exchange: ``` Exchange hashpartitioning(none#0, 5) +- *(1) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4]) +- *(1) InMemoryTableScan [none#0, none#1] +- InMemoryRelation [x#4253, y#4254, z#4255], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) +- LocalTableScan [x#4253, y#4254, z#4255] ``` Second exchange: ``` Exchange hashpartitioning(none#0, 5) +- *(3) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4]) +- *(3) InMemoryTableScan [none#0, none#1] +- InMemoryRelation [x#4253, y#4254, z#4255], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) +- LocalTableScan [x#4253, y#4254, z#4255] ``` You can find that they have the canonicalized plans are the same, although we use different columns in two `InMemoryTableScan`s. Canonicalized plan after: First exchange: ``` Exchange hashpartitioning(none#0, 5) +- *(1) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4]) +- *(1) InMemoryTableScan [none#0, none#1] +- InMemoryRelation [none#0, none#1, none#2], true, 10000, StorageLevel(memory, 1 replicas) +- LocalTableScan [none#0, none#1, none#2] ``` Second exchange: ``` Exchange hashpartitioning(none#0, 5) +- *(3) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4]) +- *(3) InMemoryTableScan [none#0, none#2] +- InMemoryRelation [none#0, none#1, none#2], true, 10000, StorageLevel(memory, 1 replicas) +- LocalTableScan [none#0, none#1, none#2] ``` ## How was this patch tested? Added unit test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#20831 from viirya/SPARK-23614. (cherry picked from commit b2edc30) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

We should provide customized canonicalize plan for `InMemoryRelation` and `InMemoryTableScanExec`. Otherwise, we can wrongly treat two different cached plans as same result. It causes wrongly reused exchange then. For a test query like this: ```scala val cached = spark.createDataset(Seq(TestDataUnion(1, 2, 3), TestDataUnion(4, 5, 6))).cache() val group1 = cached.groupBy("x").agg(min(col("y")) as "value") val group2 = cached.groupBy("x").agg(min(col("z")) as "value") group1.union(group2) ``` Canonicalized plans before: First exchange: ``` Exchange hashpartitioning(none#0, 5) +- *(1) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4]) +- *(1) InMemoryTableScan [none#0, none#1] +- InMemoryRelation [x#4253, y#4254, z#4255], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) +- LocalTableScan [x#4253, y#4254, z#4255] ``` Second exchange: ``` Exchange hashpartitioning(none#0, 5) +- *(3) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4]) +- *(3) InMemoryTableScan [none#0, none#1] +- InMemoryRelation [x#4253, y#4254, z#4255], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) +- LocalTableScan [x#4253, y#4254, z#4255] ``` You can find that they have the canonicalized plans are the same, although we use different columns in two `InMemoryTableScan`s. Canonicalized plan after: First exchange: ``` Exchange hashpartitioning(none#0, 5) +- *(1) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4]) +- *(1) InMemoryTableScan [none#0, none#1] +- InMemoryRelation [none#0, none#1, none#2], true, 10000, StorageLevel(memory, 1 replicas) +- LocalTableScan [none#0, none#1, none#2] ``` Second exchange: ``` Exchange hashpartitioning(none#0, 5) +- *(3) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4]) +- *(3) InMemoryTableScan [none#0, none#2] +- InMemoryRelation [none#0, none#1, none#2], true, 10000, StorageLevel(memory, 1 replicas) +- LocalTableScan [none#0, none#1, none#2] ``` Added unit test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#20831 from viirya/SPARK-23614. (cherry picked from commit b2edc30) Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 1d0d0a5) RB=3620847 BUG=APA-69078 G=spark-reviewers R=mmuralid,ekrogen A=ekrogen

Fix incorrect reuse exchange when caching is used.

e1f28e2

maropu reviewed Mar 15, 2018

View reviewed changes

Address comment.

a3f86b2

maropu reviewed Mar 15, 2018

View reviewed changes

Fix error.

fb0f949

cloud-fan reviewed Mar 20, 2018

View reviewed changes

Make few variables as lazy.

a592882

asfgit closed this in b2edc30 Mar 23, 2018

viirya deleted the SPARK-23614 branch December 27, 2023 18:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-23614][SQL] Fix incorrect reuse exchange when caching is used #20831

[SPARK-23614][SQL] Fix incorrect reuse exchange when caching is used #20831

viirya commented Mar 15, 2018 •

edited

Loading

SparkQA commented Mar 15, 2018

viirya commented Mar 15, 2018

maropu Mar 15, 2018

viirya Mar 15, 2018

SparkQA commented Mar 15, 2018

SparkQA commented Mar 15, 2018

maropu Mar 15, 2018

viirya Mar 15, 2018

maropu Mar 16, 2018

viirya Mar 16, 2018

maropu Mar 16, 2018

SparkQA commented Mar 15, 2018

viirya commented Mar 15, 2018

cloud-fan Mar 20, 2018

viirya Mar 21, 2018

cloud-fan Mar 20, 2018

viirya Mar 21, 2018 •

edited

Loading

cloud-fan Mar 21, 2018

viirya Mar 21, 2018

viirya Mar 21, 2018 •

edited

Loading

cloud-fan Mar 21, 2018

viirya Mar 21, 2018 •

edited

Loading

cloud-fan commented Mar 20, 2018

cloud-fan commented Mar 21, 2018

viirya commented Mar 21, 2018

SparkQA commented Mar 22, 2018

viirya commented Mar 22, 2018

SparkQA commented Mar 22, 2018

cloud-fan commented Mar 23, 2018

viirya commented Mar 23, 2018

[SPARK-23614][SQL] Fix incorrect reuse exchange when caching is used #20831

[SPARK-23614][SQL] Fix incorrect reuse exchange when caching is used #20831

Conversation

viirya commented Mar 15, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Mar 15, 2018

viirya commented Mar 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 15, 2018

SparkQA commented Mar 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 15, 2018

viirya commented Mar 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Mar 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Mar 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Mar 21, 2018 • edited Loading

Choose a reason for hiding this comment

cloud-fan commented Mar 20, 2018

cloud-fan commented Mar 21, 2018

viirya commented Mar 21, 2018

SparkQA commented Mar 22, 2018

viirya commented Mar 22, 2018

SparkQA commented Mar 22, 2018

cloud-fan commented Mar 23, 2018

viirya commented Mar 23, 2018

viirya commented Mar 15, 2018 •

edited

Loading

viirya Mar 21, 2018 •

edited

Loading

viirya Mar 21, 2018 •

edited

Loading

viirya Mar 21, 2018 •

edited

Loading