[SPARK-13671] [SPARK-13311] [SQL] Use different physical plans for RDD and data sources #11514

davies · 2016-03-04T08:05:25Z

What changes were proposed in this pull request?

This PR split the PhysicalRDD into two classes, PhysicalRDD and PhysicalScan. PhysicalRDD is used for DataFrames that is created from existing RDD. PhysicalScan is used for DataFrame that is created from data sources. This enable use to apply different optimization on both of them.

Also fix the problem for sameResult() on two DataSourceScan.

Also fix the equality check to toString for In. It's better to use Seq there, but we can't break this public API (sad).

How was this patch tested?

Existing tests. Manually tested with TPCDS query Q59 and Q64, all those duplicated exchanges can be re-used now, also saw there are 40+% performance improvement (saving half of the scan).

marmbrus · 2016-03-04T08:28:54Z

sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala

+}
+
+/** Physical plan node for scanning data from a relation. */
+private[sql] case class PhysicalScan(


DataSourceScan?

marmbrus · 2016-03-04T08:29:09Z

will conflict with #11509

SparkQA · 2016-03-04T09:42:52Z

Test build #52453 has finished for PR 11514 at commit 0e78b3a.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-03-04T11:08:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala

+  override def simpleString: String = {
+    s"RDD $nodeName${output.mkString("[", ",", "]")}"
+  }
+}


Should we override outputPartitioning and set it to UnknownPartitioning(rdd.partitions.length)?

If a partitioning is UnknownPartitioning, the number is meaningless, I think.

SparkQA · 2016-03-07T23:51:05Z

Test build #52610 has finished for PR 11514 at commit d2d2062.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-08T07:31:17Z

Test build #2618 has finished for PR 11514 at commit d2d2062.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-08T07:38:41Z

Test build #2619 has finished for PR 11514 at commit d2d2062.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-08T07:42:05Z

Test build #52644 has finished for PR 11514 at commit 0278fd9.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? It’s possible to have common parts in a query, for example, self join, it will be good to avoid the duplicated part to same CPUs and memory (Broadcast or cache). Exchange will materialize the underlying RDD by shuffle or collect, it’s a great point to check duplicates and reuse them. Duplicated exchanges means they generate exactly the same result inside a query. In order to find out the duplicated exchanges, we should be able to compare SparkPlan to check that they have same results or not. We already have that for LogicalPlan, so we should move that into QueryPlan to make it available for SparkPlan. Once we can find the duplicated exchanges, we should replace all of them with same SparkPlan object (could be wrapped by ReusedExchage for explain), then the plan tree become a DAG. Since all the planner only work with tree, so this rule should be the last one for the entire planning. After the rule, the plan will looks like: ``` WholeStageCodegen : +- Project [id#0L] : +- BroadcastHashJoin [id#0L], [id#2L], Inner, BuildRight, None : :- Project [id#0L] : : +- BroadcastHashJoin [id#0L], [id#1L], Inner, BuildRight, None : : :- Range 0, 1, 4, 1024, [id#0L] : : +- INPUT : +- INPUT :- BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L)) : +- WholeStageCodegen : : +- Range 0, 1, 4, 1024, [id#1L] +- ReusedExchange [id#2L], BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L)) ``` ![bjoin](https://cloud.githubusercontent.com/assets/40902/13414787/209e8c5c-df0a-11e5-8a0f-edff69d89e83.png) For three ways SortMergeJoin, ``` == Physical Plan == WholeStageCodegen : +- Project [id#0L] : +- SortMergeJoin [id#0L], [id#4L], None : :- INPUT : +- INPUT :- WholeStageCodegen : : +- Project [id#0L] : : +- SortMergeJoin [id#0L], [id#3L], None : : :- INPUT : : +- INPUT : :- WholeStageCodegen : : : +- Sort [id#0L ASC], false, 0 : : : +- INPUT : : +- Exchange hashpartitioning(id#0L, 200), None : : +- WholeStageCodegen : : : +- Range 0, 1, 4, 33554432, [id#0L] : +- WholeStageCodegen : : +- Sort [id#3L ASC], false, 0 : : +- INPUT : +- ReusedExchange [id#3L], Exchange hashpartitioning(id#0L, 200), None +- WholeStageCodegen : +- Sort [id#4L ASC], false, 0 : +- INPUT +- ReusedExchange [id#4L], Exchange hashpartitioning(id#0L, 200), None ``` ![sjoin](https://cloud.githubusercontent.com/assets/40902/13414790/27aea61c-df0a-11e5-8cbf-fbc985c31d95.png) If the same ShuffleExchange or BroadcastExchange, execute()/executeBroadcast() will be called by different parents, they should cached the RDD/Broadcast, return the same one for all the parents. ## How was this patch tested? Added some unit tests for this. Had done some manual tests on TPCDS query Q59 and Q64, we can see some exchanges are re-used (this requires a change in PhysicalRDD to for sameResult, is be done in #11514 ). Author: Davies Liu <davies@databricks.com> Closes #11403 from davies/dedup.

SparkQA · 2016-03-09T23:36:14Z

Test build #52775 has finished for PR 11514 at commit 6cfa545.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class In(attribute: String, values: Seq[Any]) extends Filter

davies · 2016-03-10T00:51:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SparkPlanGraph.scala

@@ -93,6 +93,10 @@ private[sql] object SparkPlanGraph {
      case "Subquery" if subgraph != null =>
        // Subquery should not be included in WholeStageCodegen
        buildSparkPlanGraphNode(planInfo, nodeIdGenerator, nodes, edges, parent, null, exchanges)
+      case "ReusedExchange" =>


This one is lost when fix conflicts in last PR (#11403).

SparkQA · 2016-03-10T02:15:14Z

Test build #52788 has finished for PR 11514 at commit c4ea2e8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class In(attribute: String, values: Array[Any]) extends Filter

SparkQA · 2016-03-10T23:01:19Z

Test build #52855 has finished for PR 11514 at commit c159b25.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-11T00:52:27Z

Test build #52865 has finished for PR 11514 at commit b482d2c.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

nongli · 2016-03-11T01:24:37Z

lgtm

SparkQA · 2016-03-12T01:47:10Z

Test build #52966 has finished for PR 11514 at commit b3d2df0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-03-12T08:48:26Z

Merged into master

…D and data sources ## What changes were proposed in this pull request? This PR split the PhysicalRDD into two classes, PhysicalRDD and PhysicalScan. PhysicalRDD is used for DataFrames that is created from existing RDD. PhysicalScan is used for DataFrame that is created from data sources. This enable use to apply different optimization on both of them. Also fix the problem for sameResult() on two DataSourceScan. Also fix the equality check to toString for `In`. It's better to use Seq there, but we can't break this public API (sad). ## How was this patch tested? Existing tests. Manually tested with TPCDS query Q59 and Q64, all those duplicated exchanges can be re-used now, also saw there are 40+% performance improvement (saving half of the scan). Author: Davies Liu <davies@databricks.com> Closes apache#11514 from davies/existing_rdd.

## What changes were proposed in this pull request? It’s possible to have common parts in a query, for example, self join, it will be good to avoid the duplicated part to same CPUs and memory (Broadcast or cache). Exchange will materialize the underlying RDD by shuffle or collect, it’s a great point to check duplicates and reuse them. Duplicated exchanges means they generate exactly the same result inside a query. In order to find out the duplicated exchanges, we should be able to compare SparkPlan to check that they have same results or not. We already have that for LogicalPlan, so we should move that into QueryPlan to make it available for SparkPlan. Once we can find the duplicated exchanges, we should replace all of them with same SparkPlan object (could be wrapped by ReusedExchage for explain), then the plan tree become a DAG. Since all the planner only work with tree, so this rule should be the last one for the entire planning. After the rule, the plan will looks like: ``` WholeStageCodegen : +- Project [id#0L] : +- BroadcastHashJoin [id#0L], [id#2L], Inner, BuildRight, None : :- Project [id#0L] : : +- BroadcastHashJoin [id#0L], [id#1L], Inner, BuildRight, None : : :- Range 0, 1, 4, 1024, [id#0L] : : +- INPUT : +- INPUT :- BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L)) : +- WholeStageCodegen : : +- Range 0, 1, 4, 1024, [id#1L] +- ReusedExchange [id#2L], BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L)) ``` ![bjoin](https://cloud.githubusercontent.com/assets/40902/13414787/209e8c5c-df0a-11e5-8a0f-edff69d89e83.png) For three ways SortMergeJoin, ``` == Physical Plan == WholeStageCodegen : +- Project [id#0L] : +- SortMergeJoin [id#0L], [id#4L], None : :- INPUT : +- INPUT :- WholeStageCodegen : : +- Project [id#0L] : : +- SortMergeJoin [id#0L], [id#3L], None : : :- INPUT : : +- INPUT : :- WholeStageCodegen : : : +- Sort [id#0L ASC], false, 0 : : : +- INPUT : : +- Exchange hashpartitioning(id#0L, 200), None : : +- WholeStageCodegen : : : +- Range 0, 1, 4, 33554432, [id#0L] : +- WholeStageCodegen : : +- Sort [id#3L ASC], false, 0 : : +- INPUT : +- ReusedExchange [id#3L], Exchange hashpartitioning(id#0L, 200), None +- WholeStageCodegen : +- Sort [id#4L ASC], false, 0 : +- INPUT +- ReusedExchange [id#4L], Exchange hashpartitioning(id#0L, 200), None ``` ![sjoin](https://cloud.githubusercontent.com/assets/40902/13414790/27aea61c-df0a-11e5-8cbf-fbc985c31d95.png) If the same ShuffleExchange or BroadcastExchange, execute()/executeBroadcast() will be called by different parents, they should cached the RDD/Broadcast, return the same one for all the parents. ## How was this patch tested? Added some unit tests for this. Had done some manual tests on TPCDS query Q59 and Q64, we can see some exchanges are re-used (this requires a change in PhysicalRDD to for sameResult, is be done in apache#11514 ). Author: Davies Liu <davies@databricks.com> Closes apache#11403 from davies/dedup.

…D and data sources ## What changes were proposed in this pull request? This PR split the PhysicalRDD into two classes, PhysicalRDD and PhysicalScan. PhysicalRDD is used for DataFrames that is created from existing RDD. PhysicalScan is used for DataFrame that is created from data sources. This enable use to apply different optimization on both of them. Also fix the problem for sameResult() on two DataSourceScan. Also fix the equality check to toString for `In`. It's better to use Seq there, but we can't break this public API (sad). ## How was this patch tested? Existing tests. Manually tested with TPCDS query Q59 and Q64, all those duplicated exchanges can be re-used now, also saw there are 40+% performance improvement (saving half of the scan). Author: Davies Liu <davies@databricks.com> Closes apache#11514 from davies/existing_rdd.

rxin · 2016-07-29T05:29:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala

+
+  // Ignore rdd when checking results
+  override def sameResult(plan: SparkPlan ): Boolean = plan match {
+    case other: DataSourceScan => relation == other.relation && metadata == other.metadata


this is actually wrong because we cannot ignore the rdd, otherwise scans of different partitions are treated as "sameResult"!

separate physical RDD and scan

0e78b3a

marmbrus reviewed Mar 4, 2016
View reviewed changes

liancheng reviewed Mar 4, 2016
View reviewed changes

davies mentioned this pull request Mar 4, 2016

[SPARK-13523] [SQL] Reuse exchanges in a query #11403

Closed

fix conflict

d2d2062

davies force-pushed the existing_rdd branch from 3767c17 to d2d2062 Compare March 7, 2016 23:29

Davies Liu added 2 commits March 7, 2016 23:31

Merge branch 'master' of github.com:apache/spark into existing_rdd

c8db837

fix tests

0278fd9

Davies Liu added 2 commits March 9, 2016 14:08

Merge branch 'master' of github.com:apache/spark into existing_rdd

618e555

fix sameResult on DataSourceScan

6cfa545

fix In

c4ea2e8

davies changed the title ~~[SPARK-13671] [SQL] Use different physical plans for RDD and data sources~~ [SPARK-13671] [SPARK-13311] [SQL] Use different physical plans for RDD and data sources Mar 10, 2016

davies reviewed Mar 10, 2016
View reviewed changes

fix tests

c159b25

fix test

b482d2c

Davies Liu added 2 commits March 11, 2016 15:50

Merge branch 'master' of github.com:apache/spark into existing_rdd

5975560

fix tests

b3d2df0

asfgit closed this in ba8c86d Mar 12, 2016

rxin reviewed Jul 29, 2016
View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13671] [SPARK-13311] [SQL] Use different physical plans for RDD and data sources #11514

[SPARK-13671] [SPARK-13311] [SQL] Use different physical plans for RDD and data sources #11514

davies commented Mar 4, 2016

marmbrus Mar 4, 2016

marmbrus commented Mar 4, 2016

SparkQA commented Mar 4, 2016

liancheng Mar 4, 2016

davies Mar 5, 2016

SparkQA commented Mar 7, 2016

SparkQA commented Mar 8, 2016

SparkQA commented Mar 8, 2016

SparkQA commented Mar 8, 2016

SparkQA commented Mar 9, 2016

davies Mar 10, 2016

SparkQA commented Mar 10, 2016

SparkQA commented Mar 10, 2016

SparkQA commented Mar 11, 2016

nongli commented Mar 11, 2016

SparkQA commented Mar 12, 2016

davies commented Mar 12, 2016

rxin Jul 29, 2016

[SPARK-13671] [SPARK-13311] [SQL] Use different physical plans for RDD and data sources #11514

[SPARK-13671] [SPARK-13311] [SQL] Use different physical plans for RDD and data sources #11514

Conversation

davies commented Mar 4, 2016

What changes were proposed in this pull request?

How was this patch tested?

marmbrus Mar 4, 2016

Choose a reason for hiding this comment

marmbrus commented Mar 4, 2016

SparkQA commented Mar 4, 2016

liancheng Mar 4, 2016

Choose a reason for hiding this comment

davies Mar 5, 2016

Choose a reason for hiding this comment

SparkQA commented Mar 7, 2016

SparkQA commented Mar 8, 2016

SparkQA commented Mar 8, 2016

SparkQA commented Mar 8, 2016

SparkQA commented Mar 9, 2016

davies Mar 10, 2016

Choose a reason for hiding this comment

SparkQA commented Mar 10, 2016

SparkQA commented Mar 10, 2016

SparkQA commented Mar 11, 2016

nongli commented Mar 11, 2016

SparkQA commented Mar 12, 2016

davies commented Mar 12, 2016

rxin Jul 29, 2016

Choose a reason for hiding this comment