[SPARK-24172][SQL] we should not apply operator pushdown to data source v2 many times #21230

cloud-fan · 2018-05-03T15:21:05Z

What changes were proposed in this pull request?

In PushDownOperatorsToDataSource, we use transformUp to match PhysicalOperation and apply pushdown. This is problematic if we have multiple Filter and Project above the data source v2 relation.

e.g. for a query

Project
  Filter
    DataSourceV2Relation

The pattern match will be triggered twice and we will do operator pushdown twice. This is unnecessary, we can use mapChildren to only apply pushdown once.

How was this patch tested?

existing test

cloud-fan · 2018-05-03T15:22:31Z

cc @rdblue @gatorsmile @gengliangwang

SparkQA · 2018-05-03T19:00:11Z

Test build #90140 has finished for PR 21230 at commit e224f8a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-05-04T03:44:04Z

...main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala

+    var pushed = false
+    plan transformDown {
+      // PhysicalOperation guarantees that filters are deterministic; no need to check
+      case PhysicalOperation(project, filters, relation: DataSourceV2Relation) if !pushed =>


Is that possible one plan has multiple PhysicalOperation?

PhysicalOperation just accumulates project and filter above a specific node, if we transform down a tree, and only transform once, we will never hit PhysicalOperation more than once.

rdblue · 2018-05-07T16:19:27Z

So it was the use of transformUp that caused this rules to match multiple times, right? In that case, would it make more sense to do what @marmbrus suggested in the immutable plan PR and make this a strategy instead of an optimizer rule?

That approach fits with what I suggested on #21118. We could have the scan node handle the filter and the projection so that it doesn't matter whether the source produces UnsafeRow or InternalRow.

rdblue · 2018-05-08T00:05:00Z

@cloud-fan, I opened #21262 that is similar to this, but does pushdown when converting to a physical plan. You might like that as an alternative because it cleans up DataSourceV2Relation quite a bit and adds output to the case class arguments like other relations.

The drawback to that approach that I had forgotten about is that it breaks computeStats because that runs on the optimized plan (but this affects all the other code paths as well).

Up to you how to continue with this work, I just think we should consider the other approach since it solves a few problems. And computeStats is something we should update to work on physical plans anyway, right? Just let me know how you want to move forward. If you want to pull that commit into this PR, I'll close the other one.

cloud-fan · 2018-05-08T01:23:01Z

Hi @rdblue , thanks for your new approach! Like you said, the major problem is about statistics. This is unfortunately a problem of Spark's CBO design: the statistics should belong to physical node but it currently belongs to logical node.

For file-based data sources, since they are builtin sources, we can create rules to update statistics at logical phase, i.e. PruneFileSourcePartitions. But for external sources like iceberg, we would not be able to update statistics before planning, and shuffle join may be wrongly planned while broadcast join is applicable. In other words, users may need to create custom optimizer rules to make their data source work well.

That said, I do like your approach if we can fix the statistics problem first. I'm not sure how hard and how soon it can be fixed, cc @wzhfy

Before that, I'd like to still keep the pushdown logic in optimizer and left the hard work to Spark instead of users. What do you think?

rdblue · 2018-05-08T15:33:30Z

Sounds good to me. Lets plan on getting this one in to fix the current problem, and commit the other approach when stats are fixed.

cloud-fan · 2018-05-10T02:19:55Z

retest this please

SparkQA · 2018-05-10T05:58:29Z

Test build #90436 has finished for PR 21230 at commit e224f8a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2018-05-10T16:34:03Z

+1 (assuming tests pass)

dongjoon-hyun · 2018-05-10T16:43:52Z

...main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala

@@ -23,17 +23,10 @@ import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan, Project
 import org.apache.spark.sql.catalyst.rules.Rule

 object PushDownOperatorsToDataSource extends Rule[LogicalPlan] {
-  override def apply(
-      plan: LogicalPlan): LogicalPlan = plan transformUp {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan.mapChildren {


Could you update the PR description (transformDown -> mapChildren), too?

SparkQA · 2018-05-10T19:03:16Z

Test build #90467 has finished for PR 21230 at commit f73440c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM.

SparkQA · 2018-05-11T05:34:05Z

Test build #90488 has finished for PR 21230 at commit 953cd7a.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-05-11T13:12:26Z

retest this please

SparkQA · 2018-05-11T16:56:24Z

Test build #90506 has finished for PR 21230 at commit 953cd7a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-05-11T16:59:57Z

LGTM Thanks! Merged to master

gengliangwang

LGTM

…ce v2 many times ## What changes were proposed in this pull request? In `PushDownOperatorsToDataSource`, we use `transformUp` to match `PhysicalOperation` and apply pushdown. This is problematic if we have multiple `Filter` and `Project` above the data source v2 relation. e.g. for a query ``` Project Filter DataSourceV2Relation ``` The pattern match will be triggered twice and we will do operator pushdown twice. This is unnecessary, we can use `mapChildren` to only apply pushdown once. ## How was this patch tested? existing test Author: Wenchen Fan <wenchen@databricks.com> Closes apache#21230 from cloud-fan/step2.

gatorsmile reviewed May 4, 2018

View reviewed changes

rdblue mentioned this pull request May 8, 2018

[SPARK-24172][SQL]: Push projection and filters once when converting to physical plan. #21262

Closed

we should not apply operator pushdown to data source v2 many times

f73440c

cloud-fan force-pushed the step2 branch from e224f8a to f73440c Compare May 10, 2018 16:26

dongjoon-hyun reviewed May 10, 2018

View reviewed changes

fix a typo

953cd7a

dongjoon-hyun approved these changes May 11, 2018

View reviewed changes

gengliangwang approved these changes May 11, 2018

View reviewed changes

asfgit closed this in 928845a May 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-24172][SQL] we should not apply operator pushdown to data source v2 many times #21230

[SPARK-24172][SQL] we should not apply operator pushdown to data source v2 many times #21230

cloud-fan commented May 3, 2018 •

edited

Loading

cloud-fan commented May 3, 2018

SparkQA commented May 3, 2018

gatorsmile May 4, 2018

cloud-fan May 4, 2018 •

edited

Loading

rdblue commented May 7, 2018

rdblue commented May 8, 2018

cloud-fan commented May 8, 2018

rdblue commented May 8, 2018

cloud-fan commented May 10, 2018

SparkQA commented May 10, 2018

rdblue commented May 10, 2018

dongjoon-hyun May 10, 2018

SparkQA commented May 10, 2018

dongjoon-hyun left a comment

SparkQA commented May 11, 2018

kiszk commented May 11, 2018

SparkQA commented May 11, 2018

gatorsmile commented May 11, 2018

gengliangwang left a comment

[SPARK-24172][SQL] we should not apply operator pushdown to data source v2 many times #21230

[SPARK-24172][SQL] we should not apply operator pushdown to data source v2 many times #21230

Conversation

cloud-fan commented May 3, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented May 3, 2018

SparkQA commented May 3, 2018

gatorsmile May 4, 2018

Choose a reason for hiding this comment

cloud-fan May 4, 2018 • edited Loading

Choose a reason for hiding this comment

rdblue commented May 7, 2018

rdblue commented May 8, 2018

cloud-fan commented May 8, 2018

rdblue commented May 8, 2018

cloud-fan commented May 10, 2018

SparkQA commented May 10, 2018

rdblue commented May 10, 2018

dongjoon-hyun May 10, 2018

Choose a reason for hiding this comment

SparkQA commented May 10, 2018

dongjoon-hyun left a comment

Choose a reason for hiding this comment

SparkQA commented May 11, 2018

kiszk commented May 11, 2018

SparkQA commented May 11, 2018

gatorsmile commented May 11, 2018

gengliangwang left a comment

Choose a reason for hiding this comment

cloud-fan commented May 3, 2018 •

edited

Loading

cloud-fan May 4, 2018 •

edited

Loading