[SPARK-24478][SQL] Move projection and filter push down to physical conversion #21503

rdblue · 2018-06-06T20:26:22Z

What changes were proposed in this pull request?

This removes the v2 optimizer rule for push-down and instead pushes filters and required columns when converting to a physical plan, as suggested by @marmbrus. This makes the v2 relation cleaner because the output and filters do not change in the logical plan.

A side-effect of this change is that the stats from the logical (optimized) plan no longer reflect pushed filters and projection. This is a temporary state, until the planner gathers stats from the physical plan instead. An alternative to this approach is rdblue@9d3a11e.

The first commit was proposed in #21262. This PR replaces #21262.

How was this patch tested?

Existing tests.

SparkQA · 2018-06-07T00:16:36Z

Test build #91507 has finished for PR 21503 at commit 9d3a11e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait SupportsPhysicalStats extends LeafNode

rdblue · 2018-06-12T17:28:47Z

@cloud-fan, this is the PR for moving push-down to the physical plan conversion and reporting the stats correctly. Sorry for the confusion because I sent a link to just the second commit.

cloud-fan · 2018-06-12T17:39:34Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala

+        (projectSet ++ filterSet).toSeq
+      }
+
+      val reader = relation.newReader


to confirm, do we have to do operator pushdown twice now? One in the plan visitor to calculate statistics, one here to build the physical plan, right?

It will configure two readers. One for the pushdown when converting to a physical plan and one for stats. The stats one should be temporary, though, since we want to address the problem. Configuring two readers instead of one allows us to decouple the problems so we can move forward with pushdown that works like the other data sources.

it's nice to decouple the problem and do pushdown during planning, but I feel the cost is too high in this approach. For file-based data sources, we need to query hive metastore to apply partitioning pruning during filter pushdown, and this can be very expensive. Doing it twice looks scaring to me.

cc @gatorsmile @dongjoon-hyun @mallman , please correct me if I have a wrong understanding.

also cc @wzhfy do you have an estimation about how long it takes to move statistics to physical plan?

@cloud-fan, there's nothing forcing other data sources to implement the new trait. Other sources can continue to report stats for the entire table and not account for filters (the code assumes that row counts don't change). This just opens the option of reporting stats that are more accurate using the filters and projection that will be pushed.

Ideally, I think that stats-based decisions would happen after pushdown so we get data that is as accurate as possible. But for now, this fixes the regression for v2 sources that happens because we move pushdown to a later step (conversion to physical plan like the other sources).

there's nothing forcing other data sources to implement the new trait ...

hmmm, I'm a little confused here. All v2 data sources (will be DataSourceV2Relation) would have to apply pushdown twice right? Or are you suggesting we should not migrate file-based data source to data source v2?

I don't mind either option #1 or #2. #2 is basically what happens for non-v2 data sources right now. Plus, both should be temporary.

I think it is a bad idea to continue with hacky code that uses the reader in the logical plan. It is much cleaner otherwise and we've spent too much time making sure that everything still works. The main example that comes to mind is setting the requested projection and finding out what output is using pushdown. I think hacks are slowing progress on the v2 sources.

Yea the second proposal is what happens for the v1 data sources. For file-based data source we kind of pick the third proposal and add an optimizer rule PruneFileSourcePartitions to push down some of the filters to data source at the logical phase, to get precise stats.

I'd like to pick from the 2nd and 3rd proposals(the 3rd proposal is also temporary, before we move stats to physical plan). Applying pushdown twice is hard to workaround(need to cache), while we can keep the PruneFileSourcePartitions rule to work around the issue in 2nd proposal for file-based data sources.

Let's also get more inputs from other people.

I'm not strongly opposed to any of the options, but based on the description above 2 would be my choice if I had to pick one. A temporary state where functionality is missing is easier to reason about than temporary states where we deliberately impose a fuzzy lifecycle.

@rdblue do you have time to prepare a PR for the 2rd proposal? I can do that too if you are busy with other stuff.

Yeah, no problem. I can just remove the stats commit from this one.

cloud-fan · 2018-06-13T17:28:29Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala

  }

-  override def computeStats(): Statistics = reader match {
+  override def computeStats(): Statistics = newReader match {
    case r: SupportsReportStatistics =>
      Statistics(sizeInBytes = r.getStatistics.sizeInBytes().orElse(conf.defaultSizeInBytes))
    case _ =>
      Statistics(sizeInBytes = conf.defaultSizeInBytes)
  }

  override def newInstance(): DataSourceV2Relation = {


I think we don't need to override newInstance now.

I thought that initially, but the canonicalization test was failing without this.

cloud-fan · 2018-06-13T17:42:40Z

Let's also update the classdoc of SupportsReportStatistics to mention that, currently Spark will call it before any operator pushdown, it effectively means the returned statistics is for a full scan.

rdblue · 2018-06-13T18:02:41Z

Updated the stats interface.

SparkQA · 2018-06-13T18:12:27Z

Test build #91789 has finished for PR 21503 at commit d5caf83.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-13T19:36:40Z

Test build #91782 has finished for PR 21503 at commit c8517e1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2018-06-14T00:03:03Z

@cloud-fan, tests are passing for c8517e1, which has all of the functional changes. The Jenkins job ran out of memory for the last commit, but the only change in it is in comments to add the note you requested. Should be good to go.

cloud-fan · 2018-06-14T05:42:06Z

cc @rxin if you are interested.

cloud-fan · 2018-06-15T03:59:53Z

thanks, merging to master!

rdblue · 2018-06-18T18:09:00Z

Thank you for reviewing this, @cloud-fan!

…physical conversion ## What changes were proposed in this pull request? This is a followup of #21503, to completely move operator pushdown to the planner rule. The code are mostly from #21319 ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #21574 from cloud-fan/followup.

…onversion This removes the v2 optimizer rule for push-down and instead pushes filters and required columns when converting to a physical plan, as suggested by marmbrus. This makes the v2 relation cleaner because the output and filters do not change in the logical plan. A side-effect of this change is that the stats from the logical (optimized) plan no longer reflect pushed filters and projection. This is a temporary state, until the planner gathers stats from the physical plan instead. An alternative to this approach is rdblue@9d3a11e. The first commit was proposed in apache#21262. This PR replaces apache#21262. Existing tests. Author: Ryan Blue <blue@apache.org> Closes apache#21503 from rdblue/SPARK-24478-move-push-down-to-physical-conversion. (cherry picked from commit 22daeba) Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala

…physical conversion This is a followup of apache#21503, to completely move operator pushdown to the planner rule. The code are mostly from apache#21319 existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes apache#21574 from cloud-fan/followup. (cherry picked from commit 1737d45) Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala

…onversion This removes the v2 optimizer rule for push-down and instead pushes filters and required columns when converting to a physical plan, as suggested by marmbrus. This makes the v2 relation cleaner because the output and filters do not change in the logical plan. A side-effect of this change is that the stats from the logical (optimized) plan no longer reflect pushed filters and projection. This is a temporary state, until the planner gathers stats from the physical plan instead. An alternative to this approach is 9d3a11e. The first commit was proposed in apache#21262. This PR replaces apache#21262. Existing tests. Author: Ryan Blue <blue@apache.org> Closes apache#21503 from rdblue/SPARK-24478-move-push-down-to-physical-conversion.

…physical conversion This is a followup of apache#21503, to completely move operator pushdown to the planner rule. The code are mostly from apache#21319 existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes apache#21574 from cloud-fan/followup.

…onversion This removes the v2 optimizer rule for push-down and instead pushes filters and required columns when converting to a physical plan, as suggested by marmbrus. This makes the v2 relation cleaner because the output and filters do not change in the logical plan. A side-effect of this change is that the stats from the logical (optimized) plan no longer reflect pushed filters and projection. This is a temporary state, until the planner gathers stats from the physical plan instead. An alternative to this approach is rdblue@9d3a11e. The first commit was proposed in apache#21262. This PR replaces apache#21262. Existing tests. Author: Ryan Blue <blue@apache.org> Closes apache#21503 from rdblue/SPARK-24478-move-push-down-to-physical-conversion. (cherry picked from commit 22daeba) Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala

…physical conversion This is a followup of apache#21503, to completely move operator pushdown to the planner rule. The code are mostly from apache#21319 existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes apache#21574 from cloud-fan/followup. (cherry picked from commit 1737d45) Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala

…onversion This removes the v2 optimizer rule for push-down and instead pushes filters and required columns when converting to a physical plan, as suggested by marmbrus. This makes the v2 relation cleaner because the output and filters do not change in the logical plan. A side-effect of this change is that the stats from the logical (optimized) plan no longer reflect pushed filters and projection. This is a temporary state, until the planner gathers stats from the physical plan instead. An alternative to this approach is rdblue@9d3a11e. The first commit was proposed in apache#21262. This PR replaces apache#21262. Existing tests. Author: Ryan Blue <blue@apache.org> Closes apache#21503 from rdblue/SPARK-24478-move-push-down-to-physical-conversion. Ref: LIHADOOP-48531 RB=1850239 G=superfriends-reviewers R=zolin,yezhou,latang,fli,mshen A=

…physical conversion This is a followup of apache#21503, to completely move operator pushdown to the planner rule. The code are mostly from apache#21319 existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes apache#21574 from cloud-fan/followup. Ref: LIHADOOP-48531 RB=1853689 G=superfriends-reviewers R=zolin,fli,yezhou,mshen,latang A=

DataSourceV2: push projection, filters when converting to physical plan.

c8517e1

cloud-fan reviewed Jun 12, 2018

View reviewed changes

rdblue force-pushed the SPARK-24478-move-push-down-to-physical-conversion branch from 9d3a11e to c8517e1 Compare June 13, 2018 15:59

cloud-fan reviewed Jun 13, 2018

View reviewed changes

SPARK-24478: Add a note about where stats are used to v2 interface.

d5caf83

asfgit closed this in 22daeba Jun 15, 2018

cloud-fan mentioned this pull request Jun 15, 2018

[SPARK-24478][SQL][followup] Move projection and filter push down to physical conversion #21574

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-24478][SQL] Move projection and filter push down to physical conversion #21503

[SPARK-24478][SQL] Move projection and filter push down to physical conversion #21503

rdblue commented Jun 6, 2018 •

edited

Loading

SparkQA commented Jun 7, 2018

rdblue commented Jun 12, 2018

cloud-fan Jun 12, 2018

rdblue Jun 12, 2018

cloud-fan Jun 12, 2018

rdblue Jun 12, 2018

cloud-fan Jun 12, 2018 •

edited

Loading

rdblue Jun 12, 2018 •

edited

Loading

cloud-fan Jun 12, 2018

jose-torres Jun 13, 2018 •

edited

Loading

cloud-fan Jun 13, 2018

rdblue Jun 13, 2018

cloud-fan Jun 13, 2018

rdblue Jun 13, 2018

cloud-fan commented Jun 13, 2018

rdblue commented Jun 13, 2018

SparkQA commented Jun 13, 2018

SparkQA commented Jun 13, 2018

rdblue commented Jun 14, 2018

cloud-fan commented Jun 14, 2018

cloud-fan commented Jun 15, 2018

rdblue commented Jun 18, 2018

[SPARK-24478][SQL] Move projection and filter push down to physical conversion #21503

[SPARK-24478][SQL] Move projection and filter push down to physical conversion #21503

Conversation

rdblue commented Jun 6, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jun 7, 2018

rdblue commented Jun 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Jun 12, 2018 • edited Loading

Choose a reason for hiding this comment

rdblue Jun 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jose-torres Jun 13, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jun 13, 2018

rdblue commented Jun 13, 2018

SparkQA commented Jun 13, 2018

SparkQA commented Jun 13, 2018

rdblue commented Jun 14, 2018

cloud-fan commented Jun 14, 2018

cloud-fan commented Jun 15, 2018

rdblue commented Jun 18, 2018

rdblue commented Jun 6, 2018 •

edited

Loading

cloud-fan Jun 12, 2018 •

edited

Loading

rdblue Jun 12, 2018 •

edited

Loading

jose-torres Jun 13, 2018 •

edited

Loading