Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24478][SQL] Move projection and filter push down to physical conversion #21503

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Jun 6, 2018

What changes were proposed in this pull request?

This removes the v2 optimizer rule for push-down and instead pushes filters and required columns when converting to a physical plan, as suggested by @marmbrus. This makes the v2 relation cleaner because the output and filters do not change in the logical plan.

A side-effect of this change is that the stats from the logical (optimized) plan no longer reflect pushed filters and projection. This is a temporary state, until the planner gathers stats from the physical plan instead. An alternative to this approach is rdblue@9d3a11e.

The first commit was proposed in #21262. This PR replaces #21262.

How was this patch tested?

Existing tests.

@SparkQA
Copy link

SparkQA commented Jun 7, 2018

Test build #91507 has finished for PR 21503 at commit 9d3a11e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait SupportsPhysicalStats extends LeafNode

@rdblue
Copy link
Contributor Author

rdblue commented Jun 12, 2018

@cloud-fan, this is the PR for moving push-down to the physical plan conversion and reporting the stats correctly. Sorry for the confusion because I sent a link to just the second commit.

(projectSet ++ filterSet).toSeq
}

val reader = relation.newReader
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to confirm, do we have to do operator pushdown twice now? One in the plan visitor to calculate statistics, one here to build the physical plan, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will configure two readers. One for the pushdown when converting to a physical plan and one for stats. The stats one should be temporary, though, since we want to address the problem. Configuring two readers instead of one allows us to decouple the problems so we can move forward with pushdown that works like the other data sources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's nice to decouple the problem and do pushdown during planning, but I feel the cost is too high in this approach. For file-based data sources, we need to query hive metastore to apply partitioning pruning during filter pushdown, and this can be very expensive. Doing it twice looks scaring to me.

cc @gatorsmile @dongjoon-hyun @mallman , please correct me if I have a wrong understanding.

also cc @wzhfy do you have an estimation about how long it takes to move statistics to physical plan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan, there's nothing forcing other data sources to implement the new trait. Other sources can continue to report stats for the entire table and not account for filters (the code assumes that row counts don't change). This just opens the option of reporting stats that are more accurate using the filters and projection that will be pushed.

Ideally, I think that stats-based decisions would happen after pushdown so we get data that is as accurate as possible. But for now, this fixes the regression for v2 sources that happens because we move pushdown to a later step (conversion to physical plan like the other sources).

Copy link
Contributor

@cloud-fan cloud-fan Jun 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's nothing forcing other data sources to implement the new trait ...

hmmm, I'm a little confused here. All v2 data sources (will be DataSourceV2Relation) would have to apply pushdown twice right? Or are you suggesting we should not migrate file-based data source to data source v2?

Copy link
Contributor Author

@rdblue rdblue Jun 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind either option #1 or #2. #2 is basically what happens for non-v2 data sources right now. Plus, both should be temporary.

I think it is a bad idea to continue with hacky code that uses the reader in the logical plan. It is much cleaner otherwise and we've spent too much time making sure that everything still works. The main example that comes to mind is setting the requested projection and finding out what output is using pushdown. I think hacks are slowing progress on the v2 sources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea the second proposal is what happens for the v1 data sources. For file-based data source we kind of pick the third proposal and add an optimizer rule PruneFileSourcePartitions to push down some of the filters to data source at the logical phase, to get precise stats.

I'd like to pick from the 2nd and 3rd proposals(the 3rd proposal is also temporary, before we move stats to physical plan). Applying pushdown twice is hard to workaround(need to cache), while we can keep the PruneFileSourcePartitions rule to work around the issue in 2nd proposal for file-based data sources.

Let's also get more inputs from other people.

Copy link
Contributor

@jose-torres jose-torres Jun 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not strongly opposed to any of the options, but based on the description above 2 would be my choice if I had to pick one. A temporary state where functionality is missing is easier to reason about than temporary states where we deliberately impose a fuzzy lifecycle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue do you have time to prepare a PR for the 2rd proposal? I can do that too if you are busy with other stuff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, no problem. I can just remove the stats commit from this one.

@rdblue rdblue force-pushed the SPARK-24478-move-push-down-to-physical-conversion branch from 9d3a11e to c8517e1 Compare June 13, 2018 15:59
}

override def computeStats(): Statistics = reader match {
override def computeStats(): Statistics = newReader match {
case r: SupportsReportStatistics =>
Statistics(sizeInBytes = r.getStatistics.sizeInBytes().orElse(conf.defaultSizeInBytes))
case _ =>
Statistics(sizeInBytes = conf.defaultSizeInBytes)
}

override def newInstance(): DataSourceV2Relation = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to override newInstance now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that initially, but the canonicalization test was failing without this.

@cloud-fan
Copy link
Contributor

Let's also update the classdoc of SupportsReportStatistics to mention that, currently Spark will call it before any operator pushdown, it effectively means the returned statistics is for a full scan.

@rdblue
Copy link
Contributor Author

rdblue commented Jun 13, 2018

Updated the stats interface.

@SparkQA
Copy link

SparkQA commented Jun 13, 2018

Test build #91789 has finished for PR 21503 at commit d5caf83.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 13, 2018

Test build #91782 has finished for PR 21503 at commit c8517e1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor Author

rdblue commented Jun 14, 2018

@cloud-fan, tests are passing for c8517e1, which has all of the functional changes. The Jenkins job ran out of memory for the last commit, but the only change in it is in comments to add the note you requested. Should be good to go.

@cloud-fan
Copy link
Contributor

cc @rxin if you are interested.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@rdblue
Copy link
Contributor Author

rdblue commented Jun 18, 2018

Thank you for reviewing this, @cloud-fan!

asfgit pushed a commit that referenced this pull request Jun 19, 2018
…physical conversion

## What changes were proposed in this pull request?

This is a followup of #21503, to completely move operator pushdown to the planner rule.

The code are mostly from #21319

## How was this patch tested?

existing tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #21574 from cloud-fan/followup.
jzhuge pushed a commit to jzhuge/spark that referenced this pull request Mar 7, 2019
…onversion

This removes the v2 optimizer rule for push-down and instead pushes filters and required columns when converting to a physical plan, as suggested by marmbrus. This makes the v2 relation cleaner because the output and filters do not change in the logical plan.

A side-effect of this change is that the stats from the logical (optimized) plan no longer reflect pushed filters and projection. This is a temporary state, until the planner gathers stats from the physical plan instead. An alternative to this approach is rdblue@9d3a11e.

The first commit was proposed in apache#21262. This PR replaces apache#21262.

Existing tests.

Author: Ryan Blue <blue@apache.org>

Closes apache#21503 from rdblue/SPARK-24478-move-push-down-to-physical-conversion.

(cherry picked from commit 22daeba)

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala
jzhuge pushed a commit to jzhuge/spark that referenced this pull request Mar 7, 2019
…physical conversion

This is a followup of apache#21503, to completely move operator pushdown to the planner rule.

The code are mostly from apache#21319

existing tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#21574 from cloud-fan/followup.

(cherry picked from commit 1737d45)

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
rdblue added a commit to rdblue/spark that referenced this pull request Apr 3, 2019
…onversion

This removes the v2 optimizer rule for push-down and instead pushes filters and required columns when converting to a physical plan, as suggested by marmbrus. This makes the v2 relation cleaner because the output and filters do not change in the logical plan.

A side-effect of this change is that the stats from the logical (optimized) plan no longer reflect pushed filters and projection. This is a temporary state, until the planner gathers stats from the physical plan instead. An alternative to this approach is 9d3a11e.

The first commit was proposed in apache#21262. This PR replaces apache#21262.

Existing tests.

Author: Ryan Blue <blue@apache.org>

Closes apache#21503 from rdblue/SPARK-24478-move-push-down-to-physical-conversion.
rdblue pushed a commit to rdblue/spark that referenced this pull request Apr 3, 2019
…physical conversion

This is a followup of apache#21503, to completely move operator pushdown to the planner rule.

The code are mostly from apache#21319

existing tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#21574 from cloud-fan/followup.
jzhuge pushed a commit to jzhuge/spark that referenced this pull request Oct 15, 2019
…onversion

This removes the v2 optimizer rule for push-down and instead pushes filters and required columns when converting to a physical plan, as suggested by marmbrus. This makes the v2 relation cleaner because the output and filters do not change in the logical plan.

A side-effect of this change is that the stats from the logical (optimized) plan no longer reflect pushed filters and projection. This is a temporary state, until the planner gathers stats from the physical plan instead. An alternative to this approach is rdblue@9d3a11e.

The first commit was proposed in apache#21262. This PR replaces apache#21262.

Existing tests.

Author: Ryan Blue <blue@apache.org>

Closes apache#21503 from rdblue/SPARK-24478-move-push-down-to-physical-conversion.

(cherry picked from commit 22daeba)

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala
jzhuge pushed a commit to jzhuge/spark that referenced this pull request Oct 15, 2019
…physical conversion

This is a followup of apache#21503, to completely move operator pushdown to the planner rule.

The code are mostly from apache#21319

existing tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#21574 from cloud-fan/followup.

(cherry picked from commit 1737d45)

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
…onversion

This removes the v2 optimizer rule for push-down and instead pushes filters and required columns when converting to a physical plan, as suggested by marmbrus. This makes the v2 relation cleaner because the output and filters do not change in the logical plan.

A side-effect of this change is that the stats from the logical (optimized) plan no longer reflect pushed filters and projection. This is a temporary state, until the planner gathers stats from the physical plan instead. An alternative to this approach is rdblue@9d3a11e.

The first commit was proposed in apache#21262. This PR replaces apache#21262.

Existing tests.

Author: Ryan Blue <blue@apache.org>

Closes apache#21503 from rdblue/SPARK-24478-move-push-down-to-physical-conversion.

Ref: LIHADOOP-48531

RB=1850239
G=superfriends-reviewers
R=zolin,yezhou,latang,fli,mshen
A=
otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
…physical conversion

This is a followup of apache#21503, to completely move operator pushdown to the planner rule.

The code are mostly from apache#21319

existing tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#21574 from cloud-fan/followup.

Ref: LIHADOOP-48531

RB=1853689
G=superfriends-reviewers
R=zolin,fli,yezhou,mshen,latang
A=
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants