Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29277][SQL] Add early DSv2 filter and projection pushdown #25955

Closed
wants to merge 6 commits into from

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Sep 28, 2019

What changes were proposed in this pull request?

This adds a new rule, V2ScanRelationPushDown, to push filters and projections in to a new DataSourceV2ScanRelation in the optimizer. That scan is then used when converting to a physical scan node. The new relation correctly reports stats based on the scan.

To run scan pushdown before rules where stats are used, this adds a new optimizer override, earlyScanPushDownRules and a batch for early pushdown in the optimizer, before cost-based join reordering. The other early pushdown rule, PruneFileSourcePartitions, is moved into the early pushdown rule set.

This also moves pushdown helper methods from DataSourceV2Strategy into a util class.

Why are the changes needed?

This is needed for DSv2 sources to supply stats for cost-based rules in the optimizer.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

This updates the implementation of stats from DataSourceV2Relation so tests will fail if stats are accessed before early pushdown for v2 relations.

@rdblue
Copy link
Contributor Author

rdblue commented Sep 28, 2019

FYI, @cloud-fan & @brkyvz.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@rdblue
Copy link
Contributor Author

rdblue commented Oct 2, 2019

The test failures are expected and are caused by removing the computeStats implementation from DataSourceV2Relation. The purpose is to see where computeStats is called before the early pushdown rules run.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA
Copy link

SparkQA commented Oct 4, 2019

Test build #111757 has finished for PR 25955 at commit 1498c43.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue rdblue changed the title [SPARK-29277][SQL] Add early DSv2 filter and projection pushdown (WIP) [SPARK-29277][SQL] Add early DSv2 filter and projection pushdown Oct 4, 2019
@rdblue
Copy link
Contributor Author

rdblue commented Oct 4, 2019

All tests are passing without calling the computeStats method before pushdown, so I'm confident that the early pushdown rule is in the right place. I'm adding back the computeStats method that will throw UnsupportedOperationException while testing to ensure that it is not incorrectly called in the future. It does not throw an exception when not testing so that user queries don't fail.

@rdblue
Copy link
Contributor Author

rdblue commented Oct 4, 2019

@cloud-fan, this is ready for another review, assuming tests pass. Thanks!

@SparkQA
Copy link

SparkQA commented Oct 5, 2019

Test build #111793 has finished for PR 25955 at commit 598a6fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor Author

rdblue commented Oct 17, 2019

@cloud-fan, I rebased and updated this if you want to have another look.

I updated this as you suggested so that optimizedPlan will always contain DataSourceV2ScanRelation. That change allows us to remove quite a few cases.

I also updated this to solve the problem where DDL commands would have other rules run on the relation, including early push-down. As we discussed, I removed the relation from children for DDL commands so that rules are not run automatically, and added cases to ResolveTables for those plans. Right now, that is done for DescribeTable and AlterTable. Other DDL commands create tables and don't have a relation.

I should point out that I didn't change DeleteFromTable or UpdateTable. Those aren't DDL commands because they modify data. Those plans also rely on the relation as a child to resolve references in the delete and update expressions. Because some rules need to run, I think it should be okay if all of the rules run. This still works fine because the plans only rely on the output of the table and it doesn't matter if the underlying relation is converted to DataSourceV2ScanRelation.

If we want to avoid the relation underneath DeleteFromTable getting converted, we could avoid early push-down when there is no filter or projection. But this strategy would cause DataSourceV2Relation to show up in optimized plans again and require adding back all the cases I just removed. I don't have a strong opinion here and could go either way.

Last, I made a small change to Analyzer while I was updating the ResolveTables rule. The cases for UnresolvedRelation and InsertIntoStatement used lookupV2RelationAndCatalog, which has been removed elsewhere. I removed those last uses so we could get rid of that method. Now, UnresolvedRelation and InsertIntoStatement are resolved in ResolveCatalogs to use UnresolvedV2Relation and then matched in ResolveTables.

.map(rel => desc.copy(table = rel))
.getOrElse(desc)

case alter @ AlterTable(_, _, u: UnresolvedV2Relation, _) =>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested out a trait that worked for all of the plans that need to be resolved here, but the code was longer with the trait and implementations. If we need it later because we have more cases in this rule, it should be easy to add. I don't think we need it right now.

@@ -32,6 +32,16 @@ class ResolveCatalogs(val catalogManager: CatalogManager)
import org.apache.spark.sql.connector.catalog.CatalogV2Util._

override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
case unresolved @ UnresolvedRelation(nameParts) =>
nameParts match {
case AsTemporaryViewIdentifier(i) if catalogManager.v1SessionCatalog.isTemporaryTable(i) =>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added to convert to v2 unresolved relation. Matching the view identifier ensures the same behavior as lookupV2RelationAndCatalog.

@@ -64,7 +69,8 @@ class SparkOptimizer(
override def nonExcludableRules: Seq[String] = super.nonExcludableRules :+
ExtractPythonUDFFromJoinCondition.ruleName :+
ExtractPythonUDFFromAggregate.ruleName :+ ExtractGroupingPythonUDFFromAggregate.ruleName :+
ExtractPythonUDFs.ruleName
ExtractPythonUDFs.ruleName :+
V2ScanRelationPushDown.ruleName
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only DataSourceV2ScanRelation will be converted to a physical scan node, so the early push-down rule is now required.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@rdblue
Copy link
Contributor Author

rdblue commented Oct 30, 2019

@cloud-fan, I've removed the refactor and tests are passing. Can you take another look at this?

s"BUG: computeStats called before pushdown on DSv2 relation: $name")
} else {
// when not testing, return stats because bad stats are better than failing a query
newScanBuilder() match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was inlined in a previous commit, why it's reverted?

@@ -3218,6 +3218,8 @@ class Dataset[T] private[sql](
fr.inputFiles
case r: HiveTableRelation =>
r.tableMeta.storage.locationUri.map(_.toString).toArray
case DataSourceV2ScanRelation(table: FileTable, _, _) =>
table.fileIndex.inputFiles
case DataSourceV2Relation(table: FileTable, _, _) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a previous discussion, we decided to make V2ScanRelationPushDown mandatory, so that DataSourceV2Relation won't appear in the optimized plan. Why do we change mind?

@cloud-fan
Copy link
Contributor

I noticed that except for the table resolution refactor, we also revert the changes that make V2ScanRelationPushDown mandatory. Is it intentional?

@rdblue
Copy link
Contributor Author

rdblue commented Oct 30, 2019

@cloud-fan, those changes were a mistake from rolling back. I've updated this with the changes.

@SparkQA

This comment has been minimized.

@SparkQA
Copy link

SparkQA commented Oct 30, 2019

Test build #112935 has finished for PR 25955 at commit 4220723.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -17,7 +17,7 @@
package org.apache.spark.sql.execution.datasources.orc

import org.apache.spark.SparkConf
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.{DataFrame, Row}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary change

@cloud-fan
Copy link
Contributor

LGTM if tests pass

@cloud-fan
Copy link
Contributor

We need to bring back the changes in CheckAnalysis: https://github.com/apache/spark/pull/26318/files#diff-1d14ac233eac6f233c027dba0bdf871dR100

@rdblue
Copy link
Contributor Author

rdblue commented Oct 30, 2019

I added the cases back to CheckAnalysis.

@SparkQA
Copy link

SparkQA commented Oct 30, 2019

Test build #112955 has finished for PR 25955 at commit c223e05.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor Author

rdblue commented Oct 30, 2019

@cloud-fan, tests are passing with the additional checks.

@rdblue
Copy link
Contributor Author

rdblue commented Oct 31, 2019

Merging to master. Thanks for reviewing, @cloud-fan and @brkyvz!

@rdblue rdblue closed this in cfc80d0 Oct 31, 2019
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Oct 31, 2019

Hi, Guys.

Sorry, but this breaks our SBT with Hadoop 3.2 (JDK8) profile.

I checked that the failure happens consistently on master branch.

$ build/sbt "sql/testOnly *.OrcFilterSuite" -Phadoop-3.2
...
[info] - filter pushdown - integer *** FAILED *** (2 seconds, 173 milliseconds)
[info]   org.apache.spark.sql.AnalysisException: Can not match OrcTable in the query.;
...
[info] *** 10 TESTS FAILED ***
[error] Failed: Total 14, Failed 10, Errors 0, Passed 4
[error] Failed tests:
[error] 	org.apache.spark.sql.execution.datasources.orc.OrcFilterSuite
[error] (sql/test:testOnly) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 57 s, completed Oct 30, 2019 11:05:15 PM

Without this patch, it's recovered. Although this is not a part of 3.0.0-preview RC2, I'll revert this urgently in order to recover our all Jenkins environment during votes. Also, we may have RC3.

cc @jiangxb1987 since he is a release manager for 3.0.0-preview.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Oct 31, 2019

@rdblue . Could you make this PR once more with [test-hadoop3.2] string in the PR title, please?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Oct 31, 2019

All Jenkins jobs using Maven look okay (including JDK11 tests).

cloud-fan pushed a commit to cloud-fan/spark that referenced this pull request Oct 31, 2019
### What changes were proposed in this pull request?

This adds a new rule, `V2ScanRelationPushDown`, to push filters and projections in to a new `DataSourceV2ScanRelation` in the optimizer. That scan is then used when converting to a physical scan node. The new relation correctly reports stats based on the scan.

To run scan pushdown before rules where stats are used, this adds a new optimizer override, `earlyScanPushDownRules` and a batch for early pushdown in the optimizer, before cost-based join reordering. The other early pushdown rule, `PruneFileSourcePartitions`, is moved into the early pushdown rule set.

This also moves pushdown helper methods from `DataSourceV2Strategy` into a util class.

### Why are the changes needed?

This is needed for DSv2 sources to supply stats for cost-based rules in the optimizer.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

This updates the implementation of stats from `DataSourceV2Relation` so tests will fail if stats are accessed before early pushdown for v2 relations.

Closes apache#25955 from rdblue/move-v2-pushdown.

Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Ryan Blue <blue@apache.org>
dongjoon-hyun pushed a commit that referenced this pull request Oct 31, 2019
…on pushdown

Bring back #25955

### What changes were proposed in this pull request?

This adds a new rule, `V2ScanRelationPushDown`, to push filters and projections in to a new `DataSourceV2ScanRelation` in the optimizer. That scan is then used when converting to a physical scan node. The new relation correctly reports stats based on the scan.

To run scan pushdown before rules where stats are used, this adds a new optimizer override, `earlyScanPushDownRules` and a batch for early pushdown in the optimizer, before cost-based join reordering. The other early pushdown rule, `PruneFileSourcePartitions`, is moved into the early pushdown rule set.

This also moves pushdown helper methods from `DataSourceV2Strategy` into a util class.

### Why are the changes needed?

This is needed for DSv2 sources to supply stats for cost-based rules in the optimizer.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

This updates the implementation of stats from `DataSourceV2Relation` so tests will fail if stats are accessed before early pushdown for v2 relations.

Closes #26341 from cloud-fan/back.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants