[SPARK-29277][SQL] Add early DSv2 filter and projection pushdown #25955

rdblue · 2019-09-28T01:28:18Z

What changes were proposed in this pull request?

This adds a new rule, V2ScanRelationPushDown, to push filters and projections in to a new DataSourceV2ScanRelation in the optimizer. That scan is then used when converting to a physical scan node. The new relation correctly reports stats based on the scan.

To run scan pushdown before rules where stats are used, this adds a new optimizer override, earlyScanPushDownRules and a batch for early pushdown in the optimizer, before cost-based join reordering. The other early pushdown rule, PruneFileSourcePartitions, is moved into the early pushdown rule set.

This also moves pushdown helper methods from DataSourceV2Strategy into a util class.

Why are the changes needed?

This is needed for DSv2 sources to supply stats for cost-based rules in the optimizer.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

This updates the implementation of stats from DataSourceV2Relation so tests will fail if stats are accessed before early pushdown for v2 relations.

rdblue · 2019-09-28T01:28:35Z

FYI, @cloud-fan & @brkyvz.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

rdblue · 2019-10-02T15:44:00Z

The test failures are expected and are caused by removing the computeStats implementation from DataSourceV2Relation. The purpose is to see where computeStats is called before the early pushdown rules run.

SparkQA · 2019-10-04T01:50:04Z

Test build #111757 has finished for PR 25955 at commit 1498c43.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-10-04T22:34:13Z

All tests are passing without calling the computeStats method before pushdown, so I'm confident that the early pushdown rule is in the right place. I'm adding back the computeStats method that will throw UnsupportedOperationException while testing to ensure that it is not incorrectly called in the future. It does not throw an exception when not testing so that user queries don't fail.

rdblue · 2019-10-04T22:34:47Z

@cloud-fan, this is ready for another review, assuming tests pass. Thanks!

SparkQA · 2019-10-05T02:20:31Z

Test build #111793 has finished for PR 25955 at commit 598a6fd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

...lyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala

rdblue · 2019-10-17T23:56:12Z

@cloud-fan, I rebased and updated this if you want to have another look.

I updated this as you suggested so that optimizedPlan will always contain DataSourceV2ScanRelation. That change allows us to remove quite a few cases.

I also updated this to solve the problem where DDL commands would have other rules run on the relation, including early push-down. As we discussed, I removed the relation from children for DDL commands so that rules are not run automatically, and added cases to ResolveTables for those plans. Right now, that is done for DescribeTable and AlterTable. Other DDL commands create tables and don't have a relation.

I should point out that I didn't change DeleteFromTable or UpdateTable. Those aren't DDL commands because they modify data. Those plans also rely on the relation as a child to resolve references in the delete and update expressions. Because some rules need to run, I think it should be okay if all of the rules run. This still works fine because the plans only rely on the output of the table and it doesn't matter if the underlying relation is converted to DataSourceV2ScanRelation.

If we want to avoid the relation underneath DeleteFromTable getting converted, we could avoid early push-down when there is no filter or projection. But this strategy would cause DataSourceV2Relation to show up in optimized plans again and require adding back all the cases I just removed. I don't have a strong opinion here and could go either way.

Last, I made a small change to Analyzer while I was updating the ResolveTables rule. The cases for UnresolvedRelation and InsertIntoStatement used lookupV2RelationAndCatalog, which has been removed elsewhere. I removed those last uses so we could get rid of that method. Now, UnresolvedRelation and InsertIntoStatement are resolved in ResolveCatalogs to use UnresolvedV2Relation and then matched in ResolveTables.

rdblue · 2019-10-17T23:57:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+          .map(rel => desc.copy(table = rel))
+          .getOrElse(desc)
+
+      case alter @ AlterTable(_, _, u: UnresolvedV2Relation, _) =>


I tested out a trait that worked for all of the plans that need to be resolved here, but the code was longer with the trait and implementations. If we need it later because we have more cases in this rule, it should be easy to add. I don't think we need it right now.

rdblue · 2019-10-17T23:58:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala

@@ -32,6 +32,16 @@ class ResolveCatalogs(val catalogManager: CatalogManager)
  import org.apache.spark.sql.connector.catalog.CatalogV2Util._

  override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
+    case unresolved @ UnresolvedRelation(nameParts) =>
+      nameParts match {
+        case AsTemporaryViewIdentifier(i) if catalogManager.v1SessionCatalog.isTemporaryTable(i) =>


This was added to convert to v2 unresolved relation. Matching the view identifier ensures the same behavior as lookupV2RelationAndCatalog.

rdblue · 2019-10-18T00:00:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala

@@ -64,7 +69,8 @@ class SparkOptimizer(
  override def nonExcludableRules: Seq[String] = super.nonExcludableRules :+
    ExtractPythonUDFFromJoinCondition.ruleName :+
    ExtractPythonUDFFromAggregate.ruleName :+ ExtractGroupingPythonUDFFromAggregate.ruleName :+
-    ExtractPythonUDFs.ruleName
+    ExtractPythonUDFs.ruleName :+
+    V2ScanRelationPushDown.ruleName


Only DataSourceV2ScanRelation will be converted to a physical scan node, so the early push-down rule is now required.

rdblue · 2019-10-30T00:38:17Z

@cloud-fan, I've removed the refactor and tests are passing. Can you take another look at this?

cloud-fan · 2019-10-30T07:32:50Z

...lyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala

+        s"BUG: computeStats called before pushdown on DSv2 relation: $name")
+    } else {
+      // when not testing, return stats because bad stats are better than failing a query
+      newScanBuilder() match {


This was inlined in a previous commit, why it's reverted?

cloud-fan · 2019-10-30T07:35:20Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

@@ -3218,6 +3218,8 @@ class Dataset[T] private[sql](
        fr.inputFiles
      case r: HiveTableRelation =>
        r.tableMeta.storage.locationUri.map(_.toString).toArray
+      case DataSourceV2ScanRelation(table: FileTable, _, _) =>
+        table.fileIndex.inputFiles
      case DataSourceV2Relation(table: FileTable, _, _) =>


In a previous discussion, we decided to make V2ScanRelationPushDown mandatory, so that DataSourceV2Relation won't appear in the optimized plan. Why do we change mind?

cloud-fan · 2019-10-30T07:39:15Z

I noticed that except for the table resolution refactor, we also revert the changes that make V2ScanRelationPushDown mandatory. Is it intentional?

rdblue · 2019-10-30T15:58:14Z

@cloud-fan, those changes were a mistake from rolling back. I've updated this with the changes.

SparkQA · 2019-10-30T17:57:46Z

Test build #112935 has finished for PR 25955 at commit 4220723.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-10-30T18:02:04Z

.../src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcV2SchemaPruningSuite.scala

@@ -17,7 +17,7 @@
 package org.apache.spark.sql.execution.datasources.orc

 import org.apache.spark.SparkConf
-import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.{DataFrame, Row}


unnecessary change

cloud-fan · 2019-10-30T18:02:30Z

LGTM if tests pass

cloud-fan · 2019-10-30T18:03:32Z

We need to bring back the changes in CheckAnalysis: https://github.com/apache/spark/pull/26318/files#diff-1d14ac233eac6f233c027dba0bdf871dR100

rdblue · 2019-10-30T18:36:32Z

I added the cases back to CheckAnalysis.

SparkQA · 2019-10-30T22:49:38Z

Test build #112955 has finished for PR 25955 at commit c223e05.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-10-30T22:53:07Z

@cloud-fan, tests are passing with the additional checks.

rdblue · 2019-10-31T01:07:30Z

Merging to master. Thanks for reviewing, @cloud-fan and @brkyvz!

dongjoon-hyun · 2019-10-31T06:10:54Z

Hi, Guys.

Sorry, but this breaks our SBT with Hadoop 3.2 (JDK8) profile.

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-3.2/681/ (OrcFilterSuite fails)
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-3.2/682/ (This is running, but OrcFilterSuite fails already)

I checked that the failure happens consistently on master branch.

$ build/sbt "sql/testOnly *.OrcFilterSuite" -Phadoop-3.2
...
[info] - filter pushdown - integer *** FAILED *** (2 seconds, 173 milliseconds)
[info]   org.apache.spark.sql.AnalysisException: Can not match OrcTable in the query.;
...
[info] *** 10 TESTS FAILED ***
[error] Failed: Total 14, Failed 10, Errors 0, Passed 4
[error] Failed tests:
[error] 	org.apache.spark.sql.execution.datasources.orc.OrcFilterSuite
[error] (sql/test:testOnly) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 57 s, completed Oct 30, 2019 11:05:15 PM

Without this patch, it's recovered. Although this is not a part of 3.0.0-preview RC2, I'll revert this urgently in order to recover our all Jenkins environment during votes. Also, we may have RC3.

cc @jiangxb1987 since he is a release manager for 3.0.0-preview.

dongjoon-hyun · 2019-10-31T06:18:22Z

@rdblue . Could you make this PR once more with [test-hadoop3.2] string in the PR title, please?

dongjoon-hyun · 2019-10-31T06:22:49Z

All Jenkins jobs using Maven look okay (including JDK11 tests).

### What changes were proposed in this pull request? This adds a new rule, `V2ScanRelationPushDown`, to push filters and projections in to a new `DataSourceV2ScanRelation` in the optimizer. That scan is then used when converting to a physical scan node. The new relation correctly reports stats based on the scan. To run scan pushdown before rules where stats are used, this adds a new optimizer override, `earlyScanPushDownRules` and a batch for early pushdown in the optimizer, before cost-based join reordering. The other early pushdown rule, `PruneFileSourcePartitions`, is moved into the early pushdown rule set. This also moves pushdown helper methods from `DataSourceV2Strategy` into a util class. ### Why are the changes needed? This is needed for DSv2 sources to supply stats for cost-based rules in the optimizer. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This updates the implementation of stats from `DataSourceV2Relation` so tests will fail if stats are accessed before early pushdown for v2 relations. Closes apache#25955 from rdblue/move-v2-pushdown. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Ryan Blue <blue@apache.org>

…on pushdown Bring back #25955 ### What changes were proposed in this pull request? This adds a new rule, `V2ScanRelationPushDown`, to push filters and projections in to a new `DataSourceV2ScanRelation` in the optimizer. That scan is then used when converting to a physical scan node. The new relation correctly reports stats based on the scan. To run scan pushdown before rules where stats are used, this adds a new optimizer override, `earlyScanPushDownRules` and a batch for early pushdown in the optimizer, before cost-based join reordering. The other early pushdown rule, `PruneFileSourcePartitions`, is moved into the early pushdown rule set. This also moves pushdown helper methods from `DataSourceV2Strategy` into a util class. ### Why are the changes needed? This is needed for DSv2 sources to supply stats for cost-based rules in the optimizer. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This updates the implementation of stats from `DataSourceV2Relation` so tests will fail if stats are accessed before early pushdown for v2 relations. Closes #26341 from cloud-fan/back. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

This comment has been minimized.

Sign in to view

rdblue force-pushed the move-v2-pushdown branch from e87fbc4 to d0890ac Compare September 28, 2019 21:14

This comment has been minimized.

Sign in to view

dongjoon-hyun added the SQL label Sep 29, 2019

This comment has been minimized.

Sign in to view

rdblue force-pushed the move-v2-pushdown branch from f01bd12 to 38b8655 Compare October 1, 2019 19:35

This comment has been minimized.

Sign in to view

cloud-fan reviewed Oct 2, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala Show resolved Hide resolved

cloud-fan reviewed Oct 2, 2019

View reviewed changes

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala Show resolved Hide resolved

rdblue force-pushed the move-v2-pushdown branch from 96be597 to a43f1b7 Compare October 2, 2019 20:08

This comment has been minimized.

Sign in to view

rdblue changed the title ~~[SPARK-29277][SQL] Add early DSv2 filter and projection pushdown (WIP)~~ [SPARK-29277][SQL] Add early DSv2 filter and projection pushdown Oct 4, 2019

cloud-fan reviewed Oct 7, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala Show resolved Hide resolved

cloud-fan reviewed Oct 7, 2019

View reviewed changes

...lyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 7, 2019

View reviewed changes

...lyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 7, 2019

View reviewed changes

...lyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala Show resolved Hide resolved

rdblue force-pushed the move-v2-pushdown branch from 598a6fd to 9503757 Compare October 17, 2019 23:36

rdblue commented Oct 17, 2019

View reviewed changes

rdblue commented Oct 18, 2019

View reviewed changes

This comment has been minimized.

Sign in to view

cloud-fan reviewed Oct 30, 2019

View reviewed changes

cloud-fan mentioned this pull request Oct 30, 2019

[SPARK-29652][SQL] AlterTable/DescribeTable should not list the table as a child #26318

Closed

This comment has been minimized.

Sign in to view

Require conversion to DataSourceV2ScanRelation.

4220723

rdblue force-pushed the move-v2-pushdown branch from 1886948 to 4220723 Compare October 30, 2019 16:09

cloud-fan reviewed Oct 30, 2019

View reviewed changes

Add Describe and Alter checks to CheckAnalysis.

c223e05

rdblue closed this in cfc80d0 Oct 31, 2019

cloud-fan mentioned this pull request Oct 31, 2019

[SPARK-29277][SQL][test-hadoop3.2] Add early DSv2 filter and projection pushdown #26341

Closed

[SPARK-29277][SQL] Add early DSv2 filter and projection pushdown #25955

[SPARK-29277][SQL] Add early DSv2 filter and projection pushdown #25955

Conversation

rdblue commented Sep 28, 2019 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

rdblue commented Sep 28, 2019

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

rdblue commented Oct 2, 2019

This comment has been minimized.

This comment has been minimized.

SparkQA commented Oct 4, 2019

rdblue commented Oct 4, 2019

rdblue commented Oct 4, 2019

SparkQA commented Oct 5, 2019

rdblue commented Oct 17, 2019

rdblue Oct 17, 2019

Choose a reason for hiding this comment

rdblue Oct 17, 2019

Choose a reason for hiding this comment

rdblue Oct 18, 2019

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

rdblue commented Oct 30, 2019

cloud-fan Oct 30, 2019

Choose a reason for hiding this comment

cloud-fan Oct 30, 2019

Choose a reason for hiding this comment

cloud-fan commented Oct 30, 2019

rdblue commented Oct 30, 2019

This comment has been minimized.

SparkQA commented Oct 30, 2019

cloud-fan Oct 30, 2019

Choose a reason for hiding this comment

cloud-fan commented Oct 30, 2019

cloud-fan commented Oct 30, 2019

rdblue commented Oct 30, 2019

SparkQA commented Oct 30, 2019

rdblue commented Oct 30, 2019

rdblue commented Oct 31, 2019

dongjoon-hyun commented Oct 31, 2019 • edited Loading

dongjoon-hyun commented Oct 31, 2019 • edited Loading

dongjoon-hyun commented Oct 31, 2019 • edited Loading

rdblue commented Sep 28, 2019 •

edited

Loading

dongjoon-hyun commented Oct 31, 2019 •

edited

Loading

dongjoon-hyun commented Oct 31, 2019 •

edited

Loading

dongjoon-hyun commented Oct 31, 2019 •

edited

Loading