[SPARK-23877][SQL]: Use filter predicates to prune partitions in metadata-only queries #20988

rdblue · 2018-04-05T23:31:20Z

What changes were proposed in this pull request?

This updates the OptimizeMetadataOnlyQuery rule to use filter expressions when listing partitions, if there are filter nodes in the logical plan. This avoids listing all partitions for large tables on the driver.

This also fixes a minor bug where the partitions returned from fsRelation cannot be serialized without hitting a stack level too deep error. This is caused by serializing a stream to executors, where the stream is a recursive structure. If the stream is too long, the serialization stack reaches the maximum level of depth. The fix is to create a LocalRelation using an Array instead of the incoming Seq.

How was this patch tested?

Existing tests for metadata-only queries.

The LocalRelation created for partition data for metadata-only queries may be a stream produced by listing directories. If the stream is large, serializing the LocalRelation to executors results in a stack overflow because the stream is a recursive structure of (head, rest-of-stream).

SparkQA · 2018-04-06T03:07:18Z

Test build #88959 has finished for PR 20988 at commit 2345896.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2018-04-13T17:32:04Z

@cloud-fan or @gatorsmile, could you review this?

cloud-fan · 2018-04-14T00:40:18Z

can we add a test? We can use HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount() to check if this patch can really reduce the number of partitions being fetched.

cloud-fan · 2018-04-14T00:47:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala

+
+        case p @ Project(projectList, child) if projectList.forall(_.deterministic) =>
+          unapply(child).flatMap { case (partAttrs, filters, relation) =>
+            if (p.references.subsetOf(partAttrs)) Some((p.outputSet, filters, relation)) else None


what about Filter(p > 1, Project(a, p, Table(a, b, p, partitioned by p)))? p > 1 should also be a partition filter.

I'd propose something top-down like

def getPartitionedRelation( plan: LogicalPlan, predicates: Seq[Expression]): Option[(AttributeSet, Seq[Expression], LogicalPlan)] = { plan match { case Filter(condition, child) if condition.deterministic => getPartitionedRelation(child, predicates ++ splitConjunctivePredicates(condition)) case Project(projectList, child) if projectList.forall(_.deterministic) => getPartitionedRelation(child, predicates.filter(_.references.subsetOf(child.outputSet))) case l @ LogicalRelation(fsRelation: HadoopFsRelation, _, _, _) if fsRelation.partitionSchema.nonEmpty => val partAttrs = ... val partitionFilters = predicates.filter(_.references.subsetOf(partAttrs)) Some(...) case _ => None } }

@cloud-fan, that is basically how this works already. Each matched node calls unapply(child) to get the result from the child node, then it adds the current node's conditions to that result. Using unapply instead of getPartitionedRelation makes this work in the matching rule:

case a @ Aggregate(_, aggExprs, child @ PartitionedRelation(partAttrs, filters, relation)) =>

rdblue · 2018-04-18T22:25:54Z

@cloud-fan, I've added the test. Thanks for letting me know about HiveCatalogMetrics, that's exactly what I needed.

SparkQA · 2018-04-19T00:38:17Z

Test build #89535 has finished for PR 20988 at commit 6e0685e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-04-19T08:45:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala

+        case f @ Filter(condition, child) if condition.deterministic =>
+          unapply(child).flatMap { case (partAttrs, filters, relation) =>
+            if (f.references.subsetOf(partAttrs)) {
+              Some((partAttrs, splitConjunctivePredicates(condition) ++ filters, relation))


there is a bug here. Think about Filter(x > 1, Project(p + 1 as x, Table(a, p, partitioned by p))), we will mistakenly report x > 1 as partition predicates and use it to list partitions and fail.

I think we should use PhysicalOperation here, which can help us to substitute the attributes in filter.

Good catch. I've added that as a test case and updated the PartitionedRelation code to keep track of both original partition attributes -- that the filter needs -- and the top-most node's output that is used by the rule. One thing to note: the optimizer usually pushes Filter below Project` by the time this runs, so it is difficult to even construct this test case.

For using PhysicalOperation instead of PartitionedRelation, I don't see a compelling reason for such an invasive change. This currently adds a couple of results to unapply and keeps mostly the same logic. PhysicalOperation would lose the check that the references are a subset of the partition attributes and be a lot larger change. If you think this should be refactored, lets talk about that separately to understand the motivation for the change.

SparkQA · 2018-04-19T23:21:38Z

Test build #89588 has finished for PR 20988 at commit 93cf217.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-04-20T04:08:28Z

thanks, merging to master!

…ndling of Project and Filter over partitioned relation ## What changes were proposed in this pull request? A followup of apache#20988 `PhysicalOperation` can collect Project and Filters over a certain plan and substitute the alias with the original attributes in the bottom plan. We can use it in `OptimizeMetadataOnlyQuery` rule to handle the Project and Filter over partitioned relation. ## How was this patch tested? existing test Author: Wenchen Fan <wenchen@databricks.com> Closes apache#21111 from cloud-fan/refactor.

…ata-only queries ## What changes were proposed in this pull request? This updates the OptimizeMetadataOnlyQuery rule to use filter expressions when listing partitions, if there are filter nodes in the logical plan. This avoids listing all partitions for large tables on the driver. This also fixes a minor bug where the partitions returned from fsRelation cannot be serialized without hitting a stack level too deep error. This is caused by serializing a stream to executors, where the stream is a recursive structure. If the stream is too long, the serialization stack reaches the maximum level of depth. The fix is to create a LocalRelation using an Array instead of the incoming Seq. ## How was this patch tested? Existing tests for metadata-only queries. Author: Ryan Blue <blue@apache.org> Closes apache#20988 from rdblue/SPARK-23877-metadata-only-push-filters. (cherry picked from commit b3fde5a)

…ndling of Project and Filter over partitioned relation A followup of apache#20988 `PhysicalOperation` can collect Project and Filters over a certain plan and substitute the alias with the original attributes in the bottom plan. We can use it in `OptimizeMetadataOnlyQuery` rule to handle the Project and Filter over partitioned relation. existing test Author: Wenchen Fan <wenchen@databricks.com> Closes apache#21111 from cloud-fan/refactor. (cherry picked from commit f70f46d) Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LocalRelation.scala

…ata-only queries ## What changes were proposed in this pull request? This updates the OptimizeMetadataOnlyQuery rule to use filter expressions when listing partitions, if there are filter nodes in the logical plan. This avoids listing all partitions for large tables on the driver. This also fixes a minor bug where the partitions returned from fsRelation cannot be serialized without hitting a stack level too deep error. This is caused by serializing a stream to executors, where the stream is a recursive structure. If the stream is too long, the serialization stack reaches the maximum level of depth. The fix is to create a LocalRelation using an Array instead of the incoming Seq. ## How was this patch tested? Existing tests for metadata-only queries. Author: Ryan Blue <blue@apache.org> Closes apache#20988 from rdblue/SPARK-23877-metadata-only-push-filters. (cherry picked from commit b3fde5a)

…ndling of Project and Filter over partitioned relation A followup of apache#20988 `PhysicalOperation` can collect Project and Filters over a certain plan and substitute the alias with the original attributes in the bottom plan. We can use it in `OptimizeMetadataOnlyQuery` rule to handle the Project and Filter over partitioned relation. existing test Author: Wenchen Fan <wenchen@databricks.com> Closes apache#21111 from cloud-fan/refactor. (cherry picked from commit f70f46d) Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LocalRelation.scala

rdblue added 2 commits April 5, 2018 16:22

Support filter conditions in metadata-only queries.

552efae

cloud-fan reviewed Apr 14, 2018

View reviewed changes

SPARK-23877: Validate metadata-only filters are pushed down.

6e0685e

cloud-fan reviewed Apr 19, 2018

View reviewed changes

SPARK-23877: Fix edge case with unusual project/filter order.

93cf217

asfgit closed this in b3fde5a Apr 20, 2018

cloud-fan mentioned this pull request Apr 20, 2018

[SPARK-23877][SQL][followup] use PhysicalOperation to simplify the handling of Project and Filter over partitioned relation #21111

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-23877][SQL]: Use filter predicates to prune partitions in metadata-only queries #20988

[SPARK-23877][SQL]: Use filter predicates to prune partitions in metadata-only queries #20988

rdblue commented Apr 5, 2018

SparkQA commented Apr 6, 2018

rdblue commented Apr 13, 2018

cloud-fan commented Apr 14, 2018

cloud-fan Apr 14, 2018

cloud-fan Apr 14, 2018

rdblue Apr 18, 2018

rdblue commented Apr 18, 2018

SparkQA commented Apr 19, 2018

cloud-fan Apr 19, 2018

rdblue Apr 19, 2018 •

edited

Loading

SparkQA commented Apr 19, 2018

cloud-fan commented Apr 20, 2018

[SPARK-23877][SQL]: Use filter predicates to prune partitions in metadata-only queries #20988

[SPARK-23877][SQL]: Use filter predicates to prune partitions in metadata-only queries #20988

Conversation

rdblue commented Apr 5, 2018

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Apr 6, 2018

rdblue commented Apr 13, 2018

cloud-fan commented Apr 14, 2018

cloud-fan Apr 14, 2018

Choose a reason for hiding this comment

cloud-fan Apr 14, 2018

Choose a reason for hiding this comment

rdblue Apr 18, 2018

Choose a reason for hiding this comment

rdblue commented Apr 18, 2018

SparkQA commented Apr 19, 2018

cloud-fan Apr 19, 2018

Choose a reason for hiding this comment

rdblue Apr 19, 2018 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Apr 19, 2018

cloud-fan commented Apr 20, 2018

rdblue Apr 19, 2018 •

edited

Loading