[SPARK-12503] [SQL] Pushing Limit Through Union All #10451

gatorsmile · 2015-12-23T18:15:39Z

"Rule that applies to a Limit on top of a Union. The original Limit won't go away after applying this rule, but additional Limit nodes will be created on top of each child of Union, so that these children produce less rows and Limit can be further optimized for children Relations."

– from https://issues.apache.org/jira/browse/CALCITE-832

Also, the same topic in Hive: https://issues.apache.org/jira/browse/HIVE-11775. This has been merged to HIVE.

This PR is for performance improvement. The idea is like predicate pushdown. It can reduce the number of rows processed by Union All.

After the improvement, we can see the changes in the optimized plan:

== Analyzed Logical Plan ==
i: int
Limit 1
+- Union
   :- Project [_1#0 AS i#1]
   :  +- LocalRelation [_1#0], [[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12],[13],[14],[15],[16],[17],[18],[19],[20]]
   +- Project [_1#2 AS i#3]
      +- LocalRelation [_1#2], [[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]]

== Optimized Logical Plan ==
Limit 1
+- Union
   :- Limit 1 <---- extra Limit
   :  +- LocalRelation [i#1], [[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12],[13],[14],[15],[16],[17],[18],[19],[20]]
   +- Limit 1 <---- extra Limit
      +- LocalRelation [i#3], [[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]]

SparkQA · 2015-12-23T19:59:54Z

Test build #48248 has finished for PR 10451 at commit 56fd782.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2015-12-23T20:03:11Z

@marmbrus @rxin Could you check if this is an appropriate improvement for Spark too?

Thanks!

gatorsmile · 2015-12-23T20:33:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+          Limit(exp, left),
+          Limit(exp, right)
+        )
+      )


A bug exists here. Will fix it soon. Thanks!

SparkQA · 2015-12-24T00:24:38Z

Test build #48261 has finished for PR 10451 at commit 77105e3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2015-12-24T01:58:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+          CombineLimits(Limit(exp, left)),
+          CombineLimits(Limit(exp, right))
+        )
+      )


we need a stop condition, or it will keep pushing Limit forever

Thank you for your review! Since we call CombineLimits here, we will not add extra Limit in the subsequent iteration. Thus, I think it will cause the plan change. Thus, it will stop automatically, right?

After think it more, there may be a problem: If left or right is an operator that can push Limit down(currently there is no such operator, but we can't guarantee there won't be). Then every time you push down a Limit here, it will be pushed down further. Thus the CombineLimits can NOT detect that you have already pushed the Limit down, and keeps generating new Limit and pushing it down.

I think we should have a better way to detect whether we have pushed Limit down or not, or add some comments to say that this rule assumes the newly added Limit on top of left and right won't be removed by other optimization rules.

You are right. Limit might not converge to the same position after multiple pushdown.

Let me think about it. Thank you!

SparkQA · 2015-12-24T09:52:32Z

Test build #48299 has finished for PR 10451 at commit 7f25d91.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-12-29T08:05:57Z

Test build #48400 has finished for PR 10451 at commit 358d62e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2015-12-29T18:05:00Z

Thanks for working on this. I think its getting pretty close. A few minor cleanups that might be nice:

I think we should consider pulling all the Limit rules into their own LimitPushDown rule. The reasoning here is twofold: we can clearly comment in one central place the requirements with respect to implementing maxRows. It will be easier to turn off if it is ever doing the wrong thing.
We should do a pass through and add maxRows to any other logical plans where it make sense. Off the top of my head:
- Filter = child.maxRows
- Union = for(leftMax <- left.maxRows; rightMax <- rightMax) yield Add(leftMax, rightMax)
- Distinct = child.maxRows
- Aggregate - child.maxRows

marmbrus · 2015-12-29T18:05:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+    // push-down rule that is unable to infer the value of maxRows. Any operator that a Limit can
+    // be pushed passed should override this function.
+    case Limit(exp, Union(left, right))
+      if left.maxRows.isEmpty || right.maxRows.isEmpty =>


Is there a reason to not check left and right separately?

Below is the example. If one side has a limit child/descendant, we still can push it down to reduce the number of returned rows.

https://github.com/gatorsmile/spark/blob/unionLimit/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PushdownLimitsSuite.scala#L50-L57

marmbrus · 2015-12-29T23:38:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+ * Any operator that a Limit can be pushed passed should override the maxRows function.
+ *
+ * Note: This rule has to be done when the logical plan is stable;
+ *       Otherwise, it could impact the other rules.


I'm not sure what this means?

If we push Limit through Filter, Aggregate and Distinct, the results will be wrong. For example, df.aggregate().limit(1) and df.limit(1).aggregate()will generate the different results.

This statement is true iff we can push Limit through some operators. So far, we did not find any eligible operators except outer/left-outer/right-outer Join and Union. Thus, let me revert them back. Thanks!

gatorsmile · 2015-12-30T00:01:51Z

After rethinking the Limit push-down rules, we are unable to push Limit through any operator that could change the number of rows or generate the values based on the inputs. Thus, so far, the eligible candidates are Project, Union All and Outer/LeftOuter/RightOuter Join. Please correct me if my understanding is not right.

Feel free to let me know if the codes need an update. Thank you!

marmbrus · 2015-12-30T00:25:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+    // safe to pushdown Limit through it. Once we add UNION DISTINCT, we will not be able to
+    // pushdown Limit.
+    case Limit(exp, Union(left, right))
+      if left.maxRows.isEmpty || right.maxRows.isEmpty =>


Okay, but why not break this into two parts. So that we push to the left when the left is not limited and we push to the right when the right is not limited. Now you push to both sides if either is not limited.

Yeah, you are right. : )

should we also check the limit value? If the maxRows is larger than the limit we wanna push down, seems it still makes sense to push it down?

Yeah, that also makes sense. Will do the change after these three running test cases. : )

SparkQA · 2015-12-30T01:54:17Z

Test build #48435 has finished for PR 10451 at commit ca5c104.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-12-30T03:43:48Z

Test build #48431 has finished for PR 10451 at commit 2823a57.

This patch fails from timeout after a configured wait of 250m.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2015-12-30T03:54:20Z

Test build #48433 has finished for PR 10451 at commit cfbeea7.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2015-12-30T07:44:03Z

The latest version covers both cases:

If the children' maxRows has a smaller number, add the extra limit.
If the children' maxRows is None, add the extra limit.

Hopefully, you like the latest implementation. : ) @marmbrus @cloud-fan

SparkQA · 2015-12-30T08:13:19Z

Test build #48461 has finished for PR 10451 at commit 7cf955f.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-12-30T10:07:13Z

Test build #48465 has finished for PR 10451 at commit 7899312.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-01-12T01:58:13Z

Let's close the limit push down pull requests. We will need to design this more properly because it is expensive to push down large limits.

gatorsmile · 2016-01-12T02:00:05Z

Sure, let me close it.

This patch adds a new optimizer rule for performing limit pushdown. Limits will now be pushed down in two cases: - If a limit is on top of a `UNION ALL` operator, then a partition-local limit operator will be pushed to each of the union operator's children. - If a limit is on top of an `OUTER JOIN` then a partition-local limit will be pushed to one side of the join. For `LEFT OUTER` and `RIGHT OUTER` joins, the limit will be pushed to the left and right side, respectively. For `FULL OUTER` join, we will only push limits when at most one of the inputs is already limited: if one input is limited we will push a smaller limit on top of it and if neither input is limited then we will limit the input which is estimated to be larger. These optimizations were proposed previously by gatorsmile in #10451 and #10454, but those earlier PRs were closed and deferred for later because at that time Spark's physical `Limit` operator would trigger a full shuffle to perform global limits so there was a chance that pushdowns could actually harm performance by causing additional shuffles/stages. In #7334, we split the `Limit` operator into separate `LocalLimit` and `GlobalLimit` operators, so we can now push down only local limits (which don't require extra shuffles). This patch is based on both of gatorsmile's patches, with changes and simplifications due to partition-local-limiting. When we push down the limit, we still keep the original limit in place, so we need a mechanism to ensure that the optimizer rule doesn't keep pattern-matching once the limit has been pushed down. In order to handle this, this patch adds a `maxRows` method to `SparkPlan` which returns the maximum number of rows that the plan can compute, then defines the pushdown rules to only push limits to children if the children's maxRows are greater than the limit's maxRows. This idea is carried over from #10451; see that patch for additional discussion. Author: Josh Rosen <joshrosen@databricks.com> Closes #11121 from JoshRosen/limit-pushdown-2.

gatorsmile and others added 20 commits November 13, 2015 14:50

Merge remote-tracking branch 'upstream/master'

01e4cdf

Merge remote-tracking branch 'upstream/master'

6835704

Merge remote-tracking branch 'upstream/master'

9180687

SPARK-11633

b38a21e

Merge remote-tracking branch 'upstream/master' into joinMakeCopy

d2b84af

Merge remote-tracking branch 'upstream/master'

fda8025

Merge branch 'master' of https://github.com/gatorsmile/spark

ac0dccd

Merge remote-tracking branch 'upstream/master'

6e0018b

converge

0546772

converge

b37a64f

Merge remote-tracking branch 'upstream/master'

661260b

Merge remote-tracking branch 'upstream/master'

2dfa0fd

Merge remote-tracking branch 'upstream/master'

d929d9b

Merge remote-tracking branch 'upstream/master'

4070d2f

Merge remote-tracking branch 'upstream/master'

38dcfb2

Merge remote-tracking branch 'upstream/master'

cb3fc83

Merge remote-tracking branch 'upstream/master'

8dbacc7

Merge remote-tracking branch 'upstream/master'

41b9172

union limit pushdown.

56fd782

Merge remote-tracking branch 'upstream/master' into unionLimit

b5ac8d7

gatorsmile changed the title ~~[SPARK-12503] [SQL] Pushing Limit Through Union~~ [SPARK-12503] [SQL] Pushing Limit Through Union ALL Dec 23, 2015

gatorsmile changed the title ~~[SPARK-12503] [SQL] Pushing Limit Through Union ALL~~ [SPARK-12503] [SQL] Pushing Limit Through Union All Dec 23, 2015

gatorsmile reviewed Dec 23, 2015
View reviewed changes

combine the limits.

77105e3

cloud-fan reviewed Dec 24, 2015
View reviewed changes

update the comments.

7f25d91

address the comments.

358d62e

marmbrus reviewed Dec 29, 2015
View reviewed changes

gatorsmile added 2 commits December 29, 2015 15:28

addressed comments.

2823a57

Merge remote-tracking branch 'upstream/master'

10d570c

marmbrus reviewed Dec 29, 2015
View reviewed changes

gatorsmile added 2 commits December 29, 2015 15:40

Merge branch 'unionLimit' into unionLimit2

cfbeea7

addressed comments.

ca5c104

marmbrus reviewed Dec 30, 2015
View reviewed changes

address the comments.

56f0c16

gatorsmile added 2 commits December 29, 2015 23:47

update the comment.

62d5cbe

update the comment.

7cf955f

Merge remote-tracking branch 'upstream/master' into unionLimit2

7899312

This was referenced Jan 7, 2016

[SPARK-12616] [SQL] Making Logical Operator Union Support Arbitrary Number of Children #10577

Closed

[SPARK-12745] [SQL] Hive Parser: Limit is not supported inside Set Operation #10689

Closed

gatorsmile closed this Jan 12, 2016

gatorsmile mentioned this pull request Feb 5, 2016

[SPARK-8964] [SQL] Use Exchange to perform shuffle in Limit #7334

Closed

JoshRosen mentioned this pull request Feb 8, 2016

[SPARK-12503][SPARK-12505] Limit pushdown in UNION ALL and OUTER JOIN #11121

Closed

gatorsmile deleted the unionLimit branch August 6, 2016 15:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-12503] [SQL] Pushing Limit Through Union All #10451

[SPARK-12503] [SQL] Pushing Limit Through Union All #10451

gatorsmile commented Dec 23, 2015

SparkQA commented Dec 23, 2015

gatorsmile commented Dec 23, 2015

gatorsmile Dec 23, 2015

SparkQA commented Dec 24, 2015

cloud-fan Dec 24, 2015

gatorsmile Dec 24, 2015

cloud-fan Dec 25, 2015

gatorsmile Dec 25, 2015

SparkQA commented Dec 24, 2015

SparkQA commented Dec 29, 2015

marmbrus commented Dec 29, 2015

marmbrus Dec 29, 2015

gatorsmile Dec 29, 2015

marmbrus Dec 29, 2015

gatorsmile Dec 29, 2015

gatorsmile commented Dec 30, 2015

marmbrus Dec 30, 2015

gatorsmile Dec 30, 2015

cloud-fan Dec 30, 2015

gatorsmile Dec 30, 2015

SparkQA commented Dec 30, 2015

SparkQA commented Dec 30, 2015

SparkQA commented Dec 30, 2015

gatorsmile commented Dec 30, 2015

SparkQA commented Dec 30, 2015

SparkQA commented Dec 30, 2015

rxin commented Jan 12, 2016

gatorsmile commented Jan 12, 2016

[SPARK-12503] [SQL] Pushing Limit Through Union All #10451

[SPARK-12503] [SQL] Pushing Limit Through Union All #10451

Conversation

gatorsmile commented Dec 23, 2015

SparkQA commented Dec 23, 2015

gatorsmile commented Dec 23, 2015

Choose a reason for hiding this comment

SparkQA commented Dec 24, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 24, 2015

SparkQA commented Dec 29, 2015

marmbrus commented Dec 29, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Dec 30, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 30, 2015

SparkQA commented Dec 30, 2015

SparkQA commented Dec 30, 2015

gatorsmile commented Dec 30, 2015

SparkQA commented Dec 30, 2015

SparkQA commented Dec 30, 2015

rxin commented Jan 12, 2016

gatorsmile commented Jan 12, 2016