[SPARK-29343][SQL] Eliminate sorts without limit in the subquery of Join/Aggregation #26011

WangGuangxin · 2019-10-03T09:23:54Z

What changes were proposed in this pull request?

This is somewhat a complement of #21853.
The Sort without Limit operator in Join subquery is useless, it's the same case in GroupBy when the aggregation function is order irrelevant, such as count, sum.
This PR try to remove this kind of Sort operator in SQL Optimizer.

Why are the changes needed?

For example, select count(1) from (select a from test1 order by a) is equal to select count(1) from (select a from test1).
'select * from (select a from test1 order by a) t1 join (select b from test2) t2 on t1.a = t2.b' is equal to select * from (select a from test1) t1 join (select b from test2) t2 on t1.a = t2.b.

Remove useless Sort operator can improve performance.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Adding new UT RemoveSortInSubquerySuite.scala

dongjoon-hyun · 2019-10-03T09:31:01Z

ok to test

SparkQA · 2019-10-03T11:33:27Z

Test build #111734 has finished for PR 26011 at commit 75b43f5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-03T16:52:23Z

Test build #111741 has finished for PR 26011 at commit e29b323.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class Average(child: Expression) extends DeclarativeAggregate with ImplicitCastInputTypes
case class Count(children: Seq[Expression]) extends DeclarativeAggregate with OrderIrrelevantAggs
case class Max(child: Expression) extends DeclarativeAggregate with OrderIrrelevantAggs
case class Min(child: Expression) extends DeclarativeAggregate with OrderIrrelevantAggs
trait OrderIrrelevantAggs extends AggregateFunction
case class Sum(child: Expression) extends DeclarativeAggregate with ImplicitCastInputTypes
case class AnyAgg(arg: Expression) extends UnevaluableBooleanAggBase(arg)

SparkQA · 2019-10-04T07:05:02Z

Test build #111766 has finished for PR 26011 at commit 25a3bb8.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

WangGuangxin · 2019-10-04T07:11:11Z

retest this please

HyukjinKwon · 2019-10-04T08:14:55Z

retest this please

SparkQA · 2019-10-04T11:50:16Z

Test build #111770 has finished for PR 26011 at commit 25a3bb8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-07T04:08:29Z

Test build #111828 has finished for PR 26011 at commit d21c683.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class AnyAgg(arg: Expression) extends UnevaluableBooleanAggBase(arg)

SparkQA · 2019-10-13T12:01:41Z

Test build #111994 has finished for PR 26011 at commit d2328a0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-13T17:04:37Z

Test build #111997 has finished for PR 26011 at commit a9e9be9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

WangGuangxin · 2019-10-14T02:02:39Z

@dongjoon-hyun @dilipbiswal @maropu @gatorsmile Could you please take a look at this PR when you have time?

dilipbiswal · 2019-10-14T06:17:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RemoveSortInSubquery.scala

+ *
+ * This rule try to remove this kind of [[Sort]] operator.
+ */
+object RemoveSortInSubquery extends Rule[LogicalPlan] with PredicateHelper {


Should the existing RemoveRedundantSorts handle this as well ? The reason i ask is, i don't see any thing subquery specific in the new rule ?

+1 for merging the 2 rules. We can call the merged rule EliminateSorts, as not only redundant sorts are removed.

dilipbiswal · 2019-10-14T06:35:38Z

The idea looks reasonable to me. cc @cloud-fan

maropu · 2019-10-23T07:16:37Z

@WangGuangxin Are you there? Can you update the pr based on the reviews above?

maropu · 2019-10-23T07:21:02Z

...src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/OrderIrrelevantAggs.scala

+ * has nothing to do with the order of input data.
+ * For example, [[Sum]] is [[OrderIrrelevantAggs]] while [[First]] is not.
+ */
+trait OrderIrrelevantAggs extends AggregateFunction {


I think this trait approach doesn't work for ScalaUDAF. Is this expected?

Yes, it's not suitable for ScalaUDAF. In fact, I used OrderIrrelevantAggs as a mixin trait, and I only apply this trait to min, max and several other agg funciton, not on ScalaUDAF. Hope I understand you correctly.

Maybe OrderIrrelevantAggs don't need to extends AggregateFunction.

If there is not too many agg functions for this optimization, can we list up them inside the rule like the others? e.g., https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala#L162

OK, I think your suggestion is better.

SparkQA · 2019-10-24T06:30:26Z

Test build #112579 has finished for PR 26011 at commit 425f76d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-10-24T07:08:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -967,12 +967,18 @@ object EliminateSorts extends Rule[LogicalPlan] {
 * Removes redundant Sort operation. This can happen:
 * 1) if the child is already sorted
 * 2) if there is another Sort operator separated by 0...n Project/Filter operators
+ * 3) if the Sort operator is within Join and without Limit
+ * 4) if the Sort operator is within GroupBy and the aggregate function is order irrelevant
 */
 object RemoveRedundantSorts extends Rule[LogicalPlan] {


Can we rename it to EliminateSorts? Some sorts are not redundant but we remove them as well according to the SQL semantic.

There is another rule named EliminateSorts. Can we merge it together?

yea let's merge them.

cloud-fan · 2019-10-24T07:14:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

 */
 object RemoveRedundantSorts extends Rule[LogicalPlan] {
  def apply(plan: LogicalPlan): LogicalPlan = plan transformDown {
    case Sort(orders, true, child) if SortOrder.orderingSatisfies(child.outputOrdering, orders) =>
      child
    case s @ Sort(_, _, child) => s.copy(child = recursiveRemoveSort(child))
+    case j @ Join(originLeft, originRight, _, _, _) =>


shall we make sure the join condition is deterministic?

SparkQA · 2019-10-26T11:27:45Z

Test build #112712 has finished for PR 26011 at commit 8082324.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-10-26T12:57:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -953,26 +952,25 @@ object CombineFilters extends Rule[LogicalPlan] with PredicateHelper {
 }

 /**
- * Removes no-op SortOrder from Sort
+ * Removes Sort operation. This can happen:
+ * 1) if the sort is noop


I think this statement is a little ambiguous, so could you make it more precise?

maropu · 2019-10-26T12:58:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+ * 1) if the sort is noop
+ * 2) if the child is already sorted
+ * 3) if there is another Sort operator separated by 0...n Project/Filter operators
+ * 4) if the Sort operator is within Join and without Limit


without limit? It seems the rule below does not check that condition?

Also, within Join -> within Join having deterministic conditions only?

limit is restricted by canEliminateSort.

That stmt is still ambiguous... probably, I think the condition 4) is similar to 3? I mean That might be Join separated by 0...n Project/Filter operators only.

maropu · 2019-10-26T13:00:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -987,6 +985,24 @@ object RemoveRedundantSorts extends Rule[LogicalPlan] {
    case f: Filter => f.condition.deterministic
    case _ => false
  }
+
+  def isOrderIrrelevantAggs(aggs: Seq[NamedExpression]): Boolean = {


nit: private?

maropu · 2019-10-26T13:00:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+    case j @ Join(originLeft, originRight, _, cond, _) if cond.forall(_.deterministic) =>
+      j.copy(left = recursiveRemoveSort(originLeft), right = recursiveRemoveSort(originRight))
+    case g @ Aggregate(_, aggs, originChild) if isOrderIrrelevantAggs(aggs) =>
+      g.copy(child = recursiveRemoveSort(originChild))
  }

  def recursiveRemoveSort(plan: LogicalPlan): LogicalPlan = plan match {


(This is not related to this pr though...) nit: private?

maropu · 2019-10-26T13:01:47Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+      case _ => false
+    }
+
+    aggs.flatMap { e =>


If aggs only has a single PythonUDF, it seems this method returns true.

yes, I'll try to fix it

Thanks for pointing out this. I've updated and add a UT for udf

SparkQA · 2019-11-03T12:34:52Z

Test build #113160 has finished for PR 26011 at commit f2d9ec1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-11-04T00:23:51Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

-        case ae: AggregateExpression => ae.aggregateFunction
-      }
-    }.forall(isOrderIrrelevantAggFunction)
+    def checkValidAggregateExpression(expr: Expression): Boolean = expr match {


nit: private?

We cannot make it private because this is a nested function within private isOrderIrrelevantAggs

Ur, this is an inner func. I see.

maropu · 2019-11-04T02:14:30Z

@dilipbiswal @cloud-fan Looks good overall and could you check this?

SparkQA · 2019-11-04T05:55:25Z

Test build #113186 has finished for PR 26011 at commit 4eccd2a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-11-04T06:52:31Z

thanks, merging to master!

maropu · 2019-11-04T06:56:59Z

Thanks, @WangGuangxin and @cloud-fan !

gatorsmile · 2019-11-11T07:25:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+      case _: Min => true
+      case _: Max => true
+      case _: Count => true
+      case _: Average => true


We could still have a precision difference after eliminating the sort for the floating point data type. I am afraid some end users might prefer to adding a sort in these cases to ensure the results are consistent?

cc @maryannxue @cloud-fan

ah that's a good point. AVG over floating values is order sensitive. Not sure if this can really affect queries in practice, but better to be conservative here. @WangGuangxin can you fix it in a followup?

Sure, I'll fix it in a followup

@WangGuangxin do you have no time for the follow-up now? Could I take this over?

Same will be true of all central moments, BTW

I revisit this and have a small question. In fact Avg is transformed to Sum and Count, so I think there should be no precision problem?

Sum is the issue itself, actually; see the followup PR

…alMomentAgg from order-insensitive aggregates ### What changes were proposed in this pull request? This pr is to remove floating-point `Sum/Average/CentralMomentAgg` from order-insensitive aggregates in `EliminateSorts`. This pr comes from the gatorsmile suggestion: #26011 (comment) ### Why are the changes needed? Bug fix. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added tests in `SubquerySuite`. Closes #26534 from maropu/SPARK-29343-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

Eliminate sorts without limit in the subquery of Join/Aggregation

75b43f5

dongjoon-hyun changed the title ~~[SPARK-29343]Eliminate sorts without limit in the subquery of Join/Aggregation~~ [SPARK-29343][SQL] Eliminate sorts without limit in the subquery of Join/Aggregation Oct 3, 2019

dongjoon-hyun added the SQL label Oct 3, 2019

add OrderIrrelevantAggs constrain

e29b323

update ut

25a3bb8

update

d21c683

update

d2328a0

fix ut

a9e9be9

dilipbiswal reviewed Oct 14, 2019

View reviewed changes

maropu reviewed Oct 23, 2019

View reviewed changes

merge with RemoveRedudanctSorts

425f76d

cloud-fan reviewed Oct 24, 2019

View reviewed changes

WangGuangxin added 2 commits October 26, 2019 15:36

merge with eliminate sorts

9072db7

merge test suites

8082324

maropu reviewed Oct 26, 2019

View reviewed changes

exclude udf

f2d9ec1

maropu reviewed Nov 4, 2019

View reviewed changes

update comments

4eccd2a

cloud-fan closed this in 83c39d1 Nov 4, 2019

gatorsmile reviewed Nov 11, 2019

View reviewed changes

maropu mentioned this pull request Nov 15, 2019

[SPARK-29343][SQL][FOLLOW-UP] Remove floating-point Sum/Average/CentralMomentAgg from order-insensitive aggregates #26534

Closed

[SPARK-29343][SQL] Eliminate sorts without limit in the subquery of Join/Aggregation #26011

[SPARK-29343][SQL] Eliminate sorts without limit in the subquery of Join/Aggregation #26011

Conversation

WangGuangxin commented Oct 3, 2019 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

dongjoon-hyun commented Oct 3, 2019

SparkQA commented Oct 3, 2019

SparkQA commented Oct 3, 2019

SparkQA commented Oct 4, 2019

WangGuangxin commented Oct 4, 2019

HyukjinKwon commented Oct 4, 2019

SparkQA commented Oct 4, 2019

SparkQA commented Oct 7, 2019

SparkQA commented Oct 13, 2019

SparkQA commented Oct 13, 2019

WangGuangxin commented Oct 14, 2019

Choose a reason for hiding this comment

cloud-fan Oct 14, 2019 • edited Loading

Choose a reason for hiding this comment

dilipbiswal commented Oct 14, 2019 • edited Loading

maropu commented Oct 23, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 24, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 26, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu Oct 26, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 3, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu commented Nov 4, 2019

SparkQA commented Nov 4, 2019

cloud-fan commented Nov 4, 2019

maropu commented Nov 4, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WangGuangxin commented Oct 3, 2019 •

edited

Loading

cloud-fan Oct 14, 2019 •

edited

Loading

dilipbiswal commented Oct 14, 2019 •

edited

Loading

maropu Oct 26, 2019 •

edited

Loading