-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29343][SQL] Eliminate sorts without limit in the subquery of Join/Aggregation #26011
Conversation
ok to test |
Test build #111734 has finished for PR 26011 at commit
|
Test build #111741 has finished for PR 26011 at commit
|
Test build #111766 has finished for PR 26011 at commit
|
retest this please |
1 similar comment
retest this please |
Test build #111770 has finished for PR 26011 at commit
|
Test build #111828 has finished for PR 26011 at commit
|
Test build #111994 has finished for PR 26011 at commit
|
Test build #111997 has finished for PR 26011 at commit
|
@dongjoon-hyun @dilipbiswal @maropu @gatorsmile Could you please take a look at this PR when you have time? |
* | ||
* This rule try to remove this kind of [[Sort]] operator. | ||
*/ | ||
object RemoveSortInSubquery extends Rule[LogicalPlan] with PredicateHelper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the existing RemoveRedundantSorts handle this as well ? The reason i ask is, i don't see any thing subquery specific in the new rule ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for merging the 2 rules. We can call the merged rule EliminateSorts
, as not only redundant sorts are removed.
The idea looks reasonable to me. cc @cloud-fan |
@WangGuangxin Are you there? Can you update the pr based on the reviews above? |
* has nothing to do with the order of input data. | ||
* For example, [[Sum]] is [[OrderIrrelevantAggs]] while [[First]] is not. | ||
*/ | ||
trait OrderIrrelevantAggs extends AggregateFunction { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this trait approach doesn't work for ScalaUDAF
. Is this expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's not suitable for ScalaUDAF
. In fact, I used OrderIrrelevantAggs
as a mixin trait, and I only apply this trait to min
, max
and several other agg funciton, not on ScalaUDAF
. Hope I understand you correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe OrderIrrelevantAggs
don't need to extends AggregateFunction
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is not too many agg functions for this optimization, can we list up them inside the rule like the others? e.g., https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala#L162
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think your suggestion is better.
Test build #112579 has finished for PR 26011 at commit
|
@@ -967,12 +967,18 @@ object EliminateSorts extends Rule[LogicalPlan] { | |||
* Removes redundant Sort operation. This can happen: | |||
* 1) if the child is already sorted | |||
* 2) if there is another Sort operator separated by 0...n Project/Filter operators | |||
* 3) if the Sort operator is within Join and without Limit | |||
* 4) if the Sort operator is within GroupBy and the aggregate function is order irrelevant | |||
*/ | |||
object RemoveRedundantSorts extends Rule[LogicalPlan] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename it to EliminateSorts
? Some sorts are not redundant but we remove them as well according to the SQL semantic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is another rule named EliminateSorts
. Can we merge it together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea let's merge them.
*/ | ||
object RemoveRedundantSorts extends Rule[LogicalPlan] { | ||
def apply(plan: LogicalPlan): LogicalPlan = plan transformDown { | ||
case Sort(orders, true, child) if SortOrder.orderingSatisfies(child.outputOrdering, orders) => | ||
child | ||
case s @ Sort(_, _, child) => s.copy(child = recursiveRemoveSort(child)) | ||
case j @ Join(originLeft, originRight, _, _, _) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we make sure the join condition is deterministic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
Test build #112712 has finished for PR 26011 at commit
|
@@ -953,26 +952,25 @@ object CombineFilters extends Rule[LogicalPlan] with PredicateHelper { | |||
} | |||
|
|||
/** | |||
* Removes no-op SortOrder from Sort | |||
* Removes Sort operation. This can happen: | |||
* 1) if the sort is noop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this statement is a little ambiguous, so could you make it more precise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
* 1) if the sort is noop | ||
* 2) if the child is already sorted | ||
* 3) if there is another Sort operator separated by 0...n Project/Filter operators | ||
* 4) if the Sort operator is within Join and without Limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without limit? It seems the rule below does not check that condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, within Join
-> within Join having deterministic conditions only
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
limit
is restricted by canEliminateSort
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That stmt is still ambiguous... probably, I think the condition 4)
is similar to 3
? I mean That might be Join separated by 0...n Project/Filter operators only
.
@@ -987,6 +985,24 @@ object RemoveRedundantSorts extends Rule[LogicalPlan] { | |||
case f: Filter => f.condition.deterministic | |||
case _ => false | |||
} | |||
|
|||
def isOrderIrrelevantAggs(aggs: Seq[NamedExpression]): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: private?
case j @ Join(originLeft, originRight, _, cond, _) if cond.forall(_.deterministic) => | ||
j.copy(left = recursiveRemoveSort(originLeft), right = recursiveRemoveSort(originRight)) | ||
case g @ Aggregate(_, aggs, originChild) if isOrderIrrelevantAggs(aggs) => | ||
g.copy(child = recursiveRemoveSort(originChild)) | ||
} | ||
|
||
def recursiveRemoveSort(plan: LogicalPlan): LogicalPlan = plan match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This is not related to this pr though...) nit: private?
case _ => false | ||
} | ||
|
||
aggs.flatMap { e => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If aggs
only has a single PythonUDF
, it seems this method returns true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I'll try to fix it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out this. I've updated and add a UT for udf
Test build #113160 has finished for PR 26011 at commit
|
case ae: AggregateExpression => ae.aggregateFunction | ||
} | ||
}.forall(isOrderIrrelevantAggFunction) | ||
def checkValidAggregateExpression(expr: Expression): Boolean = expr match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot make it private because this is a nested function within private isOrderIrrelevantAggs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ur, this is an inner func. I see.
@dilipbiswal @cloud-fan Looks good overall and could you check this? |
Test build #113186 has finished for PR 26011 at commit
|
thanks, merging to master! |
Thanks, @WangGuangxin and @cloud-fan ! |
case _: Min => true | ||
case _: Max => true | ||
case _: Count => true | ||
case _: Average => true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could still have a precision difference after eliminating the sort for the floating point data type. I am afraid some end users might prefer to adding a sort in these cases to ensure the results are consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah that's a good point. AVG over floating values is order sensitive. Not sure if this can really affect queries in practice, but better to be conservative here. @WangGuangxin can you fix it in a followup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll fix it in a followup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WangGuangxin do you have no time for the follow-up now? Could I take this over?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same will be true of all central moments, BTW
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I revisit this and have a small question. In fact Avg
is transformed to Sum
and Count
, so I think there should be no precision problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sum
is the issue itself, actually; see the followup PR
…alMomentAgg from order-insensitive aggregates ### What changes were proposed in this pull request? This pr is to remove floating-point `Sum/Average/CentralMomentAgg` from order-insensitive aggregates in `EliminateSorts`. This pr comes from the gatorsmile suggestion: #26011 (comment) ### Why are the changes needed? Bug fix. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added tests in `SubquerySuite`. Closes #26534 from maropu/SPARK-29343-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
This is somewhat a complement of #21853.
The
Sort
withoutLimit
operator inJoin
subquery is useless, it's the same case inGroupBy
when the aggregation function is order irrelevant, such ascount
,sum
.This PR try to remove this kind of
Sort
operator inSQL Optimizer
.Why are the changes needed?
For example,
select count(1) from (select a from test1 order by a)
is equal toselect count(1) from (select a from test1)
.'select * from (select a from test1 order by a) t1 join (select b from test2) t2 on t1.a = t2.b' is equal to
select * from (select a from test1) t1 join (select b from test2) t2 on t1.a = t2.b
.Remove useless
Sort
operator can improve performance.Does this PR introduce any user-facing change?
No
How was this patch tested?
Adding new UT
RemoveSortInSubquerySuite.scala