-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22084][SQL] Fix performance regression in aggregation strategy #19301
Changes from 4 commits
6f555c2
5aaae4c
adce474
bf7d2cf
85a93ac
e76c43c
7f89fe7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,6 +17,8 @@ | |
|
||
package org.apache.spark.sql.catalyst.expressions.aggregate | ||
|
||
import java.util.Objects | ||
|
||
import org.apache.spark.sql.catalyst.InternalRow | ||
import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute | ||
import org.apache.spark.sql.catalyst.expressions._ | ||
|
@@ -72,11 +74,19 @@ object AggregateExpression { | |
aggregateFunction: AggregateFunction, | ||
mode: AggregateMode, | ||
isDistinct: Boolean): AggregateExpression = { | ||
val state = if (aggregateFunction.resolved) { | ||
Seq(aggregateFunction.toString, aggregateFunction.dataType, | ||
aggregateFunction.nullable, mode, isDistinct) | ||
} else { | ||
Seq(aggregateFunction.toString, mode, isDistinct) | ||
} | ||
val hashCode = state.map(Objects.hashCode).foldLeft(0)((a, b) => 31 * a + b) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what's the purpose here? |
||
|
||
AggregateExpression( | ||
aggregateFunction, | ||
mode, | ||
isDistinct, | ||
NamedExpression.newExprId) | ||
ExprId(hashCode)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think this is the right fix. Semantically the It's kind of an optimization in aggregate planner, we should detect these semantically different but duplicated aggregate functions and only plan one of them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agreed with @cloud-fan. This should be an optimization done in aggregate planner, instead of forcibly setting expr id here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cloud-fan @viirya I've tried to optimize in aggregate planner (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala#L211). // A single aggregate expression might appear multiple times in resultExpressions.
// In order to avoid evaluating an individual aggregate function multiple times, we'll
// build a set of the distinct aggregate expressions and build a function which can
// be used to re-write expressions so that they reference the single copy of the
// aggregate function which actually gets computed.
val aggregateExpressions = resultExpressions.flatMap { expr =>
expr.collect {
case agg: AggregateExpression =>
val aggregateFunction = agg.aggregateFunction
val state = if (aggregateFunction.resolved) {
Seq(aggregateFunction.toString, aggregateFunction.dataType,
aggregateFunction.nullable, agg.mode, agg.isDistinct)
} else {
Seq(aggregateFunction.toString, agg.mode, agg.isDistinct)
}
val hashCode = state.map(Objects.hashCode).foldLeft(0)((a, b) => 31 * a + b)
(hashCode, agg)
}
}.groupBy(_._1).map { case (_, values) =>
values.head._2
}.toSeq But it's difficult to distinguish between different typed aggregators without expr id. Current solution can works well for all of aggregate functions. I'm not familiar with typed aggregators, any suggestions will be appreciated. |
||
} | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks all the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q -> O