-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27296][SQL] Allows Aggregator to be registered as a UDF #25024
Conversation
Test build #107090 has finished for PR 25024 at commit
|
Test build #107124 has finished for PR 25024 at commit
|
Test build #107132 has finished for PR 25024 at commit
|
Perf numbers? |
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/udaf.scala
Outdated
Show resolved
Hide resolved
Test build #107294 has finished for PR 25024 at commit
|
@rxin I wrote up my benchmarking results above, under "how was this patch tested" For an aggregator with non-trivial ser/de, the improvement can be two orders of magnitude, but is less for more simple aggregators. |
Can you look into how this actually works under the hood? The code is still doing serialization, so it doesn't really make sense to me that it is significantly faster vs the existing UDAF. I also find it very weird to have two UDAF interfaces that look very similar. Would be great if we can fix the old one. |
@rxin the key difference is in the def update(buf: MutableAggregationBuffer, input: Row): Unit = {
val agg = buf.getAs[AggregatorType](0) // UDT deserializes the aggregator from 'buf'
agg.update(input) // update the state of your aggregation
buf(0) = agg // UDT re-serializes the aggregator back into buf
} The consequence of this is that it is calling deserialize and (re)serialize for the actual aggregating structure for every single input row. If your dataframe has a million rows, it's doing ser/de on your aggregator a million times, not just at the end of each data partition. Compare that with the UDAI (which is driven by TypedImperativeAggregate) def update(agg: AggregatorType, input: Row): AggregatorType = {
agg.update(input) // update the state of your aggregator from the input
agg // return the aggregator
} You can see that here, there is no ser/de of the aggregator at all, when processing input rows (which is as it should be). The TypedImperativeAggregate only invokes ser/de on the aggregator when it is collecting partial results across partitions (and at the end when it is presenting final results into the output data frame). So, imagine a data-frame with 10 partitions and 1 million rows. The UDAF does ser/de on the aggregator a million (plus 10) times, while the UDIA does ser/de only 10 times. |
With respect to "fixing UDAF" (instead of creating a new UDIA), I have convinced myself there is no path there, but here is where I went with that: as described above, the basic pattern of a UDAF def update(buf: MutableAggregationBuffer, input: Row): Unit = {
val agg = buf.getAs[AggregatorType](0) // UDT deserializes the aggregator from 'buf'
agg.update(input) // update the state of your aggregation
buf(0) = agg // UDT re-serializes the aggregator back into buf
} So, the problem arises out of the UDT, which does ser/de. In theory, IF you could just store However, if you try this trick (and I did), Spark will crash with an "unrecognized data type" exception, because it only allows defined subclasses of I do not think Spark/Catalyst can be made to cope with raw object references in Row objects, as it needs to know how to operate on whatever objects live in Rows. It requires a "closed universe" of possible DataTypes. Even if you disabled the enforcement and allowed arbitrary object references in Rows, it would break spark. As an aside, spark arguably already has two parallel aggregator interfaces: UDAF and TypedImperativeAggregate (which is what all the predefined aggregators use). This PR is exposing that second one to users. |
To elaborate on the 'raw object reference' above, what I specifically did was try using a DataType like That immediately fails here: spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala Line 215 in 3139d64
For fun I tried defaulting that to "identity" for
So that is a flavor of catalyst's problem with handling anything outside its defined universe of data types. |
With respect to the redundancy of UDAF, and UDIA, I would actually propose a deprecation path for the existing UDAF. The alternative proposed on this PR has behavior parity. It could use a story for Java, presumably something like When using a final presentation type using a UDT, it ought to work via pyspark exactly the same as UDAF. |
I'd like to merge this, unless people have additional concerns or questions. |
Thanks. Will take a look at it this week. |
sql/core/src/main/scala/org/apache/spark/sql/expressions/udaf.scala
Outdated
Show resolved
Hide resolved
63802cd
to
72795a9
Compare
Test build #111540 has finished for PR 25024 at commit
|
cc @rxin @cloud-fan @hvanhovell pros:
cons:
In summary, doing this with enhancements to Aggregator is definitely feasible, however I do not think it can provide total feature parity with UDAF or UDIA. |
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/udaf.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/udaf.scala
Outdated
Show resolved
Hide resolved
What I want to see is a single UDAF API that works for all the use cases, if it's possible. It's really confusing to end-users if there are a lof of UDAF APIs in Spark and they don't know which one to use. Since ser/de is inevitable (
I think
My proposal is:
|
@cloud-fan I could get behind that as a unification. I'll look at |
@cloud-fan ok I think I can get there with |
Test build #112194 has finished for PR 25024 at commit
|
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/UDAQuerySuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/udaf.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/udaf.scala
Outdated
Show resolved
Hide resolved
looks pretty good except some minor comments, thanks for the great work! |
Test build #116183 has finished for PR 25024 at commit
|
@cloud-fan the |
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/udaf.scala
Outdated
Show resolved
Hide resolved
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/UDAQuerySuite.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty good now!
Test build #116252 has finished for PR 25024 at commit
|
@cloud-fan I'm beginning to feel like this is ready to merge, unless you have additional feedback. What do you think? |
@rdblue do you have anything to add? if not I'll merge it this week. |
Looks good to me. Thanks to both of you for all of the hard work on this feature! |
thanks, merging to master! |
with ImplicitCastInputTypes | ||
with Logging { | ||
|
||
private[this] lazy val inputEncoder = inputEncoderNR.resolveAndBind() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the problem. We shouldn't keep the encoder unresolved in the query plan, and resolve it in the executor side. We can follow ResolveEncodersInUDF
: add a rule to resolve the encoders in ScalaAggregator
at driver side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, this defers resolving encoder to executors, we should resolve it on the driver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pinging me, @cloud-fan .
… and UnresolvedMapObjects Context: The fix for SPARK-27296 introduced by #25024 allows `Aggregator` objects to appear in queries. This works fine for aggregators with atomic input types, e.g. `Aggregator[Double, _, _]`. However it can cause a null pointer exception if the input type is `Array[_]`. This was historically considered an ignorable case for serialization of `UnresolvedMapObjects`, but the new ScalaAggregator class causes these expressions to be serialized over to executors because the resolve-and-bind is being deferred. ### What changes were proposed in this pull request? A new rule `ResolveEncodersInScalaAgg` that performs the resolution of the expressions contained in the encoders so that properly resolved expressions are serialized over to executors. ### Why are the changes needed? Applying an aggregator of the form `Aggregator[Array[_], _, _]` using `functions.udaf()` currently causes a null pointer error in Catalyst. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? A unit test has been added that does aggregation with array types for input, buffer, and output. I have done additional testing with my own custom aggregators in the spark REPL. Closes #28983 from erikerlandson/fix-spark-32159. Authored-by: Erik Erlandson <eerlands@redhat.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 1cb5bfc) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
… and UnresolvedMapObjects Context: The fix for SPARK-27296 introduced by #25024 allows `Aggregator` objects to appear in queries. This works fine for aggregators with atomic input types, e.g. `Aggregator[Double, _, _]`. However it can cause a null pointer exception if the input type is `Array[_]`. This was historically considered an ignorable case for serialization of `UnresolvedMapObjects`, but the new ScalaAggregator class causes these expressions to be serialized over to executors because the resolve-and-bind is being deferred. ### What changes were proposed in this pull request? A new rule `ResolveEncodersInScalaAgg` that performs the resolution of the expressions contained in the encoders so that properly resolved expressions are serialized over to executors. ### Why are the changes needed? Applying an aggregator of the form `Aggregator[Array[_], _, _]` using `functions.udaf()` currently causes a null pointer error in Catalyst. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? A unit test has been added that does aggregation with array types for input, buffer, and output. I have done additional testing with my own custom aggregators in the spark REPL. Closes #28983 from erikerlandson/fix-spark-32159. Authored-by: Erik Erlandson <eerlands@redhat.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Defines a new subclass of UDF:
UserDefinedAggregator
. Also allowsAggregator
to be registered as a udf. Under the hood, the implementation is based on the internalTypedImperativeAggregate
class that spark's predefined aggregators make use of. The effect is that custom user defined aggregators are now serialized only on partition boundaries instead of being serialized and deserialized at each input row.The two new modes of using
Aggregator
are as follows:How was this patch tested?
Unit testing has been added that corresponds to the testing suites for
UserDefinedAggregateFunction
. Additionally, unit tests explicitly count the number of aggregator ser/de cycles to ensure that it is governed only by the number of data partitions.To evaluate the performance impact, I did two comparisons.
The code and REPL results are recorded on this gist
To characterize its behavior I benchmarked both a relatively simple aggregator and then an aggregator with a complex structure (a t-digest).
performance
The following compares the new
Aggregator
based aggregation against UDAF. In this scenario, the new aggregation is about 100x faster. The difference in performance impact depends on the complexity of the aggregator. For very simple aggregators (e.g. implementing 'sum', etc), the performance impact is more like 25-30%.