[SPARK-18980][SQL] implement Aggregator with TypedImperativeAggregate by cloud-fan · Pull Request #16383 · apache/spark

cloud-fan · 2016-12-22T12:16:37Z

What changes were proposed in this pull request?

Currently we implement Aggregator with DeclarativeAggregate, which will serialize/deserialize the buffer object every time we process an input.

This PR implements Aggregator with TypedImperativeAggregate and avoids to serialize/deserialize buffer object many times. The benchmark shows we get about 2 times speed up.

For simple buffer object that doesn't need serialization, we still go with DeclarativeAggregate, to avoid performance regression.

How was this patch tested?

N/A

cloud-fan · 2016-12-22T12:16:57Z

cc @yhuai @hvanhovell @liancheng

cloud-fan · 2016-12-22T12:18:24Z

sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala

+    RDD sum                                       1913 / 1942         52.3          19.1       1.0X
+    DataFrame sum                                   46 /   61       2157.7           0.5      41.3X
+    Dataset sum using Aggregator                  4656 / 4758         21.5          46.6       0.4X
+    Dataset complex Aggregator                    6636 / 7039         15.1          66.4       0.3X


The result of master branch:

[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Mac OS X 10.12.1 [info] Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz [info] [info] aggregate: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------ [info] RDD sum 1887 / 1898 53.0 18.9 1.0X [info] DataFrame sum 46 / 60 2152.2 0.5 40.6X [info] Dataset sum using Aggregator 4549 / 4579 22.0 45.5 0.4X [info] Dataset complex Aggregator 12885 / 13830 7.8 128.9 0.1X

You can see that, for complex aggregator, we got about 2 times speed up, without performance regression on simple aggregator

Since this benchmark uses only one key in aggregation, it should run hash based aggregation by ObjectHashAggregateExec. When the number of key is large, it will fall back to sort based aggregation, I think it should be a more common use pattern, can we still see such performance improvement?

hash-based or sort-based only decides how we "group" the records, while this PR speed up the "aggregating" part.

yeah, but sort based will add extra cost on sorting, especially an external one. With TypedImperativeAggregate, Aggregator now could easily fall back to sort based. I am wondering if it degrades the performance.

so we have a trade-off here: to waste on buffer de/serialization, or to be more likely to fall back to sort.

I think more likely to fall back to sort is just an implementation limitation of the current object hash aggregate, once we add size estimate interface to TypedImperativeAggregate, we can be more aggressive on when to fall back to sort.

viirya · 2016-12-22T13:08:29Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala

  /**
-   * In-place updates the aggregation buffer object with an input row. buffer = buffer + input.
+   * Updates the aggregation buffer object with an input row and returns a new buffer object. For
+   * performance, the function may do in-place update and return it instead of constructing new


For this change, do we have use case which doesn't do in-place update?

for example, TypedSumDouble and TypedSumLong. Ideally we can't do in-place update when the buffer type is primitive type.

I think they are the cases doesn't need serialization?

but the Aggregator inteface doesn't guarantee in-place update, and this is a public interface, you can't change it to force users to do in-place update

oh. got it. that makes sense.

SparkQA · 2016-12-22T14:43:32Z

Test build #70517 has finished for PR 16383 at commit 0a73fe2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-12-23T02:20:54Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala


  final override def update(buffer: InternalRow, input: InternalRow): Unit = {
-    update(getBufferObject(buffer), input)
+    buffer(mutableAggBufferOffset) = update(getBufferObject(buffer), input)


I do not find InternalRow implements apply(int), is it an implicit cast here?

scala will invokeupdate, see http://daily-scala.blogspot.com/2009/08/apply-update-methods.html

Thanks. I don't know that. :)

viirya · 2016-12-23T02:41:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala

+      deser: Expression,
+      cls: Class[_],
+      schema: StructType): TypedAggregateExpression = {
+    copy(inputDeserializer = Some(deser), inputClass = Some(cls), inputSchema = Some(schema))


Where do we need to use inputClass? TypedAggregateExpression has this parameter but I don't see it is used anywhere.

It was there before, looks like it's only used to show more information when explain, but I'm not going to change it in this PR.

viirya · 2016-12-24T04:54:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala

+    inputAggBufferOffset: Int = 0)
+  extends TypedImperativeAggregate[Any] with TypedAggregateExpression with NonSQLExpression {
+
+  override def deterministic: Boolean = true


I have a question about the deterministic here. Actually how the data is processed is delegated to Aggregator . I think it can be easy to output non-deterministic result by an Aggregator, especially Aggregator can be used for user-defined aggregations.

Do you think we should let Aggregator to decide if it is a deterministic expression or not?

like UDF, we can also assume the Aggregator is always deterministic. I think in the future we should allow users to define nondeterministic UDF(including Aggregator).

viirya · 2016-12-24T05:13:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala

+      newInputAggBufferOffset: Int): ComplexTypedAggregateExpression =
+    copy(inputAggBufferOffset = newInputAggBufferOffset)
+
+  override def withInputInfo(


This looks the same as SimpleTypedAggregateExpression.withInputInfo. As the returned type is TypedAggregateExpression. Can we just only implement it in TypedAggregateExpression?

how to implement copy in a trait?

oh. right. nvm.

viirya · 2016-12-24T08:02:10Z

LGTM. This is a cool improvement.

cloud-fan · 2016-12-26T06:51:06Z

retest this please

SparkQA · 2016-12-26T06:52:37Z

Test build #70583 has started for PR 16383 at commit 0a73fe2.

cloud-fan · 2016-12-26T10:32:30Z

retest this please

SparkQA · 2016-12-26T14:01:01Z

Test build #70596 has finished for PR 16383 at commit 0a73fe2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-12-26T14:11:09Z

merging to master! I'll address comments in follow-up PR if there are any.

## What changes were proposed in this pull request? Currently we implement `Aggregator` with `DeclarativeAggregate`, which will serialize/deserialize the buffer object every time we process an input. This PR implements `Aggregator` with `TypedImperativeAggregate` and avoids to serialize/deserialize buffer object many times. The benchmark shows we get about 2 times speed up. For simple buffer object that doesn't need serialization, we still go with `DeclarativeAggregate`, to avoid performance regression. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes apache#16383 from cloud-fan/aggregator.

cloud-fan added 2 commits December 22, 2016 19:53

port Aggregator to TypedImperativeAggregate

c962c52

update benchmark

0a73fe2

cloud-fan commented Dec 22, 2016

View reviewed changes

viirya reviewed Dec 22, 2016

View reviewed changes

viirya reviewed Dec 23, 2016

View reviewed changes

viirya reviewed Dec 24, 2016

View reviewed changes

asfgit closed this in 8a7db8a Dec 26, 2016

Conversation

cloud-fan commented Dec 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Dec 22, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 22, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Dec 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Dec 24, 2016

Uh oh!

cloud-fan commented Dec 26, 2016

Uh oh!

SparkQA commented Dec 26, 2016

Uh oh!

cloud-fan commented Dec 26, 2016

Uh oh!

SparkQA commented Dec 26, 2016

Uh oh!

cloud-fan commented Dec 26, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cloud-fan commented Dec 22, 2016 •

edited

Loading

cloud-fan Dec 22, 2016 •

edited

Loading

viirya Dec 24, 2016 •

edited

Loading