[SPARK-16973][SQL] remove the buffer offsets in ImperativeAggregate by cloud-fan · Pull Request #14562 · apache/spark

cloud-fan · 2016-08-09T10:19:27Z

What changes were proposed in this pull request?

the mutableAggBufferOffset and inputAggBufferOffset in ImperativeAggregate are really hard to understand and tightly coupled with aggregation implementation. What's worse, all ImperativeAggregate implementations need to understand this concept and deal with it carefully.

This PR isolate this buffet offsets concept into the base class ImperativeAggregate, by introducing a sliced row. Then put the interface to ImperativeAggregateImpl, all ImperativeAggregateImpl implementations don't need to care about the buffer offsets anymore.

How was this patch tested?

existing tests.

cloud-fan · 2016-08-09T10:19:51Z

cc @yhuai @liancheng @clockfly

SparkQA · 2016-08-09T11:46:45Z

Test build #63438 has finished for PR 14562 at commit 9658056.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CollectList(child: Expression) extends Collect
- case class CollectSet(child: Expression) extends Collect
- trait BaseSlicedInternalRow extends InternalRow
- case class SlicedInternalRow(offset: Int, numFields: Int) extends BaseSlicedInternalRow
- case class SlicedMutableRow(offset: Int, numFields: Int)

hvanhovell · 2016-08-09T12:51:20Z

Does this have performance implications? We are adding a layer of indirection to a hot code path.

hvanhovell · 2016-08-09T12:55:11Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/slicedRows.scala

+  }
+
+  override def copy(): InternalRow = {
+    throw new UnsupportedOperationException("Cannot copy a SlicedMutableRow")


SlicedMutableRow -> SlicedInternalRow?

cloud-fan · 2016-08-09T13:19:13Z

@hvanhovell I'm not sure about the performance, will benchmark it later, hopefully they can be inlined by JVM successfully.

rxin · 2016-08-10T04:03:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/slicedRows.scala

+  }
+}
+
+case class SlicedInternalRow(offset: Int, numFields: Int) extends BaseSlicedInternalRow {


does this need to be a case class?

hmmm, does case class has performance penalty? It doesn't need to be though.

It generates a lot of crap in bytecode, so would be good to not generate them unless they are useful.

SparkQA · 2016-08-10T11:08:49Z

Test build #63524 has finished for PR 14562 at commit 1444176.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SlicedInternalRow(protected val offset: Int, val numFields: Int)
- class SlicedMutableRow(protected val offset: Int, val numFields: Int)

SparkQA · 2016-08-10T20:37:01Z

Test build #63552 has finished for PR 14562 at commit 8e0531b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

clockfly · 2016-08-16T23:12:47Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala

-   */
-  def withNewMutableAggBufferOffset(newMutableAggBufferOffset: Int): ImperativeAggregate
+  final def setMutableBufferOffset(offset: Int): Unit = {
+    assert(mutableBufferRow == null)


Do you want to do a runtime check? Then how about using require?

assert may be removed by compiler.

Require implies that the caller has passed a bad argument. Assert checks if invariants hold (the class is not in an unexpected state). I think this should be an assert.

Instead of setting an attribute that is supposed to be immutable, can we use copy to copy the whole class?

clockfly · 2016-08-16T23:57:34Z

I think my biggest concern is about the performance and abstraction.

Seems there are too many function call to update each row. Is this cost too high?
The new methods added in ImperativeAggregate may create confusion, like
There are both

def update
def doUpdate
def merge
def doMerge
def initialize
def doInitialize

clockfly · 2016-08-16T23:59:04Z

Maybe there is another alternative, for example, we can define an InternalRowReader, which wraps the offset.

Sub-class of ImperativeAggregate need to use the InternalRowReader to read the fields from InternalRow.

cloud-fan · 2016-08-17T07:45:41Z

...src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlus.scala

-  }
-
-  // Note: although this simply copies aggBufferAttributes, this common code can not be placed
-  // in the superclass because that will lead to initialization ordering issues.


finally get rid of it!

SparkQA · 2016-08-17T08:55:12Z

Test build #63906 has finished for PR 14562 at commit e50aece.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class Collect extends ImperativeAggregateImpl
- sealed abstract class AggregateFunction extends Expression
- case class ImperativeAggregate(

SparkQA · 2016-08-17T09:05:20Z

Test build #63908 has finished for PR 14562 at commit 5c8f324.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class Collect extends ImperativeAggregateImpl
- sealed abstract class AggregateFunction extends Expression
- case class ImperativeAggregate(

remove the buffer offsets in ImperativeAggregate

9658056

hvanhovell reviewed Aug 9, 2016
View reviewed changes

rxin reviewed Aug 10, 2016
View reviewed changes

address comments

1444176

fix a bug

8e0531b

clockfly reviewed Aug 16, 2016
View reviewed changes

cloud-fan added 2 commits August 17, 2016 11:27

Merge remote-tracking branch 'origin/master' into agg-minor

50db3ac

refactor

5c8f324

cloud-fan force-pushed the agg-minor branch from e50aece to 5c8f324 Compare August 17, 2016 07:42

cloud-fan reviewed Aug 17, 2016
View reviewed changes

cloud-fan closed this Nov 7, 2016

cloud-fan deleted the agg-minor branch December 14, 2016 12:33

Conversation

cloud-fan commented Aug 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Aug 9, 2016

Uh oh!

SparkQA commented Aug 9, 2016

Uh oh!

hvanhovell commented Aug 9, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Aug 9, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 10, 2016

Uh oh!

SparkQA commented Aug 10, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clockfly commented Aug 16, 2016

Uh oh!

clockfly commented Aug 16, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 17, 2016

Uh oh!

SparkQA commented Aug 17, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cloud-fan commented Aug 9, 2016 •

edited

Loading