[SPARK-2554][SQL] CountDistinct partial aggregation and object allocation improvements #1935

marmbrus · 2014-08-14T02:52:20Z

No description provided.

SparkQA · 2014-08-14T02:55:00Z

QA tests have started for PR 1935. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18514/consoleFull

SparkQA · 2014-08-14T04:04:38Z

QA results for PR 1935:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
abstract class MergeableAggregate extends PartialAggregate {
case class ReturnAggregate(child: AggregateExpression)
case class ReturnAggregateFunction(agg: AggregateExpression, base: AggregateExpression)
case class MergeAggregates(child: Expression)
case class MergeAggregateFunctions(expr: Expression, base: AggregateExpression)
abstract class MergableAggregateFunction extends AggregateFunction {
case class CountDistinct(expressions: Seq[Expression]) extends MergeableAggregate {
case class CountDistinctFunction(

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18514/consoleFull

chenghao-intel · 2014-08-14T05:12:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala

+
+  def this() = this(null, null) // Required for serialization.
+
+  var currentValue: MergableAggregateFunction = null


Should we put a default value for currentValue? And then we can ignore the null checking in function eval and update

chenghao-intel · 2014-08-14T05:49:00Z

Don't forget the SumDistinct, :-)
One concern is about the memory usage after the data shuffled.
e.g. select sum(distinct(value)) from src

Probably we can improve that in another PRs.

… / max

SparkQA · 2014-08-18T00:35:41Z

QA tests have started for PR 1935 at commit 9153652.

This patch merges cleanly.

SparkQA · 2014-08-18T00:36:24Z

QA tests have finished for PR 1935 at commit 9153652.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CountDistinct(expressions: Seq[Expression]) extends PartialAggregate
- case class CollectHashSet(expressions: Seq[Expression]) extends AggregateExpression
- case class CollectHashSetFunction(
- case class CombineSetsAndCount(inputSet: Expression) extends AggregateExpression
- case class CombineSetsAndCountFunction(
- case class CountDistinctFunction(
- case class MaxOf(left: Expression, right: Expression) extends Expression
- case class NewSet(elementType: DataType) extends LeafExpression
- case class AddItemToSet(item: Expression, set: Expression) extends Expression
- case class CombineSets(left: Expression, right: Expression) extends BinaryExpression
- case class CountSet(child: Expression) extends UnaryExpression

SparkQA · 2014-08-18T01:20:19Z

QA tests have started for PR 1935 at commit 38c7449.

This patch merges cleanly.

SparkQA · 2014-08-18T01:23:38Z

QA tests have finished for PR 1935 at commit 38c7449.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CountDistinct(expressions: Seq[Expression]) extends PartialAggregate
- case class CollectHashSet(expressions: Seq[Expression]) extends AggregateExpression
- case class CollectHashSetFunction(
- case class CombineSetsAndCount(inputSet: Expression) extends AggregateExpression
- case class CombineSetsAndCountFunction(
- case class CountDistinctFunction(
- case class MaxOf(left: Expression, right: Expression) extends Expression
- case class NewSet(elementType: DataType) extends LeafExpression
- case class AddItemToSet(item: Expression, set: Expression) extends Expression
- case class CombineSets(left: Expression, right: Expression) extends BinaryExpression
- case class CountSet(child: Expression) extends UnaryExpression

SparkQA · 2014-08-18T01:45:32Z

QA tests have started for PR 1935 at commit f31b8ad.

This patch merges cleanly.

SparkQA · 2014-08-18T02:52:30Z

QA tests have finished for PR 1935 at commit f31b8ad.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CountDistinct(expressions: Seq[Expression]) extends PartialAggregate
- case class CollectHashSet(expressions: Seq[Expression]) extends AggregateExpression
- case class CollectHashSetFunction(
- case class CombineSetsAndCount(inputSet: Expression) extends AggregateExpression
- case class CombineSetsAndCountFunction(
- case class CountDistinctFunction(
- case class MaxOf(left: Expression, right: Expression) extends Expression
- case class NewSet(elementType: DataType) extends LeafExpression
- case class AddItemToSet(item: Expression, set: Expression) extends Expression
- case class CombineSets(left: Expression, right: Expression) extends BinaryExpression
- case class CountSet(child: Expression) extends UnaryExpression

ueshin · 2014-08-18T04:50:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala

+
+  override def children = left :: right :: Nil
+
+  override def references = (left.flatMap(_.references) ++ right.flatMap(_.references)).toSet


Should be left.references ++ right.references or children.flatMap(_.references).toSet ?

SparkQA · 2014-08-20T21:25:46Z

QA tests have started for PR 1935 at commit b2e8ef3.

This patch merges cleanly.

SparkQA · 2014-08-20T22:10:47Z

QA tests have started for PR 1935 at commit c122cca.

This patch merges cleanly.

rxin · 2014-08-20T22:18:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala

+
+  def this() = this(null, null) // Required for serialization.
+
+  val seen = new OpenHashSet[Any]()


does this support null?

I'm not sure, we will never put null into it though (we always put rows in, and furthermore count distinct semantics don't count null).

maybe add the line there explaining we never put null into it. i think the open hash set doesn't support null.

Actually I think most HashSets don't support null. scala.collection.mutable.HashSet throws an exception if you try to add null.

rxin · 2014-08-20T22:26:42Z

Nice job. LGTM other than some comments. You probably want to remove WIP from the title.

SparkQA · 2014-08-20T22:37:40Z

QA tests have finished for PR 1935 at commit b2e8ef3.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class JoinedRow2 extends Row
- class JoinedRow3 extends Row
- class JoinedRow4 extends Row
- class JoinedRow5 extends Row
- class GenericRow(protected[sql] val values: Array[Any]) extends Row
- final class Mutable$tpe extends MutableValue
- abstract class MutableValue extends Serializable
- final class MutableInt extends MutableValue
- final class MutableFloat extends MutableValue
- final class MutableBoolean extends MutableValue
- final class MutableDouble extends MutableValue
- final class MutableShort extends MutableValue
- final class MutableLong extends MutableValue
- final class MutableByte extends MutableValue
- final class MutableAny extends MutableValue
- class SpecificMutableRow(val values: Array[MutableValue]) extends MutableRow
- case class CountDistinct(expressions: Seq[Expression]) extends PartialAggregate
- case class CollectHashSet(expressions: Seq[Expression]) extends AggregateExpression
- case class CollectHashSetFunction(
- case class CombineSetsAndCount(inputSet: Expression) extends AggregateExpression
- case class CombineSetsAndCountFunction(
- case class CountDistinctFunction(
- case class MaxOf(left: Expression, right: Expression) extends Expression
- case class NewSet(elementType: DataType) extends LeafExpression
- case class AddItemToSet(item: Expression, set: Expression) extends Expression
- case class CombineSets(left: Expression, right: Expression) extends BinaryExpression
- case class CountSet(child: Expression) extends UnaryExpression

SparkQA · 2014-08-20T23:22:01Z

QA tests have finished for PR 1935 at commit c122cca.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- shift # Ignore main class (org.apache.spark.deploy.SparkSubmit) and use our own
- class JoinedRow2 extends Row
- class JoinedRow3 extends Row
- class JoinedRow4 extends Row
- class JoinedRow5 extends Row
- class GenericRow(protected[sql] val values: Array[Any]) extends Row
- abstract class MutableValue extends Serializable
- final class MutableInt extends MutableValue
- final class MutableFloat extends MutableValue
- final class MutableBoolean extends MutableValue
- final class MutableDouble extends MutableValue
- final class MutableShort extends MutableValue
- final class MutableLong extends MutableValue
- final class MutableByte extends MutableValue
- final class MutableAny extends MutableValue
- final class SpecificMutableRow(val values: Array[MutableValue]) extends MutableRow
- case class CountDistinct(expressions: Seq[Expression]) extends PartialAggregate
- case class CollectHashSet(expressions: Seq[Expression]) extends AggregateExpression
- case class CollectHashSetFunction(
- case class CombineSetsAndCount(inputSet: Expression) extends AggregateExpression
- case class CombineSetsAndCountFunction(
- case class CountDistinctFunction(
- case class MaxOf(left: Expression, right: Expression) extends Expression
- case class NewSet(elementType: DataType) extends LeafExpression
- case class AddItemToSet(item: Expression, set: Expression) extends Expression
- case class CombineSets(left: Expression, right: Expression) extends BinaryExpression
- case class CountSet(child: Expression) extends UnaryExpression

SparkQA · 2014-08-20T23:50:48Z

QA tests have started for PR 1935 at commit 8074a80.

This patch merges cleanly.

SparkQA · 2014-08-21T01:04:24Z

QA tests have finished for PR 1935 at commit 8074a80.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- shift # Ignore main class (org.apache.spark.deploy.SparkSubmit) and use our own
- case class SparkListenerTaskStart(stageId: Int, stageAttemptId: Int, taskInfo: TaskInfo)
- class JoinedRow2 extends Row
- class JoinedRow3 extends Row
- class JoinedRow4 extends Row
- class JoinedRow5 extends Row
- class GenericRow(protected[sql] val values: Array[Any]) extends Row
- abstract class MutableValue extends Serializable
- final class MutableInt extends MutableValue
- final class MutableFloat extends MutableValue
- final class MutableBoolean extends MutableValue
- final class MutableDouble extends MutableValue
- final class MutableShort extends MutableValue
- final class MutableLong extends MutableValue
- final class MutableByte extends MutableValue
- final class MutableAny extends MutableValue
- final class SpecificMutableRow(val values: Array[MutableValue]) extends MutableRow
- case class CountDistinct(expressions: Seq[Expression]) extends PartialAggregate
- case class CollectHashSet(expressions: Seq[Expression]) extends AggregateExpression
- case class CollectHashSetFunction(
- case class CombineSetsAndCount(inputSet: Expression) extends AggregateExpression
- case class CombineSetsAndCountFunction(
- case class CountDistinctFunction(
- case class MaxOf(left: Expression, right: Expression) extends Expression
- case class NewSet(elementType: DataType) extends LeafExpression
- case class AddItemToSet(item: Expression, set: Expression) extends Expression
- case class CombineSets(left: Expression, right: Expression) extends BinaryExpression
- case class CountSet(child: Expression) extends UnaryExpression

SparkQA · 2014-08-23T19:45:54Z

QA tests have started for PR 1935 at commit 5c7848d.

This patch merges cleanly.

SparkQA · 2014-08-23T21:08:41Z

QA tests have finished for PR 1935 at commit 5c7848d.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class JoinedRow2 extends Row
- class JoinedRow3 extends Row
- class JoinedRow4 extends Row
- class JoinedRow5 extends Row
- class GenericRow(protected[sql] val values: Array[Any]) extends Row
- abstract class MutableValue extends Serializable
- final class MutableInt extends MutableValue
- final class MutableFloat extends MutableValue
- final class MutableBoolean extends MutableValue
- final class MutableDouble extends MutableValue
- final class MutableShort extends MutableValue
- final class MutableLong extends MutableValue
- final class MutableByte extends MutableValue
- final class MutableAny extends MutableValue
- final class SpecificMutableRow(val values: Array[MutableValue]) extends MutableRow
- case class CountDistinct(expressions: Seq[Expression]) extends PartialAggregate
- case class CollectHashSet(expressions: Seq[Expression]) extends AggregateExpression
- case class CollectHashSetFunction(
- case class CombineSetsAndCount(inputSet: Expression) extends AggregateExpression
- case class CombineSetsAndCountFunction(
- case class CountDistinctFunction(
- case class MaxOf(left: Expression, right: Expression) extends Expression
- case class NewSet(elementType: DataType) extends LeafExpression
- case class AddItemToSet(item: Expression, set: Expression) extends Expression
- case class CombineSets(left: Expression, right: Expression) extends BinaryExpression
- case class CountSet(child: Expression) extends UnaryExpression

marmbrus · 2014-08-23T23:20:29Z

Thanks for looking this over! I've merged to master and 1.1

…tion improvements Author: Michael Armbrust <michael@databricks.com> Author: Gregory Owen <greowen@gmail.com> Closes #1935 from marmbrus/countDistinctPartial and squashes the following commits: 5c7848d [Michael Armbrust] turn off caching in the constructor 8074a80 [Michael Armbrust] fix tests 32d216f [Michael Armbrust] reynolds comments c122cca [Michael Armbrust] Address comments, add tests b2e8ef3 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into countDistinctPartial fae38f4 [Michael Armbrust] Fix style fdca896 [Michael Armbrust] cleanup 93d0f64 [Michael Armbrust] metastore concurrency fix. db44a30 [Michael Armbrust] JIT hax. 3868f6c [Michael Armbrust] Merge pull request #9 from GregOwen/countDistinctPartial c9e67de [Gregory Owen] Made SpecificRow and types serializable by Kryo 2b46c4b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into countDistinctPartial 8ff6402 [Michael Armbrust] Add specific row. 58d15f1 [Michael Armbrust] disable codegen logging 87d101d [Michael Armbrust] Fix isNullAt bug abee26d [Michael Armbrust] WIP 27984d0 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into countDistinctPartial 57ae3b1 [Michael Armbrust] Fix order dependent test b3d0f64 [Michael Armbrust] Add golden files. c1f7114 [Michael Armbrust] Improve tests / fix serialization. f31b8ad [Michael Armbrust] more fixes 38c7449 [Michael Armbrust] comments and style 9153652 [Michael Armbrust] better toString d494598 [Michael Armbrust] Fix tests now that the planner is better 41fbd1d [Michael Armbrust] Never try and create an empty hash set. 050bb97 [Michael Armbrust] Skip no-arg constructors for kryo, bd08239 [Michael Armbrust] WIP 213ada8 [Michael Armbrust] First draft of partially aggregated and code generated count distinct / max (cherry picked from commit 7e191fe) Signed-off-by: Michael Armbrust <michael@databricks.com>

…tion improvements Author: Michael Armbrust <michael@databricks.com> Author: Gregory Owen <greowen@gmail.com> Closes apache#1935 from marmbrus/countDistinctPartial and squashes the following commits: 5c7848d [Michael Armbrust] turn off caching in the constructor 8074a80 [Michael Armbrust] fix tests 32d216f [Michael Armbrust] reynolds comments c122cca [Michael Armbrust] Address comments, add tests b2e8ef3 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into countDistinctPartial fae38f4 [Michael Armbrust] Fix style fdca896 [Michael Armbrust] cleanup 93d0f64 [Michael Armbrust] metastore concurrency fix. db44a30 [Michael Armbrust] JIT hax. 3868f6c [Michael Armbrust] Merge pull request apache#9 from GregOwen/countDistinctPartial c9e67de [Gregory Owen] Made SpecificRow and types serializable by Kryo 2b46c4b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into countDistinctPartial 8ff6402 [Michael Armbrust] Add specific row. 58d15f1 [Michael Armbrust] disable codegen logging 87d101d [Michael Armbrust] Fix isNullAt bug abee26d [Michael Armbrust] WIP 27984d0 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into countDistinctPartial 57ae3b1 [Michael Armbrust] Fix order dependent test b3d0f64 [Michael Armbrust] Add golden files. c1f7114 [Michael Armbrust] Improve tests / fix serialization. f31b8ad [Michael Armbrust] more fixes 38c7449 [Michael Armbrust] comments and style 9153652 [Michael Armbrust] better toString d494598 [Michael Armbrust] Fix tests now that the planner is better 41fbd1d [Michael Armbrust] Never try and create an empty hash set. 050bb97 [Michael Armbrust] Skip no-arg constructors for kryo, bd08239 [Michael Armbrust] WIP 213ada8 [Michael Armbrust] First draft of partially aggregated and code generated count distinct / max

chenghao-intel reviewed Aug 14, 2014
View reviewed changes

marmbrus added 6 commits August 17, 2014 17:29

First draft of partially aggregated and code generated count distinct…

213ada8

… / max

WIP

bd08239

Skip no-arg constructors for kryo,

050bb97

Never try and create an empty hash set.

41fbd1d

Fix tests now that the planner is better

d494598

better toString

9153652

comments and style

38c7449

more fixes

f31b8ad

ueshin reviewed Aug 18, 2014
View reviewed changes

marmbrus added 7 commits August 17, 2014 22:31

Improve tests / fix serialization.

c1f7114

Add golden files.

b3d0f64

Fix order dependent test

57ae3b1

Merge remote-tracking branch 'origin/master' into countDistinctPartial

27984d0

WIP

abee26d

Fix isNullAt bug

87d101d

disable codegen logging

58d15f1

Merge remote-tracking branch 'origin/master' into countDistinctPartial

b2e8ef3

marmbrus force-pushed the countDistinctPartial branch from ae8cb53 to b2e8ef3 Compare August 20, 2014 21:20

Address comments, add tests

c122cca

rxin reviewed Aug 20, 2014
View reviewed changes

reynolds comments

32d216f

fix tests

8074a80

turn off caching in the constructor

5c7848d

marmbrus changed the title ~~[WIP][SPARK-2554][SQL] CountDistinct and SumDistinct should do partial aggregation~~ [SPARK-2554][SQL] CountDistinct and SumDistinct should do partial aggregation Aug 23, 2014

marmbrus changed the title ~~[SPARK-2554][SQL] CountDistinct and SumDistinct should do partial aggregation~~ [SPARK-2554][SQL] CountDistinct should do partial aggregation Aug 23, 2014

marmbrus changed the title ~~[SPARK-2554][SQL] CountDistinct should do partial aggregation~~ [SPARK-2554][SQL] CountDistinct partial aggregation and object allocation improvements Aug 23, 2014

asfgit closed this in 7e191fe Aug 23, 2014

marmbrus deleted the countDistinctPartial branch August 27, 2014 20:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2554][SQL] CountDistinct partial aggregation and object allocation improvements #1935

[SPARK-2554][SQL] CountDistinct partial aggregation and object allocation improvements #1935

marmbrus commented Aug 14, 2014

SparkQA commented Aug 14, 2014

SparkQA commented Aug 14, 2014

chenghao-intel Aug 14, 2014

chenghao-intel commented Aug 14, 2014

SparkQA commented Aug 18, 2014

SparkQA commented Aug 18, 2014

SparkQA commented Aug 18, 2014

SparkQA commented Aug 18, 2014

SparkQA commented Aug 18, 2014

SparkQA commented Aug 18, 2014

ueshin Aug 18, 2014

SparkQA commented Aug 20, 2014

SparkQA commented Aug 20, 2014

rxin Aug 20, 2014

marmbrus Aug 20, 2014

rxin Aug 20, 2014

marmbrus Aug 20, 2014

rxin commented Aug 20, 2014

SparkQA commented Aug 20, 2014

SparkQA commented Aug 20, 2014

SparkQA commented Aug 20, 2014

SparkQA commented Aug 21, 2014

SparkQA commented Aug 23, 2014

SparkQA commented Aug 23, 2014

marmbrus commented Aug 23, 2014


		def this() = this(null, null) // Required for serialization.

		var currentValue: MergableAggregateFunction = null


		override def children = left :: right :: Nil

		override def references = (left.flatMap(_.references) ++ right.flatMap(_.references)).toSet


		def this() = this(null, null) // Required for serialization.

		val seen = new OpenHashSet[Any]()

[SPARK-2554][SQL] CountDistinct partial aggregation and object allocation improvements #1935

[SPARK-2554][SQL] CountDistinct partial aggregation and object allocation improvements #1935

Conversation

marmbrus commented Aug 14, 2014

SparkQA commented Aug 14, 2014

SparkQA commented Aug 14, 2014

chenghao-intel Aug 14, 2014

Choose a reason for hiding this comment

chenghao-intel commented Aug 14, 2014

SparkQA commented Aug 18, 2014

SparkQA commented Aug 18, 2014

SparkQA commented Aug 18, 2014

SparkQA commented Aug 18, 2014

SparkQA commented Aug 18, 2014

SparkQA commented Aug 18, 2014

ueshin Aug 18, 2014

Choose a reason for hiding this comment

SparkQA commented Aug 20, 2014

SparkQA commented Aug 20, 2014

rxin Aug 20, 2014

Choose a reason for hiding this comment

marmbrus Aug 20, 2014

Choose a reason for hiding this comment

rxin Aug 20, 2014

Choose a reason for hiding this comment

marmbrus Aug 20, 2014

Choose a reason for hiding this comment

rxin commented Aug 20, 2014

SparkQA commented Aug 20, 2014

SparkQA commented Aug 20, 2014

SparkQA commented Aug 20, 2014

SparkQA commented Aug 21, 2014

SparkQA commented Aug 23, 2014

SparkQA commented Aug 23, 2014

marmbrus commented Aug 23, 2014