[SPARK-18429] [SQL] implement a new Aggregate for CountMinSketch #15877

wzhfy · 2016-11-14T06:20:33Z

What changes were proposed in this pull request?

This PR implements a new Aggregate to generate count min sketch, which is a wrapper of CountMinSketch.

How was this patch tested?

add test cases

wzhfy · 2016-11-14T06:22:47Z

cc @rxin

rxin · 2016-11-14T06:23:46Z

cc @liancheng

SparkQA · 2016-11-14T08:32:03Z

Test build #68604 has finished for PR 15877 at commit a4753e4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CountMinSketchAgg(

wzhfy · 2016-11-14T08:50:17Z

retest this please

SparkQA · 2016-11-14T11:17:42Z

Test build #68610 has finished for PR 15877 at commit a4753e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CountMinSketchAgg(

hvanhovell · 2016-11-14T11:51:40Z

...t/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala

+    buffer.mergeInPlace(input)
+  }
+
+  override def eval(buffer: CountMinSketch): Any = new GenericArrayData(serialize(buffer))


Is this an array of bytes?

It is better just to return the byte array and to change the datatype into a BinaryType

Yes, that's better, thanks!

hvanhovell · 2016-11-14T13:09:11Z

...t/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala

+  }
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+    val defaultCheck = super.checkInputDataTypes()


I don't think we need to check this (the super class does not implement this).

ExpectsInputTypes.checkInputDataTypes() checks validity of input types, right?

That is fair.

hvanhovell · 2016-11-14T13:11:12Z

...t/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala

+  }
+
+  override def createAggregationBuffer(): CountMinSketch = {
+    val eps: Double = epsExpression.eval().asInstanceOf[Double]


Should we cache this in lazy vals? I am not sure about the performance implications.

Ok, i'll change them to lazy vals

hvanhovell · 2016-11-14T13:12:37Z

...t/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala

+    // ignore empty rows
+    if (value != null) {
+      // UTF8String is a spark sql type, while CountMinSketch accepts String type
+      buffer.add(if (value.isInstanceOf[UTF8String]) value.toString else value)


How bad would it be to add support for UTF8 string to CMS? Or to pass the UTF8 byte array to CMS?

Yes, we should pass the byte array to CMS.

hvanhovell · 2016-11-14T13:13:35Z

...t/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala

+/**
+ * This function returns a count-min sketch of a column with the given esp, confidence and seed.
+ * A count-min sketch is a probabilistic data structure used for summarizing streams of data in
+ * sub-linear space, which is useful for equality predicates and join size estimation.


Maybe something on the return type? A developer should know how to work with these bytes.

ok, I wrote this in usage, I'll add it here too, thanks.

hvanhovell · 2016-11-14T13:13:58Z

...t/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala

+    copy(inputAggBufferOffset = newInputAggBufferOffset)
+
+  override def inputTypes: Seq[AbstractDataType] = {
+    // currently `CountMinSketch` supports integral and string types


Should we expand this?

@rxin suggested that for unsupported types, we hash it before count min sketch, i.e. CountMinSketchAgg(hash(col)).

hvanhovell · 2016-11-14T13:26:17Z

.../test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAggSuite.scala

+      agg.merge(mergeBuffer, group1Buffer)
+      agg.merge(mergeBuffer, group2Buffer)
+      checkResult(agg.eval(mergeBuffer), allItems, exactFreq)
+    }


This might also be a good place to test merging in a different order, and the merging of an empty partition.

Ok, I'll also test these.

hvanhovell · 2016-11-14T13:28:27Z

.../test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAggSuite.scala

+      data: Array[T],
+      exactFreq: Map[T, Long]): Unit = {
+    result match {
+      case arrayData: ArrayData =>


Add case _ => fail("unexpected return type") to have a nicer error when something goed wrong there

hvanhovell · 2016-11-14T13:30:02Z

...t/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala

+      !seedExpression.foldable) {
+      TypeCheckFailure(
+        "The eps, confidence or seed provided must be a literal or constant foldable")
+    } else if (epsExpression.eval() == null || confidenceExpression.eval() == null ||


Should we also check for negative eps and confidence values?

hvanhovell · 2016-11-14T13:31:48Z

.../test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAggSuite.scala

+        confidenceExpression = Literal(confidence),
+        seedExpression = Literal(seed))
+      val err = intercept[IllegalArgumentException] {
+        invalidAgg.createAggregationBuffer()


See my comment in the CMS agg. This is too late to throw such an error. I'd rather have driver side errors then executor side errors.

Yes, we should have driver side errors, thanks.

hvanhovell

This looks pretty good! I have left a few minor comments. Also consider to register this aggregate in the FunctionRegistry and to add it to functions.scala.

rxin · 2016-11-15T08:18:55Z

yes please register a count_min_sketch and alias cmsketch in FunctionRegistry.

hvanhovell · 2016-11-15T13:24:59Z

...t/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala

+        case StringType => buffer.addBinary(value.asInstanceOf[UTF8String].getBytes)
+        case ByteType => buffer.addLong(value.asInstanceOf[Byte])
+        case ShortType => buffer.addLong(value.asInstanceOf[Short])
+        case IntegerType => buffer.addLong(value.asInstanceOf[Int])


Add DateType?

hvanhovell · 2016-11-15T13:25:09Z

...t/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala

+        case ByteType => buffer.addLong(value.asInstanceOf[Byte])
+        case ShortType => buffer.addLong(value.asInstanceOf[Short])
+        case IntegerType => buffer.addLong(value.asInstanceOf[Int])
+        case LongType => buffer.addLong(value.asInstanceOf[Long])


Add TimestampType?

hvanhovell · 2016-11-15T13:26:37Z

...t/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala

+    val value = child.eval(input)
+    // ignore empty rows
+    if (value != null) {
+      child.dataType match {


A general question: what is faster a pattern match at runtime or to use a virtual function here?

cc @davies @cloud-fan

virtual function dispatch is usually a lot faster than pattern match.

although i don't know if it matters much here given we are going to run it through many hash functions.

SparkQA · 2016-11-15T15:35:45Z

Test build #68664 has finished for PR 15877 at commit 0cca205.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-15T15:48:04Z

Test build #68666 has finished for PR 15877 at commit 2064846.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-11-16T02:19:46Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala

@@ -261,6 +261,8 @@ object FunctionRegistry {
    expression[VarianceSamp]("var_samp"),
    expression[CollectList]("collect_list"),
    expression[CollectSet]("collect_set"),
+    expression[CountMinSketchAgg]("count_min_sketch"),
+    expression[CountMinSketchAgg]("cmsketch"),


actually i take my word back. let's add only count_min_sketch. I don't think it's worth having an alias given this is sketch is difficult to consume (returning some binary)

rxin · 2016-11-16T02:19:58Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+   * @group agg_funcs
+   * @since 2.2.0
+   */
+  def count_min_sketch(e: Column, eps: Double, confidence: Double, seed: Int): Column = {


let's not add these for now.

SparkQA · 2016-11-16T08:50:45Z

Test build #68699 has finished for PR 15877 at commit 7bfdd40.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-16T09:27:45Z

Test build #68703 has finished for PR 15877 at commit 6143997.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2016-11-17T09:24:51Z

Hi @hvanhovell @rxin, I've updated this pr, does it look good to you now?

hvanhovell · 2016-11-18T19:28:32Z

...t/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala

+    val value = child.eval(input)
+    // Ignore empty rows
+    if (value != null) {
+      child.dataType match {


Lets not do a pattern match for every update. We should use an update function instead, for example:

private[this] val doUpdate: (CountMinSketch, Any) => Unit = child.dataType match { case StringType => (cms, value) => cms.addBinary(value.asInstanceOf[UTF8String].getBytes) case ByteType => (cms, value) => cms.addLong(value..asInstanceOf[Byte]) ... } override def update(buffer: CountMinSketch, input: InternalRow): Unit = { val value = child.eval(input) if (value != null) { doUpdate(buffer, value) } }

hvanhovell · 2016-11-18T19:58:28Z

...t/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala

+    // Currently `CountMinSketch` supports integral (date/timestamp is represented as int/long
+    // internally) and string types.
+    Seq(TypeCollection(IntegralType, StringType, DateType, TimestampType),
+      DoubleType, DoubleType, IntegerType)


Also add FloatType (use Float.floatToIntBits), DoubleType (use Double.doubleToLongBits), BooleanType and BinaryType? We could also add support for Decimal, but that would be a bit harder to get right.

@rxin @hvanhovell If we really want to support all these types, is it better to move this conversion and pattern match logics into CountMinSketch? That is, make cms support these types itself. Then, when users do queries e.g. on float type, they don't need to do conversions like cms.estimateCount(Float.floatToIntBits(value)).

Yes if we want to add support for those I think it'd make sense to do it in count-min sketch itself too.

…fix comments

SparkQA · 2016-11-22T11:13:14Z

Test build #68985 has finished for PR 15877 at commit ca4a13f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-22T13:42:42Z

Test build #68994 has finished for PR 15877 at commit b009ff8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2016-11-23T03:36:50Z

cc @rxin @hvanhovell

wzhfy · 2016-11-23T03:45:28Z

common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchImpl.java

@@ -152,6 +153,16 @@ public void add(Object item) {
  public void add(Object item, long count) {
    if (item instanceof String) {
      addString((String) item, count);
+    } else if (item instanceof BigDecimal) {
+      addString(((BigDecimal) item).toString(), count);


Here I use string to represent decimal because there is a one-to-one mapping between BigDecimal and String.

Is this true?

"1.0" and "1.00" is the same value but not the same string representation.

Sorry I didn't describe it accurately. It should be "There is a one-to-one mapping between the distinguishable values and the result of this conversion." (from java doc of BigDecimal)

SparkQA · 2016-11-23T05:41:07Z

Test build #69049 has finished for PR 15877 at commit 1bfb6fd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-23T06:12:15Z

Test build #69047 has finished for PR 15877 at commit a6bbefc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-11-29T21:16:06Z

Thanks - I'm going to merge this in master. I will submit a follow-up PR to simplify this a little bit, and remove the handling of float/double/decimal types and require explicit user action on how to turn that into long.

rxin · 2016-11-29T21:46:32Z

Hey guys - after looking at the pr more, I'm afraid we have gone overboard with testing here. Most of the test cases written are just repeating each other and doing exactly the same thing. For testing something like this I'd probably just have some simple end-to-end test and then be done with it, because most of the complicated logics are isolated in the actual CountMinSketch implementation itself and already has good test coverage.

rxin · 2016-11-29T21:49:26Z

.../test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAggSuite.scala

+    assert(buffer.equals(agg.deserialize(agg.serialize(buffer))))
+  }
+
+  def testHighLevelInterface[T: ClassTag](


@wzhfy can you comment on why we need to test both the high level interface and the low level interface?

I just followed the style in ApproximatePercentileSuite which is also a TypedImperativeAggregate. I thought they are used to test different levels of operations for TypedImperativeAggregate, e.g. update(buffer: InternalRow, input: InternalRow) and def update(buffer: T, input: InternalRow).

## What changes were proposed in this pull request? This PR implements a new Aggregate to generate count min sketch, which is a wrapper of CountMinSketch. ## How was this patch tested? add test cases Author: wangzhenhua <wangzhenhua@huawei.com> Closes apache#15877 from wzhfy/cms.

hvanhovell reviewed Nov 14, 2016

View reviewed changes

hvanhovell requested changes Nov 14, 2016

View reviewed changes

hvanhovell reviewed Nov 15, 2016

View reviewed changes

rxin reviewed Nov 16, 2016

View reviewed changes

hvanhovell reviewed Nov 18, 2016

View reviewed changes

Agg for CountMinSketch

a9d5e03

wzhfy added 5 commits November 22, 2016 16:17

add to FunctionRegistry/functions.scala, add test for sql usage, and …

15c7ca5

…fix comments

add alias

a283e5f

add support for Date and Timestamp types

dedcfca

remove alias and recover function.scala

3e86075

support more types

ca4a13f

wzhfy force-pushed the cms branch from 6143997 to ca4a13f Compare November 22, 2016 08:18

fix inputTypes

b009ff8

smaller test case

1bfb6fd

wzhfy force-pushed the cms branch from a6bbefc to 1bfb6fd Compare November 23, 2016 03:33

wzhfy commented Nov 23, 2016

View reviewed changes

asfgit closed this in d57a594 Nov 29, 2016

rxin reviewed Nov 29, 2016

View reviewed changes

[SPARK-18429] [SQL] implement a new Aggregate for CountMinSketch #15877

[SPARK-18429] [SQL] implement a new Aggregate for CountMinSketch #15877

Conversation

wzhfy commented Nov 14, 2016

What changes were proposed in this pull request?

How was this patch tested?

wzhfy commented Nov 14, 2016

rxin commented Nov 14, 2016

SparkQA commented Nov 14, 2016

wzhfy commented Nov 14, 2016

SparkQA commented Nov 14, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wzhfy Nov 15, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hvanhovell left a comment

Choose a reason for hiding this comment

rxin commented Nov 15, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 15, 2016

SparkQA commented Nov 15, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 16, 2016

SparkQA commented Nov 16, 2016

wzhfy commented Nov 17, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 22, 2016

SparkQA commented Nov 22, 2016

wzhfy commented Nov 23, 2016

wzhfy Nov 23, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 23, 2016

SparkQA commented Nov 23, 2016

rxin commented Nov 29, 2016

rxin commented Nov 29, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wzhfy Nov 15, 2016 •

edited

wzhfy Nov 23, 2016 •

edited