[SPARK-31500][SQL] collect_set() of BinaryType returns duplicate elements #28351

planga82 · 2020-04-26T18:24:45Z

What changes were proposed in this pull request?

The collect_set() aggregate function should produce a set of distinct elements. When the column argument's type is BinayType this is not the case.

Example:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

case class R(id: String, value: String, bytes: Array[Byte])
def makeR(id: String, value: String) = R(id, value, value.getBytes)
val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), makeR("b", "fish")).toDF()
// In the example below "bytesSet" erroneously has duplicates but "stringSet" does not (as expected).
df.agg(collect_set('value) as "stringSet", collect_set('bytes) as "byteSet").show(truncate=false)
// The same problem is displayed when using window functions.
val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
val result = df.select(
  collect_set('value).over(win) as "stringSet",
  collect_set('bytes).over(win) as "bytesSet"
)
.select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", size('bytesSet) as "bytesSetSize")
.show()

We use a HashSet buffer to accumulate the results, the problem is that arrays equality in Scala don't behave as expected, arrays ara just plain java arrays and the equality don't compare the content of the arrays
Array(1, 2, 3) == Array(1, 2, 3) => False
The result is that duplicates are not removed in the hashset

The solution proposed is that in the last stage, when we have all the data in the Hashset buffer, we delete duplicates changing the type of the elements and then transform it to the original type.
This transformation is only applied when we have a BinaryType

Why are the changes needed?

Fix the bug explained

Does this PR introduce any user-facing change?

Yes. Now collect_set() correctly deduplicates array of byte.

How was this patch tested?

Unit testing

planga82 · 2020-04-26T18:38:28Z

CC @hvanhovell @rxin

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala

planga82 · 2020-04-28T14:28:02Z

@hvanhovell I think now we have a better solution.

maropu · 2020-04-29T03:21:07Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala

+      .select(struct($"x", $"y").as("a"))
+    val ret = df1.select(collect_set($"a")).collect()
+      .map(r => r.getAs[mutable.WrappedArray[_]](0)).head
+    assert(ret.length == 2)


Could you move the new tests into a new test unit like test("SPARK-31500: collect_set() of BinaryType returns duplicate elements") { ..}?

maropu · 2020-04-29T03:21:33Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala

@@ -139,6 +143,31 @@ case class CollectSet(

  def this(child: Expression) = this(child, 0, 0)

+  /* SPARK-31500
+   * Array[Byte](BinaryType) Scala equality don't works as expected


nit:

/* * SPARK-31500: Array[Byte](BinaryType) Scala equality don't works as expected

It is not a Scala issue. Java byte arrays use referential equality and identity hash codes. This has tripped up many many people before.

Yes, I know the main reason, I'm going tom explain it better

maropu · 2020-04-29T03:23:00Z

ok to test

SparkQA · 2020-04-29T07:05:01Z

Test build #122029 has finished for PR 28351 at commit 586c4b7.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2020-04-29T07:44:46Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala

@@ -65,8 +67,10 @@ abstract class Collect[T <: Growable[Any] with Iterable[Any]] extends TypedImper
    new GenericArrayData(buffer.toArray)
  }

+  lazy val typeChild = child.dataType


Couple of things:

Can you name this bufferElementType or something like that?

Please make it protected.

I would prefer to make this an abstract method (lazy val is not really needed) and make the subclass implement it.

hvanhovell · 2020-04-29T07:45:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala

@@ -46,13 +46,15 @@ abstract class Collect[T <: Growable[Any] with Iterable[Any]] extends TypedImper
  // actual order of input rows.
  override lazy val deterministic: Boolean = false

+  def getValueOnUpdate(value: Any): Any = InternalRow.copyValue(value)


Similar comments to the ones at typeChild:

Make this protected.

Can we name this converToBufferElement or something like that.

I would also prefer to make this an abstract method and have the subclass implement it.

hvanhovell · 2020-04-29T07:53:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala

+    case other => other
+  }
+
+  override def getValueOnUpdate(value: Any): Any = InternalRow.copyValue(value) match {


Please match on the datatype (or create a flag).

hvanhovell · 2020-04-29T08:02:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala

+  override def eval(buffer: mutable.HashSet[Any]): Any = {
+    val bufferUpdated: mutable.HashSet[Any] =
+      child.dataType match {
+        case BinaryType => buffer.map(_.asInstanceOf[UnsafeArrayData].toByteArray)


The map call produces another set, which is not for free. We could do the following:

override def eval(buffer: mutable.HashSet[Any]): Any = { val array = child.dataType match { case BinaryType => buffer.iterator().map(_.asInstanceOf[UnsafeArrayData].toByteArray).toArray() case _ => buffer.toArray() } new GenericArrayData(array) }

hvanhovell · 2020-04-29T08:03:39Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala

+    val bytesTest2 = "test2".getBytes
+    val df1 = Seq(bytesTest1, bytesTest1, bytesTest2).toDF("a")
+    val ret = df1.select(collect_set($"a")).collect()
+      .map(r => r.getAs[mutable.WrappedArray[_]](0)).head


Please avoid casting things to their concrete implementation. This can break when we upgrade to newer scala version. In this case use Seq[_].

planga82 · 2020-04-29T15:14:07Z

Thank you all for the comments, are very interesting

SparkQA · 2020-04-29T15:56:48Z

Test build #122070 has finished for PR 28351 at commit 2cd1a22.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-29T22:27:24Z

Test build #122077 has finished for PR 28351 at commit a5b1dd3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2020-04-30T10:53:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala

+  }
+
+  override def convertToBufferElement(value: Any): Any = {
+    val v = InternalRow.copyValue(value)


Nit you only need to copy for the default case.

Yes it's true. Cleaner in that way

hvanhovell · 2020-04-30T10:54:26Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala

@@ -17,6 +17,7 @@

 package org.apache.spark.sql

+import scala.collection.mutable


Do we still need this import?

We don't need it, I forget it, thanks

hvanhovell · 2020-04-30T10:56:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala

+  override def eval(buffer: mutable.HashSet[Any]): Any = {
+    val array = child.dataType match {
+      case BinaryType =>
+        buffer.iterator.map(_.asInstanceOf[UnsafeArrayData].toByteArray).toArray


It is a bit safer to cast to ArrayData here.

SparkQA · 2020-04-30T19:10:45Z

Test build #122138 has finished for PR 28351 at commit 7ea0059.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell

LGTM

dongjoon-hyun · 2020-05-01T00:18:16Z

Hi, @hvanhovell . What about your comment? Are we going to merge this AS-IS or do you want to revise the comment more?

https://github.com/apache/spark/pull/28351/files#r417924185

dongjoon-hyun · 2020-05-01T00:29:17Z

cc @holdenk since this is a correctness issue in Apache Spark 2.0.2 ~ 2.4.5 at least.

maropu · 2020-05-01T02:10:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala

+   * Array[Byte](BinaryType) Scala equality don't works as expected
+   * so HashSet return duplicates, we have to change types to drop
+   * this duplicates and make collect_set work as expected for this
+   * data type


Could you make this comment clearer for others and move it into the line 163-164?

Ok, it's better to move to this line. I've tried to clarify the message

maropu · 2020-05-01T02:11:10Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala

@@ -530,6 +530,26 @@ class DataFrameAggregateSuite extends QueryTest
    )
  }

+  test("SPARK-31500: collect_set() of BinaryType returns duplicate elements") {


nit: collect_set() of BinaryType should not return duplicate elements?

maropu · 2020-05-01T02:24:18Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala

+    val df = Seq(bytesTest1, bytesTest1, bytesTest2).toDF("a")
+    val ret = df.select(collect_set($"a")).collect()
+      .map(r => r.getAs[Seq[_]](0)).head
+    assert(ret.length == 2)


nit: checkAnswer(df.select(size(collect_set($"a"))), Row(2) :: Nil)?

maropu · 2020-05-01T02:24:40Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala

+      .select(struct($"x", $"y").as("a"))
+    val ret1 = df1.select(collect_set($"a")).collect()
+      .map(r => r.getAs[Seq[_]](0)).head
+    assert(ret1.length == 2)


nit: checkAnswer(df1.select(size(collect_set($"a"))), Row(2) :: Nil)?

shorter way to compare, good idea!

viirya · 2020-05-01T05:20:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala

+   * data type
+   */
+  override lazy val bufferElementType = child.dataType match {
+    case BinaryType => ArrayType(BinaryType)


ArrayType(ByteType)?

Good catch, it works anyway but I think it's better in this way

maropu · 2020-05-01T09:34:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala

+
+  override def convertToBufferElement(value: Any): Any = child.dataType match {
+    /*
+     * SPARK-31500


nit: I think we don't this jira ID here.

Ok, yes, we have the commit to look for more information

maropu · 2020-05-01T09:35:46Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala

+     * SPARK-31500
+     * collect_set() of BinaryType should not return duplicate elements,
+     * Java byte arrays use referential equality and identity hash codes
+     * so we need to use a different Scala object


nit: a different Scala object -> a catalyst value for arrays?

maropu · 2020-05-01T09:36:12Z

Looks fine except for the minor comments.

SparkQA · 2020-05-01T12:25:05Z

Test build #122168 has finished for PR 28351 at commit 4782aea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ents ### What changes were proposed in this pull request? The collect_set() aggregate function should produce a set of distinct elements. When the column argument's type is BinayType this is not the case. Example: ```scala import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions.Window case class R(id: String, value: String, bytes: Array[Byte]) def makeR(id: String, value: String) = R(id, value, value.getBytes) val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), makeR("b", "fish")).toDF() // In the example below "bytesSet" erroneously has duplicates but "stringSet" does not (as expected). df.agg(collect_set('value) as "stringSet", collect_set('bytes) as "byteSet").show(truncate=false) // The same problem is displayed when using window functions. val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing) val result = df.select( collect_set('value).over(win) as "stringSet", collect_set('bytes).over(win) as "bytesSet" ) .select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", size('bytesSet) as "bytesSetSize") .show() ``` We use a HashSet buffer to accumulate the results, the problem is that arrays equality in Scala don't behave as expected, arrays ara just plain java arrays and the equality don't compare the content of the arrays Array(1, 2, 3) == Array(1, 2, 3) => False The result is that duplicates are not removed in the hashset The solution proposed is that in the last stage, when we have all the data in the Hashset buffer, we delete duplicates changing the type of the elements and then transform it to the original type. This transformation is only applied when we have a BinaryType ### Why are the changes needed? Fix the bug explained ### Does this PR introduce any user-facing change? Yes. Now `collect_set()` correctly deduplicates array of byte. ### How was this patch tested? Unit testing Closes #28351 from planga82/feature/SPARK-31500_COLLECT_SET_bug. Authored-by: Pablo Langa <soypab@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 4fecc20) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

maropu · 2020-05-01T13:11:25Z

Thanks, all! Merged to master/branch-3.0/branch-2.4. cc: @dongjoon-hyun @holdenk

SparkQA · 2020-05-01T14:45:09Z

Test build #122170 has finished for PR 28351 at commit 67f55a6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-05-01T16:17:10Z

Thank you all.

Collect implementation and test suite

d1c7525

probot-autolabeler bot added the SQL label Apr 26, 2020

Fix import order

91382bf

hvanhovell reviewed Apr 27, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala Show resolved Hide resolved

Remove duplicates in all process

586c4b7

maropu reviewed Apr 29, 2020

View reviewed changes

hvanhovell reviewed Apr 29, 2020

View reviewed changes

Refactor with comments

2cd1a22

Fix lazy val

a5b1dd3

hvanhovell reviewed Apr 30, 2020

View reviewed changes

Implement comments

7ea0059

hvanhovell approved these changes Apr 30, 2020

View reviewed changes

maropu reviewed May 1, 2020

View reviewed changes

viirya reviewed May 1, 2020

View reviewed changes

planga82 added 3 commits May 1, 2020 08:51

Improve comments

881ee9b

Update tests

a9c3576

Change array type

4782aea

maropu reviewed May 1, 2020

View reviewed changes

PR comments

67f55a6

maropu approved these changes May 1, 2020

View reviewed changes

maropu closed this in 4fecc20 May 1, 2020

		@@ -17,6 +17,7 @@

		package org.apache.spark.sql

		import scala.collection.mutable

[SPARK-31500][SQL] collect_set() of BinaryType returns duplicate elements #28351

[SPARK-31500][SQL] collect_set() of BinaryType returns duplicate elements #28351

Conversation

planga82 commented Apr 26, 2020 • edited by HyukjinKwon Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

planga82 commented Apr 26, 2020

planga82 commented Apr 28, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu commented Apr 29, 2020

SparkQA commented Apr 29, 2020

Choose a reason for hiding this comment

hvanhovell Apr 29, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

planga82 commented Apr 29, 2020

SparkQA commented Apr 29, 2020

SparkQA commented Apr 29, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 30, 2020

hvanhovell left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented May 1, 2020 • edited Loading

dongjoon-hyun commented May 1, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu commented May 1, 2020

SparkQA commented May 1, 2020

maropu commented May 1, 2020

SparkQA commented May 1, 2020

dongjoon-hyun commented May 1, 2020

planga82 commented Apr 26, 2020 •

edited by HyukjinKwon

Loading

hvanhovell Apr 29, 2020 •

edited

Loading

dongjoon-hyun commented May 1, 2020 •

edited

Loading