[SPARK-42664][CONNECT] Support `bloomFilter` function for `DataFrameStatFunctions` #42414

LuciferYang · 2023-08-09T14:15:20Z

What changes were proposed in this pull request?

This is pr using BloomFilterAggregate to implement bloomFilter function for DataFrameStatFunctions.

Why are the changes needed?

Add Spark connect jvm client api coverage.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add new test
Manually check Scala 2.13

LuciferYang · 2023-08-09T14:16:53Z

cc @hvanhovell I make a clean one, let's restart this

hvanhovell · 2023-08-09T14:46:50Z

@LuciferYang does this return the same results as the one in sql/core?

LuciferYang · 2023-08-09T15:15:59Z

Let me check again, this pr has been put for too long, I also can't remember clearly ...

LuciferYang · 2023-08-09T16:47:04Z

@hvanhovell I generated some random sequences (covering 5 data types that need to be supported) and used different parameters to compare the output results (including numHashFunctions, bits. bitCount,bits. data. mkString of BloomFilterImpl) with the outputs in the sql/core module, and their results are consistent.

So I think their results should be consistent.

hvanhovell · 2023-08-09T17:16:42Z

@LuciferYang by consistent you mean exactly the same?

LuciferYang · 2023-08-09T17:19:39Z

@LuciferYang by consistent you mean exactly the same?

Yes, Have you found any cases with different results?

hvanhovell · 2023-08-09T17:20:16Z

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

+      fpp: Double): BloomFilter = {
+
+    val agg = if (!fpp.isNaN) {
+      Column.fn("bloom_filter_agg", col, lit(expectedNumItems), lit(fpp))


I don't really like the ambiguity here. Since we are managing this function ourselves, can we just have one way of invoking it. I kind of prefer Column.fn("bloom_filter_agg", col, lit(expectedNumItems), lit(numBits)).

Alternatively you pass all three, where you pick either fpp or numItems and pass null for the other field. Another idea would be to have different names.

Let me think about how to refactor.

fe958a6 chang e to only use Column.fn("bloom_filter_agg", col, lit(expectedNumItems), lit(numBits)).

hvanhovell · 2023-08-09T17:26:22Z

connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientDataFrameStatSuite.scala

Maybe add a negative test case where mightContain evaluates to false?

6ffbfa0 Added checks for values that are definitely not included.

hvanhovell · 2023-08-09T17:30:11Z

connector/connect/server/src/main/scala/org/apache/spark/util/sketch/BloomFilterHelper.scala

+/**
+ * `BloomFilterHelper` is used to bridge helper methods in BloomFilter`
+ */
+private[spark] object BloomFilterHelper {


Why can't you directly reference BloomFilter.optimalNumOfBits(expectedNumItems, fpp)? Alternatively you can hide a lot of this by creating dedicated constructors for the BloomFilterAggregate.

4709dd5 make BloomFilter.optimalNumOfBits public and call it directly

hvanhovell · 2023-08-09T17:31:45Z

...rc/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala

      SQLConf.get.getConf(RUNTIME_BLOOM_FILTER_MAX_NUM_BITS))

+  // Mark as lazy so that `updater` is not evaluated during tree transformation.
+  private lazy val updater: BloomFilterUpdater = first.dataType match {


For the records lazy vals are not for free.

Yes, but I haven't thought of other ways yet. This is similar to the cases of estimatedNumItems and numBits. If it's not lazy, then there will be an issue of Invalid call to dataType on unresolved object

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala

Lines 143 to 151 in 55b07b1

// Mark as lazy so that `estimatedNumItems` is not evaluated during tree transformation.

private lazy val estimatedNumItems: Long =

Math.min(estimatedNumItemsExpression.eval().asInstanceOf[Number].longValue,

SQLConf.get.getConf(RUNTIME_BLOOM_FILTER_MAX_NUM_ITEMS))

// Mark as lazy so that `numBits` is not evaluated during tree transformation.

private lazy val numBits: Long =

Math.min(numBitsExpression.eval().asInstanceOf[Number].longValue,

SQLConf.get.getConf(RUNTIME_BLOOM_FILTER_MAX_NUM_BITS))

hvanhovell

Looks pretty good! Can you address the comments?

This reverts commit dfbe1c4.

LuciferYang · 2023-08-10T03:54:53Z

...connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

+
+        // Check expectedNumItems is LongType and value greater than 0L
+        val expectedNumItemsExpr = children(1)
+        val expectedNumItems = expectedNumItemsExpr match {


Change to Column.fn("bloom_filter_agg", col, lit(expectedNumItems), lit(numBits), the logic indeed appears simpler now, and I have a point for discussion.

@hvanhovell Do you think we should check the validity of the input here? By checking here, the error message can be exactly the same as the api in sql/core. However, if we use the validation mechanism of BloomFilterAggregate, the content of the error message will be different, but the code will be more concise.

Perhaps we don't need to ensure that the error message is the same as before?

We can do that in a follow-up.

LuciferYang · 2023-08-10T05:13:33Z

connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientDataFrameStatSuite.scala

+    val filter1 = df.stat.bloomFilter("id", 1000, 0.03)
+    assert(filter1.expectedFpp() - 0.03 < 1e-3)
+    assert(data.forall(filter1.mightContain))
+    assert(notContainValues.forall(n => !filter1.mightContain(n)))


Added checks for values that are definitely not included.

LuciferYang · 2023-08-10T05:35:41Z

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

+      numBits
+    }
+
+    if (fpp <= 0d || fpp >= 1d) {


In the subsequent process, fpp is no longer involved, so a check is added here. Otherwise, if the user passes an invalid fpp value, the error message will "Number of bits must be positive", which is quite strange.

common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilter.java

LuciferYang · 2023-08-10T07:47:05Z

common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilter.java

   * @param p false positive rate (must be 0 < p < 1)
   */
-  private static long optimalNumOfBits(long n, double p) {
+  public static long optimalNumOfBits(long n, double p) {


Change to public is because DataFrameStatFunctions#buildBloomFilter needs to use this method to calculate the numBits from expectedNumItems and fpp

If you find (must be 0 < p < 1) to be quite messy, we can try changing it to (must be {@literal 0 < p < 1})

LuciferYang · 2023-08-10T11:47:59Z

unidoc check still failed, but I can run it successfully locally, and I am investigating how to resolve this.

common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilter.java

hvanhovell

LGTM

…tatFunctions` ### What changes were proposed in this pull request? This is pr using `BloomFilterAggregate` to implement `bloomFilter` function for `DataFrameStatFunctions`. ### Why are the changes needed? Add Spark connect jvm client api coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Add new test - Manually check Scala 2.13 Closes #42414 from LuciferYang/SPARK-42664-backup. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit b9f1114) Signed-off-by: Herman van Hovell <herman@databricks.com>

LuciferYang · 2023-08-16T02:24:49Z

Thanks @hvanhovell ~

…tatFunctions` ### What changes were proposed in this pull request? This is pr using `BloomFilterAggregate` to implement `bloomFilter` function for `DataFrameStatFunctions`. ### Why are the changes needed? Add Spark connect jvm client api coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Add new test - Manually check Scala 2.13 Closes apache#42414 from LuciferYang/SPARK-42664-backup. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

init

beaaae6

github-actions bot added SQL CONNECT labels Aug 9, 2023

LuciferYang mentioned this pull request Aug 9, 2023

[SPARK-42664][CONNECT] Support bloomFilter function for DataFrameStatFunctions #40352

Closed

fix test exception

a154c51

hvanhovell reviewed Aug 9, 2023

View reviewed changes

LuciferYang added 4 commits August 10, 2023 11:12

pass 4

dfbe1c4

Revert "pass 4"

d600ebb

This reverts commit dfbe1c4.

make optimalNumOfBits public

4709dd5

only pass [col, expectedNumItems: Long, numBits: Long]

fe958a6

LuciferYang commented Aug 10, 2023

View reviewed changes

add negative test check

6ffbfa0

LuciferYang commented Aug 10, 2023

View reviewed changes

use child

cf3104a

LuciferYang commented Aug 10, 2023

View reviewed changes

common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilter.java Outdated Show resolved Hide resolved

LuciferYang commented Aug 10, 2023

View reviewed changes

common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilter.java Outdated Show resolved Hide resolved

fix doc test

80a6b4b

LuciferYang force-pushed the SPARK-42664-backup branch from 0f2a7b1 to 80a6b4b Compare August 11, 2023 02:32

LuciferYang added 2 commits August 11, 2023 11:02

Merge branch 'upmaster' into SPARK-42664-backup

1b88765

Merge branch 'upmaster' into SPARK-42664-backup

473ad60

hvanhovell approved these changes Aug 15, 2023

View reviewed changes

hvanhovell closed this in b9f1114 Aug 15, 2023

mbutrovich mentioned this pull request Oct 18, 2024

Add more types to BloomFilterAgg apache/datafusion-comet#1023

Closed

	// Mark as lazy so that `estimatedNumItems` is not evaluated during tree transformation.
	private lazy val estimatedNumItems: Long =
	Math.min(estimatedNumItemsExpression.eval().asInstanceOf[Number].longValue,
	SQLConf.get.getConf(RUNTIME_BLOOM_FILTER_MAX_NUM_ITEMS))

	// Mark as lazy so that `numBits` is not evaluated during tree transformation.
	private lazy val numBits: Long =
	Math.min(numBitsExpression.eval().asInstanceOf[Number].longValue,
	SQLConf.get.getConf(RUNTIME_BLOOM_FILTER_MAX_NUM_BITS))

[SPARK-42664][CONNECT] Support bloomFilter function for DataFrameStatFunctions #42414

[SPARK-42664][CONNECT] Support bloomFilter function for DataFrameStatFunctions #42414

Uh oh!

Conversation

LuciferYang commented Aug 9, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

LuciferYang commented Aug 9, 2023

Uh oh!

hvanhovell commented Aug 9, 2023

Uh oh!

LuciferYang commented Aug 9, 2023

Uh oh!

LuciferYang commented Aug 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hvanhovell commented Aug 9, 2023

Uh oh!

LuciferYang commented Aug 9, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LuciferYang Aug 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LuciferYang commented Aug 10, 2023

Uh oh!

Uh oh!

hvanhovell left a comment

Choose a reason for hiding this comment

Uh oh!

LuciferYang commented Aug 16, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-42664][CONNECT] Support `bloomFilter` function for `DataFrameStatFunctions` #42414

[SPARK-42664][CONNECT] Support `bloomFilter` function for `DataFrameStatFunctions` #42414

LuciferYang commented Aug 9, 2023 •

edited

Loading

LuciferYang Aug 10, 2023 •

edited

Loading