Skip to content

Conversation

@LuciferYang
Copy link
Contributor

What changes were proposed in this pull request?

This is pr using BloomFilterAggregate to implement bloomFilter function for DataFrameStatFunctions.

Why are the changes needed?

Add Spark connect jvm client api coverage.

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • Add new test
  • Manually check Scala 2.13

@LuciferYang
Copy link
Contributor Author

cc @hvanhovell I make a clean one, let's restart this

@hvanhovell
Copy link
Contributor

@LuciferYang does this return the same results as the one in sql/core?

@LuciferYang
Copy link
Contributor Author

Let me check again, this pr has been put for too long, I also can't remember clearly ...

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Aug 9, 2023

@hvanhovell I generated some random sequences (covering 5 data types that need to be supported) and used different parameters to compare the output results (including numHashFunctions, bits. bitCount,bits. data. mkString of BloomFilterImpl) with the outputs in the sql/core module, and their results are consistent.

So I think their results should be consistent.

@hvanhovell
Copy link
Contributor

@LuciferYang by consistent you mean exactly the same?

@LuciferYang
Copy link
Contributor Author

@LuciferYang by consistent you mean exactly the same?

Yes, Have you found any cases with different results?

fpp: Double): BloomFilter = {

val agg = if (!fpp.isNaN) {
Column.fn("bloom_filter_agg", col, lit(expectedNumItems), lit(fpp))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like the ambiguity here. Since we are managing this function ourselves, can we just have one way of invoking it. I kind of prefer Column.fn("bloom_filter_agg", col, lit(expectedNumItems), lit(numBits)).

Alternatively you pass all three, where you pick either fpp or numItems and pass null for the other field. Another idea would be to have different names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me think about how to refactor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fe958a6 chang e to only use Column.fn("bloom_filter_agg", col, lit(expectedNumItems), lit(numBits)).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a negative test case where mightContain evaluates to false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6ffbfa0 Added checks for values that are definitely not included.

/**
* `BloomFilterHelper` is used to bridge helper methods in BloomFilter`
*/
private[spark] object BloomFilterHelper {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't you directly reference BloomFilter.optimalNumOfBits(expectedNumItems, fpp)? Alternatively you can hide a lot of this by creating dedicated constructors for the BloomFilterAggregate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4709dd5 make BloomFilter.optimalNumOfBits public and call it directly

SQLConf.get.getConf(RUNTIME_BLOOM_FILTER_MAX_NUM_BITS))

// Mark as lazy so that `updater` is not evaluated during tree transformation.
private lazy val updater: BloomFilterUpdater = first.dataType match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the records lazy vals are not for free.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I haven't thought of other ways yet. This is similar to the cases of estimatedNumItems and numBits. If it's not lazy, then there will be an issue of Invalid call to dataType on unresolved object

// Mark as lazy so that `estimatedNumItems` is not evaluated during tree transformation.
private lazy val estimatedNumItems: Long =
Math.min(estimatedNumItemsExpression.eval().asInstanceOf[Number].longValue,
SQLConf.get.getConf(RUNTIME_BLOOM_FILTER_MAX_NUM_ITEMS))
// Mark as lazy so that `numBits` is not evaluated during tree transformation.
private lazy val numBits: Long =
Math.min(numBitsExpression.eval().asInstanceOf[Number].longValue,
SQLConf.get.getConf(RUNTIME_BLOOM_FILTER_MAX_NUM_BITS))

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good! Can you address the comments?


// Check expectedNumItems is LongType and value greater than 0L
val expectedNumItemsExpr = children(1)
val expectedNumItems = expectedNumItemsExpr match {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to Column.fn("bloom_filter_agg", col, lit(expectedNumItems), lit(numBits), the logic indeed appears simpler now, and I have a point for discussion.

@hvanhovell Do you think we should check the validity of the input here? By checking here, the error message can be exactly the same as the api in sql/core. However, if we use the validation mechanism of BloomFilterAggregate, the content of the error message will be different, but the code will be more concise.

Perhaps we don't need to ensure that the error message is the same as before?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do that in a follow-up.

val filter1 = df.stat.bloomFilter("id", 1000, 0.03)
assert(filter1.expectedFpp() - 0.03 < 1e-3)
assert(data.forall(filter1.mightContain))
assert(notContainValues.forall(n => !filter1.mightContain(n)))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added checks for values that are definitely not included.

numBits
}

if (fpp <= 0d || fpp >= 1d) {
Copy link
Contributor Author

@LuciferYang LuciferYang Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the subsequent process, fpp is no longer involved, so a check is added here. Otherwise, if the user passes an invalid fpp value, the error message will "Number of bits must be positive", which is quite strange.

* @param p false positive rate (must be 0 < p < 1)
*/
private static long optimalNumOfBits(long n, double p) {
public static long optimalNumOfBits(long n, double p) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to public is because DataFrameStatFunctions#buildBloomFilter needs to use this method to calculate the numBits from expectedNumItems and fpp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you find (must be 0 &lt; p &lt; 1) to be quite messy, we can try changing it to (must be {@literal 0 < p < 1})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am good.

@LuciferYang
Copy link
Contributor Author

unidoc check still failed, but I can run it successfully locally, and I am investigating how to resolve this.

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

hvanhovell pushed a commit that referenced this pull request Aug 15, 2023
…tatFunctions`

### What changes were proposed in this pull request?
This is pr using `BloomFilterAggregate` to implement `bloomFilter` function for `DataFrameStatFunctions`.

### Why are the changes needed?
Add Spark connect jvm client api coverage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Add new test
- Manually check Scala 2.13

Closes #42414 from LuciferYang/SPARK-42664-backup.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Herman van Hovell <herman@databricks.com>
(cherry picked from commit b9f1114)
Signed-off-by: Herman van Hovell <herman@databricks.com>
@LuciferYang
Copy link
Contributor Author

Thanks @hvanhovell ~

valentinp17 pushed a commit to valentinp17/spark that referenced this pull request Aug 24, 2023
…tatFunctions`

### What changes were proposed in this pull request?
This is pr using `BloomFilterAggregate` to implement `bloomFilter` function for `DataFrameStatFunctions`.

### Why are the changes needed?
Add Spark connect jvm client api coverage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Add new test
- Manually check Scala 2.13

Closes apache#42414 from LuciferYang/SPARK-42664-backup.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Herman van Hovell <herman@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants