[SPARK-46787][CONNECT] `bloomFilter` function should throw `AnalysisException` for invalid input #44821

zhengruifeng · 2024-01-21T06:11:17Z

What changes were proposed in this pull request?

bloomFilter function should throw AnalysisException for invalid input

Why are the changes needed?

BloomFilterAggregate itself validates the input, and throws meaningful errors. we should not handle those invalid input and throw InvalidPlanInput in Planner.
to be consistent with vanilla Scala API and other functions

Does this PR introduce any user-facing change?

yes, InvalidPlanInput -> AnalysisException

How was this patch tested?

updated CI

Was this patch authored or co-authored using generative AI tooling?

no

MaxGekk · 2024-01-21T18:32:58Z

connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientDataFrameStatSuite.scala

      df.stat.bloomFilter("id", -1000, 100)
    }.getMessage
-    assert(message1.contains("Expected insertions must be positive"))
+    assert(message1.contains("VALUE_OUT_OF_RANGE"))


Could you just invoke the getErrorClass() method, please.

beliefer · 2024-01-22T04:01:43Z

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

-      numBits
-    }
-
-    if (fpp <= 0d || fpp >= 1d) {


I don't know the reason that remove the check.
It seems is a break change.

we don't have such a check in DataFrameStatFunctions of vanilla spark, BloomFilterAggregate will check the value range

I see. So this change is try to fix the bug.

For Vanilla Spark, before the refactoring work #43391, there was this check logic when constructing a BloomFilter using the following constructor:

spark/common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilter.java

Lines 244 to 251 in bc889c8

public static BloomFilter create(long expectedNumItems, double fpp) {

if (fpp <= 0D || fpp >= 1D) {

throw new IllegalArgumentException(

"False positive probability must be within range (0.0, 1.0)"

);

}

return create(expectedNumItems, optimalNumOfBits(expectedNumItems, fpp));

After the refactoring work #43391 and https://github.com/apache/spark/pull/44821/files, if an invalid fpp value is input, like df.stat.bloomFilter("id", 1000, -1.0)

I will see the following error message:

[info] org.apache.spark.sql.AnalysisException: [DATATYPE_MISMATCH.VALUE_OUT_OF_RANGE] Cannot resolve "bloom_filter_agg(id, 1000, 0)" due to data type mismatch: The numBits must be between [0, positive] (current value = 0L). SQLSTATE: 42K09

Personally feel that the new error message is not very user-friendly. My input expression is not bloom_filter_agg(id, 1000, 0), and I did not specify the numBits parameter. Why is this error reported? How should I fix it?

Is there a way to improve this?

beliefer · 2024-01-22T04:02:59Z

...connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

-        val expectedNumItems = expectedNumItemsExpr match {
-          case Literal(l: Long, LongType) => l
-          case _ =>
-            throw InvalidPlanInput("Expected insertions must be long literal.")


Why remove these checks?

InvalidPlanInput is kind of internal exception, we should throw an analyzer exception for:
1, to be consistent with vanilla spark;
2, better error message

LuciferYang · 2024-01-23T10:23:30Z

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

-      numBits
-    }
-
-    if (fpp <= 0d || fpp >= 1d) {


For Vanilla Spark, before the refactoring work #43391, there was this check logic when constructing a BloomFilter using the following constructor:

spark/common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilter.java

Lines 244 to 251 in bc889c8

public static BloomFilter create(long expectedNumItems, double fpp) {

if (fpp <= 0D || fpp >= 1D) {

throw new IllegalArgumentException(

"False positive probability must be within range (0.0, 1.0)"

);

}

return create(expectedNumItems, optimalNumOfBits(expectedNumItems, fpp));

LuciferYang · 2024-01-23T10:31:56Z

connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientDataFrameStatSuite.scala

@@ -248,19 +248,19 @@ class ClientDataFrameStatSuite extends RemoteSparkSession {

  test("Bloom filter test invalid inputs") {
    val df = spark.range(1000).toDF("id")
-    val message1 = intercept[SparkException] {
+    val error1 = intercept[AnalysisException] {


[info] - Bloom filter test invalid inputs *** FAILED *** (13 milliseconds) [info] "[DATATYPE_MISMATCH.]VALUE_OUT_OF_RANGE" did not equal "[]VALUE_OUT_OF_RANGE" (ClientDataFrameStatSuite.scala:254) [info] Analysis: [info] "[DATATYPE_MISMATCH.]VALUE_OUT_OF_RANGE" -> "[]VALUE_OUT_OF_RANGE"

LuciferYang · 2024-01-23T10:45:05Z

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

-      numBits
-    }
-
-    if (fpp <= 0d || fpp >= 1d) {


After the refactoring work #43391 and https://github.com/apache/spark/pull/44821/files, if an invalid fpp value is input, like df.stat.bloomFilter("id", 1000, -1.0)

I will see the following error message:

[info] org.apache.spark.sql.AnalysisException: [DATATYPE_MISMATCH.VALUE_OUT_OF_RANGE] Cannot resolve "bloom_filter_agg(id, 1000, 0)" due to data type mismatch: The numBits must be between [0, positive] (current value = 0L). SQLSTATE: 42K09

Personally feel that the new error message is not very user-friendly. My input expression is not bloom_filter_agg(id, 1000, 0), and I did not specify the numBits parameter. Why is this error reported? How should I fix it?

Is there a way to improve this?

zhengruifeng · 2024-01-24T01:30:30Z

@LuciferYang then I think we can try add following check back

    if (fpp <= 0D || fpp >= 1D) {
      throw new IllegalArgumentException(
        "False positive probability must be within range (0.0, 1.0)"
      );
    }

xyz fix fix fix

LuciferYang

+1, LGTM

LuciferYang · 2024-01-24T03:00:42Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

@@ -536,6 +536,11 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
   * @since 2.0.0
   */
  def bloomFilter(col: Column, expectedNumItems: Long, fpp: Double): BloomFilter = {
+    if (fpp <= 0D || fpp >= 1D) {


Perhaps in a follow-up, we could try moving the fpp check to the BloomFilter#optimalNumOfBits(long, double) method?

yeah, on second thought, it's worth a separate PR to unify the check of fpp.
let me remove the fpp check for now.

LuciferYang · 2024-01-25T03:12:05Z

Merged into master for Spark 4.0. Thanks @zhengruifeng @beliefer and @MaxGekk ~

github-actions bot added SQL CONNECT labels Jan 21, 2024

zhengruifeng force-pushed the connect_bloom_filter_agg_error branch from 10dbd7e to 1d27ff8 Compare January 21, 2024 06:12

zhengruifeng requested a review from LuciferYang January 21, 2024 06:12

MaxGekk reviewed Jan 21, 2024

View reviewed changes

beliefer reviewed Jan 22, 2024

View reviewed changes

beliefer approved these changes Jan 23, 2024

View reviewed changes

LuciferYang reviewed Jan 23, 2024

View reviewed changes

zhengruifeng added 3 commits January 24, 2024 09:36

xyz

58824ea

xyz fix fix fix

address comments

a9276ca

address comments

44031f6

zhengruifeng force-pushed the connect_bloom_filter_agg_error branch from 9059a1a to 44031f6 Compare January 24, 2024 01:36

LuciferYang approved these changes Jan 24, 2024

View reviewed changes

del

8350ab6

LuciferYang closed this in d3a8b30 Jan 25, 2024

zhengruifeng deleted the connect_bloom_filter_agg_error branch January 25, 2024 03:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-46787][CONNECT] `bloomFilter` function should throw `AnalysisException` for invalid input #44821

[SPARK-46787][CONNECT] `bloomFilter` function should throw `AnalysisException` for invalid input #44821

zhengruifeng commented Jan 21, 2024

MaxGekk Jan 21, 2024

LuciferYang Jan 22, 2024

beliefer Jan 22, 2024

zhengruifeng Jan 23, 2024 •

edited

Loading

beliefer Jan 23, 2024

LuciferYang Jan 23, 2024

LuciferYang Jan 23, 2024 •

edited

Loading

beliefer Jan 22, 2024

zhengruifeng Jan 23, 2024

beliefer Jan 23, 2024

LuciferYang Jan 23, 2024

LuciferYang Jan 23, 2024

LuciferYang Jan 23, 2024 •

edited

Loading

zhengruifeng commented Jan 24, 2024

LuciferYang left a comment

LuciferYang Jan 24, 2024

zhengruifeng Jan 24, 2024

LuciferYang Jan 24, 2024

LuciferYang commented Jan 25, 2024

	public static BloomFilter create(long expectedNumItems, double fpp) {
	if (fpp <= 0D \|\| fpp >= 1D) {
	throw new IllegalArgumentException(
	"False positive probability must be within range (0.0, 1.0)"
	);
	}

	return create(expectedNumItems, optimalNumOfBits(expectedNumItems, fpp));

[SPARK-46787][CONNECT] bloomFilter function should throw AnalysisException for invalid input #44821

[SPARK-46787][CONNECT] bloomFilter function should throw AnalysisException for invalid input #44821

Conversation

zhengruifeng commented Jan 21, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhengruifeng Jan 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LuciferYang Jan 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LuciferYang Jan 23, 2024 • edited Loading

Choose a reason for hiding this comment

zhengruifeng commented Jan 24, 2024

LuciferYang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LuciferYang commented Jan 25, 2024

[SPARK-46787][CONNECT] `bloomFilter` function should throw `AnalysisException` for invalid input #44821

[SPARK-46787][CONNECT] `bloomFilter` function should throw `AnalysisException` for invalid input #44821

zhengruifeng Jan 23, 2024 •

edited

Loading

LuciferYang Jan 23, 2024 •

edited

Loading

LuciferYang Jan 23, 2024 •

edited

Loading