Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-46787][CONNECT] bloomFilter function should throw AnalysisException for invalid input #44821

Closed

Conversation

zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

bloomFilter function should throw AnalysisException for invalid input

Why are the changes needed?

  1. BloomFilterAggregate itself validates the input, and throws meaningful errors. we should not handle those invalid input and throw InvalidPlanInput in Planner.
  2. to be consistent with vanilla Scala API and other functions

Does this PR introduce any user-facing change?

yes, InvalidPlanInput -> AnalysisException

How was this patch tested?

updated CI

Was this patch authored or co-authored using generative AI tooling?

no

df.stat.bloomFilter("id", -1000, 100)
}.getMessage
assert(message1.contains("Expected insertions must be positive"))
assert(message1.contains("VALUE_OUT_OF_RANGE"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you just invoke the getErrorClass() method, please.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

numBits
}

if (fpp <= 0d || fpp >= 1d) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the reason that remove the check.
It seems is a break change.

Copy link
Contributor Author

@zhengruifeng zhengruifeng Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have such a check in DataFrameStatFunctions of vanilla spark, BloomFilterAggregate will check the value range

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So this change is try to fix the bug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Vanilla Spark, before the refactoring work #43391, there was this check logic when constructing a BloomFilter using the following constructor:

public static BloomFilter create(long expectedNumItems, double fpp) {
if (fpp <= 0D || fpp >= 1D) {
throw new IllegalArgumentException(
"False positive probability must be within range (0.0, 1.0)"
);
}
return create(expectedNumItems, optimalNumOfBits(expectedNumItems, fpp));

Copy link
Contributor

@LuciferYang LuciferYang Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the refactoring work #43391 and https://github.com/apache/spark/pull/44821/files, if an invalid fpp value is input, like df.stat.bloomFilter("id", 1000, -1.0)

I will see the following error message:

[info] org.apache.spark.sql.AnalysisException: 
[DATATYPE_MISMATCH.VALUE_OUT_OF_RANGE] Cannot resolve "bloom_filter_agg(id, 1000, 0)" due to data type mismatch: The numBits must be between [0, positive] (current value = 0L). SQLSTATE: 42K09

Personally feel that the new error message is not very user-friendly. My input expression is not bloom_filter_agg(id, 1000, 0), and I did not specify the numBits parameter. Why is this error reported? How should I fix it?

Is there a way to improve this?

val expectedNumItems = expectedNumItemsExpr match {
case Literal(l: Long, LongType) => l
case _ =>
throw InvalidPlanInput("Expected insertions must be long literal.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove these checks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InvalidPlanInput is kind of internal exception, we should throw an analyzer exception for:
1, to be consistent with vanilla spark;
2, better error message

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it.

numBits
}

if (fpp <= 0d || fpp >= 1d) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Vanilla Spark, before the refactoring work #43391, there was this check logic when constructing a BloomFilter using the following constructor:

public static BloomFilter create(long expectedNumItems, double fpp) {
if (fpp <= 0D || fpp >= 1D) {
throw new IllegalArgumentException(
"False positive probability must be within range (0.0, 1.0)"
);
}
return create(expectedNumItems, optimalNumOfBits(expectedNumItems, fpp));

@@ -248,19 +248,19 @@ class ClientDataFrameStatSuite extends RemoteSparkSession {

test("Bloom filter test invalid inputs") {
val df = spark.range(1000).toDF("id")
val message1 = intercept[SparkException] {
val error1 = intercept[AnalysisException] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[info] - Bloom filter test invalid inputs *** FAILED *** (13 milliseconds)
[info]   "[DATATYPE_MISMATCH.]VALUE_OUT_OF_RANGE" did not equal "[]VALUE_OUT_OF_RANGE" (ClientDataFrameStatSuite.scala:254)
[info]   Analysis:
[info]   "[DATATYPE_MISMATCH.]VALUE_OUT_OF_RANGE" -> "[]VALUE_OUT_OF_RANGE"

numBits
}

if (fpp <= 0d || fpp >= 1d) {
Copy link
Contributor

@LuciferYang LuciferYang Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the refactoring work #43391 and https://github.com/apache/spark/pull/44821/files, if an invalid fpp value is input, like df.stat.bloomFilter("id", 1000, -1.0)

I will see the following error message:

[info] org.apache.spark.sql.AnalysisException: 
[DATATYPE_MISMATCH.VALUE_OUT_OF_RANGE] Cannot resolve "bloom_filter_agg(id, 1000, 0)" due to data type mismatch: The numBits must be between [0, positive] (current value = 0L). SQLSTATE: 42K09

Personally feel that the new error message is not very user-friendly. My input expression is not bloom_filter_agg(id, 1000, 0), and I did not specify the numBits parameter. Why is this error reported? How should I fix it?

Is there a way to improve this?

@zhengruifeng
Copy link
Contributor Author

@LuciferYang then I think we can try add following check back

    if (fpp <= 0D || fpp >= 1D) {
      throw new IllegalArgumentException(
        "False positive probability must be within range (0.0, 1.0)"
      );
    }

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

@@ -536,6 +536,11 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
* @since 2.0.0
*/
def bloomFilter(col: Column, expectedNumItems: Long, fpp: Double): BloomFilter = {
if (fpp <= 0D || fpp >= 1D) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps in a follow-up, we could try moving the fpp check to the BloomFilter#optimalNumOfBits(long, double) method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, on second thought, it's worth a separate PR to unify the check of fpp.
let me remove the fpp check for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@LuciferYang
Copy link
Contributor

Merged into master for Spark 4.0. Thanks @zhengruifeng @beliefer and @MaxGekk ~

@zhengruifeng zhengruifeng deleted the connect_bloom_filter_agg_error branch January 25, 2024 03:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants