-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-46787][CONNECT] bloomFilter
function should throw AnalysisException
for invalid input
#44821
[SPARK-46787][CONNECT] bloomFilter
function should throw AnalysisException
for invalid input
#44821
Conversation
10dbd7e
to
1d27ff8
Compare
df.stat.bloomFilter("id", -1000, 100) | ||
}.getMessage | ||
assert(message1.contains("Expected insertions must be positive")) | ||
assert(message1.contains("VALUE_OUT_OF_RANGE")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you just invoke the getErrorClass()
method, please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
numBits | ||
} | ||
|
||
if (fpp <= 0d || fpp >= 1d) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the reason that remove the check.
It seems is a break change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't have such a check in DataFrameStatFunctions
of vanilla spark, BloomFilterAggregate
will check the value range
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. So this change is try to fix the bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Vanilla Spark, before the refactoring work #43391, there was this check logic when constructing a BloomFilter
using the following constructor:
spark/common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilter.java
Lines 244 to 251 in bc889c8
public static BloomFilter create(long expectedNumItems, double fpp) { | |
if (fpp <= 0D || fpp >= 1D) { | |
throw new IllegalArgumentException( | |
"False positive probability must be within range (0.0, 1.0)" | |
); | |
} | |
return create(expectedNumItems, optimalNumOfBits(expectedNumItems, fpp)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the refactoring work #43391 and https://github.com/apache/spark/pull/44821/files, if an invalid fpp
value is input, like df.stat.bloomFilter("id", 1000, -1.0)
I will see the following error message:
[info] org.apache.spark.sql.AnalysisException:
[DATATYPE_MISMATCH.VALUE_OUT_OF_RANGE] Cannot resolve "bloom_filter_agg(id, 1000, 0)" due to data type mismatch: The numBits must be between [0, positive] (current value = 0L). SQLSTATE: 42K09
Personally feel that the new error message is not very user-friendly. My input expression is not bloom_filter_agg(id, 1000, 0)
, and I did not specify the numBits
parameter. Why is this error reported? How should I fix it?
Is there a way to improve this?
val expectedNumItems = expectedNumItemsExpr match { | ||
case Literal(l: Long, LongType) => l | ||
case _ => | ||
throw InvalidPlanInput("Expected insertions must be long literal.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove these checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InvalidPlanInput
is kind of internal exception, we should throw an analyzer exception for:
1, to be consistent with vanilla spark;
2, better error message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it.
numBits | ||
} | ||
|
||
if (fpp <= 0d || fpp >= 1d) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Vanilla Spark, before the refactoring work #43391, there was this check logic when constructing a BloomFilter
using the following constructor:
spark/common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilter.java
Lines 244 to 251 in bc889c8
public static BloomFilter create(long expectedNumItems, double fpp) { | |
if (fpp <= 0D || fpp >= 1D) { | |
throw new IllegalArgumentException( | |
"False positive probability must be within range (0.0, 1.0)" | |
); | |
} | |
return create(expectedNumItems, optimalNumOfBits(expectedNumItems, fpp)); |
@@ -248,19 +248,19 @@ class ClientDataFrameStatSuite extends RemoteSparkSession { | |||
|
|||
test("Bloom filter test invalid inputs") { | |||
val df = spark.range(1000).toDF("id") | |||
val message1 = intercept[SparkException] { | |||
val error1 = intercept[AnalysisException] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[info] - Bloom filter test invalid inputs *** FAILED *** (13 milliseconds)
[info] "[DATATYPE_MISMATCH.]VALUE_OUT_OF_RANGE" did not equal "[]VALUE_OUT_OF_RANGE" (ClientDataFrameStatSuite.scala:254)
[info] Analysis:
[info] "[DATATYPE_MISMATCH.]VALUE_OUT_OF_RANGE" -> "[]VALUE_OUT_OF_RANGE"
numBits | ||
} | ||
|
||
if (fpp <= 0d || fpp >= 1d) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the refactoring work #43391 and https://github.com/apache/spark/pull/44821/files, if an invalid fpp
value is input, like df.stat.bloomFilter("id", 1000, -1.0)
I will see the following error message:
[info] org.apache.spark.sql.AnalysisException:
[DATATYPE_MISMATCH.VALUE_OUT_OF_RANGE] Cannot resolve "bloom_filter_agg(id, 1000, 0)" due to data type mismatch: The numBits must be between [0, positive] (current value = 0L). SQLSTATE: 42K09
Personally feel that the new error message is not very user-friendly. My input expression is not bloom_filter_agg(id, 1000, 0)
, and I did not specify the numBits
parameter. Why is this error reported? How should I fix it?
Is there a way to improve this?
@LuciferYang then I think we can try add following check back
|
9059a1a
to
44031f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM
@@ -536,6 +536,11 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { | |||
* @since 2.0.0 | |||
*/ | |||
def bloomFilter(col: Column, expectedNumItems: Long, fpp: Double): BloomFilter = { | |||
if (fpp <= 0D || fpp >= 1D) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps in a follow-up, we could try moving the fpp check to the BloomFilter#optimalNumOfBits(long, double)
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, on second thought, it's worth a separate PR to unify the check of fpp
.
let me remove the fpp check for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
Merged into master for Spark 4.0. Thanks @zhengruifeng @beliefer and @MaxGekk ~ |
What changes were proposed in this pull request?
bloomFilter
function should throwAnalysisException
for invalid inputWhy are the changes needed?
BloomFilterAggregate
itself validates the input, and throws meaningful errors. we should not handle those invalid input and throwInvalidPlanInput
in Planner.Does this PR introduce any user-facing change?
yes,
InvalidPlanInput
->AnalysisException
How was this patch tested?
updated CI
Was this patch authored or co-authored using generative AI tooling?
no