New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-16484][SQL] Update hll function type checks to also check for non-foldable inputs #41203
[SPARK-16484][SQL] Update hll function type checks to also check for non-foldable inputs #41203
Conversation
@bersprockets here are the changes to handle non-foldable input args, based on our conversation in #40615. cc @dtenedor @mkaravel |
@RyanBerti thanks for the update! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except one suggestion to dedup two similar code blocks.
@@ -265,6 +288,26 @@ case class HllUnionAgg( | |||
override protected def withNewChildrenInternal(newLeft: Expression, newRight: Expression): | |||
HllUnionAgg = copy(left = newLeft, right = newRight) | |||
|
|||
// Overrides for ExpectsInputTypes | |||
|
|||
override def checkInputDataTypes(): TypeCheckResult = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks mostly duplicated with L104-120 above. Can we dedup into one place and just call the helper twice instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's possible, but might be more valuable to create a new trait that extends ExpectsInputTypes that by default allows all of the types defined by inputTypes to be non-foldable, but provides a mechanism to override each type and define whether its foldable/non-foldable. Feels like this could be utilized by other functions too - thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we define e.g.
private def checkInputDataTypesForHllFunction(
defaultCheck: TypeCheckResult, right: Expression): TypeCheckResult = {
if (defaultCheck.isFailure) {
defaultCheck
} else if (!right.foldable) {
DataTypeMismatch(
errorSubClass = "NON_FOLDABLE_INPUT",
messageParameters = Map(
"inputName" -> "lgConfigK",
"inputType" -> toSQLType(right.dataType),
"inputExpr" -> toSQLExpr(right)
)
)
} else {
TypeCheckSuccess
}
}
@@ -265,6 +288,26 @@ case class HllUnionAgg( | |||
override protected def withNewChildrenInternal(newLeft: Expression, newRight: Expression): | |||
HllUnionAgg = copy(left = newLeft, right = newRight) | |||
|
|||
// Overrides for ExpectsInputTypes | |||
|
|||
override def checkInputDataTypes(): TypeCheckResult = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we define e.g.
private def checkInputDataTypesForHllFunction(
defaultCheck: TypeCheckResult, right: Expression): TypeCheckResult = {
if (defaultCheck.isFailure) {
defaultCheck
} else if (!right.foldable) {
DataTypeMismatch(
errorSubClass = "NON_FOLDABLE_INPUT",
messageParameters = Map(
"inputName" -> "lgConfigK",
"inputType" -> toSQLType(right.dataType),
"inputExpr" -> toSQLExpr(right)
)
)
} else {
TypeCheckSuccess
}
}
@dtenedor I just pushed a commit that tries to generalize the foldable check, as I'm seeing duplicate code in the datasketches functions as well as others (see ApproxCountDistinctForIntervals, ApproximatePercentile, CountMinSketchAgg, etc). I'm open to modifying the trait as needed, or reverting the commit and implementing your suggestion. |
The new trait looks good. In the future we can think about reusing it. |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExpectsInputTypes.scala
Outdated
Show resolved
Hide resolved
…essions/ExpectsInputTypes.scala Co-authored-by: Gera Shegalov <gshegalov@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@HyukjinKwon could you help us merge this one? |
) | ||
checkAnswer(res, Nil) | ||
} | ||
assert(error8.toString contains "UNEXPECTED_INPUT_FOLDABLE_VALUE") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use checkError
for checking the error class and other fields, please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm seeing various other negative test cases in this suite that use assert, and no examples of checkError. Is there a good example of checkError I could reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The HiveUDAF suite uses it pretty extensively.
) | ||
checkAnswer(res, Nil) | ||
} | ||
assert(error9.toString contains "UNEXPECTED_INPUT_FOLDABLE_VALUE") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto: Use checkError
) | ||
checkAnswer(res, Nil) | ||
} | ||
assert(error10.toString contains "UNEXPECTED_INPUT_FOLDABLE_VALUE") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, use checkError
@@ -471,6 +471,11 @@ | |||
"class <className> not found." | |||
] | |||
}, | |||
"UNEXPECTED_INPUT_FOLDABLE_VALUE" : { | |||
"message" : [ | |||
"Parameter <paramIndex> requires a foldable value of <requiredFoldable>, however <inputSql> has a foldable value of <inputFoldable>." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From user's point of view, the message is not clear. What's the foldable value of true
?
Can't you just re-use the existing error class: NON_FOLDABLE_INPUT
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @MaxGekk - I introduced the new error message as the text for NON_FOLDABLE_INPUT
only makes sense when we expect the input to be foldable, but it is not. I tried to mimic the ExpectsInputTypes.checkInputTypes
implementation in ExpectsInputTypesAndFoldable
, which checks the foldable attribute of the expression against a user-provided expected value; this allows the trait to check if the foldable value is true OR false, and thus I needed an error message that was more general than NON_FOLDABLE_INPUT
. I wonder if we'll ever want to make sure that an input expression isn't foldable (IE for this function arg, you should only ever specify a column)? If not, then maybe I can update the ExpectsInputTypesAndFoldable
to only check for cases where we expect an expression to be foldable, and then re-use the NON_FOLDABLE_INPUT
error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we expect the input to be foldable, but it is not
I wonder when we need to raise an error when the expected input is any expr (means either foldable or non-foldable ), but it is foldable actually. What's the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if there's a clear use case where we'd want to check if the input is not foldable, but this implementation + the new error message does provide that level of flexibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think non-foldable might be enough, I can't think of a case where we would not want to allow a foldable input. This could also simplify the code a little bit in the checker since Seq[Option[Bool]] could become Seq[Bool]
|
||
override def inputTypes: Seq[AbstractDataType] = Seq(BinaryType, BooleanType) | ||
|
||
override def inputIsFoldable: Seq[Option[Boolean]] = Seq(None, Some(true)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None
means that you don't care of it is foldable or not, correct? You just require to be foldable for the second parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct.
@MaxGekk do you still think this PR needs changes, based on our discussion? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM provided the existing points raised but others are addressed.
@@ -471,6 +471,11 @@ | |||
"class <className> not found." | |||
] | |||
}, | |||
"UNEXPECTED_INPUT_FOLDABLE_VALUE" : { | |||
"message" : [ | |||
"Parameter <paramIndex> requires a foldable value of <requiredFoldable>, however <inputSql> has a foldable value of <inputFoldable>." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think non-foldable might be enough, I can't think of a case where we would not want to allow a foldable input. This could also simplify the code a little bit in the checker since Seq[Option[Bool]] could become Seq[Bool]
* a value of None bypasses the check for the given column | ||
*/ | ||
trait ExpectsInputTypesAndFoldable extends ExpectsInputTypes { | ||
def inputIsFoldable: Seq[Option[Boolean]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's a little bike-sheddy so feel free to disregard but I'd have a small preference towards "requireFoldableInputs" or something since as it stands it sounds like this function is returning if the inputs are foldable rather than the required state.
) | ||
checkAnswer(res, Nil) | ||
} | ||
assert(error8.toString contains "UNEXPECTED_INPUT_FOLDABLE_VALUE") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The HiveUDAF suite uses it pretty extensively.
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
This PR fixes a null pointer thrown when non-foldable values are provided to the lgConfigK or allowDifferentLgConfigK arguments of the hll_sketch_agg, hll_union_agg, or hll_union functions.
Why are the changes needed?
We should be handling this error gracefully, rather than throwing a null pointer exception.
Does this PR introduce any user-facing change?
This PR introduces two new error messages, replacing what would be null pointer exceptions.
How was this patch tested?
Three new negative test cases have been added to the 'SPARK-16484: hll_*_agg + hll_union negative tests' test of the DataframeAggregateSuite.