Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16484][SQL] Update hll function type checks to also check for non-foldable inputs #41203

Conversation

RyanBerti
Copy link
Contributor

What changes were proposed in this pull request?

This PR fixes a null pointer thrown when non-foldable values are provided to the lgConfigK or allowDifferentLgConfigK arguments of the hll_sketch_agg, hll_union_agg, or hll_union functions.

Why are the changes needed?

We should be handling this error gracefully, rather than throwing a null pointer exception.

Does this PR introduce any user-facing change?

This PR introduces two new error messages, replacing what would be null pointer exceptions.

How was this patch tested?

Three new negative test cases have been added to the 'SPARK-16484: hll_*_agg + hll_union negative tests' test of the DataframeAggregateSuite.

@github-actions github-actions bot added the SQL label May 17, 2023
@RyanBerti RyanBerti changed the title Update hll function type checks to also check for non-foldable inputs [SPARK-16484][SQL] Update hll function type checks to also check for non-foldable inputs May 17, 2023
@RyanBerti
Copy link
Contributor Author

@bersprockets here are the changes to handle non-foldable input args, based on our conversation in #40615. cc @dtenedor @mkaravel

@dtenedor
Copy link
Contributor

@RyanBerti thanks for the update!

Copy link
Contributor

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except one suggestion to dedup two similar code blocks.

@@ -265,6 +288,26 @@ case class HllUnionAgg(
override protected def withNewChildrenInternal(newLeft: Expression, newRight: Expression):
HllUnionAgg = copy(left = newLeft, right = newRight)

// Overrides for ExpectsInputTypes

override def checkInputDataTypes(): TypeCheckResult = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks mostly duplicated with L104-120 above. Can we dedup into one place and just call the helper twice instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's possible, but might be more valuable to create a new trait that extends ExpectsInputTypes that by default allows all of the types defined by inputTypes to be non-foldable, but provides a mechanism to override each type and define whether its foldable/non-foldable. Feels like this could be utilized by other functions too - thoughts?

Copy link
Contributor

@dtenedor dtenedor May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we define e.g.

  private def checkInputDataTypesForHllFunction(
      defaultCheck: TypeCheckResult, right: Expression): TypeCheckResult = {
    if (defaultCheck.isFailure) {
      defaultCheck
    } else if (!right.foldable) {
      DataTypeMismatch(
        errorSubClass = "NON_FOLDABLE_INPUT",
        messageParameters = Map(
          "inputName" -> "lgConfigK",
          "inputType" -> toSQLType(right.dataType),
          "inputExpr" -> toSQLExpr(right)
        )
      )
    } else {
      TypeCheckSuccess
    }
  }

@@ -265,6 +288,26 @@ case class HllUnionAgg(
override protected def withNewChildrenInternal(newLeft: Expression, newRight: Expression):
HllUnionAgg = copy(left = newLeft, right = newRight)

// Overrides for ExpectsInputTypes

override def checkInputDataTypes(): TypeCheckResult = {
Copy link
Contributor

@dtenedor dtenedor May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we define e.g.

  private def checkInputDataTypesForHllFunction(
      defaultCheck: TypeCheckResult, right: Expression): TypeCheckResult = {
    if (defaultCheck.isFailure) {
      defaultCheck
    } else if (!right.foldable) {
      DataTypeMismatch(
        errorSubClass = "NON_FOLDABLE_INPUT",
        messageParameters = Map(
          "inputName" -> "lgConfigK",
          "inputType" -> toSQLType(right.dataType),
          "inputExpr" -> toSQLExpr(right)
        )
      )
    } else {
      TypeCheckSuccess
    }
  }

@github-actions github-actions bot added the CORE label May 17, 2023
@RyanBerti
Copy link
Contributor Author

RyanBerti commented May 17, 2023

@dtenedor I just pushed a commit that tries to generalize the foldable check, as I'm seeing duplicate code in the datasketches functions as well as others (see ApproxCountDistinctForIntervals, ApproximatePercentile, CountMinSketchAgg, etc). I'm open to modifying the trait as needed, or reverting the commit and implementing your suggestion.

@dtenedor
Copy link
Contributor

The new trait looks good. In the future we can think about reusing it.

RyanBerti and others added 2 commits May 18, 2023 10:13
…essions/ExpectsInputTypes.scala

Co-authored-by: Gera Shegalov <gshegalov@nvidia.com>
Copy link
Contributor

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@RyanBerti
Copy link
Contributor Author

@HyukjinKwon could you help us merge this one?

)
checkAnswer(res, Nil)
}
assert(error8.toString contains "UNEXPECTED_INPUT_FOLDABLE_VALUE")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use checkError for checking the error class and other fields, please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing various other negative test cases in this suite that use assert, and no examples of checkError. Is there a good example of checkError I could reference?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HiveUDAF suite uses it pretty extensively.

)
checkAnswer(res, Nil)
}
assert(error9.toString contains "UNEXPECTED_INPUT_FOLDABLE_VALUE")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto: Use checkError

)
checkAnswer(res, Nil)
}
assert(error10.toString contains "UNEXPECTED_INPUT_FOLDABLE_VALUE")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, use checkError

@@ -471,6 +471,11 @@
"class <className> not found."
]
},
"UNEXPECTED_INPUT_FOLDABLE_VALUE" : {
"message" : [
"Parameter <paramIndex> requires a foldable value of <requiredFoldable>, however <inputSql> has a foldable value of <inputFoldable>."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From user's point of view, the message is not clear. What's the foldable value of true?

Can't you just re-use the existing error class: NON_FOLDABLE_INPUT?

Copy link
Contributor Author

@RyanBerti RyanBerti May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @MaxGekk - I introduced the new error message as the text for NON_FOLDABLE_INPUT only makes sense when we expect the input to be foldable, but it is not. I tried to mimic the ExpectsInputTypes.checkInputTypes implementation in ExpectsInputTypesAndFoldable, which checks the foldable attribute of the expression against a user-provided expected value; this allows the trait to check if the foldable value is true OR false, and thus I needed an error message that was more general than NON_FOLDABLE_INPUT. I wonder if we'll ever want to make sure that an input expression isn't foldable (IE for this function arg, you should only ever specify a column)? If not, then maybe I can update the ExpectsInputTypesAndFoldable to only check for cases where we expect an expression to be foldable, and then re-use the NON_FOLDABLE_INPUT error message.

Copy link
Member

@MaxGekk MaxGekk May 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we expect the input to be foldable, but it is not

I wonder when we need to raise an error when the expected input is any expr (means either foldable or non-foldable ), but it is foldable actually. What's the case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if there's a clear use case where we'd want to check if the input is not foldable, but this implementation + the new error message does provide that level of flexibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think non-foldable might be enough, I can't think of a case where we would not want to allow a foldable input. This could also simplify the code a little bit in the checker since Seq[Option[Bool]] could become Seq[Bool]


override def inputTypes: Seq[AbstractDataType] = Seq(BinaryType, BooleanType)

override def inputIsFoldable: Seq[Option[Boolean]] = Seq(None, Some(true))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None means that you don't care of it is foldable or not, correct? You just require to be foldable for the second parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct.

@RyanBerti
Copy link
Contributor Author

@MaxGekk do you still think this PR needs changes, based on our discussion?

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM provided the existing points raised but others are addressed.

@@ -471,6 +471,11 @@
"class <className> not found."
]
},
"UNEXPECTED_INPUT_FOLDABLE_VALUE" : {
"message" : [
"Parameter <paramIndex> requires a foldable value of <requiredFoldable>, however <inputSql> has a foldable value of <inputFoldable>."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think non-foldable might be enough, I can't think of a case where we would not want to allow a foldable input. This could also simplify the code a little bit in the checker since Seq[Option[Bool]] could become Seq[Bool]

* a value of None bypasses the check for the given column
*/
trait ExpectsInputTypesAndFoldable extends ExpectsInputTypes {
def inputIsFoldable: Seq[Option[Boolean]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's a little bike-sheddy so feel free to disregard but I'd have a small preference towards "requireFoldableInputs" or something since as it stands it sounds like this function is returning if the inputs are foldable rather than the required state.

)
checkAnswer(res, Nil)
}
assert(error8.toString contains "UNEXPECTED_INPUT_FOLDABLE_VALUE")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HiveUDAF suite uses it pretty extensively.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Sep 15, 2023
@github-actions github-actions bot closed this Sep 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants