-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix issues with array_contains and array_overlap with null left side arguments #15974
fix issues with array_contains and array_overlap with null left side arguments #15974
Conversation
|
||
final Object[] array1 = arrayExpr1.asArray(); | ||
final Object[] array2 = arrayExpr2.asArray(); | ||
if (array1 == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in these 2 cases, if the first argument is null, we follow 3VL and return null. This might not be backward compatible with someone who is already following the older version to return false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function already returned null, though in a non-obvious way. Both of these functions were previously extending the ArraysFunction
base class, which is mostly intended for stuff like array_append
and such to do stuff to two array arguments, and has sections like this for when the arguments were null
if (arrayExpr1.asArray() == null) {
return arrayExpr1;
}
but weren't really returning the standardized long boolean type. The other problem with extending this, is that for something like array_contains, that second check for a null argument meant there was no way to check if the null element was present in an array.
So anyway, that is what i meant by 'correctly return null', they return LONG null instead of whatever array type null, so that the return type is always consistently LONG
@@ -83,7 +83,11 @@ public void reduce( | |||
final RexNode literal; | |||
|
|||
if (sqlTypeName == SqlTypeName.BOOLEAN) { | |||
literal = rexBuilder.makeLiteral(exprResult.asBoolean(), constExp.getType(), true); | |||
if (exprResult.valueOrDefault() == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also seems similar to the earlier comment. Apart from array_contains and array_overlap this will also change some other results as well. Should we also think about backward compatibility for all of these or just say that these are the correct results now ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code impacts things that the planner determines are constants, and it was a bug that it was using exprResult.asBoolean()
without checking if exprResult
was null. This means that there are a handful of things (not just array_contains/array_overlap) where the function behaves differently if it is a constant than if it was a non-constant executed against a real table, because most of the functions in the real table would correctly return null, while as a constant they would not. This was the case for array_contains/array_overlap, evaluating a constant would return false, while evaluating a table would return null, and how i caught this bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to check whether we're in 2VL mode and make a FALSE literal instead of NULL literal in that mode? Or in 2VL mode checked somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expressions don't really natively have a 2vl mode other than druid.generic.useDefaultValueForNull=false, so I suppose I could check NullHandling.sqlCompatible()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would be a behavior change though, since previously this would always return null, it just wasn't a LONG
typed null, it was instead whatever the lhs argument was
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't suggesting we should change behavior, so let's keep it then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh i guess the behavior change would be changing the native array_contains/array_overlap functions to not return null except in some cases.
This part of the code is actually a bug fix, since previously boolean constants weren't properly allowed to be null even though the native function might have been a null result, which I guess is a sort of behavior change though the previous behavior was certainly incorrect because the behavior of constant expressions that were considered boolean didn't match non-constant expressions.
NullHandling.sqlCompatible() | ||
? ImmutableList.of() | ||
: ImmutableList.of(new Object[]{"", 5L}) | ||
ImmutableList.of(new Object[]{NullHandling.defaultStringValue(), 5L}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the expected results in SQL compatible mode were incorrect before in this test because mv_contains
was considered not nullable by the SQL planner so the 'is not true' part was eliminated translating the filter into:
NOT(MV_CONTAINS(lookup(dim1, 'lookyloo'), 'xabc'))
@@ -83,7 +83,11 @@ public void reduce( | |||
final RexNode literal; | |||
|
|||
if (sqlTypeName == SqlTypeName.BOOLEAN) { | |||
literal = rexBuilder.makeLiteral(exprResult.asBoolean(), constExp.getType(), true); | |||
if (exprResult.valueOrDefault() == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to check whether we're in 2VL mode and make a FALSE literal instead of NULL literal in that mode? Or in 2VL mode checked somewhere else?
if (array2 == null) { | ||
return ExprEval.ofLongBoolean(false); | ||
} | ||
final Set<Object> set = array1Type.isPrimitiveArray() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ArrayContains should be an ExprMacro, since the rhs expr is often literal, and we don't want to be rebuilding a set for each row (building sets isn't super-cheap).
for (Object check : array1) { | ||
any |= array2.contains(check); | ||
ExpressionType array1Type = arrayExpr1.asArrayType(); | ||
final Set<Object> set = array1Type.isPrimitiveArray() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about making this an ExprMacro.
final Object[] array1 = arrayExpr1.asArray(); | ||
final Object[] array2 = arrayExpr2.asArray(); | ||
if (array1 == null) { | ||
return ExprEval.ofLong(null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me wonder what happens when the rhs array includes null.
Consider MV_CONTAINS(x, ARRAY[NULL, 'abc', 'def'])
, when x
is a MVD where one row contains a single NULL
. Is that equivalent to x IS NULL OR MV_CONTAINS(x, ARRAY['abc', 'def'])
(which returns TRUE
) or is it equivalent to MV_CONTAINS(x, ARRAY['abc', 'def'])
?
For that matter, consider MV_CONTAINS(x, ARRAY['abc', 'def'])
in the same situation. Does it return FALSE
or does it return UNKNOWN
?
Do the answers to these questions depend on how the MV_CONTAINS
is planned— whether it ends up as an array_contains
or a filter or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heh, good question. So if I remember correctly, by default for an expression like MV_CONTAINS
the value NULL
would just be NULL
, so this would be checking MV_CONTAINS(NULL, ARRAY[NULL, 'abc', 'def'])
.
This is controlled https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/segment/virtual/ExpressionSelectors.java#L344 where the 'homogenize' flag which translates stuff like NULL
and []
into [NULL]
is only done for scalar expressions which are transformed to apply the expression across the MVD, but since MV_CONTAINS
is array_contains
native expression, this transformation does not happen because we are using the MVD as an array instead.
So, in all of these cases it returns UNKNOWN
when evaluated as an expression.
Looking at MV_CONTAINS as a filter in the sql operator conversions, it should behave the same way except in one case I think due to how we rewrite it into native filters. The case where I think it would be different is MV_CONTAINS(mvd, [NULL])
would rewrite into basically mvd is null
, which would return true instead of unknown. Maybe we should prevent the MV_CONTAINS(mvd, [NULL])
case from being allowed to rewrite in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh to clarify a bit more, MV_CONTAINS(x, ARRAY[NULL, 'abc', 'def'])
gets rewritten at sql planner into x is null AND x = 'abc' AND x = 'def'
if x
is a string (contains is an AND operation, not an OR operation, overlap is the OR operation though we use an IN filter instead). array_contains
/mv_contains
are allowed to check if an array contains a null element.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new version of this patch with mv_harmonize_nulls
helps make this more clear.
* modify singleThreaded stuff to allow optimizing Function similar to how we do for ExprMacro - removed SingleThreadSpecializable in favor of default impl of asSingleThreaded on Expr with clear javadocs that most callers shouldn't be calling it directly and should be using Expr.singleThreaded static method which uses a shuttle and delegates to asSingleThreaded instead * add optimized 'singleThreaded' versions of array_contains and array_overlap * add mv_harmonize_nulls native expression to use with MV_CONTAINS and MV_OVERLAP to allow them to behave consistently with filter rewrites, coercing null and [] into [null] * fix bug with casting rhs argument for native array_contains and array_overlap expressions
...g/apache/druid/sql/calcite/expression/builtin/BaseExpressionDimFilterOperatorConversion.java
Fixed
Show fixed
Hide fixed
...ava/org/apache/druid/sql/calcite/expression/builtin/MultiValueStringOperatorConversions.java
Fixed
Show fixed
Hide fixed
...ava/org/apache/druid/sql/calcite/expression/builtin/MultiValueStringOperatorConversions.java
Fixed
Show fixed
Hide fixed
@@ -3243,6 +3252,62 @@ public Set<Expr> getArrayInputs(List<Expr> args) | |||
} | |||
} | |||
|
|||
class MultiValueStringHarmonizeNullsFunction implements Function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use a comment as to the purpose of this function, and why it's undocumented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can do, should we document it though? is it useful for people to call or should it remain primarily for internal use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm my feeling is undocumented for now, I don't really want to have to add SQL bindings and stuff. Unless you have a use case in mind.
final Object[] array1 = arrayExpr1.asArray(); | ||
final Object[] array2 = arrayExpr2.asArray(); | ||
if (array1 == null) { | ||
return ExprEval.ofLong(null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new version of this patch with mv_harmonize_nulls
helps make this more clear.
public ExprEval apply(List<Expr> args, Expr.ObjectBinding bindings) | ||
{ | ||
final ExprEval<?> lhsExpr = args.get(0).eval(bindings); | ||
final Object[] array1 = lhsExpr.asArray(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if this didn't need to wrap the lhs
in an array if lhs
is a scalar. Then it would be a better fit for implementing expr version of the scalar IN
filter, like array_contains(scalar_field, array('a', 'b', 'c'))
for scalar_field IN ('a', 'b', 'c')
. The current code would still work, it just has more overhead than necessary.
Could also do this in a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: array_contains
is not an IN filter, you're thinking of array_overlap
, but i get the idea. I could add a 'asSingleThreaded' optimization for these cases, array_contains
with a scalar field and a rhs array argument with length greater than 1 would be an 'all false unless unknown' optimization since it could not possible contain all of the elements, and array_overlap would be as you suggested.
private static List<DruidExpression> harmonizeNullsMvdArg0OperandList(List<DruidExpression> druidExpressions) | ||
{ | ||
final List<DruidExpression> newArgs; | ||
if (druidExpressions.get(0).isDirectColumnAccess()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not always harmonize the first arg if it's of type VARCHAR
or STRING
? So something like MV_OVERLAP(LOOKUP(mvd, 'lookupTable'), ARRAY['stuff', 'things'])
would still have the first arg harmonized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, good question. I guess I was trying to avoid weird stuff happening in cases where someone was using one of the other MV_ functions which do not coerce arguments like this like MV_APPEND
and such. For scalar function arguments like this, we actually do automatically do this harmonization in the native layer because we do it whenever we need to 'apply' a function across an MVD, of which this is a case of that happening.
However, I guess that does create a difference in behavior if there is a mix segments which have single value strings and multi-value strings, since the single value strings would not need applied on the native layer... (I need to double check this actually, its possible we always apply in this case) I hate mvds so so much 😭
Side note, I hope we can get the higher order function stuff done, because it makes me sad everytime i see a query like this when it really should be MV_OVERLAP(MAP(x -> LOOKUP(x, 'lookupTable'), mvd), ARRAY['stuff', 'things'])
. Granted, that the native layer will automatically apply this transformation (and when it maps like this it does automatically do this harmonization), but i still find it confusing to look at, and if it was written like this, then I think that it would be strange to harmonize the arg0 of MV_OVERLAP
instead of harmonizing the mvd
variable, e.g. MV_OVERLAP(MAP(x -> LOOKUP(x, 'lookupTable'), MV_HAROMONIZE_NULLS(mvd)), ARRAY['stuff', 'things'])
.
So... I guess we can apply it all the time, i'll need to update some other tests, especially some of the tests i have with array columns which are wrapping the array columns with ARRAY_TO_MV
and then testing the MV_
functions, who do correctly distinguish between null
, []
and [null]
but now would no longer at least when using with these functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So... I guess we can apply it all the time, i'll need to update some other tests, especially some of the tests i have with array columns which are wrapping the array columns with
ARRAY_TO_MV
and then testing theMV_
functions, who do correctly distinguish betweennull
,[]
and[null]
but now would no longer at least when using with these functions.
How do you define "correctly" in this sentence? I would think that if you apply ARRAY_TO_MV
to a legit array, it's more correct to harmonize the arrays than to not harmonize them. That makes them behave more like they would if they had been ingested as MVDs, which I think is the point of ARRAY_TO_MV
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I was trying to avoid weird stuff happening in cases where someone was using one of the other MV_ functions which do not coerce arguments like this like
MV_APPEND
and such.
Do you have an example in mind of the weird stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an example in mind of the weird stuff?
Well, so today MV_CONCAT(mvd, ARRAY['foo'])
plans to array_concat(mvd, array('foo'))
which returns null
instead of [null, 'foo']
if the mvd
row is null
because we don't coerce those values into [null]
.
Functions like this are why we turned off the blanket coercion we were doing in the first place (druid.expressions.homogenizeNullMultiValueStringArrays
) because (at the time at least) it felt too strange for the latter answer to be the behavior, and part of the reasoning for this is because when you do a scan query, which is the only way you can see MVDs as their entire (barring grouping on them with MV_TO_ARRAY) you never see the value as [null]
, rather it is null
because it also uses getObject
.
So if this was the argument and then rewritten into MV_CONTAINS(MV_HAROMONIZE_NULLS(MV_CONCAT(mvd, ['foo'])), 'foo')
at the native expression layer it would be like array_contains([null], 'foo')
which is false, instead of array_contains(null, 'foo')
which is null
.
How do you define "correctly" in this sentence? I would think that if you apply ARRAY_TO_MV to a legit array, it's more correct to harmonize the arrays than to not harmonize them. That makes them behave more like they would if they had been ingested as MVDs, which I think is the point of ARRAY_TO_MV?
I guess there is no such thing as "correctly" because mvds themselves are completely inconsistent. I imagined ARRAY_TO_MV
to primarily be used to get stuff like automatic unnest and other native query mvd behaviors (outside of expressions), and didn't really imagine reducing the fidelity of the underlying arrays to have the same deficiencies as mvds, but I suppose that is another way of thinking about it and be used with native stuff. It gets really weird doing ARRAY_TO_MV
and then using MV_
functions that plan to native expressions (which themselves actually turn the mvds into ARRAY so that the expressions can handle them), so it didn't occur to me that we would want to do coercion because at the native layer ARRAY_TO_MV
is basically cast(array as ARRAY<STRING>)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, given that and given that the native layer harmonizes MVDs when passing them into scalar functions anyway, I suppose it's reasonable to only harmonize direct MVD accesses here. When they're passed into array functions, we probably don't want to harmonize.
@@ -83,7 +83,11 @@ public void reduce( | |||
final RexNode literal; | |||
|
|||
if (sqlTypeName == SqlTypeName.BOOLEAN) { | |||
literal = rexBuilder.makeLiteral(exprResult.asBoolean(), constExp.getType(), true); | |||
if (exprResult.valueOrDefault() == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't suggesting we should change behavior, so let's keep it then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, although I'd appreciate javadocs for mv_harmonize_nulls
. With regard to my other comments, I am ok with the current state of the patch given the discussions.
Description
Fixes a couple of issues with
ARRAY_CONTAINS
andARRAY_OVERLAP
and null handling. First is now if the left side is null the functions will correctly return null. This was also an issue in the SQL planner layer, where boolean typed constants were never created as null literals if the boolean expression was nullable, which has been fixed as well. The type inference of these functions now reflects that a null left side argument makes the function return type nullable.This PR also changes how
MV_CONTAINS
andMV_OVERLAP
are planned into native expressions to behave consistently with the native filter rewrites used, adding a new nativemv_harmonize_nulls
expression which coercesnull
and[]
into[null]
.Finally, i adjusted about how the 'singleThreaded' stuff works with
Expr
to allow built-inFunction
to be able to do some of the same optimizations we do withExprMacro
, such as optimize when an argument is a constant. The change is moving the method to a default implementation onExpr
The real change here is that these methods now accept a
InputBindingInspector
so thatExpr
can analyze inputs such as to compute output type and the like. The javadocs try to be clear that most callers shouldn't be calling this method directly since it only translates a single layer ofExpr
.Function
has a similar method:which
FunctionExpr
takes advantage of:allowing any function to specialize itself before computation. The only ones I've done so far are
array_contains
andarray_overlap
, but i'm sure there are plenty of others that could benefit.It does make an improvement. Given queries:
before:
after:
and feels pretty clean, to optimize
Function
implementations only need to implementasSingleThreaded
and allow it to pick internally defined class that overrides theapply
method.This PR has: