Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix issues with array_contains and array_overlap with null left side arguments #15974

Merged

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented Feb 27, 2024

Description

Fixes a couple of issues with ARRAY_CONTAINS and ARRAY_OVERLAP and null handling. First is now if the left side is null the functions will correctly return null. This was also an issue in the SQL planner layer, where boolean typed constants were never created as null literals if the boolean expression was nullable, which has been fixed as well. The type inference of these functions now reflects that a null left side argument makes the function return type nullable.

This PR also changes how MV_CONTAINS and MV_OVERLAP are planned into native expressions to behave consistently with the native filter rewrites used, adding a new native mv_harmonize_nulls expression which coerces null and [] into [null].

Finally, i adjusted about how the 'singleThreaded' stuff works with Expr to allow built-in Function to be able to do some of the same optimizations we do with ExprMacro, such as optimize when an argument is a constant. The change is moving the method to a default implementation on Expr

  /**
   * Possibly convert the {@link Expr} into an optimized, possibly not thread-safe {@link Expr}. Does not convert
   * child {@link Expr}. Most callers should use {@link Expr#singleThreaded(Expr, InputBindingInspector)} to convert
   * an entire tree, which delegates to this method to translate individual nodes.
   */
  default Expr asSingleThreaded(InputBindingInspector inspector)
  {
    return this;
  }

The real change here is that these methods now accept a InputBindingInspector so that Expr can analyze inputs such as to compute output type and the like. The javadocs try to be clear that most callers shouldn't be calling this method directly since it only translates a single layer of Expr.

Function has a similar method:

  /**
   * Possibly convert a {@link Function} into an optimized, possibly not thread-safe {@link Function}.
   */
  default Function asSingleThreaded(List<Expr> args, Expr.InputBindingInspector inspector)
  {
    return this;
  }

which FunctionExpr takes advantage of:

  @Override
  public Expr asSingleThreaded(InputBindingInspector inspector)
  {
    return new FunctionExpr(
        function.asSingleThreaded(args, inspector),
        name,
        args
    );
  }

allowing any function to specialize itself before computation. The only ones I've done so far are array_contains and array_overlap, but i'm sure there are plenty of others that could benefit.

It does make an improvement. Given queries:

      "SELECT ARRAY_CONTAINS(\"multi-string3\", 100) FROM foo",
      "SELECT ARRAY_CONTAINS(\"multi-string3\", ARRAY[1, 2, 10, 11, 20, 22, 30, 33, 40, 44, 50, 55, 100]) FROM foo",
      "SELECT ARRAY_OVERLAP(\"multi-string3\", ARRAY[1, 100]) FROM foo",
      "SELECT ARRAY_OVERLAP(\"multi-string3\", ARRAY[1, 2, 10, 11, 20, 22, 30, 33, 40, 44, 50, 55, 100]) FROM foo"

before:

Benchmark                        (query)  (rowsPerSegment)  (schema)  (vectorize)  Mode  Cnt     Score     Error  Units
SqlExpressionBenchmark.querySql       42           5000000  explicit        false  avgt    5  4832.125 ± 106.946  ms/op
SqlExpressionBenchmark.querySql       43           5000000  explicit        false  avgt    5  6263.339 ± 210.716  ms/op
SqlExpressionBenchmark.querySql       44           5000000  explicit        false  avgt    5  4470.768 ±  55.625  ms/op
SqlExpressionBenchmark.querySql       45           5000000  explicit        false  avgt    5  5476.418 ± 145.618  ms/op

after:

Benchmark                        (query)  (rowsPerSegment)  (schema)  (vectorize)  Mode  Cnt     Score     Error  Units
SqlExpressionBenchmark.querySql       42           5000000  explicit        false  avgt    5  4740.515 ±  82.529  ms/op
SqlExpressionBenchmark.querySql       43           5000000  explicit        false  avgt    5  5035.238 ±  99.058  ms/op
SqlExpressionBenchmark.querySql       44           5000000  explicit        false  avgt    5  4450.752 ±  65.535  ms/op
SqlExpressionBenchmark.querySql       45           5000000  explicit        false  avgt    5  4726.881 ±  76.610  ms/op

and feels pretty clean, to optimize Function implementations only need to implement asSingleThreaded and allow it to pick internally defined class that overrides the apply method.


This PR has:

  • been self-reviewed.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.


final Object[] array1 = arrayExpr1.asArray();
final Object[] array2 = arrayExpr2.asArray();
if (array1 == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in these 2 cases, if the first argument is null, we follow 3VL and return null. This might not be backward compatible with someone who is already following the older version to return false.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function already returned null, though in a non-obvious way. Both of these functions were previously extending the ArraysFunction base class, which is mostly intended for stuff like array_append and such to do stuff to two array arguments, and has sections like this for when the arguments were null

if (arrayExpr1.asArray() == null) {
  return arrayExpr1;
}

but weren't really returning the standardized long boolean type. The other problem with extending this, is that for something like array_contains, that second check for a null argument meant there was no way to check if the null element was present in an array.

So anyway, that is what i meant by 'correctly return null', they return LONG null instead of whatever array type null, so that the return type is always consistently LONG

@@ -83,7 +83,11 @@ public void reduce(
final RexNode literal;

if (sqlTypeName == SqlTypeName.BOOLEAN) {
literal = rexBuilder.makeLiteral(exprResult.asBoolean(), constExp.getType(), true);
if (exprResult.valueOrDefault() == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also seems similar to the earlier comment. Apart from array_contains and array_overlap this will also change some other results as well. Should we also think about backward compatibility for all of these or just say that these are the correct results now ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code impacts things that the planner determines are constants, and it was a bug that it was using exprResult.asBoolean() without checking if exprResult was null. This means that there are a handful of things (not just array_contains/array_overlap) where the function behaves differently if it is a constant than if it was a non-constant executed against a real table, because most of the functions in the real table would correctly return null, while as a constant they would not. This was the case for array_contains/array_overlap, evaluating a constant would return false, while evaluating a table would return null, and how i caught this bug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to check whether we're in 2VL mode and make a FALSE literal instead of NULL literal in that mode? Or in 2VL mode checked somewhere else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expressions don't really natively have a 2vl mode other than druid.generic.useDefaultValueForNull=false, so I suppose I could check NullHandling.sqlCompatible()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be a behavior change though, since previously this would always return null, it just wasn't a LONG typed null, it was instead whatever the lhs argument was

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't suggesting we should change behavior, so let's keep it then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i guess the behavior change would be changing the native array_contains/array_overlap functions to not return null except in some cases.

This part of the code is actually a bug fix, since previously boolean constants weren't properly allowed to be null even though the native function might have been a null result, which I guess is a sort of behavior change though the previous behavior was certainly incorrect because the behavior of constant expressions that were considered boolean didn't match non-constant expressions.

NullHandling.sqlCompatible()
? ImmutableList.of()
: ImmutableList.of(new Object[]{"", 5L})
ImmutableList.of(new Object[]{NullHandling.defaultStringValue(), 5L})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the expected results in SQL compatible mode were incorrect before in this test because mv_contains was considered not nullable by the SQL planner so the 'is not true' part was eliminated translating the filter into:

NOT(MV_CONTAINS(lookup(dim1, 'lookyloo'), 'xabc'))

@@ -83,7 +83,11 @@ public void reduce(
final RexNode literal;

if (sqlTypeName == SqlTypeName.BOOLEAN) {
literal = rexBuilder.makeLiteral(exprResult.asBoolean(), constExp.getType(), true);
if (exprResult.valueOrDefault() == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to check whether we're in 2VL mode and make a FALSE literal instead of NULL literal in that mode? Or in 2VL mode checked somewhere else?

if (array2 == null) {
return ExprEval.ofLongBoolean(false);
}
final Set<Object> set = array1Type.isPrimitiveArray()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArrayContains should be an ExprMacro, since the rhs expr is often literal, and we don't want to be rebuilding a set for each row (building sets isn't super-cheap).

for (Object check : array1) {
any |= array2.contains(check);
ExpressionType array1Type = arrayExpr1.asArrayType();
final Set<Object> set = array1Type.isPrimitiveArray()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about making this an ExprMacro.

final Object[] array1 = arrayExpr1.asArray();
final Object[] array2 = arrayExpr2.asArray();
if (array1 == null) {
return ExprEval.ofLong(null);
Copy link
Contributor

@gianm gianm Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me wonder what happens when the rhs array includes null.

Consider MV_CONTAINS(x, ARRAY[NULL, 'abc', 'def']), when x is a MVD where one row contains a single NULL. Is that equivalent to x IS NULL OR MV_CONTAINS(x, ARRAY['abc', 'def']) (which returns TRUE) or is it equivalent to MV_CONTAINS(x, ARRAY['abc', 'def'])?

For that matter, consider MV_CONTAINS(x, ARRAY['abc', 'def']) in the same situation. Does it return FALSE or does it return UNKNOWN?

Do the answers to these questions depend on how the MV_CONTAINS is planned— whether it ends up as an array_contains or a filter or something else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh, good question. So if I remember correctly, by default for an expression like MV_CONTAINS the value NULL would just be NULL, so this would be checking MV_CONTAINS(NULL, ARRAY[NULL, 'abc', 'def']).

This is controlled https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/segment/virtual/ExpressionSelectors.java#L344 where the 'homogenize' flag which translates stuff like NULL and [] into [NULL] is only done for scalar expressions which are transformed to apply the expression across the MVD, but since MV_CONTAINS is array_contains native expression, this transformation does not happen because we are using the MVD as an array instead.

So, in all of these cases it returns UNKNOWN when evaluated as an expression.

Looking at MV_CONTAINS as a filter in the sql operator conversions, it should behave the same way except in one case I think due to how we rewrite it into native filters. The case where I think it would be different is MV_CONTAINS(mvd, [NULL]) would rewrite into basically mvd is null, which would return true instead of unknown. Maybe we should prevent the MV_CONTAINS(mvd, [NULL]) case from being allowed to rewrite in this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh to clarify a bit more, MV_CONTAINS(x, ARRAY[NULL, 'abc', 'def']) gets rewritten at sql planner into x is null AND x = 'abc' AND x = 'def' if x is a string (contains is an AND operation, not an OR operation, overlap is the OR operation though we use an IN filter instead). array_contains/mv_contains are allowed to check if an array contains a null element.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new version of this patch with mv_harmonize_nulls helps make this more clear.

* modify singleThreaded stuff to allow optimizing Function similar to how we do for ExprMacro - removed SingleThreadSpecializable in favor of default impl of asSingleThreaded on Expr with clear javadocs that most callers shouldn't be calling it directly and should be using Expr.singleThreaded static method which uses a shuttle and delegates to asSingleThreaded instead
* add optimized 'singleThreaded' versions of array_contains and array_overlap
* add mv_harmonize_nulls native expression to use with MV_CONTAINS and MV_OVERLAP to allow them to behave consistently with filter rewrites, coercing null and [] into [null]
* fix bug with casting rhs argument for native array_contains and array_overlap expressions
@@ -3243,6 +3252,62 @@ public Set<Expr> getArrayInputs(List<Expr> args)
}
}

class MultiValueStringHarmonizeNullsFunction implements Function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use a comment as to the purpose of this function, and why it's undocumented.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can do, should we document it though? is it useful for people to call or should it remain primarily for internal use?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm my feeling is undocumented for now, I don't really want to have to add SQL bindings and stuff. Unless you have a use case in mind.

final Object[] array1 = arrayExpr1.asArray();
final Object[] array2 = arrayExpr2.asArray();
if (array1 == null) {
return ExprEval.ofLong(null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new version of this patch with mv_harmonize_nulls helps make this more clear.

public ExprEval apply(List<Expr> args, Expr.ObjectBinding bindings)
{
final ExprEval<?> lhsExpr = args.get(0).eval(bindings);
final Object[] array1 = lhsExpr.asArray();
Copy link
Contributor

@gianm gianm Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if this didn't need to wrap the lhs in an array if lhs is a scalar. Then it would be a better fit for implementing expr version of the scalar IN filter, like array_contains(scalar_field, array('a', 'b', 'c')) for scalar_field IN ('a', 'b', 'c'). The current code would still work, it just has more overhead than necessary.

Could also do this in a follow-up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: array_contains is not an IN filter, you're thinking of array_overlap, but i get the idea. I could add a 'asSingleThreaded' optimization for these cases, array_contains with a scalar field and a rhs array argument with length greater than 1 would be an 'all false unless unknown' optimization since it could not possible contain all of the elements, and array_overlap would be as you suggested.

private static List<DruidExpression> harmonizeNullsMvdArg0OperandList(List<DruidExpression> druidExpressions)
{
final List<DruidExpression> newArgs;
if (druidExpressions.get(0).isDirectColumnAccess()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not always harmonize the first arg if it's of type VARCHAR or STRING? So something like MV_OVERLAP(LOOKUP(mvd, 'lookupTable'), ARRAY['stuff', 'things']) would still have the first arg harmonized.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, good question. I guess I was trying to avoid weird stuff happening in cases where someone was using one of the other MV_ functions which do not coerce arguments like this like MV_APPEND and such. For scalar function arguments like this, we actually do automatically do this harmonization in the native layer because we do it whenever we need to 'apply' a function across an MVD, of which this is a case of that happening.

However, I guess that does create a difference in behavior if there is a mix segments which have single value strings and multi-value strings, since the single value strings would not need applied on the native layer... (I need to double check this actually, its possible we always apply in this case) I hate mvds so so much 😭

Side note, I hope we can get the higher order function stuff done, because it makes me sad everytime i see a query like this when it really should be MV_OVERLAP(MAP(x -> LOOKUP(x, 'lookupTable'), mvd), ARRAY['stuff', 'things']). Granted, that the native layer will automatically apply this transformation (and when it maps like this it does automatically do this harmonization), but i still find it confusing to look at, and if it was written like this, then I think that it would be strange to harmonize the arg0 of MV_OVERLAP instead of harmonizing the mvd variable, e.g. MV_OVERLAP(MAP(x -> LOOKUP(x, 'lookupTable'), MV_HAROMONIZE_NULLS(mvd)), ARRAY['stuff', 'things']).

So... I guess we can apply it all the time, i'll need to update some other tests, especially some of the tests i have with array columns which are wrapping the array columns with ARRAY_TO_MV and then testing the MV_ functions, who do correctly distinguish between null, [] and [null] but now would no longer at least when using with these functions.

Copy link
Contributor

@gianm gianm Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So... I guess we can apply it all the time, i'll need to update some other tests, especially some of the tests i have with array columns which are wrapping the array columns with ARRAY_TO_MV and then testing the MV_ functions, who do correctly distinguish between null, [] and [null] but now would no longer at least when using with these functions.

How do you define "correctly" in this sentence? I would think that if you apply ARRAY_TO_MV to a legit array, it's more correct to harmonize the arrays than to not harmonize them. That makes them behave more like they would if they had been ingested as MVDs, which I think is the point of ARRAY_TO_MV?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I was trying to avoid weird stuff happening in cases where someone was using one of the other MV_ functions which do not coerce arguments like this like MV_APPEND and such.

Do you have an example in mind of the weird stuff?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example in mind of the weird stuff?

Well, so today MV_CONCAT(mvd, ARRAY['foo']) plans to array_concat(mvd, array('foo')) which returns null instead of [null, 'foo'] if the mvd row is null because we don't coerce those values into [null].

Functions like this are why we turned off the blanket coercion we were doing in the first place (druid.expressions.homogenizeNullMultiValueStringArrays) because (at the time at least) it felt too strange for the latter answer to be the behavior, and part of the reasoning for this is because when you do a scan query, which is the only way you can see MVDs as their entire (barring grouping on them with MV_TO_ARRAY) you never see the value as [null], rather it is null because it also uses getObject.

So if this was the argument and then rewritten into MV_CONTAINS(MV_HAROMONIZE_NULLS(MV_CONCAT(mvd, ['foo'])), 'foo') at the native expression layer it would be like array_contains([null], 'foo') which is false, instead of array_contains(null, 'foo') which is null.

How do you define "correctly" in this sentence? I would think that if you apply ARRAY_TO_MV to a legit array, it's more correct to harmonize the arrays than to not harmonize them. That makes them behave more like they would if they had been ingested as MVDs, which I think is the point of ARRAY_TO_MV?

I guess there is no such thing as "correctly" because mvds themselves are completely inconsistent. I imagined ARRAY_TO_MV to primarily be used to get stuff like automatic unnest and other native query mvd behaviors (outside of expressions), and didn't really imagine reducing the fidelity of the underlying arrays to have the same deficiencies as mvds, but I suppose that is another way of thinking about it and be used with native stuff. It gets really weird doing ARRAY_TO_MV and then using MV_ functions that plan to native expressions (which themselves actually turn the mvds into ARRAY so that the expressions can handle them), so it didn't occur to me that we would want to do coercion because at the native layer ARRAY_TO_MV is basically cast(array as ARRAY<STRING>).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, given that and given that the native layer harmonizes MVDs when passing them into scalar functions anyway, I suppose it's reasonable to only harmonize direct MVD accesses here. When they're passed into array functions, we probably don't want to harmonize.

@@ -83,7 +83,11 @@ public void reduce(
final RexNode literal;

if (sqlTypeName == SqlTypeName.BOOLEAN) {
literal = rexBuilder.makeLiteral(exprResult.asBoolean(), constExp.getType(), true);
if (exprResult.valueOrDefault() == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't suggesting we should change behavior, so let's keep it then.

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, although I'd appreciate javadocs for mv_harmonize_nulls. With regard to my other comments, I am ok with the current state of the patch given the discussions.

@clintropolis clintropolis merged commit dd9bc37 into apache:master Mar 14, 2024
87 checks passed
@clintropolis clintropolis deleted the fix-array-contains-overlap-nulls branch March 14, 2024 01:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants