expression filter support for vectorized query engines #10613

clintropolis · 2020-11-28T00:20:24Z

Description

This PR adds support for ExpressionFilter in vectorized query engines, and expands vectorization support for conditional expressions to support string typed columns.

The signature of the Filter interface method which checks if a matcher can be vectorized has been updated to accept a ColumnInspector, so that the ExpressionVirtualColumn.canVectorize method can do its thing to determine if the underlying expression can be vectorized:

  default boolean canVectorizeMatcher(ColumnInspector inspector)
  {
    return false;
  }

Expression filters on string typed expressions use a new ObjectVectorValueMatcher (since string expressions most naturally use a VectorObjectSelector instead of the dictionary encoded vector dimension selectors), and DruidPredicateFactory has been updated to include a method to make an object matching predicate to work with that.

  default Predicate<Object> makeObjectPredicate()
  {
    // default to try to use string predicate;
    final Predicate<String> stringPredicate = makeStringPredicate();
    return o -> stringPredicate.apply((String) o);
  }

Similarly, VectorColumnProcessorFactory has been expanded with a new method to allow constructing these VectorObjectSelector based matchers:

  T makeObjectProcessor(ColumnCapabilities capabilities, VectorObjectSelector selector);

I imagine that this potentially opens the door to someday having filters that could support any of the complex typed columns we support as inputs as well using the predicate factory pattern, though I haven't really spent time thinking of any use cases for this yet.

Finally, this PR fixes a bug when using a vectorizable string typed expression virtual column as an input to a filter, where the virtual column would report itself as vectorizable, but then not have implemented a single value dictionary selector. Instead, these non-dictionary string columns will now use the string object selector and should work correctly now. I've added additional virtual column inputs to filter tests to cover this case as well.

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

Key changed/added classes in this PR

Filter
ExpressionFilter
VectorComparisonProcessors
BivariateFunctionVectorObjectProcessor
VectorValueMatcherFactory
ObjectVectorValueMatcher

gianm · 2021-01-08T07:28:35Z

core/src/main/java/org/apache/druid/math/expr/BinaryLogicalOperatorExpr.java

@@ -74,7 +74,7 @@ public ExprType getOutputType(InputBindingInspector inspector)
  @Override
  public boolean canVectorize(InputBindingInspector inspector)
  {
-    return inspector.areNumeric(left, right) && inspector.canVectorize(left, right);


This opens not just STRING but also LONG_ARRAY and DOUBLE_ARRAY. Is that intentional (will numeric arrays work)?

i think canVectorize will still end up false because array types/functions are not yet vectorizable, but yeah i guess array types would not be well handled here even if they were, though maybe that is the case for non-vectorized expression as well. I'll do a bit of exploration and see what up

Did this exploration lead somewhere?

array types do not currently implement canVectorize so will return the default false. We probably should not use canVectorize for argument validation and maybe should consider doing this explicitly when we can resolve the input types I think.

Got it. IMO we shouldn't rely on the fact that the array types happen to not be vectorizable. If they suddenly became vectorizable then this method would start returning true, maybe improperly.

like, I agree here, but canVectorize still feels wrong to saddle with type validation, it can explode when creating the processor if the types are nonsense

I think what might be best is to introduce a 'validate' method to expressions so that arguments can be checked up front, and maybe explode if the input is madness

In that case, could you add a comment about the situation?

gianm · 2021-01-08T07:32:04Z

core/src/main/java/org/apache/druid/math/expr/ExprTypeConversion.java


 import javax.annotation.Nullable;
 import java.util.List;

 public class ExprTypeConversion
 {
+  public static Pair<ExprType, ExprType> coerceNull(Expr.InputBindingInspector inspector, Expr left, Expr right)


This method deserves a javadoc.

i did one better and just removed it. null literals output type is now null instead of their type, so that they behave and are handled like missing columns (since null doesn't have a type)

gianm · 2021-01-08T07:35:42Z

core/src/main/java/org/apache/druid/math/expr/IdentifierExpr.java

        {
-          return new ExprEvalDoubleVector(bindings.getDoubleVector(binding), bindings.getNullVector(binding));
+          return new ExprEvalStringVector(Arrays.stream(bindings.getObjectVector(binding)).map(x -> (String) x).toArray(String[]::new));


What guarantees that the objects in the vector are Strings?

not sure why i had this code like this since in this case it is getting back nulls... it can just call bindings.getObjectVector. changed

oops, i lied, it needs to cast because non-existent column gives back Object[] of nulls instead of String[] of nulls

gianm · 2021-01-08T07:39:34Z

processing/src/main/java/org/apache/druid/query/dimension/DimensionSpec.java

@@ -70,6 +71,11 @@ default MultiValueDimensionVectorSelector decorate(MultiValueDimensionVectorSele
    throw new UOE("DimensionSpec[%s] cannot vectorize", getClass().getName());
  }

+  default VectorObjectSelector decorate(VectorObjectSelector selector)


This method should have a javadoc. The others should too, and they don't, but don't let that prevent you from adding one here 🙂

As I type this, it's not clear to me what this method does, but hopefully it becomes clear elsewhere in the patch.

This is used with the poorly named VectorColumnSelectorFactory.makeStringObjectSelector, which should be renamed VectorColumnSelectorFactory.makeDimensionObjectSelector or something, which is for doing dimensiony things with a vector object selector similar to what is done for SingleValueDimensionVectorSelector or MultiValueDimensionVectorSelector. There are currently no implementations of this, like is the case with the other decorate methods with vector selectors, i was just trying for symmetry here so that the object selector could behave similar to the other string selectors when used in this context

I think this makes sense, so if we're going to keep this method, that'd all be good stuff to add to the javadoc. This method does need one because it's not obvious from the signature how it works, what its preconditions are, and what kinds of objects will come out of the selector it returns.

Will the VectorObjectSelector passed to this method always return an array whose values are all Strings? And is the same thing expected of the returned VectorObjectSelector? If so, please add that to the javadoc. If not, then I must be missing something and could you please explain it to me 🙂

By the way, what happens if the input is multi-value? Will some of the entries in the Object[] of either the input or the output VectorObjectSelector actually be String[] instead of String? Or is multi-value not supported today?

I haven't yet implemented support for multi-value string object selectors, but your question also seems applicable if we were to add array types too. I guess the contents of the array would be dependent on whatever the underlying selector is providing, and it would be the responsibility of the DimensionSpec implementation to handle decoration appropriately? I guess we should also probably consider adding methods for the primitive numeric columns as well, since they are not masqueraded as "dimension selectors" in the vectorized engine, (if we ever want to fully support all of the non-vectorized behavior here I think).

I don't know, the more I think about this, do we really need DimensionSpec, like at all? The stuff it does is all over the place, and it feels like maybe this type of stuff should just be pushed down into either expression virtual columns or something more specialized as appropriate (more generic predicate filtered selectors virtual column?). Then we maybe we can just ditch this stuff entirely in some sunny future instead of implementing vectorized versions of decorate for the various DimensionSpec implementations. Or is there an advantage to DimensionSpec that I am missing?

I don't know, the more I think about this, do we really need DimensionSpec, like at all?

I don't think so, because the stuff it does could be done by virtual columns. Removing them would be a big backwards-incompatibility, but we don't necessarily have to remove them; we could just make them not-vectorizable and add notes in the docs that they are a legacy thing.

Today the only DimensionSpec that is vectorizable is DefaultDimensionSpec (the one that does nothing other than pass-through), so nothing special is needed to support it.

I've removed the decorate method I added in this PR, in favor of removing this functionality from the vectorized engine. Since no decorate, no need for a separate "object dimension selector", I've removed that from VectorColumnSelectorFactory as well and reverted the signatures changes to processors that were needed to support that 👍

If we do the same thing and ditch decorate for SingleValueDimensionVectorSelector and MultiValueDimensionVectorSelector there is quite a bit of simplification we can do, because we can discard the DimensionSpec much earlier and just work with columnName like the other selectors instead of going all the way down like it does now.

gianm · 2021-01-08T07:40:34Z

processing/src/main/java/org/apache/druid/query/filter/DruidPredicateFactory.java

@@ -27,6 +27,13 @@
 {
  Predicate<String> makeStringPredicate();

+  default Predicate<Object> makeObjectPredicate()
+  {
+    // default to try to use string predicate;


"try" makes it sound like this could potentially not work out. Why might it not work out? What bad stuff will happen if it doesn't work out?

currently I don't think it should be possible because only expression filter will make a matcher factory with object matching predicate and only for string column types, but i guess cast exceptions if an object selector of a complex column somehow got in here from future code, see other comment about thinking a bit further about this 😅

(see next comment #10613 (comment))

gianm · 2021-01-08T07:45:34Z

processing/src/main/java/org/apache/druid/segment/DimensionHandlerUtils.java

+      if (!forceSingleValue) {
+        // if column is not dictionary encoded (and not non-existent or complex), use object selector
+        if (effectiveCapabilites.isDictionaryEncoded().isFalse() ||
+            effectiveCapabilites.areDictionaryValuesUnique().isFalse()


Why does it matter if the dictionary values are unique? (Even if they're not, can't we still use a regular string selector?)

by regular string selector do you mean dimension selector? The answer is yes we could, but if the dictionary values are not unique, i came to the conclusion that they aren't especially useful since in all the cases I can think of you can't use them as is and you're going to have to lookup the string values anyway (e.g. filter matching and vector group by).

I was trying to provide an avenue to avoid the extra steps with expression virtual columns, and allow an object selector to be created directly since the string is the useable part in this case, and the dictionary ids just feel like an obstruction. But maybe this is overly broad and there are cases where we would want it to act like a dictionary encoded column for some reason that I have missed?

I think your reasoning makes sense, and it sounds like your assumption is that when selecting string-typed things that aren't dictionary-encoded, we should prefer the object selector vs. one of the DimensionVectorSelectors, because that'll incur less overhead.

This logic should be spelled out somewhere though. IMO javadocs on the appropriate methods in VectorColumnSelectorFactory is the best place. They should say when callers should prefer using the object selector vs. the more specifically-typed selectors.

I think I've got this covered better now

gianm · 2021-01-08T08:06:36Z

processing/src/main/java/org/apache/druid/segment/filter/ExpressionFilter.java

+      }
+
+      // The hashcode and equals are to make SubclassesMustOverrideEqualsAndHashCodeTest stop complaining..
+      // DruidPredicateFactory doesn't really need equals or hashcode, in fact only the 'toString' method is called


Interesting. It sounds like a bug that DruidPredicateFactory is annotated with @SubclassesMustOverrideEqualsAndHashCodeTest but yet its toString method is used for equality checks. Is that right? If so, perhaps a good fix would be:

Keep the annotation on DimensionPredicateFilter

Switch the equality checks in DimensionPredicateFilter to use DruidPredicateFactory equals rather than toString

If we do that, IMO it'd be good to do it separately from this patch and also separately from #8256.

DruidPredicateFactory doesn't really need equals or hashcode

How else would DimensionPredicateFilter's equals be implemented?

yeah, I was planning on addressing this in a separate PR, the comments are partially to help remind us to do it in case I didn't get to it immediately

👍 although I'm still wondering: how else would DimensionPredicateFilter's equals be implemented, if not via equals and hashCode on DruidPredicateFactory?

(Assuming we stop it from using toString, which IMO we should)

oops, I forgot to rework this comment, it is assuming that it keeps using toString. If we changed that to not use the string for its equality then it would be the toString that isn't necessary, will try to rework

changed comment to suggest DimensionPredicateFilter should use equals of predicate factory instead of string form

gianm · 2021-01-08T08:08:15Z

processing/src/main/java/org/apache/druid/segment/filter/ExpressionFilter.java

+      }
+      // in default mode, just fallback to using a long matcher since nearly all boolean-ish expressions
+      // output a long value so it is probably a safe bet? idk, ending up here by using all null-ish things
+      // in default mode is dancing on the edge of insanity anyway...


Love the comment but the patch shouldn't go live like this 🙂

Can you come up with a scenario where we'll actually hit this case and add some tests for it? Maybe in the process of doing that, a better way to behave or a better way to describe the behavior will come up.

I reworked this to be cooler I think, since a vectorizable expression that produces a 'null' output type means all inputs are null constants or missing columns, so the function is a constant function, we can just evaluate the expression and make a constant matcher of the correct type instead of just going with something that worked in most cases by luck.

gianm · 2021-01-08T08:16:41Z

processing/src/main/java/org/apache/druid/segment/filter/ExpressionFilter.java

+      case STRING:
+        return VectorValueMatcherColumnProcessorFactory.instance().makeObjectProcessor(
+            // using 'numeric' capabilities creator so we are configured to NOT be dictionary encoded, etc
+            ColumnCapabilitiesImpl.createSimpleNumericColumnCapabilities(ValueType.STRING),


createSimpleNumericColumnCapabilities is a weird method to call with type STRING. Is this a mistake? Or is the method just named poorly?

it is strange, but on purpose for the reason the comment says since the numeric capabilities defaults are correct in this case. I will make sure all current users are numeric types, and if so then just make these capabilites from scratch because we should be counting on something geared towards numeric behavior in case those change in some unexpected way. Or, if it is used by other non-numbers then rename it to something more appropriate, like createSimpleSingleValueColumnCapabilites or similar

I think one of those two options you mentioned would be good. IMO we shouldn't call a function named createSimpleNumericColumnCapabilities in this context, even if it happens to do the right thing, because it "looks wrong" and that trips people up that are trying to read through it.

added a createSimpleStringColumnCapabilities

gianm · 2021-01-08T08:21:00Z

processing/src/main/java/org/apache/druid/segment/vector/VectorColumnSelectorFactory.java

@@ -57,6 +57,15 @@ default int getMaxVectorSize()
   */
  MultiValueDimensionVectorSelector makeMultiValueDimensionSelector(DimensionSpec dimensionSpec);

+  /**
+   * Returns an object selector for string columns, useful for non-dictionary encoded strings, or when


What does it do when called on a non-string column? Cast the stuff to strings? Return null? Something else?

hmm, i have no idea why I named this function like this, it should probably be called makeDimensionObjectSelector. Its just expected to make a VectorObjectSelector from a DimensionSpec rather than a raw column name

The javadoc here should be fleshed out to describe the expected contract. IMO it should answer questions like:

If you call this with a DimensionSpec that refers to a non-string column, what happens?

Will the returned VectorObjectSelector always return a String or null?

…might it return List for a multi-value string column?

I think javadocs are maybe cooler now in describing the intent of these methods

gianm · 2021-02-25T16:20:18Z

processing/src/main/java/org/apache/druid/query/filter/DruidPredicateFactory.java

+   * behavior with non-vectorized matchers which use a string predicate with null inputs for these 'nil' matchers,
+   * do the same thing here...
+   */
+  default Predicate<Object> makeObjectPredicate()


Based on #10613 (comment) it seems like the intent of this method is that it will only be called for complex types (or maybe nonexistent columns too). If that's the case, please add some javadoc explaining that this method will only be called for complex types. It's analogous to "makeObjectProcessor" in "VectorColumnProcessorFactory". (You could even @see to that method in the javadocs if you want.)

adjusted javadocs

gianm · 2021-02-25T16:36:49Z

processing/src/main/java/org/apache/druid/segment/vector/VectorColumnSelectorFactory.java

+   * isn't useful. Right now, this should probably only be called on single valued STRING inputs (multi-value STRING
+   * vector object selector is not yet implemented), but perhaps a world can be imagined where this could work with
+   * COMPLEX too perhaps if it were useful for them to interact with {@link DimensionSpec} in some manner? May be called
+   * for non-existent columns.


These javadocs should explicitly constrain the type of Object that may show up in the returned VectorObjectSelector. Is it always going to be a String or a null? Might it be String[], if the underlying column is multi-value?

gianm · 2021-02-25T16:37:33Z

processing/src/main/java/org/apache/druid/segment/vector/VectorColumnSelectorFactory.java

   *
   * If you need to write code that adapts to different input types, you should write a
   * {@link org.apache.druid.segment.VectorColumnProcessorFactory} and use one of the
   * {@link org.apache.druid.segment.ColumnProcessors#makeVectorProcessor} functions instead of using this method.
   */
  MultiValueDimensionVectorSelector makeMultiValueDimensionSelector(DimensionSpec dimensionSpec);

+  /**
+   * Returns a dimension object selector. Useful for non-dictionary encoded strings, or when using the dictionary
+   * isn't useful. Right now, this should probably only be called on single valued STRING inputs (multi-value STRING


Get rid of the "probably" 🙂

Either it will work or it won't. If it doesn't work, say firmly not to do it, and ideally describe what will happen if you do it accidentally (the method will throw an error? the method will return a selector with undefined behavior? etc). If it does work, but might be a bad idea for some reason, spell that out.

i got rid of this method, and merged these javadocs into the makeObjectSelector javadocs

gianm · 2021-02-25T16:41:31Z

processing/src/main/java/org/apache/druid/segment/column/ColumnCapabilitiesImpl.java

+  public static ColumnCapabilitiesImpl createSimpleStringColumnCapabilities()
+  {
+    return new ColumnCapabilitiesImpl().setType(ValueType.STRING)
+                                       .setHasMultipleValues(false)


Is it ok that hasMultipleValues is false here? That seems like the potentially-sketchiest of the behaviors, since someone might use this method for a multi-value string accidentally. How about naming it createSimpleSingleValueStringColumnCapabilities?

Method names: the longer the better. Go, Java.

gianm · 2021-02-25T16:46:15Z

Thanks for your patience, @clintropolis. Most of the remaining comments I have on the code are about clarifying the method contracts properly.

At this point I wanted to ask about the testing. I haven't looked at it much, and I would deeply appreciate it if you could describe briefly what the testing strategy is for this new feature, and why you think it covers all the important cases. That'll make reviewing it a lot quicker.

clintropolis · 2021-02-27T22:52:10Z

At this point I wanted to ask about the testing. I haven't looked at it much, and I would deeply appreciate it if you could describe briefly what the testing strategy is for this new feature, and why you think it covers all the important cases. That'll make reviewing it a lot quicker.

I guess there are 2 main areas of changes, some modifications in druid-core at the Expr layer so that logical operations can process single value string columns, and in druid-processing vector processors and filter matcher stuffs.

The Expr changes are mostly covered by VectorExprSanityTest, which does a sort of brute-force comparision of vectorized and non-vectorized expressions and ensures that the outputs are identical, but I added a few other odds and ends to cover the changes to null constants.

For the filter changes, most of the coverage is provided by the tests that extend BaseFilterTest, which itself has been improved to include testing matches against a virtual column (which was missing before, and serves as coverage for the bug for vectorized filter matching on virtual columns which is fixed by this PR), with new test cases added to SelectorFilterTest and BoundFilterTest to cover this and new functionality. In ExpressionFilterTest, every vectorizable expression is now using assertFilterMatches instead of assertFilterMatchesSkipVectorize, but there are a couple of stragglers such as the like expression which are not vectorizable yet still using the latter method. A couple calcite tests which previously could not vectorize are now vectorizable as well.

The only stuff without great coverage as far as I can tell from the report is the changes vector processor stuff with group by engine, but that will be covered in a follow-up PR that adds the vectorized expression support to vector group by (dictionary building vectorized group by), and the decorate methods in DimensionSpec which i think is expected since nothing implements them (or the other vectorized decorate methods for that matter)

gianm

LGTM, thanks @clintropolis

expression filter support for vectorized query engines

bdee07d

clintropolis added Performance Area - Querying labels Nov 28, 2020

remove unused codes

b53ae6e

clintropolis added the WIP label Nov 29, 2020

clintropolis added 3 commits November 28, 2020 23:18

more tests

14c83db

refactor, more tests

3737851

suppress

c6cbe3f

clintropolis added Bug and removed WIP labels Dec 1, 2020

gianm reviewed Jan 8, 2021

View reviewed changes

clintropolis mentioned this pull request Jan 20, 2021

Vectorized theta sketch aggregator + rework of VectorColumnProcessorFactory. #10767

Merged

clintropolis added 7 commits February 7, 2021 19:36

Merge remote-tracking branch 'upstream/master' into vector-expr-filter

3736bed

more

5903fab

more

711a6b5

more

95bcf60

Merge remote-tracking branch 'upstream/master' into vector-expr-filter

8d29fd6

oops, i was wrong

a3e5c91

comment

406a0fa

gianm reviewed Feb 25, 2021

View reviewed changes

clintropolis added 2 commits March 4, 2021 04:26

remove decorate, object dimension selector, more javadocs

e3b0a94

style

3477524

gianm approved these changes Mar 16, 2021

View reviewed changes

gianm merged commit 4cd4a22 into apache:master Mar 16, 2021

clintropolis deleted the vector-expr-filter branch March 16, 2021 19:28

clintropolis mentioned this pull request Mar 18, 2021

vector group by support for string expressions #11010

Merged

6 tasks

clintropolis mentioned this pull request May 7, 2021

Fix vectorized cardinality bug on certain string columns. #11199

Merged

clintropolis added this to the 0.22.0 milestone Aug 12, 2021

clintropolis mentioned this pull request Sep 3, 2021

[Draft] 0.22.0 Release Notes #11657

Closed

expression filter support for vectorized query engines #10613

expression filter support for vectorized query engines #10613

Conversation

clintropolis commented Nov 28, 2020 • edited

Description

Key changed/added classes in this PR

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gianm commented Feb 25, 2021

clintropolis commented Feb 27, 2021

gianm left a comment

Choose a reason for hiding this comment

clintropolis commented Nov 28, 2020 •

edited