Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expression filter support for vectorized query engines #10613

Merged
merged 14 commits into from Mar 16, 2021

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented Nov 28, 2020

Description

This PR adds support for ExpressionFilter in vectorized query engines, and expands vectorization support for conditional expressions to support string typed columns.

The signature of the Filter interface method which checks if a matcher can be vectorized has been updated to accept a ColumnInspector, so that the ExpressionVirtualColumn.canVectorize method can do its thing to determine if the underlying expression can be vectorized:

  default boolean canVectorizeMatcher(ColumnInspector inspector)
  {
    return false;
  }

Expression filters on string typed expressions use a new ObjectVectorValueMatcher (since string expressions most naturally use a VectorObjectSelector instead of the dictionary encoded vector dimension selectors), and DruidPredicateFactory has been updated to include a method to make an object matching predicate to work with that.

  default Predicate<Object> makeObjectPredicate()
  {
    // default to try to use string predicate;
    final Predicate<String> stringPredicate = makeStringPredicate();
    return o -> stringPredicate.apply((String) o);
  }

Similarly, VectorColumnProcessorFactory has been expanded with a new method to allow constructing these VectorObjectSelector based matchers:

  T makeObjectProcessor(ColumnCapabilities capabilities, VectorObjectSelector selector);

I imagine that this potentially opens the door to someday having filters that could support any of the complex typed columns we support as inputs as well using the predicate factory pattern, though I haven't really spent time thinking of any use cases for this yet.

Finally, this PR fixes a bug when using a vectorizable string typed expression virtual column as an input to a filter, where the virtual column would report itself as vectorizable, but then not have implemented a single value dictionary selector. Instead, these non-dictionary string columns will now use the string object selector and should work correctly now. I've added additional virtual column inputs to filter tests to cover this case as well.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Key changed/added classes in this PR
  • Filter
  • ExpressionFilter
  • VectorComparisonProcessors
  • BivariateFunctionVectorObjectProcessor
  • VectorValueMatcherFactory
  • ObjectVectorValueMatcher

@clintropolis clintropolis added Bug and removed WIP labels Dec 1, 2020
@@ -74,7 +74,7 @@ public ExprType getOutputType(InputBindingInspector inspector)
@Override
public boolean canVectorize(InputBindingInspector inspector)
{
return inspector.areNumeric(left, right) && inspector.canVectorize(left, right);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This opens not just STRING but also LONG_ARRAY and DOUBLE_ARRAY. Is that intentional (will numeric arrays work)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think canVectorize will still end up false because array types/functions are not yet vectorizable, but yeah i guess array types would not be well handled here even if they were, though maybe that is the case for non-vectorized expression as well. I'll do a bit of exploration and see what up

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this exploration lead somewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

array types do not currently implement canVectorize so will return the default false. We probably should not use canVectorize for argument validation and maybe should consider doing this explicitly when we can resolve the input types I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. IMO we shouldn't rely on the fact that the array types happen to not be vectorizable. If they suddenly became vectorizable then this method would start returning true, maybe improperly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like, I agree here, but canVectorize still feels wrong to saddle with type validation, it can explode when creating the processor if the types are nonsense

I think what might be best is to introduce a 'validate' method to expressions so that arguments can be checked up front, and maybe explode if the input is madness

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, could you add a comment about the situation?


import javax.annotation.Nullable;
import java.util.List;

public class ExprTypeConversion
{
public static Pair<ExprType, ExprType> coerceNull(Expr.InputBindingInspector inspector, Expr left, Expr right)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method deserves a javadoc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i did one better and just removed it. null literals output type is now null instead of their type, so that they behave and are handled like missing columns (since null doesn't have a type)

{
return new ExprEvalDoubleVector(bindings.getDoubleVector(binding), bindings.getNullVector(binding));
return new ExprEvalStringVector(Arrays.stream(bindings.getObjectVector(binding)).map(x -> (String) x).toArray(String[]::new));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What guarantees that the objects in the vector are Strings?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why i had this code like this since in this case it is getting back nulls... it can just call bindings.getObjectVector. changed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, i lied, it needs to cast because non-existent column gives back Object[] of nulls instead of String[] of nulls

@@ -70,6 +71,11 @@ default MultiValueDimensionVectorSelector decorate(MultiValueDimensionVectorSele
throw new UOE("DimensionSpec[%s] cannot vectorize", getClass().getName());
}

default VectorObjectSelector decorate(VectorObjectSelector selector)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method should have a javadoc. The others should too, and they don't, but don't let that prevent you from adding one here 🙂

As I type this, it's not clear to me what this method does, but hopefully it becomes clear elsewhere in the patch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used with the poorly named VectorColumnSelectorFactory.makeStringObjectSelector, which should be renamed VectorColumnSelectorFactory.makeDimensionObjectSelector or something, which is for doing dimensiony things with a vector object selector similar to what is done for SingleValueDimensionVectorSelector or MultiValueDimensionVectorSelector. There are currently no implementations of this, like is the case with the other decorate methods with vector selectors, i was just trying for symmetry here so that the object selector could behave similar to the other string selectors when used in this context

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense, so if we're going to keep this method, that'd all be good stuff to add to the javadoc. This method does need one because it's not obvious from the signature how it works, what its preconditions are, and what kinds of objects will come out of the selector it returns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the VectorObjectSelector passed to this method always return an array whose values are all Strings? And is the same thing expected of the returned VectorObjectSelector? If so, please add that to the javadoc. If not, then I must be missing something and could you please explain it to me 🙂

By the way, what happens if the input is multi-value? Will some of the entries in the Object[] of either the input or the output VectorObjectSelector actually be String[] instead of String? Or is multi-value not supported today?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't yet implemented support for multi-value string object selectors, but your question also seems applicable if we were to add array types too. I guess the contents of the array would be dependent on whatever the underlying selector is providing, and it would be the responsibility of the DimensionSpec implementation to handle decoration appropriately? I guess we should also probably consider adding methods for the primitive numeric columns as well, since they are not masqueraded as "dimension selectors" in the vectorized engine, (if we ever want to fully support all of the non-vectorized behavior here I think).

I don't know, the more I think about this, do we really need DimensionSpec, like at all? The stuff it does is all over the place, and it feels like maybe this type of stuff should just be pushed down into either expression virtual columns or something more specialized as appropriate (more generic predicate filtered selectors virtual column?). Then we maybe we can just ditch this stuff entirely in some sunny future instead of implementing vectorized versions of decorate for the various DimensionSpec implementations. Or is there an advantage to DimensionSpec that I am missing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, the more I think about this, do we really need DimensionSpec, like at all?

I don't think so, because the stuff it does could be done by virtual columns. Removing them would be a big backwards-incompatibility, but we don't necessarily have to remove them; we could just make them not-vectorizable and add notes in the docs that they are a legacy thing.

Today the only DimensionSpec that is vectorizable is DefaultDimensionSpec (the one that does nothing other than pass-through), so nothing special is needed to support it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the decorate method I added in this PR, in favor of removing this functionality from the vectorized engine. Since no decorate, no need for a separate "object dimension selector", I've removed that from VectorColumnSelectorFactory as well and reverted the signatures changes to processors that were needed to support that 👍

If we do the same thing and ditch decorate for SingleValueDimensionVectorSelector and MultiValueDimensionVectorSelector there is quite a bit of simplification we can do, because we can discard the DimensionSpec much earlier and just work with columnName like the other selectors instead of going all the way down like it does now.

@@ -27,6 +27,13 @@
{
Predicate<String> makeStringPredicate();

default Predicate<Object> makeObjectPredicate()
{
// default to try to use string predicate;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"try" makes it sound like this could potentially not work out. Why might it not work out? What bad stuff will happen if it doesn't work out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently I don't think it should be possible because only expression filter will make a matcher factory with object matching predicate and only for string column types, but i guess cast exceptions if an object selector of a complex column somehow got in here from future code, see other comment about thinking a bit further about this 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(see next comment #10613 (comment))

if (!forceSingleValue) {
// if column is not dictionary encoded (and not non-existent or complex), use object selector
if (effectiveCapabilites.isDictionaryEncoded().isFalse() ||
effectiveCapabilites.areDictionaryValuesUnique().isFalse()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it matter if the dictionary values are unique? (Even if they're not, can't we still use a regular string selector?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by regular string selector do you mean dimension selector? The answer is yes we could, but if the dictionary values are not unique, i came to the conclusion that they aren't especially useful since in all the cases I can think of you can't use them as is and you're going to have to lookup the string values anyway (e.g. filter matching and vector group by).

I was trying to provide an avenue to avoid the extra steps with expression virtual columns, and allow an object selector to be created directly since the string is the useable part in this case, and the dictionary ids just feel like an obstruction. But maybe this is overly broad and there are cases where we would want it to act like a dictionary encoded column for some reason that I have missed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your reasoning makes sense, and it sounds like your assumption is that when selecting string-typed things that aren't dictionary-encoded, we should prefer the object selector vs. one of the DimensionVectorSelectors, because that'll incur less overhead.

This logic should be spelled out somewhere though. IMO javadocs on the appropriate methods in VectorColumnSelectorFactory is the best place. They should say when callers should prefer using the object selector vs. the more specifically-typed selectors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've got this covered better now

}

// The hashcode and equals are to make SubclassesMustOverrideEqualsAndHashCodeTest stop complaining..
// DruidPredicateFactory doesn't really need equals or hashcode, in fact only the 'toString' method is called
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. It sounds like a bug that DruidPredicateFactory is annotated with @SubclassesMustOverrideEqualsAndHashCodeTest but yet its toString method is used for equality checks. Is that right? If so, perhaps a good fix would be:

  • Keep the annotation on DimensionPredicateFilter
  • Switch the equality checks in DimensionPredicateFilter to use DruidPredicateFactory equals rather than toString

If we do that, IMO it'd be good to do it separately from this patch and also separately from #8256.

DruidPredicateFactory doesn't really need equals or hashcode

How else would DimensionPredicateFilter's equals be implemented?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I was planning on addressing this in a separate PR, the comments are partially to help remind us to do it in case I didn't get to it immediately

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 although I'm still wondering: how else would DimensionPredicateFilter's equals be implemented, if not via equals and hashCode on DruidPredicateFactory?

(Assuming we stop it from using toString, which IMO we should)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, I forgot to rework this comment, it is assuming that it keeps using toString. If we changed that to not use the string for its equality then it would be the toString that isn't necessary, will try to rework

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed comment to suggest DimensionPredicateFilter should use equals of predicate factory instead of string form

}
// in default mode, just fallback to using a long matcher since nearly all boolean-ish expressions
// output a long value so it is probably a safe bet? idk, ending up here by using all null-ish things
// in default mode is dancing on the edge of insanity anyway...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the comment but the patch shouldn't go live like this 🙂

Can you come up with a scenario where we'll actually hit this case and add some tests for it? Maybe in the process of doing that, a better way to behave or a better way to describe the behavior will come up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworked this to be cooler I think, since a vectorizable expression that produces a 'null' output type means all inputs are null constants or missing columns, so the function is a constant function, we can just evaluate the expression and make a constant matcher of the correct type instead of just going with something that worked in most cases by luck.

case STRING:
return VectorValueMatcherColumnProcessorFactory.instance().makeObjectProcessor(
// using 'numeric' capabilities creator so we are configured to NOT be dictionary encoded, etc
ColumnCapabilitiesImpl.createSimpleNumericColumnCapabilities(ValueType.STRING),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createSimpleNumericColumnCapabilities is a weird method to call with type STRING. Is this a mistake? Or is the method just named poorly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is strange, but on purpose for the reason the comment says since the numeric capabilities defaults are correct in this case. I will make sure all current users are numeric types, and if so then just make these capabilites from scratch because we should be counting on something geared towards numeric behavior in case those change in some unexpected way. Or, if it is used by other non-numbers then rename it to something more appropriate, like createSimpleSingleValueColumnCapabilites or similar

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one of those two options you mentioned would be good. IMO we shouldn't call a function named createSimpleNumericColumnCapabilities in this context, even if it happens to do the right thing, because it "looks wrong" and that trips people up that are trying to read through it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a createSimpleStringColumnCapabilities

@@ -57,6 +57,15 @@ default int getMaxVectorSize()
*/
MultiValueDimensionVectorSelector makeMultiValueDimensionSelector(DimensionSpec dimensionSpec);

/**
* Returns an object selector for string columns, useful for non-dictionary encoded strings, or when
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it do when called on a non-string column? Cast the stuff to strings? Return null? Something else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, i have no idea why I named this function like this, it should probably be called makeDimensionObjectSelector. Its just expected to make a VectorObjectSelector from a DimensionSpec rather than a raw column name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The javadoc here should be fleshed out to describe the expected contract. IMO it should answer questions like:

  • If you call this with a DimensionSpec that refers to a non-string column, what happens?
  • Will the returned VectorObjectSelector always return a String or null?
  • …might it return List for a multi-value string column?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think javadocs are maybe cooler now in describing the intent of these methods

* behavior with non-vectorized matchers which use a string predicate with null inputs for these 'nil' matchers,
* do the same thing here...
*/
default Predicate<Object> makeObjectPredicate()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on #10613 (comment) it seems like the intent of this method is that it will only be called for complex types (or maybe nonexistent columns too). If that's the case, please add some javadoc explaining that this method will only be called for complex types. It's analogous to "makeObjectProcessor" in "VectorColumnProcessorFactory". (You could even @see to that method in the javadocs if you want.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adjusted javadocs

* isn't useful. Right now, this should probably only be called on single valued STRING inputs (multi-value STRING
* vector object selector is not yet implemented), but perhaps a world can be imagined where this could work with
* COMPLEX too perhaps if it were useful for them to interact with {@link DimensionSpec} in some manner? May be called
* for non-existent columns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These javadocs should explicitly constrain the type of Object that may show up in the returned VectorObjectSelector. Is it always going to be a String or a null? Might it be String[], if the underlying column is multi-value?

*
* If you need to write code that adapts to different input types, you should write a
* {@link org.apache.druid.segment.VectorColumnProcessorFactory} and use one of the
* {@link org.apache.druid.segment.ColumnProcessors#makeVectorProcessor} functions instead of using this method.
*/
MultiValueDimensionVectorSelector makeMultiValueDimensionSelector(DimensionSpec dimensionSpec);

/**
* Returns a dimension object selector. Useful for non-dictionary encoded strings, or when using the dictionary
* isn't useful. Right now, this should probably only be called on single valued STRING inputs (multi-value STRING
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get rid of the "probably" 🙂

Either it will work or it won't. If it doesn't work, say firmly not to do it, and ideally describe what will happen if you do it accidentally (the method will throw an error? the method will return a selector with undefined behavior? etc). If it does work, but might be a bad idea for some reason, spell that out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i got rid of this method, and merged these javadocs into the makeObjectSelector javadocs

public static ColumnCapabilitiesImpl createSimpleStringColumnCapabilities()
{
return new ColumnCapabilitiesImpl().setType(ValueType.STRING)
.setHasMultipleValues(false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok that hasMultipleValues is false here? That seems like the potentially-sketchiest of the behaviors, since someone might use this method for a multi-value string accidentally. How about naming it createSimpleSingleValueStringColumnCapabilities?

Method names: the longer the better. Go, Java.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed ^

@gianm
Copy link
Contributor

gianm commented Feb 25, 2021

Thanks for your patience, @clintropolis. Most of the remaining comments I have on the code are about clarifying the method contracts properly.

At this point I wanted to ask about the testing. I haven't looked at it much, and I would deeply appreciate it if you could describe briefly what the testing strategy is for this new feature, and why you think it covers all the important cases. That'll make reviewing it a lot quicker.

@clintropolis
Copy link
Member Author

At this point I wanted to ask about the testing. I haven't looked at it much, and I would deeply appreciate it if you could describe briefly what the testing strategy is for this new feature, and why you think it covers all the important cases. That'll make reviewing it a lot quicker.

I guess there are 2 main areas of changes, some modifications in druid-core at the Expr layer so that logical operations can process single value string columns, and in druid-processing vector processors and filter matcher stuffs.

The Expr changes are mostly covered by VectorExprSanityTest, which does a sort of brute-force comparision of vectorized and non-vectorized expressions and ensures that the outputs are identical, but I added a few other odds and ends to cover the changes to null constants.

For the filter changes, most of the coverage is provided by the tests that extend BaseFilterTest, which itself has been improved to include testing matches against a virtual column (which was missing before, and serves as coverage for the bug for vectorized filter matching on virtual columns which is fixed by this PR), with new test cases added to SelectorFilterTest and BoundFilterTest to cover this and new functionality. In ExpressionFilterTest, every vectorizable expression is now using assertFilterMatches instead of assertFilterMatchesSkipVectorize, but there are a couple of stragglers such as the like expression which are not vectorizable yet still using the latter method. A couple calcite tests which previously could not vectorize are now vectorizable as well.

The only stuff without great coverage as far as I can tell from the report is the changes vector processor stuff with group by engine, but that will be covered in a follow-up PR that adds the vectorized expression support to vector group by (dictionary building vectorized group by), and the decorate methods in DimensionSpec which i think is expected since nothing implements them (or the other vectorized decorate methods for that matter)

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @clintropolis

@gianm gianm merged commit 4cd4a22 into apache:master Mar 16, 2021
@clintropolis clintropolis deleted the vector-expr-filter branch March 16, 2021 19:28
@clintropolis clintropolis added this to the 0.22.0 milestone Aug 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants