-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add single input string expression dimension vector selector and better expression planning #11213
add single input string expression dimension vector selector and better expression planning #11213
Conversation
…er expression planning
…o be less aggressive about vectorizing
* the the string input. | ||
*/ | ||
@Override | ||
public boolean useDictionaryEncodedSelector(ColumnCapabilities capabilities) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add @Nullable
for capabilities
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, it should never be null for this method though (nor any of the other VectorColumnProcessorFactory
methods), ColumnProcessors
will return a nil vector selector if the capabilities are null, since null capabilities in the vectorized engine means that the column doesn't exist.
* We do this even for things like virtual columns that have a single string input, because it allows deferring | ||
* accessing any of the actual string values, which involves at minimum reading utf8 byte values and converting | ||
* them to string form (if not already cached), and in the case of expressions, computing the expression output for | ||
* the the string input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* the the string input. | |
* the string input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quite a few 'the the', fixed all I could find.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a lot! Thanks for fixing them.
* deal with the actual string value in exchange for the increased complexity of dealing with dictionary encoded | ||
* selectors. | ||
*/ | ||
default boolean useDictionaryEncodedSelector(ColumnCapabilities capabilities) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add @Nullable
for capabilities
.
.setDictionaryValuesSorted(sorted && dimensionSpec.getExtractionFn().preservesOrdering()) | ||
.setDictionaryValuesUnique( | ||
unique && dimensionSpec.getExtractionFn().getExtractionType() == ExtractionFn.ExtractionType.ONE_TO_ONE | ||
) | ||
.setHasMultipleValues(dimensionSpec.mustDecorate() || mayBeMultiValue(columnCapabilities)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: dimensionSpec.mustDecorate()
always returns false here.
* If no output type was able to be inferred during planning, returns null | ||
*/ | ||
@Nullable | ||
public ColumnCapabilities inferColumnCapabilities(@Nullable ValueType hint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public ColumnCapabilities inferColumnCapabilities(@Nullable ValueType hint) | |
public ColumnCapabilities inferColumnCapabilities(@Nullable ValueType outputTypeHint) |
return ColumnCapabilitiesImpl.createSimpleSingleValueStringColumnCapabilities(); | ||
} | ||
// we don't know what we don't know | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we still infer in some cases if a non-null hint
is provided, such as when outputType
is null but hint
is ValueType.DOUBLE
? I'm not sure when this case can happen though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The contract of this method is that it only returns column capabilities if it could infer an output type, and it factors the hinted type into things (where the hint is from SQL planner or user who sent JSON query).
The caller of this method, ExpressionVirtualColumn.capabilities
, will construct ColumnCapabilities
using the hint if this inferColumnCapabilities
method returns null.
@@ -54,7 +55,13 @@ public static SingleValueDimensionVectorSelector makeSingleValueDimensionVectorS | |||
String constant = plan.getExpression().eval(ExprUtils.nilBindings()).asString(); | |||
return ConstantVectorSelectors.singleValueDimensionVectorSelector(factory.getReadableVectorInspector(), constant); | |||
} | |||
throw new IllegalStateException("Only constant expressions currently support dimension selectors"); | |||
if (plan.is(ExpressionPlan.Trait.SINGLE_INPUT_SCALAR) && ExprType.STRING == plan.getOutputType()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also use SingleStringInputDeferredEvaluationExpressionDimensionVectorSelector
when the plan is SINGLE_INPUT_MAPPABLE
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at this time, at least for expressions that aren't also SINGLE_INPUT_SCALAR
. The vector engine uses different selectors for single and multi value string dimensions, and all SINGLE_INPUT_MAPPABLE
that have a single input and single output will also have the trait SINGLE_INPUT_SCALAR
. This means any SINGLE_INPUT_MAPPABLE
that aren't caught by this if will be on multi-valued columns so need to use the multi-valued dimension vector selector, which isn't implemented for vectorized expression processing yet.
I guess we could hypothetically unroll multi-valued rows into a vector if we had some sort of input binding which could handle that mapping and still processing them as single valued dims, but I'm not certain if the complexity would be worth it over just using the multi-valued selector. I also can imagine reworking array processing to be much closer to scalar vector processing, so I need to think a bit more about the best way to tackle this.
* construct the appropriate capabilities for virtual columns, while the base inspector directly supplies the | ||
* capabilities for non-virtual columns. | ||
*/ | ||
public class VirtualizedColumnInspector implements ColumnInspector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
{ | ||
Preconditions.checkArgument(capabilities != null, "Capabilities must not be null"); | ||
Preconditions.checkArgument(capabilities.getType() == ValueType.STRING, "Must only be called on a STRING column"); | ||
return capabilities.isDictionaryEncoded().isTrue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for recording (and helping myself to remember later in the future), this means that the groupBy vector engine will use the dictionary IDs to compute per-segment results, and decode them when merging those results, which is what non-vectorized engine does today. When the column is dictionary encoded but not unique, this optimization might not be always good because there could be some sort of tradeoff depending on the column cardinality post expression evaluation. Even though I think this optimization is likely good in most cases, it could worth investigating further later to understand the tradeoff better.
@Override | ||
public int getMaxVectorSize() | ||
{ | ||
return 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would worth mentioning why the vector size is 1. Can you add some comment about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added javadocs to the bindings class to detail its usage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
thanks for review @jihoonson 🤘 |
…er expression planning (apache#11213) * add single input string expression dimension vector selector and better expression planning * better * fixes * oops * rework how vector processor factories choose string processors, fix to be less aggressive about vectorizing * oops * javadocs, renaming * more javadocs * benchmarks * use string expression vector processor with vector size 1 instead of expr.eval * better logging * javadocs, surprising number of the the * more * simplify
Description
This PR adds some improvements and fixes to the query processing system, mostly related to expression planning and execution.
ExpressionPlan
has been reworked to now produce theColumnCapabilities
directly instead ofExpressionVirtualColumn
doing this from the plan, which keeps everything a bit tighter together and much actual logic out of being inExpressionVirtualColumn
. TheExpressionPlanner
was relatively well covered by tests prior to this PR, but it was almost entirely indirect coverage from the set of queries we were running, and had no direct tests. I've changed this added quite a few tests to model various scenarios that can occur when planning expressions against an underlyingColumnInspector
so hopefully things will be better defined.Part of the reason for reworking this stuff is so that single input strings can correctly report themselves as dictionary encoded to allow for deferring expression evaluation until the time
lookupName
is called, an optimization that had not yet made its way to the vectorized expression processing, but is available now throughSingleStringDeferredEvaluationExpressionDimensionVectorSelector
. This deferred evaluation offers quite a performance improvement, allowing using the native string grouping strategy and lazily executing expression on value lookup, instead of eagerly evaluating them and using the dictionary building grouping strategy.Given the queries:
non-vectorized:
eager vector object selector/dictionary building grouping strategy (unoptimized previous behavior which ended up being slower than the optimized non-vectorized engine) :
new lazy evaluated expression grouping optimization for parity with non-vectorized engine using
Expr.eval
:new lazy evaluated expression grouping optimization for parity with non-vectorized engine using
ExprVectorProcessor.evalVector
with a vector size of 1:The last one measured slightly better than using
Expr.eval
, so this PR went with usingExprVectorProcessor
even though the vector size is only ever 1. I suspect this will often be the case since the vectorized expression processors are stronger typed and so can eliminate most branching required to process individual rows since it knows the types of everything ahead of time.This optimization can only be used by single input strings, because it can delegate dictionary ids to use for grouping to the underlying column. String expressions with multiple input columns will still use the default vector object selector path.
Vector selector construction in
ColumnProcessors
has been adjusted to push the decision on whether or not to use dictionary encoded selectors or object selectors into the processor factory, and group by engine now takes a less aggressive stance that any dictionary encoded column should use a dictionary encoded selector, which means single input string expressions will now use theSingleStringDeferredEvaluationExpressionDimensionVectorSelector
and group on the underlying dictionary ids, while aggregators or filters on virtual columns will still use the object selector, which should still be more performant in non-grouping cases where evaluation cannot be deferred.This PR also fixes an unrelated issue with vector query engines not having access to virtual column capabilities when trying to determine when if they can vectorize, and solves it by introducing a
VirtualizedColumnInspector
which wraps anotherColumnInspector
and is now used in these engines instead of the direct segment adapter, fixing the issue described in this comment #11188 (comment)This PR has: