Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add single input string expression dimension vector selector and better expression planning #11213

Merged
merged 18 commits into from
Jul 6, 2021

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented May 6, 2021

Description

This PR adds some improvements and fixes to the query processing system, mostly related to expression planning and execution. ExpressionPlan has been reworked to now produce the ColumnCapabilities directly instead of ExpressionVirtualColumn doing this from the plan, which keeps everything a bit tighter together and much actual logic out of being in ExpressionVirtualColumn. The ExpressionPlanner was relatively well covered by tests prior to this PR, but it was almost entirely indirect coverage from the set of queries we were running, and had no direct tests. I've changed this added quite a few tests to model various scenarios that can occur when planning expressions against an underlying ColumnInspector so hopefully things will be better defined.

Part of the reason for reworking this stuff is so that single input strings can correctly report themselves as dictionary encoded to allow for deferring expression evaluation until the time lookupName is called, an optimization that had not yet made its way to the vectorized expression processing, but is available now through SingleStringDeferredEvaluationExpressionDimensionVectorSelector. This deferred evaluation offers quite a performance improvement, allowing using the native string grouping strategy and lazily executing expression on value lookup, instead of eagerly evaluating them and using the dictionary building grouping strategy.

Given the queries:

      // 28: group by single input string low cardinality expr with expr agg
      "SELECT CONCAT(string2, '-', 'foo'), SUM(long1 * long4) FROM foo GROUP BY 1 ORDER BY 2",
      // 28: group by single input string high cardinality expr with expr agg
      "SELECT CONCAT(string3, '-', 'foo'), SUM(long1 * long4) FROM foo GROUP BY 1 ORDER BY 2"

non-vectorized:

Benchmark                        (query)  (rowsPerSegment)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlExpressionBenchmark.querySql       28           5000000        false  avgt    5  686.636 ± 22.514  ms/op
SqlExpressionBenchmark.querySql       29           5000000        false  avgt    5  852.508 ± 28.924  ms/op

eager vector object selector/dictionary building grouping strategy (unoptimized previous behavior which ended up being slower than the optimized non-vectorized engine) :

Benchmark                        (query)  (rowsPerSegment)  (vectorize)  Mode  Cnt    Score     Error  Units
SqlExpressionBenchmark.querySql       28           5000000        force  avgt    5  976.128 ±  78.928  ms/op
SqlExpressionBenchmark.querySql       29           5000000        force  avgt    5  960.815 ± 272.107  ms/op

new lazy evaluated expression grouping optimization for parity with non-vectorized engine using Expr.eval:

Benchmark                        (query)  (rowsPerSegment)  (vectorize)  Mode  Cnt    Score     Error  Units
SqlExpressionBenchmark.querySql       28           5000000        force  avgt    5  158.428 ±  6.382  ms/op
SqlExpressionBenchmark.querySql       29           5000000        force  avgt    5  194.414 ± 20.348  ms/op

new lazy evaluated expression grouping optimization for parity with non-vectorized engine using ExprVectorProcessor.evalVector with a vector size of 1:

Benchmark                        (query)  (rowsPerSegment)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlExpressionBenchmark.querySql       28           5000000        force  avgt    5  155.643 ± 11.507  ms/op
SqlExpressionBenchmark.querySql       29           5000000        force  avgt    5  164.360 ±  8.924  ms/op

The last one measured slightly better than using Expr.eval, so this PR went with using ExprVectorProcessor even though the vector size is only ever 1. I suspect this will often be the case since the vectorized expression processors are stronger typed and so can eliminate most branching required to process individual rows since it knows the types of everything ahead of time.

This optimization can only be used by single input strings, because it can delegate dictionary ids to use for grouping to the underlying column. String expressions with multiple input columns will still use the default vector object selector path.

Vector selector construction in ColumnProcessors has been adjusted to push the decision on whether or not to use dictionary encoded selectors or object selectors into the processor factory, and group by engine now takes a less aggressive stance that any dictionary encoded column should use a dictionary encoded selector, which means single input string expressions will now use the SingleStringDeferredEvaluationExpressionDimensionVectorSelector and group on the underlying dictionary ids, while aggregators or filters on virtual columns will still use the object selector, which should still be more performant in non-grouping cases where evaluation cannot be deferred.

This PR also fixes an unrelated issue with vector query engines not having access to virtual column capabilities when trying to determine when if they can vectorize, and solves it by introducing a VirtualizedColumnInspector which wraps another ColumnInspector and is now used in these engines instead of the direct segment adapter, fixing the issue described in this comment #11188 (comment)


This PR has:

  • been self-reviewed.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

* the the string input.
*/
@Override
public boolean useDictionaryEncodedSelector(ColumnCapabilities capabilities)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add @Nullable for capabilities.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, it should never be null for this method though (nor any of the other VectorColumnProcessorFactory methods), ColumnProcessors will return a nil vector selector if the capabilities are null, since null capabilities in the vectorized engine means that the column doesn't exist.

* We do this even for things like virtual columns that have a single string input, because it allows deferring
* accessing any of the actual string values, which involves at minimum reading utf8 byte values and converting
* them to string form (if not already cached), and in the case of expressions, computing the expression output for
* the the string input.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* the the string input.
* the string input.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quite a few 'the the', fixed all I could find.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a lot! Thanks for fixing them.

* deal with the actual string value in exchange for the increased complexity of dealing with dictionary encoded
* selectors.
*/
default boolean useDictionaryEncodedSelector(ColumnCapabilities capabilities)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add @Nullable for capabilities.

.setDictionaryValuesSorted(sorted && dimensionSpec.getExtractionFn().preservesOrdering())
.setDictionaryValuesUnique(
unique && dimensionSpec.getExtractionFn().getExtractionType() == ExtractionFn.ExtractionType.ONE_TO_ONE
)
.setHasMultipleValues(dimensionSpec.mustDecorate() || mayBeMultiValue(columnCapabilities));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: dimensionSpec.mustDecorate() always returns false here.

* If no output type was able to be inferred during planning, returns null
*/
@Nullable
public ColumnCapabilities inferColumnCapabilities(@Nullable ValueType hint)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public ColumnCapabilities inferColumnCapabilities(@Nullable ValueType hint)
public ColumnCapabilities inferColumnCapabilities(@Nullable ValueType outputTypeHint)

return ColumnCapabilitiesImpl.createSimpleSingleValueStringColumnCapabilities();
}
// we don't know what we don't know
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we still infer in some cases if a non-null hint is provided, such as when outputType is null but hint is ValueType.DOUBLE? I'm not sure when this case can happen though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contract of this method is that it only returns column capabilities if it could infer an output type, and it factors the hinted type into things (where the hint is from SQL planner or user who sent JSON query).

The caller of this method, ExpressionVirtualColumn.capabilities, will construct ColumnCapabilities using the hint if this inferColumnCapabilities method returns null.

@@ -54,7 +55,13 @@ public static SingleValueDimensionVectorSelector makeSingleValueDimensionVectorS
String constant = plan.getExpression().eval(ExprUtils.nilBindings()).asString();
return ConstantVectorSelectors.singleValueDimensionVectorSelector(factory.getReadableVectorInspector(), constant);
}
throw new IllegalStateException("Only constant expressions currently support dimension selectors");
if (plan.is(ExpressionPlan.Trait.SINGLE_INPUT_SCALAR) && ExprType.STRING == plan.getOutputType()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also use SingleStringInputDeferredEvaluationExpressionDimensionVectorSelector when the plan is SINGLE_INPUT_MAPPABLE?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at this time, at least for expressions that aren't also SINGLE_INPUT_SCALAR. The vector engine uses different selectors for single and multi value string dimensions, and all SINGLE_INPUT_MAPPABLE that have a single input and single output will also have the trait SINGLE_INPUT_SCALAR. This means any SINGLE_INPUT_MAPPABLE that aren't caught by this if will be on multi-valued columns so need to use the multi-valued dimension vector selector, which isn't implemented for vectorized expression processing yet.

I guess we could hypothetically unroll multi-valued rows into a vector if we had some sort of input binding which could handle that mapping and still processing them as single valued dims, but I'm not certain if the complexity would be worth it over just using the multi-valued selector. I also can imagine reworking array processing to be much closer to scalar vector processing, so I need to think a bit more about the best way to tackle this.

* construct the appropriate capabilities for virtual columns, while the base inspector directly supplies the
* capabilities for non-virtual columns.
*/
public class VirtualizedColumnInspector implements ColumnInspector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

{
Preconditions.checkArgument(capabilities != null, "Capabilities must not be null");
Preconditions.checkArgument(capabilities.getType() == ValueType.STRING, "Must only be called on a STRING column");
return capabilities.isDictionaryEncoded().isTrue();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for recording (and helping myself to remember later in the future), this means that the groupBy vector engine will use the dictionary IDs to compute per-segment results, and decode them when merging those results, which is what non-vectorized engine does today. When the column is dictionary encoded but not unique, this optimization might not be always good because there could be some sort of tradeoff depending on the column cardinality post expression evaluation. Even though I think this optimization is likely good in most cases, it could worth investigating further later to understand the tradeoff better.

@Override
public int getMaxVectorSize()
{
return 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would worth mentioning why the vector size is 1. Can you add some comment about it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added javadocs to the bindings class to detail its usage

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@clintropolis
Copy link
Member Author

thanks for review @jihoonson 🤘

@clintropolis clintropolis merged commit 17efa6f into apache:master Jul 6, 2021
@clintropolis clintropolis deleted the expression-plan-adjustments branch July 6, 2021 18:20
jihoonson pushed a commit to jihoonson/druid that referenced this pull request Jul 12, 2021
…er expression planning (apache#11213)

* add single input string expression dimension vector selector and better expression planning

* better

* fixes

* oops

* rework how vector processor factories choose string processors, fix to be less aggressive about vectorizing

* oops

* javadocs, renaming

* more javadocs

* benchmarks

* use string expression vector processor with vector size 1 instead of expr.eval

* better logging

* javadocs, surprising number of the the

* more

* simplify
@clintropolis clintropolis added this to the 0.22.0 milestone Aug 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants