add single input string expression dimension vector selector and better expression planning #11213

clintropolis · 2021-05-06T19:22:25Z

Description

This PR adds some improvements and fixes to the query processing system, mostly related to expression planning and execution. ExpressionPlan has been reworked to now produce the ColumnCapabilities directly instead of ExpressionVirtualColumn doing this from the plan, which keeps everything a bit tighter together and much actual logic out of being in ExpressionVirtualColumn. The ExpressionPlanner was relatively well covered by tests prior to this PR, but it was almost entirely indirect coverage from the set of queries we were running, and had no direct tests. I've changed this added quite a few tests to model various scenarios that can occur when planning expressions against an underlying ColumnInspector so hopefully things will be better defined.

Part of the reason for reworking this stuff is so that single input strings can correctly report themselves as dictionary encoded to allow for deferring expression evaluation until the time lookupName is called, an optimization that had not yet made its way to the vectorized expression processing, but is available now through SingleStringDeferredEvaluationExpressionDimensionVectorSelector. This deferred evaluation offers quite a performance improvement, allowing using the native string grouping strategy and lazily executing expression on value lookup, instead of eagerly evaluating them and using the dictionary building grouping strategy.

Given the queries:

      // 28: group by single input string low cardinality expr with expr agg
      "SELECT CONCAT(string2, '-', 'foo'), SUM(long1 * long4) FROM foo GROUP BY 1 ORDER BY 2",
      // 28: group by single input string high cardinality expr with expr agg
      "SELECT CONCAT(string3, '-', 'foo'), SUM(long1 * long4) FROM foo GROUP BY 1 ORDER BY 2"

non-vectorized:

Benchmark                        (query)  (rowsPerSegment)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlExpressionBenchmark.querySql       28           5000000        false  avgt    5  686.636 ± 22.514  ms/op
SqlExpressionBenchmark.querySql       29           5000000        false  avgt    5  852.508 ± 28.924  ms/op

eager vector object selector/dictionary building grouping strategy (unoptimized previous behavior which ended up being slower than the optimized non-vectorized engine) :

Benchmark                        (query)  (rowsPerSegment)  (vectorize)  Mode  Cnt    Score     Error  Units
SqlExpressionBenchmark.querySql       28           5000000        force  avgt    5  976.128 ±  78.928  ms/op
SqlExpressionBenchmark.querySql       29           5000000        force  avgt    5  960.815 ± 272.107  ms/op

new lazy evaluated expression grouping optimization for parity with non-vectorized engine using Expr.eval:

Benchmark                        (query)  (rowsPerSegment)  (vectorize)  Mode  Cnt    Score     Error  Units
SqlExpressionBenchmark.querySql       28           5000000        force  avgt    5  158.428 ±  6.382  ms/op
SqlExpressionBenchmark.querySql       29           5000000        force  avgt    5  194.414 ± 20.348  ms/op

new lazy evaluated expression grouping optimization for parity with non-vectorized engine using ExprVectorProcessor.evalVector with a vector size of 1:

Benchmark                        (query)  (rowsPerSegment)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlExpressionBenchmark.querySql       28           5000000        force  avgt    5  155.643 ± 11.507  ms/op
SqlExpressionBenchmark.querySql       29           5000000        force  avgt    5  164.360 ±  8.924  ms/op

The last one measured slightly better than using Expr.eval, so this PR went with using ExprVectorProcessor even though the vector size is only ever 1. I suspect this will often be the case since the vectorized expression processors are stronger typed and so can eliminate most branching required to process individual rows since it knows the types of everything ahead of time.

This optimization can only be used by single input strings, because it can delegate dictionary ids to use for grouping to the underlying column. String expressions with multiple input columns will still use the default vector object selector path.

Vector selector construction in ColumnProcessors has been adjusted to push the decision on whether or not to use dictionary encoded selectors or object selectors into the processor factory, and group by engine now takes a less aggressive stance that any dictionary encoded column should use a dictionary encoded selector, which means single input string expressions will now use the SingleStringDeferredEvaluationExpressionDimensionVectorSelector and group on the underlying dictionary ids, while aggregators or filters on virtual columns will still use the object selector, which should still be more performant in non-grouping cases where evaluation cannot be deferred.

This PR also fixes an unrelated issue with vector query engines not having access to virtual column capabilities when trying to determine when if they can vectorize, and solves it by introducing a VirtualizedColumnInspector which wraps another ColumnInspector and is now used in these engines instead of the direct segment adapter, fixing the issue described in this comment #11188 (comment)

This PR has:

been self-reviewed.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
been tested in a test Druid cluster.

…er expression planning

…djustments

…o be less aggressive about vectorizing

…djustments

…expr.eval

…djustments

jihoonson · 2021-06-30T05:01:02Z

...org/apache/druid/query/groupby/epinephelinae/vector/GroupByVectorColumnProcessorFactory.java

+   * the the string input.
+   */
+  @Override
+  public boolean useDictionaryEncodedSelector(ColumnCapabilities capabilities)


Please add @Nullable for capabilities.

hmm, it should never be null for this method though (nor any of the other VectorColumnProcessorFactory methods), ColumnProcessors will return a nil vector selector if the capabilities are null, since null capabilities in the vectorized engine means that the column doesn't exist.

jihoonson · 2021-06-30T18:07:31Z

...org/apache/druid/query/groupby/epinephelinae/vector/GroupByVectorColumnProcessorFactory.java

+   * We do this even for things like virtual columns that have a single string input, because it allows deferring
+   * accessing any of the actual string values, which involves at minimum reading utf8 byte values and converting
+   * them to string form (if not already cached), and in the case of expressions, computing the expression output for
+   * the the string input.


Suggested change

* the the string input.

* the string input.

quite a few 'the the', fixed all I could find.

That's a lot! Thanks for fixing them.

jihoonson · 2021-06-30T19:50:52Z

processing/src/main/java/org/apache/druid/segment/VectorColumnProcessorFactory.java

+   * deal with the actual string value in exchange for the increased complexity of dealing with dictionary encoded
+   * selectors.
+   */
+  default boolean useDictionaryEncodedSelector(ColumnCapabilities capabilities)


Please add @Nullable for capabilities.

jihoonson · 2021-06-30T19:54:53Z

processing/src/main/java/org/apache/druid/segment/ColumnProcessors.java

+          .setDictionaryValuesSorted(sorted && dimensionSpec.getExtractionFn().preservesOrdering())
+          .setDictionaryValuesUnique(
+              unique && dimensionSpec.getExtractionFn().getExtractionType() == ExtractionFn.ExtractionType.ONE_TO_ONE
+          )
          .setHasMultipleValues(dimensionSpec.mustDecorate() || mayBeMultiValue(columnCapabilities));


nit: dimensionSpec.mustDecorate() always returns false here.

jihoonson · 2021-06-30T20:30:26Z

processing/src/main/java/org/apache/druid/segment/virtual/ExpressionPlan.java

+   * If no output type was able to be inferred during planning, returns null
+   */
+  @Nullable
+  public ColumnCapabilities inferColumnCapabilities(@Nullable ValueType hint)


Suggested change

public ColumnCapabilities inferColumnCapabilities(@Nullable ValueType hint)

public ColumnCapabilities inferColumnCapabilities(@Nullable ValueType outputTypeHint)

jihoonson · 2021-06-30T20:37:07Z

processing/src/main/java/org/apache/druid/segment/virtual/ExpressionPlan.java

+      return ColumnCapabilitiesImpl.createSimpleSingleValueStringColumnCapabilities();
+    }
+    // we don't know what we don't know
+    return null;


Can we still infer in some cases if a non-null hint is provided, such as when outputType is null but hint is ValueType.DOUBLE? I'm not sure when this case can happen though.

The contract of this method is that it only returns column capabilities if it could infer an output type, and it factors the hinted type into things (where the hint is from SQL planner or user who sent JSON query).

The caller of this method, ExpressionVirtualColumn.capabilities, will construct ColumnCapabilities using the hint if this inferColumnCapabilities method returns null.

jihoonson · 2021-06-30T20:47:06Z

processing/src/main/java/org/apache/druid/segment/virtual/ExpressionVectorSelectors.java

@@ -54,7 +55,13 @@ public static SingleValueDimensionVectorSelector makeSingleValueDimensionVectorS
      String constant = plan.getExpression().eval(ExprUtils.nilBindings()).asString();
      return ConstantVectorSelectors.singleValueDimensionVectorSelector(factory.getReadableVectorInspector(), constant);
    }
-    throw new IllegalStateException("Only constant expressions currently support dimension selectors");
+    if (plan.is(ExpressionPlan.Trait.SINGLE_INPUT_SCALAR) && ExprType.STRING == plan.getOutputType()) {


Can we also use SingleStringInputDeferredEvaluationExpressionDimensionVectorSelector when the plan is SINGLE_INPUT_MAPPABLE?

Not at this time, at least for expressions that aren't also SINGLE_INPUT_SCALAR. The vector engine uses different selectors for single and multi value string dimensions, and all SINGLE_INPUT_MAPPABLE that have a single input and single output will also have the trait SINGLE_INPUT_SCALAR. This means any SINGLE_INPUT_MAPPABLE that aren't caught by this if will be on multi-valued columns so need to use the multi-valued dimension vector selector, which isn't implemented for vectorized expression processing yet.

I guess we could hypothetically unroll multi-valued rows into a vector if we had some sort of input binding which could handle that mapping and still processing them as single valued dims, but I'm not certain if the complexity would be worth it over just using the multi-valued selector. I also can imagine reworking array processing to be much closer to scalar vector processing, so I need to think a bit more about the best way to tackle this.

jihoonson · 2021-06-30T20:54:05Z

processing/src/main/java/org/apache/druid/segment/virtual/VirtualizedColumnInspector.java

+ * construct the appropriate capabilities for virtual columns, while the base inspector directly supplies the
+ * capabilities for non-virtual columns.
+ */
+public class VirtualizedColumnInspector implements ColumnInspector


jihoonson · 2021-06-30T21:26:12Z

...org/apache/druid/query/groupby/epinephelinae/vector/GroupByVectorColumnProcessorFactory.java

+  {
+    Preconditions.checkArgument(capabilities != null, "Capabilities must not be null");
+    Preconditions.checkArgument(capabilities.getType() == ValueType.STRING, "Must only be called on a STRING column");
+    return capabilities.isDictionaryEncoded().isTrue();


Just for recording (and helping myself to remember later in the future), this means that the groupBy vector engine will use the dictionary IDs to compute per-segment results, and decode them when merging those results, which is what non-vectorized engine does today. When the column is dictionary encoded but not unique, this optimization might not be always good because there could be some sort of tradeoff depending on the column cardinality post expression evaluation. Even though I think this optimization is likely good in most cases, it could worth investigating further later to understand the tradeoff better.

jihoonson · 2021-06-30T21:44:39Z

...id/segment/virtual/SingleStringInputDeferredEvaluationExpressionDimensionVectorSelector.java

+    @Override
+    public int getMaxVectorSize()
+    {
+      return 1;


It would worth mentioning why the vector size is 1. Can you add some comment about it?

added javadocs to the bindings class to detail its usage

…djustments

jihoonson

LGTM

clintropolis · 2021-07-06T18:20:43Z

thanks for review @jihoonson 🤘

…er expression planning (apache#11213) * add single input string expression dimension vector selector and better expression planning * better * fixes * oops * rework how vector processor factories choose string processors, fix to be less aggressive about vectorizing * oops * javadocs, renaming * more javadocs * benchmarks * use string expression vector processor with vector size 1 instead of expr.eval * better logging * javadocs, surprising number of the the * more * simplify

add single input string expression dimension vector selector and bett…

a009ce2

…er expression planning

clintropolis added Bug Area - Querying labels May 6, 2021

clintropolis added 10 commits May 12, 2021 16:34

better

09913fc

Merge remote-tracking branch 'upstream/master' into expression-plan-a…

5a9ecf8

…djustments

fixes

22e8f2f

oops

516a4a5

rework how vector processor factories choose string processors, fix t…

79468b1

…o be less aggressive about vectorizing

oops

66eceb5

javadocs, renaming

1d3f590

Merge remote-tracking branch 'upstream/master' into expression-plan-a…

52cc7aa

…djustments

more javadocs

a3e98ba

benchmarks

60910e2

clintropolis added the Performance label May 18, 2021

clintropolis added 3 commits May 18, 2021 01:10

use string expression vector processor with vector size 1 instead of …

99d0b74

…expr.eval

Merge remote-tracking branch 'upstream/master' into expression-plan-a…

b934b85

…djustments

better logging

63f6277

jihoonson reviewed Jun 30, 2021

View reviewed changes

clintropolis added 4 commits July 2, 2021 05:02

Merge remote-tracking branch 'upstream/master' into expression-plan-a…

3837e2e

…djustments

javadocs, surprising number of the the

fdb7a54

more

ced4a02

simplify

76677bf

jihoonson approved these changes Jul 3, 2021

View reviewed changes

clintropolis merged commit 17efa6f into apache:master Jul 6, 2021

clintropolis deleted the expression-plan-adjustments branch July 6, 2021 18:20

clintropolis added this to the 0.22.0 milestone Aug 12, 2021

clintropolis mentioned this pull request Sep 3, 2021

[Draft] 0.22.0 Release Notes #11657

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add single input string expression dimension vector selector and better expression planning #11213

add single input string expression dimension vector selector and better expression planning #11213

clintropolis commented May 6, 2021 •

edited

Loading

jihoonson Jun 30, 2021

clintropolis Jul 2, 2021

jihoonson Jun 30, 2021

clintropolis Jul 2, 2021

jihoonson Jul 3, 2021

jihoonson Jun 30, 2021

jihoonson Jun 30, 2021

jihoonson Jun 30, 2021

jihoonson Jun 30, 2021

clintropolis Jul 2, 2021

jihoonson Jun 30, 2021

clintropolis Jul 2, 2021

jihoonson Jun 30, 2021

jihoonson Jun 30, 2021

jihoonson Jun 30, 2021

clintropolis Jul 2, 2021

jihoonson left a comment

clintropolis commented Jul 6, 2021

	public ColumnCapabilities inferColumnCapabilities(@Nullable ValueType hint)
	public ColumnCapabilities inferColumnCapabilities(@Nullable ValueType outputTypeHint)

add single input string expression dimension vector selector and better expression planning #11213

add single input string expression dimension vector selector and better expression planning #11213

Conversation

clintropolis commented May 6, 2021 • edited Loading

Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jihoonson left a comment

Choose a reason for hiding this comment

clintropolis commented Jul 6, 2021

clintropolis commented May 6, 2021 •

edited

Loading