expression aggregator #11104

clintropolis · 2021-04-12T23:41:32Z

Description

I loved the idea of the Javascript aggregator - the flexibility of being able to define arbitrary computations across any number of inputs to accumulate values is very powerful. But, it turns out it was a bit too powerful due to the number of exploitive things you can do with it to the host machine, so it broke my heart. Additionally, it was limited to only supporting aggregations of double types, which really cuts down on the expressiveness possible.

In this PR, I have re-imagined the concept of such a flexible aggregator, but sandboxed, using native Druid expressions. Further, this ExpressionLambdaAggregatorFactory really rounds out the role of the Druid expression system in the query engine. With its introduction, Druid native expressions can now be used to perform a "fold" or "reduce" operation on any number of input columns, in addition to the previously possible "map" (ExpressionVirtualColumn), "filter" (ExpressionFilter) and post-transform (ExpressionPostAggregator).

ExpressionLambdaAggregatorFactory also supports all native Druid expression types, including array types, which means that it can be used to build things such as ARRAY_AGG and GROUP_CONCAT and similar functionality.

ExpressionLambdaAggregatorFactory offers near complete control over the AggregatorFactory with expressions.

property	description	required
`name`	aggregator name	true
`fields`	aggregator input columns	true
`accumulatorIdentifier`	variable which identifies the accumulator value in the fold and combine expressions	false (default `__acc`)
`initialValue`	initial value of the accumulator for `fold` (and `combine`, if `InitialCombineValue` is null) expression	true
`initialCombineValue`	initial value of the accumulator for `combine` expression	false (default `initialValue`)
`fold`	expression to accumulate values from `fields`. The result of the expression will be stored in `accumulatorIdentifier` and available to the next computation.	true
`combine`	expression to combine the results of various `fold` expressions. If not defined and `fold` has a single input column in `fields`, then the `fold` expression may be used, otherwise the input is available to the expression as the `name`	false (default to `fold` expression if and only if the expression has a single input in `fields`)
`compare`	comparator expression which can only refer to 2 input variables, `o1` and `o2`, where `o1` and `o2` are the output of `fold` or `combine` expressions, and must adhere to the Java comparator contract. If not set, this will try to fall back to an output type appropriate comparator	false
`finalize`	finalize expression which can only refer to a single input variable, `o`, and is used to perform any final transformation of the output of `fold` or `combine` expressions. If not set, then the value is not transformed	false
`maxSizeBytes`	maximum size in bytes that variably sized aggregator output types such as strings and arrays are allowed to grow before the aggregation will fail.	false (8192 bytes)

Examples (some contrived)

"count" aggregator

    {
      "type": "expression",
      "name": "expression_count",
      "fields": [],
      "initialValue": "0",
      "fold": "__acc + 1",
      "combine": "__acc + expression_count"
    }

"sum" aggregator

    {
      "type": "expression",
      "name": "expression_sum",
      "fields": ["column_a"],
      "initialValue": "0",
      "fold": "__acc + column_a"
    }

"array" aggregator, sorted by array length

    {
      "type": "expression",
      "name": "expression_array_agg_distinct",
      "fields": ["column_a"],
      "initialValue": "[]",
      "fold": "array_set_add(__acc, column_a)",
      "combine": "array_set_add_all(__acc, expression_array_agg_distinct)",
      "compare": "if(array_length(o1) > array_length(o2), 1, if (array_length(o1) == array_length(o2), 0, -1))"
    }

"group_concat" aggregator, sorted by array length

    {
      "type": "expression",
      "name": "expression_group_concat_distinct",
      "fields": ["column_a"],
      "initialValue": "[]",
      "fold": "array_set_add(__acc, column_a)",
      "combine": "array_set_add_all(__acc, expression_array_agg_distinct)",
      "compare": "if(array_length(o1) > array_length(o2), 1, if (array_length(o1) == array_length(o2), 0, -1))",
      "finalize": "array_to_string(o, ',')"
    }

decomposed "sum" aggregator, where instead of merging by summing it merges into an array of each individual sum, before finally summing values in a finalizer

    {
      "type": "expression",
      "name": "expression_decomposed_sum",
      "fields": ["column_a"],
      "initialValue": "0.0",
      "initialCombineValue": "<DOUBLE>[]",
      "fold": "__acc + column_a",
      "combine": "array_concat(__acc, expression_decomposed_sum)",
      "finalize": "fold((x, __acc) -> x + __acc, o, 0.0)"
    }

Variably sized results such as string and array types are controlled with a maximum byte setting. Unlike some other aggregators like string first/last, the expression aggregator will currently fail if the size grows too large instead of silently truncating.

Since array types can get quite large, i've added array_set_add and array_set_add_all which allows working with array types as sets so that they can contain only unique values (which also allows for modeling things such as use of distinct keyword with SQL functions such as ARRAY_AGG and GROUP_CONCAT).

I have not actually documented the native expression aggregator itself in this PR yet, because I am unsure if it is ready for large scale use quite yet, but would consider adding it to the native querying docs perhaps with a disclaimer.

Key changed/added classes in this PR

ExprEval
Parser
ExpressionLambdaAggregatorFactory
ExpressionLambdaAggregator
ExpressionLambdaBufferAggregator

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

lgtm-com · 2021-04-13T01:13:10Z

This pull request introduces 1 alert when merging 7007ac4 into a6a2758 - view on LGTM.com

new alerts:

1 for Dereferenced variable may be null

…te-aggregator

jihoonson · 2021-04-15T00:59:27Z

...sing/src/main/java/org/apache/druid/query/aggregation/ExpressionLambdaAggregatorFactory.java

+  @Override
+  public byte[] getCacheKey()
+  {
+    byte[] fieldsBytes = StringUtils.toUtf8WithNullToEmpty(String.join(",", fields));


Suggest using CacheKeyBuilder.

jihoonson · 2021-04-16T07:25:40Z

core/src/main/java/org/apache/druid/math/expr/ExprEval.java

+    switch (eval.type()) {
+      case LONG:
+        if (eval.isNumericNull()) {
+          buffer.put(offset, NullHandling.IS_NULL_BYTE);


It seems reasonable to me to ignore maxSizeBytes for primitive types, but please document it at least in the javadoc.

jihoonson · 2021-04-16T07:34:03Z

core/src/main/java/org/apache/druid/math/expr/ExprEval.java

+          buffer.putInt(offset, stringBytes.length);
+          offset += Integer.BYTES;
+          for (byte stringByte : stringBytes) {
+            buffer.put(offset++, stringByte);


Hmm, why not buffer.position(offset); buffer.put(stringBytes, offset, stringBytes.length) and restoring the original position?

oops, I meant to switch to doing what you suggest here.. originally I was doing all cases manually but then started to switch some over to the bulk methods but I guess didn't finish all the way

jihoonson · 2021-04-16T07:37:28Z

core/src/main/java/org/apache/druid/math/expr/ExprEval.java

+          }
+        } else {
+          checkMaxBytes(eval.type(), 1 + Integer.BYTES, maxSizeBytes);
+          buffer.putInt(offset, -1);


Suggest adding a static variable for -1 and using it.

jihoonson · 2021-04-16T07:44:32Z

core/src/main/java/org/apache/druid/math/expr/ExprEval.java

+    return new String[]{null};
+  }
+
+  private static Class convertType(@Nullable Class existing, Class next)


Could you please add some javadoc?

jihoonson · 2021-04-17T00:07:52Z

docs/misc/math-expr.md

@@ -177,14 +177,17 @@ See javadoc of java.lang.Math for detailed explanation for each function.
 | array_offset_of(arr,expr) | returns the 0 based index of the first occurrence of expr in the array, or `-1` or `null` if `druid.generic.useDefaultValueForNull=false`if no matching elements exist in the array. |
 | array_ordinal_of(arr,expr) | returns the 1 based index of the first occurrence of expr in the array, or `-1` or `null` if `druid.generic.useDefaultValueForNull=false` if no matching elements exist in the array. |
 | array_prepend(expr,arr) | adds expr to arr at the beginning, the resulting array type determined by the type of the array |
-| array_append(arr1,expr) | appends expr to arr, the resulting array type determined by the type of the first array |
+| array_append(arr,expr) | appends expr to arr, the resulting array type determined by the type of the first array |


the resulting array type determined by the type of the first array

Is this true? Should it mention how the type is determined per https://github.com/apache/druid/pull/11104/files#diff-7badc739fd6eef810cbd31950d52f28267273e129b9371beeea0bc5d125da7f7R267?

Ah, it is true. The method you linked is used when converting the values from the input binding to expressions. Within expression evaluation the types are already decided, and so the first array type dictates the output type (see the case expressions in the array function eval methods)

Ah cool, thanks for the explanation 👍

jihoonson · 2021-04-17T00:09:07Z

docs/misc/math-expr.md

 | array_concat(arr1,arr2) | concatenates 2 arrays, the resulting array type determined by the type of the first array |
+| array_set_add(arr,expr) | adds expr to arr and converts the array to a new array composed of the unique set of elements. The resulting array type determined by the type of the array |


The resulting array type determined by the type of the array

Similarly, is the type determined by inspecting all elements in the new set?

No, for the same reason as #11104 (comment)

jihoonson · 2021-04-17T00:14:20Z

...sing/src/main/java/org/apache/druid/query/aggregation/ExpressionLambdaAggregatorFactory.java

+      @JsonProperty("combine") @Nullable final String combineExpression,
+      @JsonProperty("compare") @Nullable final String compareExpression,
+      @JsonProperty("finalize") @Nullable final String finalizeExpression,
+      @JsonProperty("maxSizeBytes") @Nullable final Integer maxSizeBytes,


HumanReadableBytes?

jihoonson · 2021-04-17T00:15:20Z

...sing/src/main/java/org/apache/druid/query/aggregation/ExpressionLambdaAggregatorFactory.java

+public class ExpressionLambdaAggregatorFactory extends AggregatorFactory
+{
+  private static final String DEFAULT_ACCUMULATOR_ID = "__acc";
+  private static final int DEFAULT_MAX_SIZE_BYTES = 1 << 13;


8K by default seems pretty big. Maybe 1K instead?

jihoonson · 2021-04-17T00:20:45Z

...sing/src/main/java/org/apache/druid/query/aggregation/ExpressionLambdaAggregatorFactory.java

+  private final Supplier<Expr> combineExpression;
+  private final Supplier<Expr> compareExpression;
+  private final Supplier<Expr> finalizeExpression;
+  private final int maxSizeBytes;


Please document that it's ignored when combinedValue is a primitive type.

I think it might be better to have a precondition that the size is at minimum the size required to hold a long or double then we wouldn't really have to ignore it because it would be illegal

…te-aggregator

jihoonson

LGTM. Thanks @clintropolis.

jon-wei

I think the interface for the aggregator makes sense, LGTM

add experimental expression aggregator

151824b

clintropolis added Area - Querying WIP labels Apr 12, 2021

add test

7007ac4

fix lgtm

157ddae

clintropolis added Design Review and removed WIP labels Apr 13, 2021

clintropolis added 7 commits April 13, 2021 00:21

fix test

8167add

adjust test

fe0c46f

use not null constant

c354453

array_set_concat docs

10ef3b9

add equals and hashcode and tostring

aa838bb

fix it

7188222

spelling

aa2f60d

clintropolis added the WIP label Apr 13, 2021

clintropolis added 3 commits April 13, 2021 16:55

Merge remote-tracking branch 'upstream/master' into lambda-the-ultima…

d9bde55

…te-aggregator

do multi-value magic for expression agg, more javadocs, tests

0afee3f

formatting

794b5c9

clintropolis removed the WIP label Apr 14, 2021

fix inspection

55c8d1e

jihoonson reviewed Apr 17, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/master' into lambda-the-ultima…

47bd743

…te-aggregator

jihoonson mentioned this pull request Apr 19, 2021

Query like Support for Array_agg #11122

Closed

clintropolis added 2 commits April 19, 2021 18:27

more better

362e5ec

nullable

183eab8

jihoonson approved these changes Apr 21, 2021

View reviewed changes

jon-wei approved these changes Apr 23, 2021

View reviewed changes

jon-wei merged commit 57ff1f9 into apache:master Apr 23, 2021

clintropolis deleted the lambda-the-ultimate-aggregator branch April 23, 2021 01:44

clintropolis mentioned this pull request Apr 23, 2021

ARRAY_AGG sql aggregator function #11157

Merged

7 tasks

clintropolis mentioned this pull request May 21, 2021

bitwise aggregators, better null handling options for expression agg #11280

Merged

6 tasks

clintropolis added this to the 0.22.0 milestone Aug 12, 2021

clintropolis mentioned this pull request Sep 3, 2021

[Draft] 0.22.0 Release Notes #11657

Closed

clintropolis mentioned this pull request Jun 28, 2023

document expression aggregator #14497

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

expression aggregator #11104

expression aggregator #11104

clintropolis commented Apr 12, 2021 •

edited

Loading

lgtm-com bot commented Apr 13, 2021

jihoonson Apr 15, 2021

jihoonson Apr 16, 2021

jihoonson Apr 16, 2021

clintropolis Apr 17, 2021

jihoonson Apr 16, 2021

jihoonson Apr 16, 2021

jihoonson Apr 17, 2021

clintropolis Apr 17, 2021

jihoonson Apr 19, 2021

jihoonson Apr 17, 2021

clintropolis Apr 17, 2021

jihoonson Apr 17, 2021

jihoonson Apr 17, 2021

jihoonson Apr 17, 2021

clintropolis Apr 17, 2021

jihoonson left a comment

jon-wei left a comment

		\| array_concat(arr1,arr2) \| concatenates 2 arrays, the resulting array type determined by the type of the first array \|
		\| array_set_add(arr,expr) \| adds expr to arr and converts the array to a new array composed of the unique set of elements. The resulting array type determined by the type of the array \|

expression aggregator #11104

expression aggregator #11104

Conversation

clintropolis commented Apr 12, 2021 • edited Loading

Description

Key changed/added classes in this PR

lgtm-com bot commented Apr 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jihoonson left a comment

Choose a reason for hiding this comment

jon-wei left a comment

Choose a reason for hiding this comment

clintropolis commented Apr 12, 2021 •

edited

Loading