-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
expression aggregator #11104
expression aggregator #11104
Conversation
This pull request introduces 1 alert when merging 7007ac4 into a6a2758 - view on LGTM.com new alerts:
|
@Override | ||
public byte[] getCacheKey() | ||
{ | ||
byte[] fieldsBytes = StringUtils.toUtf8WithNullToEmpty(String.join(",", fields)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest using CacheKeyBuilder
.
switch (eval.type()) { | ||
case LONG: | ||
if (eval.isNumericNull()) { | ||
buffer.put(offset, NullHandling.IS_NULL_BYTE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems reasonable to me to ignore maxSizeBytes
for primitive types, but please document it at least in the javadoc.
buffer.putInt(offset, stringBytes.length); | ||
offset += Integer.BYTES; | ||
for (byte stringByte : stringBytes) { | ||
buffer.put(offset++, stringByte); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, why not buffer.position(offset); buffer.put(stringBytes, offset, stringBytes.length)
and restoring the original position?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, I meant to switch to doing what you suggest here.. originally I was doing all cases manually but then started to switch some over to the bulk methods but I guess didn't finish all the way
} | ||
} else { | ||
checkMaxBytes(eval.type(), 1 + Integer.BYTES, maxSizeBytes); | ||
buffer.putInt(offset, -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest adding a static variable for -1
and using it.
return new String[]{null}; | ||
} | ||
|
||
private static Class convertType(@Nullable Class existing, Class next) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add some javadoc?
@@ -177,14 +177,17 @@ See javadoc of java.lang.Math for detailed explanation for each function. | |||
| array_offset_of(arr,expr) | returns the 0 based index of the first occurrence of expr in the array, or `-1` or `null` if `druid.generic.useDefaultValueForNull=false`if no matching elements exist in the array. | | |||
| array_ordinal_of(arr,expr) | returns the 1 based index of the first occurrence of expr in the array, or `-1` or `null` if `druid.generic.useDefaultValueForNull=false` if no matching elements exist in the array. | | |||
| array_prepend(expr,arr) | adds expr to arr at the beginning, the resulting array type determined by the type of the array | | |||
| array_append(arr1,expr) | appends expr to arr, the resulting array type determined by the type of the first array | | |||
| array_append(arr,expr) | appends expr to arr, the resulting array type determined by the type of the first array | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the resulting array type determined by the type of the first array
Is this true? Should it mention how the type is determined per https://github.com/apache/druid/pull/11104/files#diff-7badc739fd6eef810cbd31950d52f28267273e129b9371beeea0bc5d125da7f7R267?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it is true. The method you linked is used when converting the values from the input binding to expressions. Within expression evaluation the types are already decided, and so the first array type dictates the output type (see the case expressions in the array function eval methods)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah cool, thanks for the explanation 👍
| array_concat(arr1,arr2) | concatenates 2 arrays, the resulting array type determined by the type of the first array | | ||
| array_set_add(arr,expr) | adds expr to arr and converts the array to a new array composed of the unique set of elements. The resulting array type determined by the type of the array | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The resulting array type determined by the type of the array
Similarly, is the type determined by inspecting all elements in the new set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, for the same reason as #11104 (comment)
@JsonProperty("combine") @Nullable final String combineExpression, | ||
@JsonProperty("compare") @Nullable final String compareExpression, | ||
@JsonProperty("finalize") @Nullable final String finalizeExpression, | ||
@JsonProperty("maxSizeBytes") @Nullable final Integer maxSizeBytes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HumanReadableBytes
?
public class ExpressionLambdaAggregatorFactory extends AggregatorFactory | ||
{ | ||
private static final String DEFAULT_ACCUMULATOR_ID = "__acc"; | ||
private static final int DEFAULT_MAX_SIZE_BYTES = 1 << 13; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8K by default seems pretty big. Maybe 1K instead?
private final Supplier<Expr> combineExpression; | ||
private final Supplier<Expr> compareExpression; | ||
private final Supplier<Expr> finalizeExpression; | ||
private final int maxSizeBytes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document that it's ignored when combinedValue is a primitive type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be better to have a precondition that the size is at minimum the size required to hold a long or double then we wouldn't really have to ignore it because it would be illegal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @clintropolis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the interface for the aggregator makes sense, LGTM
Description
I loved the idea of the Javascript aggregator - the flexibility of being able to define arbitrary computations across any number of inputs to accumulate values is very powerful. But, it turns out it was a bit too powerful due to the number of exploitive things you can do with it to the host machine, so it broke my heart. Additionally, it was limited to only supporting aggregations of double types, which really cuts down on the expressiveness possible.
In this PR, I have re-imagined the concept of such a flexible aggregator, but sandboxed, using native Druid expressions. Further, this
ExpressionLambdaAggregatorFactory
really rounds out the role of the Druid expression system in the query engine. With its introduction, Druid native expressions can now be used to perform a "fold" or "reduce" operation on any number of input columns, in addition to the previously possible "map" (ExpressionVirtualColumn
), "filter" (ExpressionFilter
) and post-transform (ExpressionPostAggregator
).ExpressionLambdaAggregatorFactory
also supports all native Druid expression types, including array types, which means that it can be used to build things such asARRAY_AGG
andGROUP_CONCAT
and similar functionality.ExpressionLambdaAggregatorFactory
offers near complete control over theAggregatorFactory
with expressions.name
fields
accumulatorIdentifier
__acc
)initialValue
fold
(andcombine
, ifInitialCombineValue
is null) expressioninitialCombineValue
combine
expressioninitialValue
)fold
fields
. The result of the expression will be stored inaccumulatorIdentifier
and available to the next computation.combine
fold
expressions. If not defined andfold
has a single input column infields
, then thefold
expression may be used, otherwise the input is available to the expression as thename
fold
expression if and only if the expression has a single input infields
)compare
o1
ando2
, whereo1
ando2
are the output offold
orcombine
expressions, and must adhere to the Java comparator contract. If not set, this will try to fall back to an output type appropriate comparatorfinalize
o
, and is used to perform any final transformation of the output offold
orcombine
expressions. If not set, then the value is not transformedmaxSizeBytes
Examples (some contrived)
"count" aggregator
"sum" aggregator
"array" aggregator, sorted by array length
"group_concat" aggregator, sorted by array length
decomposed "sum" aggregator, where instead of merging by summing it merges into an array of each individual sum, before finally summing values in a finalizer
Variably sized results such as string and array types are controlled with a maximum byte setting. Unlike some other aggregators like string first/last, the expression aggregator will currently fail if the size grows too large instead of silently truncating.
Since array types can get quite large, i've added
array_set_add
andarray_set_add_all
which allows working with array types as sets so that they can contain only unique values (which also allows for modeling things such as use of distinct keyword with SQL functions such asARRAY_AGG
andGROUP_CONCAT
).I have not actually documented the native expression aggregator itself in this PR yet, because I am unsure if it is ready for large scale use quite yet, but would consider adding it to the native querying docs perhaps with a disclaimer.
Key changed/added classes in this PR
ExprEval
Parser
ExpressionLambdaAggregatorFactory
ExpressionLambdaAggregator
ExpressionLambdaBufferAggregator
This PR has: