-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARRAY_AGG sql aggregator function #11157
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work 👍
docs/querying/sql.md
Outdated
|`ARRAY_AGG(expr)`|Collects all values of `expr` into an ARRAY, including null values, with the default limit on aggregation size of 1024 bytes. `ORDER BY` on the `ARRAY_AGG` expression is not currently supported.| | ||
|`ARRAY_AGG(DISTINCT expr)`|Collects all distinct values of `expr` into an ARRAY, including null values, with the default limit on aggregation size of 1024 bytes per aggregate. `ORDER BY` on the `ARRAY_AGG` expression is not currently supported.| | ||
|`ARRAY_AGG(expr, maxSizeBytes)`|Collects all values of `expr` into an ARRAY, including null values, with specified maximum byte size per aggregate. `ORDER BY` on the `ARRAY_AGG` expression is not currently supported.| | ||
|`ARRAY_AGG(DISTINCT expr, maxSizeBytes)`|Collects all distinct values of `expr` into an ARRAY, including null values, with specified maximum byte size per aggregate. `ORDER BY` on the `ARRAY_AGG` expression is not currently supported.| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe better to consolidate these into 2, such as ARRAY_AGG(expr)
and ARRAY_AGG(DISTINCT expr)
with an optional size parameter as in https://github.com/apache/druid/pull/11157/files#diff-cada0a48f60f548ce82ef1008918e0aed6b3de0339312c662761855a35dde6c6L339.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 changed
@Override | ||
public SqlAggFunction calciteFunction() | ||
{ | ||
return new ArrayAggFunction(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is it worth making it as a singleton?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
final String initialvalue; | ||
final ValueType elementType; | ||
final ValueType druidType = Calcites.getValueTypeForRelDataTypeFull(aggregateCall.getType()); | ||
switch (druidType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can throw NPE since druidType
can be null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, modified to default to string if null
return null; | ||
} | ||
|
||
Integer maxSizeBytes = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think probably most people just use the byte size for this function, but it seems that it can be a string literal which will allow using the HumanReadable
format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good point. I was just matching other sketch aggregators, I think it makes sense to be able to handle it, but maybe it makes sense to do this to all of them in a follow-up, so they are consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but please address my comment on docs before you merge.
return null; | ||
} | ||
|
||
Integer maxSizeBytes = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -353,6 +353,8 @@ Only the COUNT aggregation can accept DISTINCT. | |||
|`ANY_VALUE(expr)`|Returns any value of `expr` including null. `expr` must be numeric. This aggregator can simplify and optimize the performance by returning the first encountered value (including null)| | |||
|`ANY_VALUE(expr, maxBytesPerString)`|Like `ANY_VALUE(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.| | |||
|`GROUPING(expr, expr...)`|Returns a number to indicate which groupBy dimension is included in a row, when using `GROUPING SETS`. Refer to [additional documentation](aggregations.md#grouping-aggregator) on how to infer this number.| | |||
|`ARRAY_AGG(expr, [size])`|Collects all values of `expr` into an ARRAY, including null values, with `size` in bytes limit on aggregation size (default of 1024 bytes). Use of `ORDER BY` within the `ARRAY_AGG` expression is not currently supported, and the ordering of results within the output array may vary depending on processing order.| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please document that this function can return null when it processes no rows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍, but I will do that in #11188, which adds that documentation for all aggregator functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good 👍
thanks for review @jihoonson |
Description
This PR adds
ARRAY_AGG
to SQL, building it on top of the expression aggregator provided in #11104.ARRAY_AGG(expr, [size])
expr
into an ARRAY, including null values, withsize
in bytes limit on aggregation size (default of 1024 bytes). Use ofORDER BY
within theARRAY_AGG
expression is not currently supported, and the ordering of results within the output array may vary depending on processing order.ARRAY_AGG(DISTINCT expr, [size])
expr
into an ARRAY, including null values, withsize
in bytes limit on aggregation size (default of 1024 bytes) per aggregate. Use ofORDER BY
within theARRAY_AGG
expression is not currently supported, and the ordering of results within the output array may vary depending on processing order.Since this is an expression aggregator, it is unlikely to be the most optimal way to provide this functionality, But, it importantly makes it so that we can offer this functionality relatively easily, since arrays are more or less only supported in the expression system at this point, so it is still an improvement.
Until we update Apache Calcite to a newer version, we are unable to support the
ORDER BY
that some other databases which offerARRAY_AGG
have, due to https://issues.apache.org/jira/browse/CALCITE-4335. Upgrading Calcite to any newer version is tricky because of requirements of a newer Guava dependency.We also will accept a non-standard 2 argument form of
ARRAY_AGG
as specified above, so that we can specify the maximum buffer aggregator size in bytes, which will be necessary until we have growable buffer aggregators.I added a lot of SQL query tests to try and stress some areas of array handling in addition to testing this aggregator itself, including expected failure tests such as trying to group on the output of an array agg output of a subquery or join.
It was admittedly rather pleasing to see stuff like
able to just work out of the box after I added the aggregator function, so I consider array support overall to be progressing pretty smoothly.
This PR has: