Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARRAY_AGG sql aggregator function #11157

Merged
merged 8 commits into from
May 4, 2021
Merged

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented Apr 23, 2021

Description

This PR adds ARRAY_AGG to SQL, building it on top of the expression aggregator provided in #11104.

function description
ARRAY_AGG(expr, [size]) Collects all values of expr into an ARRAY, including null values, with size in bytes limit on aggregation size (default of 1024 bytes). Use of ORDER BY within the ARRAY_AGG expression is not currently supported, and the ordering of results within the output array may vary depending on processing order.
ARRAY_AGG(DISTINCT expr, [size]) Collects all distinct values of expr into an ARRAY, including null values, with size in bytes limit on aggregation size (default of 1024 bytes) per aggregate. Use of ORDER BY within the ARRAY_AGG expression is not currently supported, and the ordering of results within the output array may vary depending on processing order.

Since this is an expression aggregator, it is unlikely to be the most optimal way to provide this functionality, But, it importantly makes it so that we can offer this functionality relatively easily, since arrays are more or less only supported in the expression system at this point, so it is still an improvement.

Until we update Apache Calcite to a newer version, we are unable to support the ORDER BY that some other databases which offer ARRAY_AGG have, due to https://issues.apache.org/jira/browse/CALCITE-4335. Upgrading Calcite to any newer version is tricky because of requirements of a newer Guava dependency.

We also will accept a non-standard 2 argument form of ARRAY_AGG as specified above, so that we can specify the maximum buffer aggregator size in bytes, which will be necessary until we have growable buffer aggregators.

I added a lot of SQL query tests to try and stress some areas of array handling in addition to testing this aggregator itself, including expected failure tests such as trying to group on the output of an array agg output of a subquery or join.

It was admittedly rather pleasing to see stuff like

SELECT dim2, COUNT(*)
FROM foo
WHERE ARRAY_CONTAINS((SELECT ARRAY_AGG(DISTINCT dim1) FROM foo WHERE dim1 is not null), dim1)
GROUP BY 1

able to just work out of the box after I added the aggregator function, so I consider array support overall to be progressing pretty smoothly.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work 👍

|`ARRAY_AGG(expr)`|Collects all values of `expr` into an ARRAY, including null values, with the default limit on aggregation size of 1024 bytes. `ORDER BY` on the `ARRAY_AGG` expression is not currently supported.|
|`ARRAY_AGG(DISTINCT expr)`|Collects all distinct values of `expr` into an ARRAY, including null values, with the default limit on aggregation size of 1024 bytes per aggregate. `ORDER BY` on the `ARRAY_AGG` expression is not currently supported.|
|`ARRAY_AGG(expr, maxSizeBytes)`|Collects all values of `expr` into an ARRAY, including null values, with specified maximum byte size per aggregate. `ORDER BY` on the `ARRAY_AGG` expression is not currently supported.|
|`ARRAY_AGG(DISTINCT expr, maxSizeBytes)`|Collects all distinct values of `expr` into an ARRAY, including null values, with specified maximum byte size per aggregate. `ORDER BY` on the `ARRAY_AGG` expression is not currently supported.|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better to consolidate these into 2, such as ARRAY_AGG(expr) and ARRAY_AGG(DISTINCT expr) with an optional size parameter as in https://github.com/apache/druid/pull/11157/files#diff-cada0a48f60f548ce82ef1008918e0aed6b3de0339312c662761855a35dde6c6L339.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 changed

@Override
public SqlAggFunction calciteFunction()
{
return new ArrayAggFunction();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is it worth making it as a singleton?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

final String initialvalue;
final ValueType elementType;
final ValueType druidType = Calcites.getValueTypeForRelDataTypeFull(aggregateCall.getType());
switch (druidType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can throw NPE since druidType can be null.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, modified to default to string if null

return null;
}

Integer maxSizeBytes = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think probably most people just use the byte size for this function, but it seems that it can be a string literal which will allow using the HumanReadable format?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point. I was just matching other sketch aggregators, I think it makes sense to be able to handle it, but maybe it makes sense to do this to all of them in a follow-up, so they are consistent?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please address my comment on docs before you merge.

return null;
}

Integer maxSizeBytes = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -353,6 +353,8 @@ Only the COUNT aggregation can accept DISTINCT.
|`ANY_VALUE(expr)`|Returns any value of `expr` including null. `expr` must be numeric. This aggregator can simplify and optimize the performance by returning the first encountered value (including null)|
|`ANY_VALUE(expr, maxBytesPerString)`|Like `ANY_VALUE(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|
|`GROUPING(expr, expr...)`|Returns a number to indicate which groupBy dimension is included in a row, when using `GROUPING SETS`. Refer to [additional documentation](aggregations.md#grouping-aggregator) on how to infer this number.|
|`ARRAY_AGG(expr, [size])`|Collects all values of `expr` into an ARRAY, including null values, with `size` in bytes limit on aggregation size (default of 1024 bytes). Use of `ORDER BY` within the `ARRAY_AGG` expression is not currently supported, and the ordering of results within the output array may vary depending on processing order.|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please document that this function can return null when it processes no rows?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍, but I will do that in #11188, which adds that documentation for all aggregator functions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good 👍

@clintropolis
Copy link
Member Author

thanks for review @jihoonson

@clintropolis clintropolis merged commit 554f1ff into apache:master May 4, 2021
@clintropolis clintropolis deleted the array_agg branch May 4, 2021 05:17
@clintropolis clintropolis added this to the 0.22.0 milestone Aug 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants