Aggregate functions on string columns#17192
Conversation
Add StringMinBufferAggregator Settle all fields in StringMinAggregatorFactory Add min aggregator to Guice Fix combine logic in aggregator Fix byte logic in bufferAggregator Make Factory Java8 compatible Add String Aggregator to AggregationType Implement methods to pass Unit AggregatorFactoryTest Implement superclass for StringAggregatorFactory Remove re-implementation of superclass Remove duplicated implementation of factorize() Finish StringMinBufferAggregator Add limit to number of bytes Add and run unit tests for StringMinBufferAggregator Add MinAggregationFactory methods Add StringMinVectorAggregator Change AggregatorFactoryTest declarations Fix checkstyle
processing/src/main/java/org/apache/druid/query/aggregation/SimpleStringAggregatorFactory.java
Fixed
Show fixed
Hide fixed
| d a | ||
| e a | ||
| null a | ||
| null null |
There was a problem hiding this comment.
I would like to justify my changes to the expected test result for test cases woutPrtnBy_6 and woutPrtnBy_7.
Taking woutPrtnBy_6 for example, the query is as follows:
SELECT c2, MIN(MIN(c2)) OVER( ORDER BY c2 ) min_c2
FROM "tblWnulls.parquet"
GROUP BY c2
the original expected and actual results are as follows:
2024-10-02T08:47:04,748 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_6] org.apache.druid.sql.calcite.BaseCalciteQueryTest - -- Expected results --
new Object[]{"a", "a"},
new Object[]{"b", "a"},
new Object[]{"c", "a"},
new Object[]{"d", "a"},
new Object[]{"e", "a"},
new Object[]{null, "a"}
----
2024-10-02T08:47:04,748 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_6] org.apache.druid.sql.calcite.BaseCalciteQueryTest - -- Actual results --
new Object[]{"a", "a"},
new Object[]{"b", "a"},
new Object[]{"c", "a"},
new Object[]{"d", "a"},
new Object[]{"e", "a"},
new Object[]{null, null}
----
Turns out that either one of OVER(ORDER BY c2) or GROUP BY will a partial-result of the following order: [null, a, b, c, d, e]. As such, our MinStringAggregator#compare will be iterating strings as follows:
compare(null, null) --> null
compare(null, 'a') --> 'a'
compare('a', 'b') --> 'a'
compare('a', 'c') --> 'a'
compare('a', 'd') --> 'a'
compare('a', 'e') --> 'a'
As we can see, due to the order in which compare() runs, we are not able to do away with null. This order of things also happened for woutPrtnBy_7, which uses MAX instead of MIN.
While I continued to work and experimented with things, I realised that this order is not only limited to my new StringMin implementation. A tweak to woutPrtnBy_2 shows the following results:
Original Query:
SELECT c2, MIN(MAX(c1)) OVER( ORDER BY c2 ) min_mx_c1 FROM "tblWnulls.parquet" GROUP BY c2
Changed Query:
SELECT c1, MIN(MAX(c1)) OVER( ORDER BY c1 ) min_mx_c1 FROM "tblWnulls.parquet" GROUP BY c1
Results:
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #0: [null, null]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #1: [-1, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #2: [0, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #3: [1, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #4: [2, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #5: [4, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #6: [5, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #7: [6, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #8: [8, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #9: [9, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #10: [10, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #11: [11, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #12: [12, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #13: [13, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #14: [14, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #15: [15, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #16: [17, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #17: [19, -1]
2024-10-02T08:43:55,869 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #18: [11111, -1]
2024-10-02T08:43:55,869 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #19: [65536, -1]
2024-10-02T08:43:55,869 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #20: [1000000, -1]
2024-10-02T08:43:55,869 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #21: [2147483647, -1]
java.lang.RuntimeException: Can't parse input!
Although the test ran into a runtime exception, we can see that the results that are logged shows the presence of [null, null]. Hence, this means that the partial query generated in this case also does not follow a null last comparator logic.
Seeing that the behaviour is the same across different ColumnTypes, I think that fixing this bug (if do we see it as one) will require another PR.
|
Thanks for making the PR! I'll try to review this soon. |
|
Hello @adarshsanjeev, would like to ping and see how's it going 😄 |
|
This pull request has been marked as stale due to 60 days of inactivity. |
|
This pull request has been marked as stale due to 60 days of inactivity. |
|
This pull request/issue has been closed due to lack of activity. If you think that |
Fixes #16956.
Description
This PR introduces min() and max() aggregators for string columns:
StringMinAggregatorandStringMaxAggregator, and their respective buffer and vector aggregators. These aggregators compute the minimum and maximum string values respectively in a column during query execution.Throughout the implementation process, I had some difficulty understand some of the below, and would appreciate extra care in the reviewing process for:
expressionin the aggregator factories.SingleValueDimensionVectorSelectorandMultiValueDimensionVectorSelectorfor vector aggregation.moving-average-query.md?Regarding the issue #11659 that is linked to #16956, I did some testing, and while my string aggregators can run on datasources:
Results:
Running the following will still not work.
Results:
With that out of the way, let's go into the description of my implementations.
Add StringMin and StringMax Aggregators
StringMinAggregatorto find the minimum string value in a column.StringMaxAggregatorto find the maximum string value in a column.StringMinBufferAggregatorandStringMaxBufferAggregatorfor efficient in-memory aggregation.StringMinVectorAggregatorandStringMaxVectorAggregatorfor vectorized query execution.Setting up Aggregator Factory classes
StringMinAggregatorFactoryandStringMaxAggregatorFactoryto create the respective aggregators.StringMinAggregatorFactoryandStringMaxAggregatorFactoryundercreateMinAggregatorFactoryandcreateMaxAggregatorFactoryrespectively.Aggregations#getArgumentsForSimpleAggregatorto accommodate the new String aggregation functions.AggregatorUtilsAggregatorsModule.Unit Tests
AggregatorFactoryTest.AGGREGATION_NOT_SUPPORT_TYPEannotations underDrillWindowQueryTestto support testing of new aggregators.DrillWindowQueryTestRelease note
New: You can now query string columns with min() and max() functions.
Key changed/added classes in this PR
StringMinAggregatorStringMaxAggregatorStringMinBufferAggregatorStringMaxBufferAggregatorStringMinVectorAggregatorStringMaxVectorAggregatorStringMinAggregatorFactoryStringMaxAggregatorFactoryStringMinAggregatorTestStringMaxAggregatorTestStringMinBufferAggregatorTestStringMaxBufferAggregatorTestStringMinVectorAggregatorTestStringMaxVectorAggregatorTestStringMinAggregatorFactoryTestStringMaxAggregatorFactoryTestaggregations.mdThis PR has: