Skip to content

Comments

Aggregate functions on string columns#17192

Closed
GWphua wants to merge 34 commits intoapache:masterfrom
GWphua:aggregate-functions-on-string-columns
Closed

Aggregate functions on string columns#17192
GWphua wants to merge 34 commits intoapache:masterfrom
GWphua:aggregate-functions-on-string-columns

Conversation

@GWphua
Copy link
Contributor

@GWphua GWphua commented Sep 30, 2024

Fixes #16956.

Description

This PR introduces min() and max() aggregators for string columns: StringMinAggregator and StringMaxAggregator, and their respective buffer and vector aggregators. These aggregators compute the minimum and maximum string values respectively in a column during query execution.

Throughout the implementation process, I had some difficulty understand some of the below, and would appreciate extra care in the reviewing process for:

  • Usage of expression in the aggregator factories.
  • Use cases of SingleValueDimensionVectorSelector and MultiValueDimensionVectorSelector for vector aggregation.
  • VectorAggregator classes in general.
  • Should I mention my aggregators under moving-average-query.md?

Regarding the issue #11659 that is linked to #16956, I did some testing, and while my string aggregators can run on datasources:

SELECT min("user")
FROM "wikipedia"

Results:

*feridiák

Running the following will still not work.

SELECT
  min("start"), max("end")
FROM sys.segments

Results:

Error: RUNTIME_FAILURE (OPERATOR)
java.lang.reflect.InvocationTargetException
java.lang.RuntimeException

With that out of the way, let's go into the description of my implementations.


Add StringMin and StringMax Aggregators

  • Implemented StringMinAggregator to find the minimum string value in a column.
  • Implemented StringMaxAggregator to find the maximum string value in a column.
  • Implemented StringMinBufferAggregator and StringMaxBufferAggregator for efficient in-memory aggregation.
  • Implemented StringMinVectorAggregator and StringMaxVectorAggregator for vectorized query execution.

Setting up Aggregator Factory classes

  • Implemented StringMinAggregatorFactory and StringMaxAggregatorFactory to create the respective aggregators.
  • Allow creation of StringMinAggregatorFactory and StringMaxAggregatorFactory under createMinAggregatorFactory and createMaxAggregatorFactory respectively.
  • Change logic of Aggregations#getArgumentsForSimpleAggregator to accommodate the new String aggregation functions.
  • Add new cache key IDs under AggregatorUtils
  • Add the new aggregators as modules to AggregatorsModule.

Unit Tests

  • Added unit tests for all above classes, except VectorAggregators.
  • Added the new aggregator factory classes in AggregatorFactoryTest.
  • Remove AGGREGATION_NOT_SUPPORT_TYPE annotations under DrillWindowQueryTest to support testing of new aggregators.
  • Add new tests under DrillWindowQueryTest

Release note

New: You can now query string columns with min() and max() functions.


Key changed/added classes in this PR
  • StringMinAggregator
  • StringMaxAggregator
  • StringMinBufferAggregator
  • StringMaxBufferAggregator
  • StringMinVectorAggregator
  • StringMaxVectorAggregator
  • StringMinAggregatorFactory
  • StringMaxAggregatorFactory
  • StringMinAggregatorTest
  • StringMaxAggregatorTest
  • StringMinBufferAggregatorTest
  • StringMaxBufferAggregatorTest
  • StringMinVectorAggregatorTest
  • StringMaxVectorAggregatorTest
  • StringMinAggregatorFactoryTest
  • StringMaxAggregatorFactoryTest
  • aggregations.md

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Add StringMinBufferAggregator

Settle all fields in StringMinAggregatorFactory

Add min aggregator to Guice

Fix combine logic in aggregator

Fix byte logic in bufferAggregator

Make Factory Java8 compatible

Add String Aggregator to AggregationType

Implement methods to pass Unit AggregatorFactoryTest

Implement superclass for StringAggregatorFactory

Remove re-implementation of superclass

Remove duplicated implementation of factorize()

Finish StringMinBufferAggregator

Add limit to number of bytes

Add and run unit tests for StringMinBufferAggregator

Add MinAggregationFactory methods

Add StringMinVectorAggregator

Change AggregatorFactoryTest declarations

Fix checkstyle
d a
e a
null a
null null
Copy link
Contributor Author

@GWphua GWphua Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to justify my changes to the expected test result for test cases woutPrtnBy_6 and woutPrtnBy_7.

Taking woutPrtnBy_6 for example, the query is as follows:

SELECT c2, MIN(MIN(c2)) OVER( ORDER BY c2 ) min_c2 
FROM "tblWnulls.parquet" 
GROUP BY c2

the original expected and actual results are as follows:

2024-10-02T08:47:04,748 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_6] org.apache.druid.sql.calcite.BaseCalciteQueryTest - -- Expected results --
new Object[]{"a", "a"},
new Object[]{"b", "a"},
new Object[]{"c", "a"},
new Object[]{"d", "a"},
new Object[]{"e", "a"},
new Object[]{null, "a"}
----
2024-10-02T08:47:04,748 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_6] org.apache.druid.sql.calcite.BaseCalciteQueryTest - -- Actual results --
new Object[]{"a", "a"},
new Object[]{"b", "a"},
new Object[]{"c", "a"},
new Object[]{"d", "a"},
new Object[]{"e", "a"},
new Object[]{null, null}
----

Turns out that either one of OVER(ORDER BY c2) or GROUP BY will a partial-result of the following order: [null, a, b, c, d, e]. As such, our MinStringAggregator#compare will be iterating strings as follows:

compare(null, null) --> null
compare(null, 'a') --> 'a'
compare('a', 'b') --> 'a'
compare('a', 'c') --> 'a'
compare('a', 'd') --> 'a'
compare('a', 'e') --> 'a'

As we can see, due to the order in which compare() runs, we are not able to do away with null. This order of things also happened for woutPrtnBy_7, which uses MAX instead of MIN.

While I continued to work and experimented with things, I realised that this order is not only limited to my new StringMin implementation. A tweak to woutPrtnBy_2 shows the following results:

Original Query:

SELECT c2, MIN(MAX(c1)) OVER( ORDER BY c2 ) min_mx_c1 FROM "tblWnulls.parquet" GROUP BY c2

Changed Query:

SELECT c1, MIN(MAX(c1)) OVER( ORDER BY c1 ) min_mx_c1 FROM "tblWnulls.parquet" GROUP BY c1

Results:

2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #0: [null, null]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #1: [-1, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #2: [0, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #3: [1, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #4: [2, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #5: [4, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #6: [5, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #7: [6, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #8: [8, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #9: [9, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #10: [10, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #11: [11, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #12: [12, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #13: [13, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #14: [14, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #15: [15, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #16: [17, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #17: [19, -1]
2024-10-02T08:43:55,869 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #18: [11111, -1]
2024-10-02T08:43:55,869 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #19: [65536, -1]
2024-10-02T08:43:55,869 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #20: [1000000, -1]
2024-10-02T08:43:55,869 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #21: [2147483647, -1]

java.lang.RuntimeException: Can't parse input!

Although the test ran into a runtime exception, we can see that the results that are logged shows the presence of [null, null]. Hence, this means that the partial query generated in this case also does not follow a null last comparator logic.

Seeing that the behaviour is the same across different ColumnTypes, I think that fixing this bug (if do we see it as one) will require another PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adarshsanjeev
Copy link
Contributor

Thanks for making the PR! I'll try to review this soon.

@GWphua
Copy link
Contributor Author

GWphua commented Nov 4, 2024

Hello @adarshsanjeev, would like to ping and see how's it going 😄

@github-actions
Copy link

github-actions bot commented Jan 4, 2025

This pull request has been marked as stale due to 60 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If you think
that's incorrect or this pull request should instead be reviewed, please simply
write any comment. Even if closed, you can still revive the PR at any time or
discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

@github-actions github-actions bot added the stale label Jan 4, 2025
@github-actions github-actions bot removed the stale label Jan 23, 2025
@github-actions
Copy link

This pull request has been marked as stale due to 60 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If you think
that's incorrect or this pull request should instead be reviewed, please simply
write any comment. Even if closed, you can still revive the PR at any time or
discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

@github-actions github-actions bot added the stale label Mar 24, 2025
@github-actions
Copy link

This pull request/issue has been closed due to lack of activity. If you think that
is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Apr 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Min() and Max() aggregate functions on string columns

3 participants