Enable multiple distinct aggregators in same query #11014

abhishekagarwal87 · 2021-03-19T07:44:09Z

Description

Running queries with multiple exact distinct aggregations require us to enable a calcite rule AggregateExpandDistinctAggregatesRule.INSTANCE which is currently not enabled. Instead AggregateExpandDistinctAggregatesRule.JOIN rule is used which plans queries with multiple distinct aggregations as a Join query with a join condition of type IS_NOT_DISTINCT_FROM. However, druid supports only equality conditions in joins. AggregateExpandDistinctAggregatesRule.INSTANCE rule, on the other hand, uses grouping aggregator, to run the queries with distinct aggregations. That aggregator was added recently.

With AggregateExpandDistinctAggregatesRule.INSTANCE enabled, query planning completes just fine however, after planning, the query execution fails. This is due to a bug in how a group by query procures merge buffers. Druid undercounts required merge buffers when there is a nested query and the subquery has subtotals.

This patch fixes the logic to compute required merge buffers. Additionally, a flag has been added to control to switch between old and new behavior.

There is still a group of queries that will still fail

A nested group by query in which subquery has subtotals and the query runs on broker itself. Say merging results need 3 buffers and merging runners need 1 buffer and merge pool size on broker is 3. Query will be accepted since both checks are done independently. But when you run the query, it will block since there are only 2 buffers left for merging results. It's a hypothesis and I haven't verified it via a run. This is an existing behavior and not introduced because of this PR.

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
been tested in a test Druid cluster.

clintropolis · 2021-03-26T23:31:25Z

docs/configuration/index.md

@@ -1675,6 +1675,7 @@ The Druid SQL server is configured through the following properties on the Broke
 |`druid.sql.planner.maxTopNLimit`|Maximum threshold for a [TopN query](../querying/topnquery.md). Higher limits will be planned as [GroupBy queries](../querying/groupbyquery.md) instead.|100000|
 |`druid.sql.planner.metadataRefreshPeriod`|Throttle for metadata refreshes.|PT1M|
 |`druid.sql.planner.useApproximateCountDistinct`|Whether to use an approximate cardinality algorithm for `COUNT(DISTINCT foo)`.|true|
+|`druid.sql.planner.useGroupingSetForExactDistinct`|Only relevant when `useApproximateCountDistinct` is disabled. If set to true, exact distinct queries are re-written using grouping sets. Otherwise, exact distinct queries are re-written using joins. This should be set to true for group by query with multiple exact distinct aggregations. This flag can be overridden per query.|false|


naively this seems better maybe than using joins... is the reason to make it false by default in case there are any regressions I guess? I only ask because things that are cool, but off by default tend to take a long time to make it to being turned on, if ever.

I didn't want to accidentally break any queries that are running already. At least for the backports, we do want to keep it off by default but maybe turn it on for new releases?

I think it would be ok if we could leave off by default for next release, and maybe consider turning on in the release after

clintropolis · 2021-03-26T23:35:56Z

processing/src/test/java/org/apache/druid/query/groupby/GroupByQueryMergeBufferTest.java

    // This should be 0 because the broker needs 2 buffers and the queryable node needs one.
-    Assert.assertEquals(0, MERGE_BUFFER_POOL.getMinRemainBufferNum());
-    Assert.assertEquals(3, MERGE_BUFFER_POOL.getPoolSize());
+    Assert.assertEquals(1, MERGE_BUFFER_POOL.getMinRemainBufferNum());
+    Assert.assertEquals(4, MERGE_BUFFER_POOL.getPoolSize());


nit: this comment isn't accurate anymore

clintropolis · 2021-03-26T23:39:31Z

sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java

@@ -131,6 +131,8 @@
 import java.util.Map;
 import java.util.stream.Collectors;

+import static org.apache.druid.sql.calcite.planner.PlannerConfig.CTX_KEY_USE_GROUPING_SET_FOR_EXACT_DISTINCT;


iirc, I think we typically prefer to not use static imports, I know this is enforced in some places, but maybe not in test code because of some junit stuffs?

clintropolis

👍

abhishekagarwal87 added 2 commits March 19, 2021 13:04

Enable multiple distinct count

4745f52

Add more tests

6860bc9

abhishekagarwal87 changed the title ~~[Draft] Enable multiple distinct aggregators in same query~~ Enable multiple distinct aggregators in same query Mar 19, 2021

abhishekagarwal87 marked this pull request as ready for review March 19, 2021 10:45

abhishekagarwal87 added 2 commits March 19, 2021 21:55

fix sql test

cc95fcf

docs fix

c0f81f7

abhishekagarwal87 added Area - Querying Area - SQL Feature Release Notes labels Mar 25, 2021

clintropolis reviewed Mar 26, 2021

View reviewed changes

Address nits

061e804

abhishekagarwal87 self-assigned this Apr 6, 2021

clintropolis approved these changes Apr 7, 2021

View reviewed changes

clintropolis merged commit 0df0bff into apache:master Apr 7, 2021

clintropolis added this to the 0.22.0 milestone Aug 12, 2021

clintropolis mentioned this pull request Sep 3, 2021

[Draft] 0.22.0 Release Notes #11657

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable multiple distinct aggregators in same query #11014

Enable multiple distinct aggregators in same query #11014

abhishekagarwal87 commented Mar 19, 2021 •

edited

Loading

clintropolis Mar 26, 2021

abhishekagarwal87 Apr 5, 2021

clintropolis Apr 5, 2021

clintropolis Mar 26, 2021

clintropolis Mar 26, 2021

clintropolis left a comment

Enable multiple distinct aggregators in same query #11014

Enable multiple distinct aggregators in same query #11014

Conversation

abhishekagarwal87 commented Mar 19, 2021 • edited Loading

Description

clintropolis Mar 26, 2021

Choose a reason for hiding this comment

abhishekagarwal87 Apr 5, 2021

Choose a reason for hiding this comment

clintropolis Apr 5, 2021

Choose a reason for hiding this comment

clintropolis Mar 26, 2021

Choose a reason for hiding this comment

clintropolis Mar 26, 2021

Choose a reason for hiding this comment

clintropolis left a comment

Choose a reason for hiding this comment

abhishekagarwal87 commented Mar 19, 2021 •

edited

Loading