SQL support for nested groupBys. #3806

Merged
merged 6 commits into from Jan 12, 2017

Projects

None yet

5 participants

@gianm
Member
gianm commented Dec 25, 2016

Allows, for example, doing exact count distinct by writing:

SELECT COUNT(*) FROM (SELECT DISTINCT col FROM druid.foo)

Contrast with approximate count distinct, which is:

SELECT COUNT(DISTINCT col) FROM druid.foo
@fjy fjy added this to the 0.10.0 milestone Dec 26, 2016
@gianm gianm SQL support for nested groupBys.
Allows, for example, doing exact count distinct by writing:

  SELECT COUNT(*) FROM (SELECT DISTINCT col FROM druid.foo)

Contrast with approximate count distinct, which is:

  SELECT COUNT(DISTINCT col) FROM druid.foo
6798c01
+
+- `COUNT(DISTINCT col)` aggregations use [HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf), a
+fast approximate distinct counting algorithm. If you need exact distinct counts, you can instead use
+`SELECT COUNT(*) FROM (SELECT DISTINCT col FROM druid.foo)`, which will use a slower and more resource intensive exact
@nishantmonu51
nishantmonu51 Dec 31, 2016 Member

would be nice if there can be a flag where, Count(Distinct col) can also be executed using exact algo, instead of expecting the user to write a nested query instead.

@fjy
fjy Jan 2, 2017 Member

👍 on that suggestion

@gianm
gianm Jan 2, 2017 edited Member

That would be a nice feature, but imo it should be a different PR.

@jihoonson
jihoonson Jan 4, 2017 Contributor

I also prefer this approach. Different behavior depending on query structure can make users confused.

@gianm
gianm Jan 4, 2017 Member

Fair enough, I agree that would be cool, but I don't think it makes sense to change DISTINCT aggs in this PR. All this PR is doing is adding the nested query feature, it's not making any changes to how DISTINCT aggs are handled.

@jihoonson
jihoonson Jan 4, 2017 Contributor

Ok. I'm reviewing the patch.

@fjy
Member
fjy commented Jan 3, 2017

👍

@jihoonson

@gianm, this patch looks good. I left some comments.
I additionally tested the following double nested group by query using CalciteQueryTest, and found it doesn't finish. Is this query not covered in this issue?

@Test
  public void testRecursivelyNestedGroupby() throws Exception
  {
    testQuery(
        "select sum(cnt), count(*) from (select dim2, sum(t1.cnt) cnt from (select dim1, dim2, count(*) cnt from druid.foo group by dim1, dim2) t1 group by dim2) t2",
        null,
        ImmutableList.of(
            new Object[]{6L, 3L}
        )
    );
  }
+
+ final TimeseriesQuery timeseriesQuery = queryBuilder.toTimeseriesQuery(dataSource, sourceRowSignature);
+ if (timeseriesQuery != null) {
+ executeTimeseries(queryBuilder, timeseriesQuery, sink);
@jihoonson
jihoonson Jan 4, 2017 edited Contributor

I think it would be better if we are able to know which operator creates the data source ahead. But, I know this accumulate() method is just moved from DruidQueryBuilder with little changes, and adding query types will involve a lot of additional changes. Do you have any plan for this?

@gianm
gianm Jan 4, 2017 edited Member

I thought of having just one toQuery method, but the problem with that is then when we want to execute the query, we need to pair it with the correct execution strategy for that query (select needs to issue multiple queries for pagination, all query types have different result formats, etc). So that's why accumulate checks each possible query type individually.

I don't have a plan for changing this but I am open to change.

@jihoonson
jihoonson Jan 4, 2017 Contributor

Ok. We can change later if it needs.

+ if (druidRel.getQueryBuilder().getSelectProjection() != null
+ || druidRel.getQueryBuilder().getGrouping() != null
+ || druidRel.getQueryBuilder().getLimitSpec() != null) {
+ return;
@jihoonson
jihoonson Jan 4, 2017 Contributor

How about implementing public boolean matches(RelOptRuleCall call)? I think this will be a better approach for avoiding not-matched rules early.
I know these rule classes are just moved to here, but it will be good if we can improve them.

@gianm
gianm Jan 4, 2017 Member

Sounds good, I'll do that. I didn't think of this before.

+ }
+
+ if (queryBuilder.getGrouping() != null) {
+ cost *= 0.5;
@jihoonson
jihoonson Jan 4, 2017 Contributor

How about making these constants as static variables?

@gianm
gianm Jan 4, 2017 Member

Sure thing.

@gianm
Member
gianm commented Jan 4, 2017 edited

@jihoonson, on your test testRecursivelyNestedGroupby, that didn't work because it needs more merge buffers than the test allows, and since the merge buffer pool is blocking, the test blocks forever. I pushed a commit that adds your test, bumps up the test merge buffer pool to 3, adds a maxQueryCount config, and adds docs warning users about this potential deadlock in prod.

The new doc blurb is:

Note that groupBys require a separate merge buffer on the broker for each layer beyond the first layer of the groupBy. With the v2 groupBy strategy, this can potentially lead to deadlocks for groupBys nested beyond two layers, since the merge buffers are limited in number and are acquired one-by-one and not as a complete set. At this time we recommend that you avoid deeply-nested groupBys with the v2 strategy. Doubly-nested groupBys (groupBy -> groupBy -> table) are safe and do not suffer from this issue. If you like, you can forbid deeper nesting by setting druid.sql.planner.maxQueryCount = 2.

@gianm
Member
gianm commented Jan 4, 2017

@jihoonson, I just pushed commits for the rest of your comments, please let me know what you think. thanks for the review.

@gianm
Member
gianm commented Jan 4, 2017

Raised #3819 for the deeply nested groupBy thing.

@gianm gianm closed this Jan 4, 2017
@gianm gianm reopened this Jan 4, 2017
@jihoonson
Contributor

Thanks! The latest patch looks good to me.

@gianm gianm assigned fjy and jon-wei and unassigned jon-wei Jan 5, 2017
@gianm
Member
gianm commented Jan 6, 2017

Resolved conflicts.

@gianm gianm closed this Jan 7, 2017
@gianm gianm reopened this Jan 7, 2017
@jon-wei
Member
jon-wei commented Jan 12, 2017

👍

@jon-wei jon-wei merged commit e86859b into druid-io:master Jan 12, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment