groupBy sorting behaves not as expected with granularity != 'all' #1926

vogievetsky · 2015-11-06T18:05:23Z

I ran this query:

{
  "queryType": "groupBy",
  "dataSource": "wikipedia",
  "intervals": [
    "2015-08-14/2015-08-15"
  ],
  "granularity": {
    "type": "period",
    "period": "PT1H",
    "timeZone": "Etc/UTC"
  },
  "aggregations": [
    {
      "name": "Count",
      "type": "count"
    }
  ],
  "dimensions": ["newPage"],
  "limitSpec": {
    "type": "default",
    "limit": 7,
    "columns": [
      {
        "dimension": "Count",
        "direction": "descending"
      }
    ]
  }
}

and got this result:

[ {
  "version" : "v1",
  "timestamp" : "2015-08-14T00:00:00.000Z",
  "event" : {
    "newPage" : "false",
    "Count" : 8609
  }
}, {
  "version" : "v1",
  "timestamp" : "2015-08-14T00:00:00.000Z",
  "event" : {
    "newPage" : "true",
    "Count" : 478
  }
}, {
  "version" : "v1",
  "timestamp" : "2015-08-14T01:00:00.000Z",
  "event" : {
    "newPage" : "false",
    "Count" : 7439
  }
}, {
  "version" : "v1",
  "timestamp" : "2015-08-14T01:00:00.000Z",
  "event" : {
    "newPage" : "true",
    "Count" : 378
  }
}, {
  "version" : "v1",
  "timestamp" : "2015-08-14T02:00:00.000Z",
  "event" : {
    "newPage" : "false",
    "Count" : 6537
  }
}, {
  "version" : "v1",
  "timestamp" : "2015-08-14T02:00:00.000Z",
  "event" : {
    "newPage" : "true",
    "Count" : 327
  }
}, {
  "version" : "v1",
  "timestamp" : "2015-08-14T03:00:00.000Z",
  "event" : {
    "newPage" : "false",
    "Count" : 5061
  }
} ]

It seems like the results from the different granularities (2 per hour in this case) are concatenated together and never resorted. The limit is then applied to the entire (non sorted list) making it useless.

The workaround (for anyone interested) and the expected action is:

{
  "queryType": "groupBy",
  "dataSource": "wikipedia",
  "intervals": [
    "2015-08-14/2015-08-15"
  ],
  "granularity": "all",
  "aggregations": [
    {
      "name": "Count",
      "type": "count"
    }
  ],
  "dimensions": [
    {
      "type": "extraction",
      "dimension": "__time",
      "outputName": "timestamp",
      "extractionFn" : {
        "type" : "timeFormat",
        "format" : "yyyy-MM-dd'T'HH':00:00Z",
        "timeZone" : "Etc/UTC",
        "locale" : "en"
      }
    },
    "newPage"
  ],
  "limitSpec": {
    "type": "default",
    "limit": 7,
    "columns": [
      {
        "dimension": "Count",
        "direction": "descending"
      }
    ]
  }
}

There are actually two separate uses for groupBy that are being muddled up here.

One is that granularity functions more like it does in topN and returns a separate groupBy per granularity with it's own limit in a nested JSON data structure.
Is that granularity functions just like another dimension in the dimensionSpec (which is what I was trying to do here)

My proposed solution to this issue.

Add a resorting step when granularity != 'all'
Add a flag that would allow the user to get a topN like group by out with one group by per granularity bucket (with its own limit) (solves 1)
Add a dimension extractionFn so that the user can use the syntax of the GranularitySpec to bucket in the dimensions list avoiding the hack used above (solves 2)

drcrallen · 2015-11-06T18:07:11Z

This looks related to #701 and https://groups.google.com/d/msg/druid-development/BSpDmAq-7Jk/_W3q00kkwMIJ
can you confirm or refute if these look similar?

gianm · 2015-11-06T18:12:12Z

Fwiw the way groupBy granularity and limitSpec are supposed to work now is that results are first grouped in their granular buckets (so, one set of groupings for each day if you have granularity "day"), then they are sorted within their granular buckets, then the limit is applied to the global list of results starting from the first granular bucket.

Whether or not this is useful behavior is left as an exercise to the reader.

I agree that it seems weird that the sorting is done within granular buckets, but the limiting is done globally. It would probably make more sense to do both within the buckets or both globally. If they're both local then it acts like your (1). If they're both global it acts like your (2).

vogievetsky · 2015-11-06T18:15:42Z

@drcrallen I read those issues and they do not look related but I might be missing something. I am aware of how zero filling works and this is not in play here because there is data in all those buckets.

I think the root problem here is that groupBy returns results in it's own special way. groupBY is the only query type which uses the key 'event' in the return. groupBy tries to de-nest itself (unlike topNs) but does not go all the way since it forgets to resort.

vogievetsky · 2015-11-06T18:17:27Z

Whether or not this is useful behavior is left as an exercise to the reader.

It isn't

vogievetsky · 2015-11-06T18:18:47Z

for (1) it would also be nice to change the flatness of groupby and get a nested data-structure like topN maybe with a flag?

gianm · 2015-11-10T22:26:51Z

related #1780

shanemt · 2016-02-09T18:34:15Z

+1

gesundkrank · 2016-07-14T14:13:02Z

+1

lkokhreidze · 2016-07-26T16:05:03Z

+1

fedecasanova72 · 2016-08-10T13:51:42Z

+1

jkukul · 2016-08-10T14:03:16Z

+1

julianhyde · 2017-01-18T22:15:19Z

-1 to nested data structures; relational is easier to consume

gianm · 2018-02-09T19:52:32Z

Fwiw when you group by FLOOR(__time TO X) then Druid SQL will run a query similar to @vogievetsky's suggested workaround. The reason is because that's more like what SQL semantics would expect.

stale · 2019-06-21T18:13:19Z

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 2 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

stale · 2019-07-05T19:10:26Z

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

gianm · 2019-07-05T19:15:16Z

For anyone watching, I'd suggest using Druid SQL, since it has the behavior you would expect. (Under the hood it will issue a granularity 'all' groupBy query and then include a timestamp_floor expression as one of the dimensions)

…ndexing in mapper

alflennik · 2020-06-30T17:13:45Z

@gianm Thanks for the suggestion about using granularity all in conjunction with an extraction function on the __time dimension, and it really works - but I noticed (in my version of Druid at least) that it's a loooot slower.

Version 1 "Basic Granularity"

  "granularity": "day", 
  "dimensions": []

Version 2: "JS Dimension"

"granularity": "all",
"dimensions": [{
  "type": "extraction",
  "dimension": "__time",
  "outputName": "day",
  "outputType": "STRING",
  "extractionFn": {
    "type": "javascript",
    "function": "function (milliseconds) { return new Date(milliseconds).toISOString().replace(/..:..:......Z/, '00:00:00.000Z') }"
  }
}]

Version 3: "Time Format Dimension"

"granularity": "all",
"dimensions": [{
  "type": "extraction",
  "dimension": "__time",
  "outputName": "day",
  "extractionFn": {
    "type": "timeFormat",
    "format": "yyyy-MM-dd"
  }
}]

Average Query Durations in MS

Small-Size Dataset

Basic Granularity 1230ms
JS Dimension 2430ms
Time Format Dimension 2570ms

Medium-Size Dataset

Basic Granularity 3370ms
JS Dimension 6940ms
Time Format Dimension 11780ms

Large-Size Dataset

Basic Granularity 21090ms
JS Dimension 51090ms
Time Format Dimension 50890ms

Is this an inefficiency only present in older versions of Druid (v0.9.2)?

Am I doing something wrong?

Is it that the native granularities are preoptimized in a way that I can't get with manually-implemented granularities?

This performance penalty is serious enough that it doesn't quite qualify as a solution for my use-case. Unfortunately.

gianm · 2020-07-01T02:07:11Z

Hi @alflennik, the native granularities are indeed somewhat more optimized, as of this writing (latest Druid version is 0.18). This is something that I expect will change in the future as we bring more optimization to the expression system.

alflennik · 2020-07-01T02:50:33Z

@gianm Thanks for the reply! After thinking on it a while I figured out a good way in my system to make a time-based dimension available for the unusual situations that demand it while 99% of the time relying on the native granularities.

navis mentioned this issue Dec 3, 2015

Support ordering granularity different from query granularity for group-by queries #2032

Closed

vogievetsky mentioned this issue Feb 7, 2016

unsupported part in timeBucket expression error when Order By included implydata/plyql#6

Closed

knoguchi mentioned this issue Feb 7, 2017

add build scripts grafana-druid-plugin/druidplugin#4

Closed

gianm mentioned this issue Aug 5, 2018

Allow sorting segments on some dims before time #6105

Closed

stale bot added the stale label Jun 21, 2019

stale bot closed this as completed Jul 5, 2019

seoeun25 pushed a commit to seoeun25/incubator-druid that referenced this issue Jan 10, 2020

apache#1926 Possible maximum number of rows exception when multiple-i…

26d96b6

…ndexing in mapper

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groupBy sorting behaves not as expected with granularity != 'all' #1926

groupBy sorting behaves not as expected with granularity != 'all' #1926

vogievetsky commented Nov 6, 2015

drcrallen commented Nov 6, 2015

gianm commented Nov 6, 2015

vogievetsky commented Nov 6, 2015

vogievetsky commented Nov 6, 2015

vogievetsky commented Nov 6, 2015

gianm commented Nov 10, 2015

shanemt commented Feb 9, 2016

gesundkrank commented Jul 14, 2016

lkokhreidze commented Jul 26, 2016

fedecasanova72 commented Aug 10, 2016

jkukul commented Aug 10, 2016

julianhyde commented Jan 18, 2017

gianm commented Feb 9, 2018

stale bot commented Jun 21, 2019

stale bot commented Jul 5, 2019

gianm commented Jul 5, 2019

alflennik commented Jun 30, 2020

gianm commented Jul 1, 2020 •

edited

alflennik commented Jul 1, 2020

groupBy sorting behaves not as expected with granularity != 'all' #1926

groupBy sorting behaves not as expected with granularity != 'all' #1926

Comments

vogievetsky commented Nov 6, 2015

drcrallen commented Nov 6, 2015

gianm commented Nov 6, 2015

vogievetsky commented Nov 6, 2015

vogievetsky commented Nov 6, 2015

vogievetsky commented Nov 6, 2015

gianm commented Nov 10, 2015

shanemt commented Feb 9, 2016

gesundkrank commented Jul 14, 2016

lkokhreidze commented Jul 26, 2016

fedecasanova72 commented Aug 10, 2016

jkukul commented Aug 10, 2016

julianhyde commented Jan 18, 2017

gianm commented Feb 9, 2018

stale bot commented Jun 21, 2019

stale bot commented Jul 5, 2019

gianm commented Jul 5, 2019

alflennik commented Jun 30, 2020

Average Query Durations in MS

Small-Size Dataset

Medium-Size Dataset

Large-Size Dataset

gianm commented Jul 1, 2020 • edited

alflennik commented Jul 1, 2020

gianm commented Jul 1, 2020 •

edited