Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupBy sorting behaves not as expected with granularity != 'all' #1926

Closed
vogievetsky opened this issue Nov 6, 2015 · 19 comments
Closed

groupBy sorting behaves not as expected with granularity != 'all' #1926

vogievetsky opened this issue Nov 6, 2015 · 19 comments
Labels

Comments

@vogievetsky
Copy link
Contributor

I ran this query:

{
  "queryType": "groupBy",
  "dataSource": "wikipedia",
  "intervals": [
    "2015-08-14/2015-08-15"
  ],
  "granularity": {
    "type": "period",
    "period": "PT1H",
    "timeZone": "Etc/UTC"
  },
  "aggregations": [
    {
      "name": "Count",
      "type": "count"
    }
  ],
  "dimensions": ["newPage"],
  "limitSpec": {
    "type": "default",
    "limit": 7,
    "columns": [
      {
        "dimension": "Count",
        "direction": "descending"
      }
    ]
  }
}

and got this result:

[ {
  "version" : "v1",
  "timestamp" : "2015-08-14T00:00:00.000Z",
  "event" : {
    "newPage" : "false",
    "Count" : 8609
  }
}, {
  "version" : "v1",
  "timestamp" : "2015-08-14T00:00:00.000Z",
  "event" : {
    "newPage" : "true",
    "Count" : 478
  }
}, {
  "version" : "v1",
  "timestamp" : "2015-08-14T01:00:00.000Z",
  "event" : {
    "newPage" : "false",
    "Count" : 7439
  }
}, {
  "version" : "v1",
  "timestamp" : "2015-08-14T01:00:00.000Z",
  "event" : {
    "newPage" : "true",
    "Count" : 378
  }
}, {
  "version" : "v1",
  "timestamp" : "2015-08-14T02:00:00.000Z",
  "event" : {
    "newPage" : "false",
    "Count" : 6537
  }
}, {
  "version" : "v1",
  "timestamp" : "2015-08-14T02:00:00.000Z",
  "event" : {
    "newPage" : "true",
    "Count" : 327
  }
}, {
  "version" : "v1",
  "timestamp" : "2015-08-14T03:00:00.000Z",
  "event" : {
    "newPage" : "false",
    "Count" : 5061
  }
} ]

It seems like the results from the different granularities (2 per hour in this case) are concatenated together and never resorted. The limit is then applied to the entire (non sorted list) making it useless.

The workaround (for anyone interested) and the expected action is:

{
  "queryType": "groupBy",
  "dataSource": "wikipedia",
  "intervals": [
    "2015-08-14/2015-08-15"
  ],
  "granularity": "all",
  "aggregations": [
    {
      "name": "Count",
      "type": "count"
    }
  ],
  "dimensions": [
    {
      "type": "extraction",
      "dimension": "__time",
      "outputName": "timestamp",
      "extractionFn" : {
        "type" : "timeFormat",
        "format" : "yyyy-MM-dd'T'HH':00:00Z",
        "timeZone" : "Etc/UTC",
        "locale" : "en"
      }
    },
    "newPage"
  ],
  "limitSpec": {
    "type": "default",
    "limit": 7,
    "columns": [
      {
        "dimension": "Count",
        "direction": "descending"
      }
    ]
  }
}

There are actually two separate uses for groupBy that are being muddled up here.

  1. One is that granularity functions more like it does in topN and returns a separate groupBy per granularity with it's own limit in a nested JSON data structure.
  2. Is that granularity functions just like another dimension in the dimensionSpec (which is what I was trying to do here)

My proposed solution to this issue.

  • Add a resorting step when granularity != 'all'
  • Add a flag that would allow the user to get a topN like group by out with one group by per granularity bucket (with its own limit) (solves 1)
  • Add a dimension extractionFn so that the user can use the syntax of the GranularitySpec to bucket in the dimensions list avoiding the hack used above (solves 2)
@drcrallen
Copy link
Contributor

This looks related to #701 and https://groups.google.com/d/msg/druid-development/BSpDmAq-7Jk/_W3q00kkwMIJ
can you confirm or refute if these look similar?

@gianm
Copy link
Contributor

gianm commented Nov 6, 2015

Fwiw the way groupBy granularity and limitSpec are supposed to work now is that results are first grouped in their granular buckets (so, one set of groupings for each day if you have granularity "day"), then they are sorted within their granular buckets, then the limit is applied to the global list of results starting from the first granular bucket.

Whether or not this is useful behavior is left as an exercise to the reader.

I agree that it seems weird that the sorting is done within granular buckets, but the limiting is done globally. It would probably make more sense to do both within the buckets or both globally. If they're both local then it acts like your (1). If they're both global it acts like your (2).

@vogievetsky
Copy link
Contributor Author

@drcrallen I read those issues and they do not look related but I might be missing something. I am aware of how zero filling works and this is not in play here because there is data in all those buckets.

I think the root problem here is that groupBy returns results in it's own special way. groupBY is the only query type which uses the key 'event' in the return. groupBy tries to de-nest itself (unlike topNs) but does not go all the way since it forgets to resort.

@vogievetsky
Copy link
Contributor Author

Whether or not this is useful behavior is left as an exercise to the reader.

It isn't

@vogievetsky
Copy link
Contributor Author

for (1) it would also be nice to change the flatness of groupby and get a nested data-structure like topN maybe with a flag?

@gianm
Copy link
Contributor

gianm commented Nov 10, 2015

related #1780

@shanemt
Copy link

shanemt commented Feb 9, 2016

+1

4 similar comments
@gesundkrank
Copy link

+1

@lkokhreidze
Copy link

+1

@fedecasanova72
Copy link

+1

@jkukul
Copy link
Contributor

jkukul commented Aug 10, 2016

+1

@julianhyde
Copy link

-1 to nested data structures; relational is easier to consume

@gianm
Copy link
Contributor

gianm commented Feb 9, 2018

Fwiw when you group by FLOOR(__time TO X) then Druid SQL will run a query similar to @vogievetsky's suggested workaround. The reason is because that's more like what SQL semantics would expect.

@stale
Copy link

stale bot commented Jun 21, 2019

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 2 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

@stale stale bot added the stale label Jun 21, 2019
@stale
Copy link

stale bot commented Jul 5, 2019

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

@stale stale bot closed this as completed Jul 5, 2019
@gianm
Copy link
Contributor

gianm commented Jul 5, 2019

For anyone watching, I'd suggest using Druid SQL, since it has the behavior you would expect. (Under the hood it will issue a granularity 'all' groupBy query and then include a timestamp_floor expression as one of the dimensions)

seoeun25 pushed a commit to seoeun25/incubator-druid that referenced this issue Jan 10, 2020
@alflennik
Copy link

@gianm Thanks for the suggestion about using granularity all in conjunction with an extraction function on the __time dimension, and it really works - but I noticed (in my version of Druid at least) that it's a loooot slower.

Version 1 "Basic Granularity"

  "granularity": "day", 
  "dimensions": []

Version 2: "JS Dimension"

"granularity": "all",
"dimensions": [{
  "type": "extraction",
  "dimension": "__time",
  "outputName": "day",
  "outputType": "STRING",
  "extractionFn": {
    "type": "javascript",
    "function": "function (milliseconds) { return new Date(milliseconds).toISOString().replace(/..:..:......Z/, '00:00:00.000Z') }"
  }
}]

Version 3: "Time Format Dimension"

"granularity": "all",
"dimensions": [{
  "type": "extraction",
  "dimension": "__time",
  "outputName": "day",
  "extractionFn": {
    "type": "timeFormat",
    "format": "yyyy-MM-dd"
  }
}]

Average Query Durations in MS

Small-Size Dataset

Basic Granularity 1230ms
JS Dimension 2430ms
Time Format Dimension 2570ms

Medium-Size Dataset

Basic Granularity 3370ms
JS Dimension 6940ms
Time Format Dimension 11780ms

Large-Size Dataset

Basic Granularity 21090ms
JS Dimension 51090ms
Time Format Dimension 50890ms

Is this an inefficiency only present in older versions of Druid (v0.9.2)?

Am I doing something wrong?

Is it that the native granularities are preoptimized in a way that I can't get with manually-implemented granularities?

This performance penalty is serious enough that it doesn't quite qualify as a solution for my use-case. Unfortunately.

@gianm
Copy link
Contributor

gianm commented Jul 1, 2020

Hi @alflennik, the native granularities are indeed somewhat more optimized, as of this writing (latest Druid version is 0.18). This is something that I expect will change in the future as we bring more optimization to the expression system.

@alflennik
Copy link

@gianm Thanks for the reply! After thinking on it a while I figured out a good way in my system to make a time-based dimension available for the unusual situations that demand it while 99% of the time relying on the native granularities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants