New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
groupBy sorting behaves not as expected with granularity != 'all' #1926
Comments
This looks related to #701 and https://groups.google.com/d/msg/druid-development/BSpDmAq-7Jk/_W3q00kkwMIJ |
Fwiw the way groupBy Whether or not this is useful behavior is left as an exercise to the reader. I agree that it seems weird that the sorting is done within granular buckets, but the limiting is done globally. It would probably make more sense to do both within the buckets or both globally. If they're both local then it acts like your (1). If they're both global it acts like your (2). |
@drcrallen I read those issues and they do not look related but I might be missing something. I am aware of how zero filling works and this is not in play here because there is data in all those buckets. I think the root problem here is that groupBy returns results in it's own special way. groupBY is the only query type which uses the key 'event' in the return. groupBy tries to de-nest itself (unlike topNs) but does not go all the way since it forgets to resort. |
It isn't |
for (1) it would also be nice to change the flatness of groupby and get a nested data-structure like topN maybe with a flag? |
related #1780 |
+1 |
4 similar comments
+1 |
+1 |
+1 |
+1 |
-1 to nested data structures; relational is easier to consume |
Fwiw when you group by |
This issue has been marked as stale due to 280 days of inactivity. It will be closed in 2 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions. |
This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time. |
For anyone watching, I'd suggest using Druid SQL, since it has the behavior you would expect. (Under the hood it will issue a granularity 'all' groupBy query and then include a |
…ndexing in mapper
@gianm Thanks for the suggestion about using granularity Version 1 "Basic Granularity" "granularity": "day",
"dimensions": [] Version 2: "JS Dimension" "granularity": "all",
"dimensions": [{
"type": "extraction",
"dimension": "__time",
"outputName": "day",
"outputType": "STRING",
"extractionFn": {
"type": "javascript",
"function": "function (milliseconds) { return new Date(milliseconds).toISOString().replace(/..:..:......Z/, '00:00:00.000Z') }"
}
}] Version 3: "Time Format Dimension" "granularity": "all",
"dimensions": [{
"type": "extraction",
"dimension": "__time",
"outputName": "day",
"extractionFn": {
"type": "timeFormat",
"format": "yyyy-MM-dd"
}
}] Average Query Durations in MSSmall-Size DatasetBasic Granularity 1230ms Medium-Size DatasetBasic Granularity 3370ms Large-Size DatasetBasic Granularity 21090ms Is this an inefficiency only present in older versions of Druid (v0.9.2)? Am I doing something wrong? Is it that the native granularities are preoptimized in a way that I can't get with manually-implemented granularities? This performance penalty is serious enough that it doesn't quite qualify as a solution for my use-case. Unfortunately. |
Hi @alflennik, the native granularities are indeed somewhat more optimized, as of this writing (latest Druid version is 0.18). This is something that I expect will change in the future as we bring more optimization to the expression system. |
@gianm Thanks for the reply! After thinking on it a while I figured out a good way in my system to make a time-based dimension available for the unusual situations that demand it while 99% of the time relying on the native granularities. |
I ran this query:
and got this result:
It seems like the results from the different granularities (2 per hour in this case) are concatenated together and never resorted. The limit is then applied to the entire (non sorted list) making it useless.
The workaround (for anyone interested) and the expected action is:
There are actually two separate uses for groupBy that are being muddled up here.
My proposed solution to this issue.
The text was updated successfully, but these errors were encountered: