Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupBy query: limit push down to segment scan is poor performance #9689

Open
xiangqiao123 opened this issue Apr 12, 2020 · 8 comments
Open

Comments

@xiangqiao123
Copy link
Contributor

xiangqiao123 commented Apr 12, 2020

Affected Version

0.17.1

Description

When druid-0.17.1 is used, the performance of group by query is worse than druid-0.16.1. When
applylimitpushdowntosegment is set to false in query context, the performance returns to normal.

From this figure, we can see that the CPU consumption is in LimitedBufferHashGrouper.
image

query performance is according to query/time from historical log:
{"query/time":13372,"query/bytes":6118553,"success":true,"identity":"allowAll"}

@himanshug

@clintropolis
Copy link
Member

Thanks for the report @xiangqiao123 , do you have any additional details on the query you issued, such as what the limit was, any extractionFn or lookups, and any details about segment size, column type, if string approximate cardinality, or anything else you think might be useful in helping us determine why the performance degraded?

@himanshug
Copy link
Contributor

@xiangqiao123 LimitedBufferHashGrouper does incur some overhead to maintain a heap to apply limit and if that overhead is more than cost of processing all rows from segments at merge stage then it can happen.
can you post your groupBy query?

@gianm
Copy link
Contributor

gianm commented Apr 14, 2020

It looks like the reset is happening on init of LimitedBufferHashGrouper. I'm guessing it's taking a long time because the AlternatingByteBufferHashTable starts out at the max size (tableArenaSize / 2) rather than growing like the regular ByteBufferHashTable does.

The perf regression is probably because #8426 was added in 0.17.0, and before then, we wouldn't push down the limit all the way to the segment. I bet the pushdown to the merge buffer level isn't as bad because there's only one merge buffer per query, so the cost of init is only paid once, instead of per segment.

I think to improve the performance we'd need AlternatingByteBufferHashTable to implement growability.

@xiangqiao123
Copy link
Contributor Author

Thanks for the report @xiangqiao123 , do you have any additional details on the query you issued, such as what the limit was, any extractionFn or lookups, and any details about segment size, column type, if string approximate cardinality, or anything else you think might be useful in helping us determine why the performance degraded?

  1. Query statement:
    {
    "dimensions": [
    "host",
    "type",
    "status"
    ],
    "aggregations": [
    {
    "type": "longSum",
    "fieldName": "count",
    "name": "SUM(count)"
    }
    ],
    "intervals": "2020-04-08T00:00:00+08:00/2020-04-09T00:00:00+08:00",
    "limitSpec": {
    "type": "default",
    "limit": 50000,
    "columns": []
    },
    "granularity": {
    "type": "period",
    "period": "PT1M",
    "timeZone": "Asia/Shanghai",
    "origin": "1970-01-01T00:00:00+08:00"
    },
    "queryType": "groupBy",
    "dataSource": "druid_ds",
    "context": {
    "useCache": false,
    "populateCache": false,
    }
    }
    2.This is about 50 segments,1G per segment

@himanshug
Copy link
Contributor

@xiangqiao123 that is simplest of queries in terms of aggregation with a largish limit, even with efforts to reduce overheads there would be a non-zero cost in pushing the limit to segment scan phase, for your use case I guess it would be advisable to disable limit pushdown to segments (in reality, you might need to measure before blindly believing in my claim though :)

@gianm I agree with your observations. it sucks that we need to implement grow-ability to save cost of zeroing out where we already have allocated all the memory or is there any other advantage?. another consideration, I don't know/remember if we tried, a different layout for marking used/unused buckets.
currently first byte of each bucket tells whether that bucket is used or not. Instead, if we reserved numBucket bytes at the start of buffer and used those for marking. Zeroing those out might be faster due to batching and even looking them up might be faster when most of the buckets were mark used.

@gianm
Copy link
Contributor

gianm commented Apr 14, 2020

it sucks that we need to implement grow-ability to save cost of zeroing out where we already have allocated all the memory or is there any other advantage?.

The problem is the memory is not initialized when we allocated it, so it has garbage in it, and therefore we need to initialize the parts we're going to use. And initialization can take a while if your buffer size is big (like a gigabyte, as some people do).

Instead, if we reserved numBucket bytes at the start of buffer and used those for marking.

I bet it would be faster, especially if we use the Memory.clear API. I did some work to migrate some of the groupBy code to use Memory instead of ByteBuffer (#9308, #9314; see also MemoryBenchmark) partially motivated by performance. If we moved this part too it should help with initialization speed.

But growability would probably help even more. It should be relatively straightforward for AlternatingByteBufferHashTable: each time we grow we swap into the other half of the buffer.

@himanshug
Copy link
Contributor

... groupBy code to use Memory instead of ByteBuffer (#9308, #9314; see also MemoryBenchmark) partially motivated by performance. If we moved this part too it should help with initialization speed.

thanks for pointing those out . from a quick look, it looks like we should just delete ByteBufferHashTable and replace all usage by MemoryOpenHashTable which is a clear winner. Do you see any reason for keeping ByteBufferHashTable around anywhere ?

agree with everything, all of them could/should be independently done.

@gianm
Copy link
Contributor

gianm commented Apr 14, 2020

from a quick look, it looks like we should just delete ByteBufferHashTable and replace all usage by MemoryOpenHashTable which is a clear winner. Do you see any reason for keeping ByteBufferHashTable around anywhere ?

I don't see a reason to keep ByteBufferHashTable around. The only reason I didn't replace all usages of it was time constraints.

Although for the LimitedBufferHashGrouper. specifically, I think in order to migrate that to MemoryOpenHashTable, we'd need to also add a MemoryMinMaxHeap (replacing ByteBufferMinMaxOffsetHeap).

I would suggest refactoring the code of LimitedBufferHashGrouper a bit so rather than subclassing the hashtable and overriding stuff, it instead has a (not-subclasses) MemoryOpenHashTable and a MemoryMinMaxHeap and uses them both together to achieve its ends.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants