groupBy query: limit push down to segment scan is poor performance #9689

xiangqiao123 · 2020-04-12T09:30:34Z

Affected Version

0.17.1

Description

When druid-0.17.1 is used, the performance of group by query is worse than druid-0.16.1. When
applylimitpushdowntosegment is set to false in query context, the performance returns to normal.

From this figure, we can see that the CPU consumption is in LimitedBufferHashGrouper.

query performance is according to query/time from historical log:
{"query/time":13372,"query/bytes":6118553,"success":true,"identity":"allowAll"}

@himanshug

The text was updated successfully, but these errors were encountered:

clintropolis · 2020-04-14T00:10:04Z

Thanks for the report @xiangqiao123 , do you have any additional details on the query you issued, such as what the limit was, any extractionFn or lookups, and any details about segment size, column type, if string approximate cardinality, or anything else you think might be useful in helping us determine why the performance degraded?

himanshug · 2020-04-14T00:16:51Z

@xiangqiao123 LimitedBufferHashGrouper does incur some overhead to maintain a heap to apply limit and if that overhead is more than cost of processing all rows from segments at merge stage then it can happen.
can you post your groupBy query?

gianm · 2020-04-14T04:39:57Z

It looks like the reset is happening on init of LimitedBufferHashGrouper. I'm guessing it's taking a long time because the AlternatingByteBufferHashTable starts out at the max size (tableArenaSize / 2) rather than growing like the regular ByteBufferHashTable does.

The perf regression is probably because #8426 was added in 0.17.0, and before then, we wouldn't push down the limit all the way to the segment. I bet the pushdown to the merge buffer level isn't as bad because there's only one merge buffer per query, so the cost of init is only paid once, instead of per segment.

I think to improve the performance we'd need AlternatingByteBufferHashTable to implement growability.

xiangqiao123 · 2020-04-14T07:15:25Z

Thanks for the report @xiangqiao123 , do you have any additional details on the query you issued, such as what the limit was, any extractionFn or lookups, and any details about segment size, column type, if string approximate cardinality, or anything else you think might be useful in helping us determine why the performance degraded?

Query statement：
{
"dimensions": [
"host",
"type",
"status"
],
"aggregations": [
{
"type": "longSum",
"fieldName": "count",
"name": "SUM(count)"
}
],
"intervals": "2020-04-08T00:00:00+08:00/2020-04-09T00:00:00+08:00",
"limitSpec": {
"type": "default",
"limit": 50000,
"columns": []
},
"granularity": {
"type": "period",
"period": "PT1M",
"timeZone": "Asia/Shanghai",
"origin": "1970-01-01T00:00:00+08:00"
},
"queryType": "groupBy",
"dataSource": "druid_ds",
"context": {
"useCache": false,
"populateCache": false,
}
}
2.This is about 50 segments，1G per segment

himanshug · 2020-04-14T19:18:01Z

@xiangqiao123 that is simplest of queries in terms of aggregation with a largish limit, even with efforts to reduce overheads there would be a non-zero cost in pushing the limit to segment scan phase, for your use case I guess it would be advisable to disable limit pushdown to segments (in reality, you might need to measure before blindly believing in my claim though :)

@gianm I agree with your observations. it sucks that we need to implement grow-ability to save cost of zeroing out where we already have allocated all the memory or is there any other advantage?. another consideration, I don't know/remember if we tried, a different layout for marking used/unused buckets.
currently first byte of each bucket tells whether that bucket is used or not. Instead, if we reserved numBucket bytes at the start of buffer and used those for marking. Zeroing those out might be faster due to batching and even looking them up might be faster when most of the buckets were mark used.

gianm · 2020-04-14T19:49:58Z

it sucks that we need to implement grow-ability to save cost of zeroing out where we already have allocated all the memory or is there any other advantage?.

The problem is the memory is not initialized when we allocated it, so it has garbage in it, and therefore we need to initialize the parts we're going to use. And initialization can take a while if your buffer size is big (like a gigabyte, as some people do).

Instead, if we reserved numBucket bytes at the start of buffer and used those for marking.

I bet it would be faster, especially if we use the Memory.clear API. I did some work to migrate some of the groupBy code to use Memory instead of ByteBuffer (#9308, #9314; see also MemoryBenchmark) partially motivated by performance. If we moved this part too it should help with initialization speed.

But growability would probably help even more. It should be relatively straightforward for AlternatingByteBufferHashTable: each time we grow we swap into the other half of the buffer.

himanshug · 2020-04-14T20:46:07Z

... groupBy code to use Memory instead of ByteBuffer (#9308, #9314; see also MemoryBenchmark) partially motivated by performance. If we moved this part too it should help with initialization speed.

thanks for pointing those out . from a quick look, it looks like we should just delete ByteBufferHashTable and replace all usage by MemoryOpenHashTable which is a clear winner. Do you see any reason for keeping ByteBufferHashTable around anywhere ?

agree with everything, all of them could/should be independently done.

gianm · 2020-04-14T21:29:23Z

from a quick look, it looks like we should just delete ByteBufferHashTable and replace all usage by MemoryOpenHashTable which is a clear winner. Do you see any reason for keeping ByteBufferHashTable around anywhere ?

I don't see a reason to keep ByteBufferHashTable around. The only reason I didn't replace all usages of it was time constraints.

Although for the LimitedBufferHashGrouper. specifically, I think in order to migrate that to MemoryOpenHashTable, we'd need to also add a MemoryMinMaxHeap (replacing ByteBufferMinMaxOffsetHeap).

I would suggest refactoring the code of LimitedBufferHashGrouper a bit so rather than subclassing the hashtable and overriding stuff, it instead has a (not-subclasses) MemoryOpenHashTable and a MemoryMinMaxHeap and uses them both together to achieve its ends.

xiangqiao123 added the Uncategorized problem report label Apr 12, 2020

clintropolis added the Performance label Apr 14, 2020

gianm removed the Uncategorized problem report label Apr 14, 2020

clintropolis mentioned this issue Apr 16, 2020

disable group by config applyLimitPushDownToSegment by default #9711

Merged

jihoonson mentioned this issue Apr 16, 2020

[Draft] 0.18.0 release notes #9652

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groupBy query: limit push down to segment scan is poor performance #9689

groupBy query: limit push down to segment scan is poor performance #9689

xiangqiao123 commented Apr 12, 2020 •

edited

Loading

clintropolis commented Apr 14, 2020

himanshug commented Apr 14, 2020

gianm commented Apr 14, 2020

xiangqiao123 commented Apr 14, 2020

himanshug commented Apr 14, 2020

gianm commented Apr 14, 2020

himanshug commented Apr 14, 2020

gianm commented Apr 14, 2020 •

edited

Loading

groupBy query: limit push down to segment scan is poor performance #9689

groupBy query: limit push down to segment scan is poor performance #9689

Comments

xiangqiao123 commented Apr 12, 2020 • edited Loading

Affected Version

Description

clintropolis commented Apr 14, 2020

himanshug commented Apr 14, 2020

gianm commented Apr 14, 2020

xiangqiao123 commented Apr 14, 2020

himanshug commented Apr 14, 2020

gianm commented Apr 14, 2020

himanshug commented Apr 14, 2020

gianm commented Apr 14, 2020 • edited Loading

xiangqiao123 commented Apr 12, 2020 •

edited

Loading

gianm commented Apr 14, 2020 •

edited

Loading