You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have seen this internally with HashJoinExec and RepartitionExec. The repartition case is a bit hard to reproduce, since it depends on how many batches are being buffered. So I will be using HashJoinExec as the example. I'm suspecting any operator that holds a sequence of RecordBatches in memory has this problem.
peak_mem_used for final aggregate is ~104MB in the test. But hash join fails even with 3GB memory limit!
We can see the test failing even with memory limit set to 30x (!) of the aggregate peak mem. I can understand it being 2x of that - the memory counting might duplicate once in join and once in agg, but 30x does not make sense. Ideally, this should be close to 1x since we're not really allocating more memory.
Expected behavior
The query in the above test should pass with at most 2x the size of the aggregate peak mem.
Additional context
This is not just hash join, any operator that can buffer data coming from an agg will show this behavior:
HashJoinExec: buffers all batches coming on the build side
NestedLoopJoinExec
CrossJoinExec
SortMergeJoinExec: although it won't buffer many records
For the fix, I'm proposing a new helper alongside get_record_batch_memory_size that is stateful. It keeps track of buffer pointers that were previously seen (in either a HashSet or a HashMap) and de-duplicates across batches using the state. We have been using a version of this successfully internally.
Describe the bug
We have seen this internally with
HashJoinExecandRepartitionExec. The repartition case is a bit hard to reproduce, since it depends on how many batches are being buffered. So I will be usingHashJoinExecas the example. I'm suspecting any operator that holds a sequence ofRecordBatches in memory has this problem.Consider a tree like this:
AggregateExecon the build-side of a hash join.Vec<RecordBatch>in memory. Ref.get_record_batch_memory_sizeof each batch separately.AggregateExecproduces output: refnum_output_rows / batch_sizetimes! This can be a huge multiplier for large aggs.To Reproduce
I have a reproducer test here: Samyak2#2
peak_mem_usedfor final aggregate is ~104MB in the test. But hash join fails even with 3GB memory limit!We can see the test failing even with memory limit set to 30x (!) of the aggregate peak mem. I can understand it being 2x of that - the memory counting might duplicate once in join and once in agg, but 30x does not make sense. Ideally, this should be close to 1x since we're not really allocating more memory.
Expected behavior
The query in the above test should pass with at most 2x the size of the aggregate peak mem.
Additional context
This is not just hash join, any operator that can buffer data coming from an agg will show this behavior:
HashJoinExec: buffers all batches coming on the build sideNestedLoopJoinExecCrossJoinExecSortMergeJoinExec: although it won't buffer many recordsRepartitionExec: can buffer an unbounded number of batches in some cases. See: RepartitionExec channels grow unboundedly with one slow consumer #22090SortExec: buffers batches in-memTopK: buffers heap size number of batchesSortPreservingMergeExecThere may be more I'm missing.
For the fix, I'm proposing a new helper alongside
get_record_batch_memory_sizethat is stateful. It keeps track of buffer pointers that were previously seen (in either a HashSet or a HashMap) and de-duplicates across batches using the state. We have been using a version of this successfully internally.