Memory Limited GroupBy (Externalized / Spill) #1570

alamb · 2022-01-15T12:18:24Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Support Grouping "arbitrarily" large inputs (e.g. when the group by hash table doesn't fit in RAM or within some user defined budget)

This typically happens when there are a "large" number of distinct groups. For example, with queries like

-- Find the top ten users by event count
SELECT user_id, count(events) 
FROM users 
GROUP BY user_id 
ORDER BY count(events) DESC
limit 10

When there are large number of groups

This ticket concerns the memory used the HashAggregateExec operator -- it doesn't cover other potential targets (e.g. externalized sort or join). That will be covered by other tasks tracked by #587

Describe the solution you'd like

Allow DataFusion users to specify a RAM budget (aka via the config introduced in Initial MemoryManager and DiskManager APIs for query execution + External Sort implementation #1526) and have their queries complete running without the group by exceeding the budget allocated to it via the memory manager.

For the HashAggregateExec operator, I think the best behavior would be:

Use an in memory hash table (as is done today in HashAggregateExec ), if the memory budget allows
If all the input does not fit in the RAM budget, the hash table and partial aggregates are sorted by group key and written to temporary disk files
Temporary disk files are read / merged to produce the final results

Hopefully after #1568 is complete we'll have an efficient N-way merge that can be reused.

Some ideas of how to break this task down

Consolidate GroupByHash implementations row_hash.rs and hash.rs (remove duplication) #2723
Throw a runtime error if the memory allocated to GroupByHash exceeds a limit #3940
Improve the performance of Aggregator, grouping, aggregation #4973
Implement proper Spilling in GroupByHashExec

Describe alternatives you've considered
TBD

Context
This is follow on work from the great PR from @yjshen in #1526 and part of the story of limiting memory used by DataFusion #587

The text was updated successfully, but these errors were encountered:

jhorstmann · 2022-01-21T09:54:49Z

For hash aggregation I quite like the implementation described in https://dl.acm.org/doi/pdf/10.1145/2588555.2610507 (4.4). The algorithm there optimizes for cache locality by pre-aggregating into fixed-size thread-local hash tables and then overflowing into main memory partitions, but I think it should be easy to extend with also flushing these partitions to disk if they get too large.

milenkovicm · 2022-09-10T19:52:27Z

An idea to brainstorm this feature. master...milenkovicm:arrow-datafusion:spill_aggregation_to_mmap has a POC code which uses a memory mapped file as the aggregation store, which should spill to disc in case of memory pressure.

This solution is just a prototype, far from perfect, and good chance it is not fit for purpose.

The main change is that Vec<RowGroupState> has been replaced with Vec<u8>, which is then replaced with mmap file.

Problems with this solution:

assumption is that aggregation buffer size will not change. This assumption may go away but working with underlying state will get way more complicated, as we will need to implement buffer reallocation, eviction ...
we don't have control when the data is going to be evicted from memory, we can use lock and unlock on mmap to try to keep it in memory but not sure if that would be the case

good point(s):

this solution chunks aggregation state in multiple batches when downstream producer finishes, thus limiting memory needed to read aggregation state (in theory)

no performance numbers at the moment, and not much experience with datafusion

Any comments welcomed

alamb · 2022-09-13T11:07:50Z

I will try and review this sometime later today.

milenkovicm · 2022-09-13T11:33:37Z

thanks @alamb,

just a follow up with some numbers

running a simple query on 10M distinct keys and 50M records total (each key has 5 records) ( /usr/bin/time -l ./bench_10m)

with this change:

      112.10 real       363.15 user        21.28 sys
          2246549504  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
             3689321  page reclaims
              503785  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                5924  voluntary context switches
              442244  involuntary context switches
       2212997211010  instructions retired
       1232264629987  cycles elapsed
          1254973440  peak memory footprint

without change:

      109.09 real       355.84 user        23.10 sys
          3891322880  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
             5401700  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                  15  voluntary context switches
              414334  involuntary context switches
       2222284866547  instructions retired
       1208398130091  cycles elapsed
          3404832768  peak memory footprint

alamb · 2022-09-13T19:47:37Z

Thanks for the info @milenkovicm

In general I think database systems tend to try and shy away from using mmap for spilling operations, for the reasons very clearly articulated in "Are You Sure You Want to Use MMAP in Your Database Management System?" https://db.cs.cmu.edu/mmap-cidr2022/

I think the current state of the art / best practice for dealing with grouping when the hashtable doesn't fit in memory is to:

Sort the data by groups and write that data to a spill file (ideally sharing / reusing the code used for externalized sort)
Create a new hash table, spilling repeatedly if needed
When all the input has been processed, then the spill files are read back and merged (again sharing code for externalized sort) and a merge group by is performed.

Thanks to @yjshen we have much of the spilling / merging code already. We are lacking:

merge group by (where the groups are guaranteed to be contiguous in the input stream)
The orchestration logic to spill and then re-merge

Does that make sense? I know it is a high level description

milenkovicm · 2022-09-13T20:42:19Z

thanks @alamb,
your comment makes perfect sense, I can't disagree with you. As I said in my comment approach is not without its problems and shortcomings. This solution is quite simple and it is question if there is a speed from simplicity :)

Would you have any design doc regarding merge group by and re-merge which you can share?

alamb · 2022-09-13T21:01:18Z

Would you have any design doc regarding merge group by and re-merge which you can share?

I do not have a design document, sadly. I can try and help write one (perhaps I can make a diagram and example) if you are interested in working on it @milenkovicm

milenkovicm · 2022-09-13T21:05:55Z

I'd be interested. after a big sleep I think I get your approach, but if you can produce a diagram it would be great.
can't really commit on timeframe, will have to have a look at code first

thanks @alamb

alamb · 2022-09-14T18:31:27Z

😆 sounds good -- I will try and write up something in the coming days to start a discussion

* change how final aggregation row group is created ... this change would prevent of cloning of whole state, doubling memory needed for aggregation. this PR relates to #1570 * Fix clippy issues * read batch size from `session_config`

alamb · 2022-11-28T14:59:12Z

Update on this task is @crepererum has added initial support to track the memory use in the grouping operation. We @tustvold / @crepererum (and myself as ticket master) have plans to improve the grouping operation which I will make a bit more discoverable shortly

alamb · 2023-01-31T23:28:47Z

I'd be interested. after a big sleep I think I get your approach, but if you can produce a diagram it would be great.

@milenkovicm very belatedly, here is a document / diagrams: https://docs.google.com/document/d/16rm5VR1nGkY6DedMCh1NUmThwf3RduAweaBH9b1h6AY/edit?usp=sharing

mingmwang · 2023-04-25T10:38:06Z

@alamb @yjshen
Can we make the GroupState and the Accumulator states serializable ?
With this approach, we do not need to do any sort when spiiling data to disks. And when we read the data back, we reconstruct our raw hash table quickly from the hash values and indexes, because our hashmap is very lightweight, the hash value can be re-calculated from grouping rows, or we can cache the hash value inside the GroupState to avoid the re-calculating.

crepererum · 2023-04-25T14:31:50Z

Can we make the GroupState and the Accumulator states serializable ? With this approach, we do not need to do any sort when spiiling data to disks. And when we read the data back, we reconstruct our raw hash table quickly from the hash values and indexes, because our hashmap is very lightweight, the hash value can be re-calculated from grouping rows, or we can cache the hash value inside the GroupState to avoid the re-calculating.

You still need to disk spilling, no? Or where do you store the serialized state? Also I guess that serialization may become a major bottleneck for some of the accumulators.

mingmwang · 2023-04-26T09:28:46Z

Can we make the GroupState and the Accumulator states serializable ? With this approach, we do not need to do any sort when spiiling data to disks. And when we read the data back, we reconstruct our raw hash table quickly from the hash values and indexes, because our hashmap is very lightweight, the hash value can be re-calculated from grouping rows, or we can cache the hash value inside the GroupState to avoid the re-calculating.

You still need to disk spilling, no? Or where do you store the serialized state? Also I guess that serialization may become a major bottleneck for some of the accumulators.

Yes, we still need the disk spilling, the disk spilling can be managed and tracked by the disk_manager in the RuntimeEnv, but anyway it avoid sort the entire group data or hash table before the spilling.

milenkovicm · 2023-04-26T09:39:05Z

I'm not sure I see many benefits of having it serializable, would agree with @crepererum
Now this discussion would make more sense if we would know more about your implementation.

IMHO, aggregation should start with hash map, we can assume that there is not going to be spill, if we're wrong we would pay penalty of being wrong as we will have to sort it before spill.

Once we have it spill to disc I'd argue it would make more sense to switch from hash map to b-tree, as we would need to merge it with spill, it is slower but from my experience it is a bit faster than sorting hash map.

Spilling can be implemented using two column parquet file (key: blob, value: blob) .

Implementation like this works quite well from my experience, especially that in most cases we wont trigger spill

alamb · 2023-04-27T15:50:27Z

Can we make the GroupState and the Accumulator states serializable ?
With this approach, we do not need to do any sort when spiiling data to disks. And when we read the data back, we reconstruct our raw hash table quickly from the hash values and indexes, because our hashmap is very lightweight, the hash value can be re-calculated from grouping rows, or we can cache the hash value inside the GroupState to avoid the re-calculating.

@mingmwang when we serialize the groups to disk when the hash table is full, we then need to read them back in again somehow and do the final combination. My assumption is that we don't have the memory to read them all back in at once as we had to spill in the first place

If we sort the data that is spilled on the group keys, we can stream data from the different spill files in parallel, merge, and then do a streaming group by, which we will have sufficient memory to accomplished.

IMHO, aggregation should start with hash map, we can assume that there is not going to be spill, if we're wrong we would pay penalty of being wrong as we will have to sort it before spill.

Yes I agree with @milenkovicm - this approach will work well. It does have a disadvantage of a performance "cliff" when the query goes from in memory --> spilling (it doesn't degrade gracefully) but it is likely the fastest for queries that have sufficient memory

alamb · 2023-07-15T22:18:11Z

BTW after #6889 I think we'll be in pretty good shape to dump the accumulator state to disk (as Arrow arrays, perhaps) if we want to pursue this more

milenkovicm · 2023-07-15T22:46:04Z

Hi @alamb i have a poc code, a bit stale though, which will dump state to arrow. I've tried to rebase it recently but did not have time to finish it. It's in my branch at https://github.com/milenkovicm/arrow-datafusion/commit/e9d17d2a456707f88497606b3977a8a82199c7d1

Not sure if it still makes sense after all this time.

alamb · 2023-07-16T13:16:36Z

Hi @alamb i have a poc code, a bit stale though, which will dump state to arrow. I've tried to rebase it recently but did not have time to finish it. It's in my branch at https://github.com/milenkovicm/arrow-datafusion/commit/e9d17d2a456707f88497606b3977a8a82199c7d1

Thanks @milenkovicm -- that actually looks quite cool. If/when someone picks up this ticket I think it would be a good place to start from. I especially like how you encapsulated the spilling / state code in its own module

kazuyukitanimura · 2023-07-27T21:16:37Z

Hello,

We have an internal spilling implementation for partial aggregation. We are currently in the middle of applying back the recent improvements of aggregation (such as #7095). After that is done, we plan to contribute back the work to DataFusion, but I just found this ticket. I am wondering what the current status of this ticket is.
Our implementation is quite similar to what the description of this thicket says, but (obviously) different from @milenkovicm 's poc.

I am wondering if the community is willing to review yet another spilling implementation...

alamb · 2023-07-27T23:22:24Z

After that is done, we plan to contribute back the work to DataFusion, but I just found this ticket. I am wondering what the current status of this ticket is.

That sounds good

I am wondering if the community is willing to review yet another spilling implementation as it is one of the missing pices

I am willing to review another spilling implementation for sure -- it is one of the missing features for datafusion to be a complete analytic solution

To assit review, it would help to explain / document how it works (e.g. is it based on mmap, or are the intermediate groups spilled as arrow, or are they sorted?) ideally in code comments

I made some diagrams here that might be helpful (but are mostly related to streaming groupby)
https://docs.google.com/document/d/16rm5VR1nGkY6DedMCh1NUmThwf3RduAweaBH9b1h6AY/edit?usp=sharing

kazuyukitanimura · 2023-07-28T01:00:10Z

Thank you @alamb
Our implementation is using the intermediate groups spilled as arrow, very similar to ExternalSorter. I will post the PR and doc shortly

kazuyukitanimura · 2023-08-24T17:16:59Z

I finally posted a PR for this #7400
In particular, I explained how the implementation works at https://github.com/apache/arrow-datafusion/pull/7400/files#diff-0b72cdc7a5e02b6650430c3eb7bdea4f1d2d32867b927633d92abb4fcf31c81bR135
Thanks! cc @alamb

alamb mentioned this issue Jan 15, 2022

[Epic] Optionally Limit memory used by DataFusion plan #587

Closed

8 tasks

This was referenced Apr 30, 2022

Grouped Aggregate in row format #2375

Merged

Re-organize and rename aggregates physical plan #2387

Closed

Re-organize and rename aggregates physical plan #2388

Merged

milenkovicm mentioned this issue Sep 13, 2022

Generate hash aggregation output in smaller record batches #3461

Merged

alamb mentioned this issue Jan 31, 2023

Support StreamAggregation / streaming group by #5133

Closed

alamb mentioned this issue Feb 16, 2023

Update benchmarks on clickbench #5276

Closed

kazuyukitanimura mentioned this issue Aug 24, 2023

feat: Support spilling for hash aggregation #7400

Merged

alamb mentioned this issue Sep 14, 2023

Revert unecessary changes in first/last aggregator #7559

Closed

alamb closed this as completed in #7400 Sep 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Limited GroupBy (Externalized / Spill) #1570

Memory Limited GroupBy (Externalized / Spill) #1570

alamb commented Jan 15, 2022 •

edited

Loading

jhorstmann commented Jan 21, 2022

milenkovicm commented Sep 10, 2022

alamb commented Sep 13, 2022

milenkovicm commented Sep 13, 2022

alamb commented Sep 13, 2022

milenkovicm commented Sep 13, 2022

alamb commented Sep 13, 2022

milenkovicm commented Sep 13, 2022 •

edited

Loading

alamb commented Sep 14, 2022

alamb commented Nov 28, 2022

alamb commented Jan 31, 2023

mingmwang commented Apr 25, 2023

crepererum commented Apr 25, 2023

mingmwang commented Apr 26, 2023

milenkovicm commented Apr 26, 2023

alamb commented Apr 27, 2023 •

edited

Loading

alamb commented Jul 15, 2023

milenkovicm commented Jul 15, 2023

alamb commented Jul 16, 2023

kazuyukitanimura commented Jul 27, 2023

alamb commented Jul 27, 2023

kazuyukitanimura commented Jul 28, 2023

kazuyukitanimura commented Aug 24, 2023

Memory Limited GroupBy (Externalized / Spill) #1570

Memory Limited GroupBy (Externalized / Spill) #1570

Comments

alamb commented Jan 15, 2022 • edited Loading

jhorstmann commented Jan 21, 2022

milenkovicm commented Sep 10, 2022

alamb commented Sep 13, 2022

milenkovicm commented Sep 13, 2022

alamb commented Sep 13, 2022

milenkovicm commented Sep 13, 2022

alamb commented Sep 13, 2022

milenkovicm commented Sep 13, 2022 • edited Loading

alamb commented Sep 14, 2022

alamb commented Nov 28, 2022

alamb commented Jan 31, 2023

mingmwang commented Apr 25, 2023

crepererum commented Apr 25, 2023

mingmwang commented Apr 26, 2023

milenkovicm commented Apr 26, 2023

alamb commented Apr 27, 2023 • edited Loading

alamb commented Jul 15, 2023

milenkovicm commented Jul 15, 2023

alamb commented Jul 16, 2023

kazuyukitanimura commented Jul 27, 2023

alamb commented Jul 27, 2023

kazuyukitanimura commented Jul 28, 2023

kazuyukitanimura commented Aug 24, 2023

alamb commented Jan 15, 2022 •

edited

Loading

milenkovicm commented Sep 13, 2022 •

edited

Loading

alamb commented Apr 27, 2023 •

edited

Loading