-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Radix Partitioned Hash Table Rework #8475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
|
Just as I fixed all tests, another aggregate test without ordering sneaked in from |
|
Only the R tests are failing ( |
|
Thanks! Looks great! |
2 tasks
2 tasks
|
what a beautiful job, history in the making, running TPCH_SF50 with only 13 GB of RAM and it just works !!! |
krlmlr
pushed a commit
to krlmlr/duckdb-r
that referenced
this pull request
Sep 9, 2023
Radix Partitioned Hash Table Rework
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR re-works
RadixPartitionedHashTable, improving both in-memory and out-of-core aggregation.Parallelization
To efficiently do aggregations in parallel, threads need a thread-local hash table. After we've seen all data, those thread-local hash tables need to be combined in a thread-global state. This combining also needs to be parallel, therefore the data has to be partitioned.
Our previous implementation would create multiple hash tables per thread, which would keep growing. This created a lot of cache pressure, i.e., threads needed to access to large regions of memory that would exceed the size of the CPU's cache.
The PR changes this, and creates just one very small hash table per thread, which should fit in the CPU's cache. Although we only have one hash table, the data that resides in it is partitioned. If the hash table fills up, we reset the 'first part', i.e., the hash map containing pointers to the data, of the hash table, and reset its count to 0. Note, however, that the hash table does still own the data that was added to it, but it is simply not 'active' anymore in the hash table. Once we've done this, we keep adding more data to the hash table, and this cycle repeats until we've seen all data.
Out-of-core
Once we've seen all data, the partitions are combined. Before, an event was scheduled, which launched tasks to combine the partitions. Once this was done, the partitions were scanned, and the aggregation was done. This is very inefficient for out-of-core aggregation, because this requires going through all data to combine it, then going through all data again to scan it.
This PR addresses this by moving this 'combine' step to the
Sourcephase of the aggregation. Now, we finalize a partition, and then immediately scan it, pushing the data to the operator above. This means we're doing one less pass over the data, causing less I/O.Benchmark
Of course I had to run the same benchmark as in the previous PR (#7931), again on my laptop (2020 MacBook Pro an M1 CPU). The query is
where
lineitem30.parquetis the first 30 million rows of TPC-H'slineitemat SF10, which has 0 duplicates. These are the numbers:At a memory limit of 10.0GB, the data fully fits in memory. As we can see, there is almost a 3x improvement. When we reduce the memory limit, we now see a much more graceful degradation in performance than before. At just a memory limit of 1.0GB, the query still finishes within 5 seconds, which is more than 2x faster than before. Interestingly, running this query with a 1.0GB memory limit is now faster than it was with the 10.0GB memory limit.
I've also re-ran ClickBench on c6a.metal, which has 192 threads, to see how the parallelization improved. Here are the results:
As we can see, we have improved by a lot on this benchmark as well.
Future Work
Currently, thread-local data is only combined after all data has been seen. This is potentially problematic where there are many distinct groups. If all of the threads see the same group just once, then this group resides in memory in each thread. This is wasteful since the group can be reduced in a thread-global state, reducing the memory needed. In a future PR, I will implement "early and often" thread-global combines during the
Sinkphase of the operator.