Skip to content

Radix Partitioned Hash Table Rework #8475

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 174 commits into from
Sep 4, 2023
Merged

Radix Partitioned Hash Table Rework #8475

merged 174 commits into from
Sep 4, 2023

Conversation

lnkuiper
Copy link
Contributor

@lnkuiper lnkuiper commented Aug 4, 2023

This PR re-works RadixPartitionedHashTable, improving both in-memory and out-of-core aggregation.

Parallelization

To efficiently do aggregations in parallel, threads need a thread-local hash table. After we've seen all data, those thread-local hash tables need to be combined in a thread-global state. This combining also needs to be parallel, therefore the data has to be partitioned.

Our previous implementation would create multiple hash tables per thread, which would keep growing. This created a lot of cache pressure, i.e., threads needed to access to large regions of memory that would exceed the size of the CPU's cache.

The PR changes this, and creates just one very small hash table per thread, which should fit in the CPU's cache. Although we only have one hash table, the data that resides in it is partitioned. If the hash table fills up, we reset the 'first part', i.e., the hash map containing pointers to the data, of the hash table, and reset its count to 0. Note, however, that the hash table does still own the data that was added to it, but it is simply not 'active' anymore in the hash table. Once we've done this, we keep adding more data to the hash table, and this cycle repeats until we've seen all data.

Out-of-core

Once we've seen all data, the partitions are combined. Before, an event was scheduled, which launched tasks to combine the partitions. Once this was done, the partitions were scanned, and the aggregation was done. This is very inefficient for out-of-core aggregation, because this requires going through all data to combine it, then going through all data again to scan it.

This PR addresses this by moving this 'combine' step to the Source phase of the aggregation. Now, we finalize a partition, and then immediately scan it, pushing the data to the operator above. This means we're doing one less pass over the data, causing less I/O.

Benchmark

Of course I had to run the same benchmark as in the previous PR (#7931), again on my laptop (2020 MacBook Pro an M1 CPU). The query is

select count(*) from (select distinct * from 'lineitem30.parquet');

where lineitem30.parquet is the first 30 million rows of TPC-H's lineitem at SF10, which has 0 duplicates. These are the numbers:

memory limit previous this pr
10.0GB 8.52s 2.91s
9.0GB 9.35s 3.45s
8.0GB 7.89s 3.45s
7.0GB 7.81s 3.47s
6.0GB 6.84s 3.41s
5.0GB 7.39s 3.67s
4.0GB 9.04s 3.87s
3.0GB 8.76s 4.20s
2.0GB 8.08s 4.39s
1.0GB 10.27s 4.91s

At a memory limit of 10.0GB, the data fully fits in memory. As we can see, there is almost a 3x improvement. When we reduce the memory limit, we now see a much more graceful degradation in performance than before. At just a memory limit of 1.0GB, the query still finishes within 5 seconds, which is more than 2x faster than before. Interestingly, running this query with a 1.0GB memory limit is now faster than it was with the 10.0GB memory limit.

I've also re-ran ClickBench on c6a.metal, which has 192 threads, to see how the parallelization improved. Here are the results:

query duckdb (old) duckdb (new)
q1 0.00 0.00
q2 0.01 0.00
q3 0.02 0.01
q4 0.02 0.01
q5 0.36 0.17
q6 0.38 0.25
q7 0.01 0.01
q8 0.01 0.00
q9 0.39 0.41
q10 0.56 0.06
q11 0.27 0.06
q12 0.40 0.06
q13 0.31 0.40
q14 0.53 0.28
q15 0.31 0.22
q16 0.32 0.19
q17 0.54 0.54
q18 0.52 0.57
q19 1.00 0.59
q20 0.44 0.15
q21 0.14 0.10
q22 0.17 0.10
q23 0.21 0.22
q24 1.34 0.04
q25 0.05 0.03
q26 0.13 0.06
q27 0.08 0.06
q28 0.27 0.10
q29 1.33 0.11
q30 0.11 0.12
q31 0.24 0.19
q32 0.32 0.45
q33 1.68 1.34
q34 1.21 1.71
q35 1.04 0.22
q36 0.30 0.04
q37 0.22 0.05
q38 0.21 0.04
q39 0.33 0.05
q40 0.19 0.04
q41 0.15 0.03
q42 0.17 0.03
q43 0.10 0.03

As we can see, we have improved by a lot on this benchmark as well.

Future Work

Currently, thread-local data is only combined after all data has been seen. This is potentially problematic where there are many distinct groups. If all of the threads see the same group just once, then this group resides in memory in each thread. This is wasteful since the group can be reduced in a thread-global state, reducing the memory needed. In a future PR, I will implement "early and often" thread-global combines during the Sink phase of the operator.

@github-actions github-actions bot marked this pull request as draft August 31, 2023 08:56
@lnkuiper lnkuiper marked this pull request as ready for review August 31, 2023 09:14
@github-actions github-actions bot marked this pull request as draft August 31, 2023 14:00
@lnkuiper lnkuiper marked this pull request as ready for review August 31, 2023 14:00
@github-actions github-actions bot marked this pull request as draft September 1, 2023 07:03
@lnkuiper lnkuiper marked this pull request as ready for review September 1, 2023 07:56
@github-actions github-actions bot marked this pull request as draft September 1, 2023 13:53
@lnkuiper lnkuiper marked this pull request as ready for review September 1, 2023 13:53
@lnkuiper
Copy link
Contributor Author

lnkuiper commented Sep 1, 2023

Just as I fixed all tests, another aggregate test without ordering sneaked in from main. Hopefully CI is happy this time 🤞

@lnkuiper
Copy link
Contributor Author

lnkuiper commented Sep 4, 2023

Only the R tests are failing (test_struct.R), which are unrelated to this PR. I think this can be merged.

@Mytherin Mytherin merged commit c54d7f5 into duckdb:main Sep 4, 2023
@Mytherin
Copy link
Collaborator

Mytherin commented Sep 4, 2023

Thanks! Looks great!

@djouallah
Copy link

what a beautiful job, history in the making, running TPCH_SF50 with only 13 GB of RAM and it just works !!!

krlmlr pushed a commit to krlmlr/duckdb-r that referenced this pull request Sep 9, 2023
Radix Partitioned Hash Table Rework
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants