Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected performance drop-off (~10x) aggregating 64bit int vs 32bit int groups #4208

Closed
1 of 2 tasks
alexander-beedie opened this issue Jul 26, 2022 · 3 comments · Fixed by #4304
Closed
1 of 2 tasks
Assignees

Comments

@alexander-beedie
Copy link

alexander-beedie commented Jul 26, 2022

What happens?

I was looking at the grouped aggregation benchmark earlier, as referenced in Parallel Grouped Aggregation in DuckDB, and found a fairly dramatic performance issue (which is so large I think it is likely to be a bug), just by using 64bit values (while keeping everything else about the benchmark the same).

Performance drops by a full order of magnitude (~10x) across the board.

To Reproduce

Run the grouped aggregation benchmark with the following minor change to the experiment() function:

# scale by a value that will force the underlying column dtypes to 64bit
scaling_factor = 5_000_000_000
    
g1 = np.repeat( np.arange( 0, int( n_groups / 2 ) ), int( n_rows / (n_groups / 2) ) ) * scaling_factor
g2 = np.tile( np.arange( 0, 2 ), int( n_rows / 2 ) ) * scaling_factor
d = np.random.randint( 0, n_rows, n_rows ) * scaling_factor

This changes nothing about how much data is generated, it just scales the test values to guarantee that 64bit column dtypes are applied. The result is an unexpected performance falloff.

Results

Characterizing the slowdown under 32bit/64bit regimes, with polars as a reference implementation:

rows groups dd:32 dd:64 dd:slowdown pl:32 pl:64 pl:slowdown
10,000,000 1,000 0.0140 0.588 42.0x 0.032 0.088 2.8x
10,000,000 10,000 0.0520 0.664 12.8x 0.031 0.089 2.9x
10,000,000 100,000 0.0680 0.599 8.8x 0.038 0.098 2.6x
10,000,000 1,000,000 0.1280 1.481 11.6x 0.090 0.183 2.0x
10,000,000 10,000,000 0.1980 2.830 14.3x 0.228 0.307 1.3x
100,000,000 1,000 0.2130 4.145 19.5x 0.319 0.841 2.6x
100,000,000 100,000 0.5490 4.530 8.3x 0.415 0.978 2.4x
100,000,000 1,000,000 0.8060 7.525 9.3x 0.841 1.893 2.3x
100,000,000 10,000,000 1.5920 16.932 10.6x 1.609 2.984 1.9x
100,000,000 100,000,000 2.7520 31.819 11.6x 3.729 5.861 1.6x

Moving to 64bit...

  • DuckDB (dd in the table above) slows down by ~10x.
    this degree of drop-off is not proportional to the extra work being done

  • Polars (pl in the table above) slows down by ~2x.
    this is broadly what you'd expect, given that 64bit values occupy twice the memory of 32bit values

image

y-axis: time in seconds (lower is better) 
x-axis: rows/groups bucket in the above table

OS:

macOS (12.4)

DuckDB Version:

0.4.0

DuckDB Client:

Python

Full Name:

Alexander Beedie

Have you tried this on the latest master branch?

  • I agree (no - tested on latest officially released package, v0.4.0)

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree
@alexander-beedie alexander-beedie changed the title Huge performance drop-off (~10x) aggregating 64 bit int vs 32 bit int groups Unexpected performance drop-off (~10x) aggregating 64bit int vs 32bit int groups Jul 26, 2022
@Mytherin
Copy link
Collaborator

Thanks for the report! This data set seems to trigger a worst-case behavior in our hash function, causing all the data to be handled in the same hash partition. Modifying the hash function fixes the issue. This is likely because the data all consists of very large numbers that are evenly spaced out. We should definitely have a look at fixing this, although I doubt this is a common data distribution in practice.

@alexander-beedie
Copy link
Author

We should definitely have a look at fixing this...

That would be great!

...although I doubt this is a common data distribution in practice.

Hopefully not; I might try some other workloads to see if this can be triggered in other data regimes, because when it does occur it seems to hit hard, heh. Thanks! :)

@Mytherin
Copy link
Collaborator

Mytherin commented Aug 7, 2022

This should be fixed on the master now by the introduction of a stronger hash function (#4304).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants