Skip to content

[Aggregate] DISTINCT aggregates *with* GROUP BY are now executed in parallel #5146

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 65 commits into from
Nov 1, 2022

Conversation

Tishj
Copy link
Contributor

@Tishj Tishj commented Oct 31, 2022

This PR addresses #4275

Previously the GROUP BY aggregates (handled by the HashAggregateOperator) would be forced to run (essentially) single-threaded when any of the aggregates to be computed would have the DISTINCT modifier.

This was done because aggregate states were always being directly computed.
If that would be done for DISTINCT aggregates, that could result in two threads both updating their local aggregate state, and when merging there would be no way to reconcile that problem.

What now happens is for every grouping_set (a regular GROUP BY regardless of the number of columns is considered one grouping_set) is that we create RadixPartitionedHashTables for every distinct aggregate we have.

These hash tables are then grouped on the columns of the group + the children (inputs) of the distinct aggregate.
That way we can ensure that we register each new (distinct) combination of those columns, and we preserve the context so we can create these in parallel and merge without loss of information.

We also delay computing the AggregateState for these aggregates until the Finalize step, where we scan the hashtables of the distinct aggregate for a grouping, and add them to the main hashtable for that grouping.

I have added a benchmark:

load
CREATE TABLE integers AS SELECT i % 5 AS i, i % 25 as j FROM range(0, 10000000) tbl(i);

run
SELECT SUM(distinct i), COUNT(distinct i), AVG(distinct i), PRODUCT(distinct i) FROM integers group by j

old timings (master):

name    run     timing
benchmark/micro/aggregate/grouped_distinct.benchmark    1       0.399002
benchmark/micro/aggregate/grouped_distinct.benchmark    2       0.403178
benchmark/micro/aggregate/grouped_distinct.benchmark    3       0.400935
benchmark/micro/aggregate/grouped_distinct.benchmark    4       0.400398
benchmark/micro/aggregate/grouped_distinct.benchmark    5       0.403561

new timings (this branch):

name    run     timing
benchmark/micro/aggregate/grouped_distinct.benchmark    1       0.065063
benchmark/micro/aggregate/grouped_distinct.benchmark    2       0.065781
benchmark/micro/aggregate/grouped_distinct.benchmark    3       0.063701
benchmark/micro/aggregate/grouped_distinct.benchmark    4       0.064280
benchmark/micro/aggregate/grouped_distinct.benchmark    5       0.064783

Tishj added 30 commits October 21, 2022 13:35
…ut for Sink, but all aggregate children come out as NULL
…using the size of the group_expressions instead of the size of the grouping_set of the current grouping when initializing the grouping_set of the distinct aggregate data
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Looks great. Very nice performance improvement!

Some comments below:

@Mytherin
Copy link
Collaborator

Could we also add some large grouping set tests with TPC-H, e.g.:

CALL dbgen(sf=1);

select grouping(l_returnflag, l_linestatus), l_returnflag, l_linestatus, count(distinct l_orderkey), count(distinct l_comment) 
from lineitem
group by cube(l_returnflag, l_linestatus)
order by all;



SELECT COUNT(DISTINCT l_orderkey), COUNT(DISTINCT l_partkey), COUNT(*), MIN(l_orderkey), MAX(l_orderkey), MIN(l_partkey), MAX(l_partkey), SUM(distinct_comment), AVG(distinct_comment)
FROM (
select l_orderkey, l_partkey, count(distinct l_comment)  AS distinct_comment
from lineitem
group by cube(l_orderkey, l_partkey)
);

Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - one more comment and some answers to inline questions, otherwise everything looks good.

@Tishj
Copy link
Contributor Author

Tishj commented Nov 1, 2022

Ran all the tests introduced in this PR on master, all pass except for the case mentioned in #5070 as expected.

@Mytherin Mytherin linked an issue Nov 1, 2022 that may be closed by this pull request
2 tasks
@Mytherin Mytherin merged commit 270b584 into duckdb:master Nov 1, 2022
@Mytherin
Copy link
Collaborator

Mytherin commented Nov 1, 2022

Thanks! Everything looks good now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect count results by COUNT DISTINCT with FILTER and GROUP BY query
2 participants