Count distinct support multiple expressions #5939

allenma · 2023-04-10T10:20:14Z

Which issue does this PR close?

Closes #5619.

Rationale for this change

support count distinct aggregation for multiple expressions like: select count(distinct col1, col2) from table1; will return distinct count of (col1, col2), any value is null in the columns will not be calculated.

What changes are included in this PR?

Support count distinct for multiple expressions, most code is from the previous commit, which is removed in the PR: #5408

Are these changes tested?

Yes, add new UT and leverage existing ut.

Are there any user-facing changes?

No

Dandandan · 2023-04-10T14:54:38Z

datafusion/physical-expr/src/aggregate/count_distinct.rs


-type DistinctScalarValues = ScalarValue;
+#[derive(Debug, PartialEq, Eq, Hash, Clone)]
+struct DistinctScalarValues(Vec<ScalarValue>);


The downside is some performance loss / increased memory usage for single expressions.

ozankabak · 2023-04-10T16:59:10Z

How would the code look like if we separate impls (multiple vs. single COUNT)? Is it possible to do this with tolerable code duplication so we don't suffer from the performance loss @Dandandan mentions?

allenma · 2023-04-11T04:53:22Z

@Dandandan @ozankabak , I did the benchmark with the new implementation, actually there is little performance downgrade:

Benchmarking aggregate_query_no_group_by_count_distinct_wide: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 61.0s, or reduce sample count to 10.
aggregate_query_no_group_by_count_distinct_wide
                        time:   [587.01 ms 598.43 ms 611.68 ms]
                        change: [-6.8593% -3.2992% +0.1757%] (p = 0.08 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

Benchmarking aggregate_query_no_group_by_count_distinct_narrow: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 40.1s, or reduce sample count to 10.
aggregate_query_no_group_by_count_distinct_narrow
                        time:   [399.48 ms 415.63 ms 438.63 ms]
                        change: [-1.1234% +3.7277% +10.592%] (p = 0.20 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

I increase the test array size from 65536 to 134_217_728 to reduce the env noise, and the benchmark command is:
cargo bench --bench aggregate_query_sql

alamb · 2023-05-03T20:01:30Z

What is the status of this PR? @Dandandan are you satisfied with the results? How do we move the PR forward?

ozankabak · 2023-05-03T20:05:42Z

My two cents: If we are to apply the same performance sensitivity we are recently showing to aggregation-related changes to this PR, my impression would be we'd want to think about a refactor that doesn't result in a slow down for the single argument codepath.

alamb · 2023-05-03T20:23:28Z

My two cents: If we are to apply the same performance sensitivity we are recently showing to aggregation-related changes to this PR, my impression would be we'd want to think about a refactor that doesn't result in a slow down for the single argument codepath.

I agree -- if we want to proceed with this PR I can run the bench.sh benchmark script to see what it says (it has pretty good coverage of aggregation I think)

alamb · 2023-05-22T13:40:14Z

Marking as draft as this PR seems to have stalled -- when it is next ready for review, please mark it as such. Thank you

github-actions · 2024-05-13T01:47:06Z

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

Count distinct support multiple expressions

e13de1e

github-actions bot added core Core DataFusion crate physical-expr Changes to the physical-expr crates labels Apr 10, 2023

Dandandan reviewed Apr 10, 2023

View reviewed changes

alamb marked this pull request as draft May 22, 2023 13:40

github-actions bot added the Stale PR has not had any activity for some time label May 13, 2024

github-actions bot closed this May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Count distinct support multiple expressions #5939

Count distinct support multiple expressions #5939

Uh oh!

allenma commented Apr 10, 2023 •

edited

Loading

Uh oh!

Dandandan Apr 10, 2023

Uh oh!

ozankabak commented Apr 10, 2023

Uh oh!

allenma commented Apr 11, 2023

Uh oh!

alamb commented May 3, 2023

Uh oh!

ozankabak commented May 3, 2023

Uh oh!

alamb commented May 3, 2023

Uh oh!

alamb commented May 22, 2023

Uh oh!

github-actions bot commented May 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Count distinct support multiple expressions #5939

Count distinct support multiple expressions #5939

Uh oh!

Conversation

allenma commented Apr 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Dandandan Apr 10, 2023

Choose a reason for hiding this comment

Uh oh!

ozankabak commented Apr 10, 2023

Uh oh!

allenma commented Apr 11, 2023

Uh oh!

alamb commented May 3, 2023

Uh oh!

ozankabak commented May 3, 2023

Uh oh!

alamb commented May 3, 2023

Uh oh!

alamb commented May 22, 2023

Uh oh!

github-actions bot commented May 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

allenma commented Apr 10, 2023 •

edited

Loading