Skip to content

Conversation

@allenma
Copy link
Contributor

@allenma allenma commented Apr 10, 2023

Which issue does this PR close?

Closes #5619.

Rationale for this change

support count distinct aggregation for multiple expressions like: select count(distinct col1, col2) from table1; will return distinct count of (col1, col2), any value is null in the columns will not be calculated.

What changes are included in this PR?

Support count distinct for multiple expressions, most code is from the previous commit, which is removed in the PR: #5408

Are these changes tested?

Yes, add new UT and leverage existing ut.

Are there any user-facing changes?

No

@github-actions github-actions bot added core Core DataFusion crate physical-expr Changes to the physical-expr crates labels Apr 10, 2023

type DistinctScalarValues = ScalarValue;
#[derive(Debug, PartialEq, Eq, Hash, Clone)]
struct DistinctScalarValues(Vec<ScalarValue>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside is some performance loss / increased memory usage for single expressions.

@ozankabak
Copy link
Contributor

How would the code look like if we separate impls (multiple vs. single COUNT)? Is it possible to do this with tolerable code duplication so we don't suffer from the performance loss @Dandandan mentions?

@allenma
Copy link
Contributor Author

allenma commented Apr 11, 2023

@Dandandan @ozankabak , I did the benchmark with the new implementation, actually there is little performance downgrade:

Benchmarking aggregate_query_no_group_by_count_distinct_wide: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 61.0s, or reduce sample count to 10.
aggregate_query_no_group_by_count_distinct_wide
                        time:   [587.01 ms 598.43 ms 611.68 ms]
                        change: [-6.8593% -3.2992% +0.1757%] (p = 0.08 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

Benchmarking aggregate_query_no_group_by_count_distinct_narrow: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 40.1s, or reduce sample count to 10.
aggregate_query_no_group_by_count_distinct_narrow
                        time:   [399.48 ms 415.63 ms 438.63 ms]
                        change: [-1.1234% +3.7277% +10.592%] (p = 0.20 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

I increase the test array size from 65536 to 134_217_728 to reduce the env noise, and the benchmark command is:
cargo bench --bench aggregate_query_sql

@alamb
Copy link
Contributor

alamb commented May 3, 2023

What is the status of this PR? @Dandandan are you satisfied with the results? How do we move the PR forward?

@ozankabak
Copy link
Contributor

My two cents: If we are to apply the same performance sensitivity we are recently showing to aggregation-related changes to this PR, my impression would be we'd want to think about a refactor that doesn't result in a slow down for the single argument codepath.

@alamb
Copy link
Contributor

alamb commented May 3, 2023

My two cents: If we are to apply the same performance sensitivity we are recently showing to aggregation-related changes to this PR, my impression would be we'd want to think about a refactor that doesn't result in a slow down for the single argument codepath.

I agree -- if we want to proceed with this PR I can run the bench.sh benchmark script to see what it says (it has pretty good coverage of aggregation I think)

@alamb
Copy link
Contributor

alamb commented May 22, 2023

Marking as draft as this PR seems to have stalled -- when it is next ready for review, please mark it as such. Thank you

@alamb alamb marked this pull request as draft May 22, 2023 13:40
@github-actions
Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label May 13, 2024
@github-actions github-actions bot closed this May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate physical-expr Changes to the physical-expr crates Stale PR has not had any activity for some time

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Count() and Count(Distinct )should accept multiple exprs

4 participants