-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Count distinct support multiple expressions #5939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| type DistinctScalarValues = ScalarValue; | ||
| #[derive(Debug, PartialEq, Eq, Hash, Clone)] | ||
| struct DistinctScalarValues(Vec<ScalarValue>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The downside is some performance loss / increased memory usage for single expressions.
|
How would the code look like if we separate impls (multiple vs. single COUNT)? Is it possible to do this with tolerable code duplication so we don't suffer from the performance loss @Dandandan mentions? |
|
@Dandandan @ozankabak , I did the benchmark with the new implementation, actually there is little performance downgrade: I increase the test array size from 65536 to 134_217_728 to reduce the env noise, and the benchmark command is: |
|
What is the status of this PR? @Dandandan are you satisfied with the results? How do we move the PR forward? |
|
My two cents: If we are to apply the same performance sensitivity we are recently showing to aggregation-related changes to this PR, my impression would be we'd want to think about a refactor that doesn't result in a slow down for the single argument codepath. |
|
My two cents: If we are to apply the same performance sensitivity we are recently showing to aggregation-related changes to this PR, my impression would be we'd want to think about a refactor that doesn't result in a slow down for the single argument codepath. I agree -- if we want to proceed with this PR I can run the bench.sh benchmark script to see what it says (it has pretty good coverage of aggregation I think) |
|
Marking as draft as this PR seems to have stalled -- when it is next ready for review, please mark it as such. Thank you |
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Which issue does this PR close?
Closes #5619.
Rationale for this change
support count distinct aggregation for multiple expressions like: select count(distinct col1, col2) from table1; will return distinct count of (col1, col2), any value is null in the columns will not be calculated.
What changes are included in this PR?
Support count distinct for multiple expressions, most code is from the previous commit, which is removed in the PR: #5408
Are these changes tested?
Yes, add new UT and leverage existing ut.
Are there any user-facing changes?
No