Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Compute] Implement non-hash count_distinct aggregate kernel #29633

Closed
asfimport opened this issue Sep 18, 2021 · 8 comments
Closed

[C++][Compute] Implement non-hash count_distinct aggregate kernel #29633

asfimport opened this issue Sep 18, 2021 · 8 comments

Comments

@asfimport
Copy link

asfimport commented Sep 18, 2021

ARROW-12728 added a hash_count_distinct hash aggregate kernel, but there is no non-hash count_distinct aggregate kernel.

Reporter: Ian Cook / @ianmcook
Assignee: Percy Camilo Triveño Aucahuasi / @aucahuasi

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-14035. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Percy Camilo Triveño Aucahuasi / @aucahuasi:
Can you please elaborate more about this requirement?

  1. Do we need to compute the same thing of hash_count_distinct but without using the hash table from the hash group?

  2. Are we going to offer non hash version for all hash_x functions too? (hash_distinct, hash_count, hash_sum)

    cc @ianmcook  @lidavidm

@asfimport
Copy link
Author

Ian Cook / @ianmcook:

  1. Do we need to compute the same thing of hash_count_distinct but without using the hash table from the hash group?
    Yes. @lidavidm  if I am missing any nuance here, please let me know :)
    Are we going to offer non hash version for all hash_x functions too? (hash_distinct, hash_count, hash_sum)
    Yes I think we should aim for that (or nearly that; there might be a few exceptions where it does not make sense.) Comparing the lists of aggregation functions and hash (grouped) aggregation functions in compute.rst, they are mostly the same already, with just a few differences. I think this issue and ARROW-13309 are the most important two additions to bring these two lists closer to parity.

@asfimport
Copy link
Author

Percy Camilo Triveño Aucahuasi / @aucahuasi:
Thanks @ianmcook, another question: What is the difference between value_counts and count_distinct?

https://github.com/apache/arrow/blob/master/docs/source/cpp/compute.rst#associative-transforms

 

@asfimport
Copy link
Author

David Li / @lidavidm:
value_counts gives you a histogram where the x-axis are the distinct values and the y-axis is the number of occurrences of that value. count_distinct is  just COUNT(DISTINCT *).

Also, value_counts is a vector kernel whereas this should be a scalar aggregate kernel.

@asfimport
Copy link
Author

Percy Camilo Triveño Aucahuasi / @aucahuasi:
Thanks David!

@asfimport
Copy link
Author

@asfimport
Copy link
Author

@asfimport
Copy link
Author

Neal Richardson / @nealrichardson:
Issue resolved by pull request 11257
#11257

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant