Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Implement ScalarAggregateOptions for count_distinct (grouped) #29392

Closed
asfimport opened this issue Aug 26, 2021 · 8 comments
Closed

[C++] Implement ScalarAggregateOptions for count_distinct (grouped) #29392

asfimport opened this issue Aug 26, 2021 · 8 comments

Comments

@asfimport
Copy link

asfimport commented Aug 26, 2021

I'm writing the R bindings for the grouped count_distinct kernel, but the current implementation counts nulls as their own group. To match the R behaviour, I need to be able to specify whether or not to remove NA/NULL values.

Please could we have ScalarAggregateOptions implemented for count_distinct?

Reporter: Nicola Crane / @thisisnic
Assignee: David Li / @lidavidm

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-13764. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
One way would probably to remove the "null" row manually from the result. I'm not sure how easy that is in R.

@asfimport
Copy link
Author

David Li / @lidavidm:
And to be clear, this is about the groups, not the values themselves? i.e.


keys   = [0,   0,    1,   1,   null]
values = ["a", null, "b", "b", "c"]

should give


counts = {0: 2, 1: 1} 

instead of


counts = {0: 2, 1: 1, null: 1} 

?

@asfimport
Copy link
Author

Neal Richardson / @nealrichardson:
Side note: should this be CountOptions and not ScalarAggregateOptions?

@asfimport
Copy link
Author

David Li / @lidavidm:
Well, it would be neither, since both of those are about the values, not the groups. If this is about the groups, maybe this should actually be an option for the aggregation itself instead. If it's for the values, then it would be CountOptions.

@asfimport
Copy link
Author

Neal Richardson / @nealrichardson:
@lidavidm Neither of those, if I understand it. Here's the expectation from dplyr:

> data.frame(keys = c(0, 0, 1, 1, NA), values = c("a", NA, "b", "c", "d")) %>% group_by(keys) %>% summarize(n_distinct(values))
# A tibble: 3 × 2
   keys `n_distinct(values)`
  <dbl>                <int>
1     0                    2
2     1                    2
3    NA                    1
> data.frame(keys = c(0, 0, 1, 1, NA), values = c("a", NA, "b", "c", "d")) %>% group_by(keys) %>% summarize(n_distinct(values, na.rm = TRUE))
# A tibble: 3 × 2
   keys `n_distinct(values, na.rm = TRUE)`
  <dbl>                              <int>
1     0                                  1
2     1                                  2
3    NA                                  1

@asfimport
Copy link
Author

David Li / @lidavidm:
Ok, so this is about the values - then we should support CountOptions which is quite doable (sorry for overlooking that initially).

@asfimport
Copy link
Author

Nicola Crane / @thisisnic:
Ah, my mistake, CountOptions would be fantastic, thanks! (and yep, what Neal said)

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Issue resolved by pull request 11011
#11011

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants