[C++] Implement ScalarAggregateOptions for count_distinct (grouped) #29392

asfimport · 2021-08-26T13:46:49Z

I'm writing the R bindings for the grouped count_distinct kernel, but the current implementation counts nulls as their own group. To match the R behaviour, I need to be able to specify whether or not to remove NA/NULL values.

Please could we have ScalarAggregateOptions implemented for count_distinct?

Reporter: Nicola Crane / @thisisnic
Assignee: David Li / @lidavidm

Related issues:

[R] Use Arrow engine for summarize() by default (blocks)
[R] Binding for n_distinct() (relates to)

PRs and other links:

GitHub Pull Request #11011

_{Note: This issue was originally created as ARROW-13764. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2021-08-26T14:26:29Z

Antoine Pitrou / @pitrou:
One way would probably to remove the "null" row manually from the result. I'm not sure how easy that is in R.

asfimport · 2021-08-26T14:39:24Z

David Li / @lidavidm:
And to be clear, this is about the groups, not the values themselves? i.e.


keys   = [0,   0,    1,   1,   null]
values = ["a", null, "b", "b", "c"]

should give


counts = {0: 2, 1: 1}

instead of


counts = {0: 2, 1: 1, null: 1}

?

asfimport · 2021-08-26T15:17:04Z

Neal Richardson / @nealrichardson:
Side note: should this be CountOptions and not ScalarAggregateOptions?

asfimport · 2021-08-26T15:19:30Z

David Li / @lidavidm:
Well, it would be neither, since both of those are about the values, not the groups. If this is about the groups, maybe this should actually be an option for the aggregation itself instead. If it's for the values, then it would be CountOptions.

asfimport · 2021-08-26T15:20:33Z

Neal Richardson / @nealrichardson:
@lidavidm Neither of those, if I understand it. Here's the expectation from dplyr:

> data.frame(keys = c(0, 0, 1, 1, NA), values = c("a", NA, "b", "c", "d")) %>% group_by(keys) %>% summarize(n_distinct(values))
# A tibble: 3 × 2
   keys `n_distinct(values)`
  <dbl>                <int>
1     0                    2
2     1                    2
3    NA                    1
> data.frame(keys = c(0, 0, 1, 1, NA), values = c("a", NA, "b", "c", "d")) %>% group_by(keys) %>% summarize(n_distinct(values, na.rm = TRUE))
# A tibble: 3 × 2
   keys `n_distinct(values, na.rm = TRUE)`
  <dbl>                              <int>
1     0                                  1
2     1                                  2
3    NA                                  1

asfimport · 2021-08-26T15:22:50Z

David Li / @lidavidm:
Ok, so this is about the values - then we should support CountOptions which is quite doable (sorry for overlooking that initially).

asfimport · 2021-08-26T15:58:36Z

Nicola Crane / @thisisnic:
Ah, my mistake, CountOptions would be fantastic, thanks! (and yep, what Neal said)

asfimport · 2021-08-30T17:15:38Z

Antoine Pitrou / @pitrou:
Issue resolved by pull request 11011
#11011

asfimport closed this as completed Aug 30, 2021

asfimport assigned lidavidm Jan 10, 2023

asfimport added this to the 6.0.0 milestone Jan 11, 2023

This was referenced Jan 11, 2023

[R] Use Arrow engine for summarize() by default #29259

Closed

[R] Binding for n_distinct() #29261

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Implement ScalarAggregateOptions for count_distinct (grouped) #29392

[C++] Implement ScalarAggregateOptions for count_distinct (grouped) #29392

asfimport commented Aug 26, 2021 •

edited

asfimport commented Aug 26, 2021

asfimport commented Aug 26, 2021

asfimport commented Aug 26, 2021

asfimport commented Aug 26, 2021

asfimport commented Aug 26, 2021

asfimport commented Aug 26, 2021

asfimport commented Aug 26, 2021

asfimport commented Aug 30, 2021

[C++] Implement ScalarAggregateOptions for count_distinct (grouped) #29392

[C++] Implement ScalarAggregateOptions for count_distinct (grouped) #29392

Comments

asfimport commented Aug 26, 2021 • edited

Related issues:

PRs and other links:

asfimport commented Aug 26, 2021

asfimport commented Aug 26, 2021

asfimport commented Aug 26, 2021

asfimport commented Aug 26, 2021

asfimport commented Aug 26, 2021

asfimport commented Aug 26, 2021

asfimport commented Aug 26, 2021

asfimport commented Aug 30, 2021

asfimport commented Aug 26, 2021 •

edited