New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10415: [R] Support for dplyr::distinct() #11143
Conversation
|
ed189d3
to
6e45378
Compare
r/R/dplyr-distinct.R
Outdated
|
||
if (.keep_all == TRUE) { | ||
# After ARROW-13767 is merged, we can implement this via e.g. | ||
# iris %>% group_by(Species) %>% slice(1) %>% ungroup() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is accurate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think ARROW-13767 will help us implement .keep_all = TRUE
. To implement that, I think we would need a hash aggregate function hash_first()
that returns the first value of a column within each group. However since row order within groups is potentially non-deterministic, this is perhaps better understood as a function that just grabs any one column value from within the group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened ARROW-13993 for a hash aggregate function that returns the first value (effectively any one arbitrary value) from within each group
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ianmcook !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thisisnic could you please open a separate Jira to implement .keep_all = TRUE
for 7.0.0? Thanks
r/R/dplyr-distinct.R
Outdated
} | ||
|
||
# Call mutate in case any columns are expressions | ||
.data <- dplyr::mutate(.data, ...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need this: group_by will handle this for you, won't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
r/R/dplyr-distinct.R
Outdated
.data$group_by_vars <- gv | ||
} | ||
|
||
# Select the columns to return in the correct order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are columns out of order? group_by should get the order right for you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order of columns returned by group_by()
and then summarise()
are the order in which they're supplied to group_by()
, whereas the order of columns returned from distinct()
are the order in which they occur in the data, even if there's a group_by()
call previously.
> tbl %>%
+ group_by(some_grouping, int) %>%
+ summarise() %>%
+ collect()
`summarise()` has grouped output by 'some_grouping'. You can override using the `.groups` argument.
# A tibble: 10 × 2
# Groups: some_grouping [2]
some_grouping int
<dbl> <int>
1 1 1
2 1 3
3 1 5
4 1 7
5 1 9
6 2 2
7 2 6
8 2 8
9 2 10
10 2 NA
> tbl %>%
+ group_by(some_grouping, int) %>%
+ distinct(lgl) %>%
+ collect()
# A tibble: 10 × 3
# Groups: some_grouping, int [10]
int lgl some_grouping
<int> <lgl> <dbl>
1 1 TRUE 1
2 2 NA 2
3 3 TRUE 1
4 NA FALSE 2
5 5 TRUE 1
6 6 NA 2
7 7 NA 1
8 8 FALSE 2
9 9 FALSE 1
10 10 NA 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, that's worthy of a comment in the code :)
Maybe you can sort the group_by_vars to match their order in the data before calling summarize, then the result from summarize should match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion, I'll do this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment.
Am no longer using group_by_vars
after realising that group_by()
has a .add
argument, which seems cleaner to take advantage of instead of redoing all of the grouping myself. Therefore, have kept in the ordering code.
r/R/dplyr-distinct.R
Outdated
.data <- dplyr::group_by(.data, !!!syms(vars_to_group)) | ||
.data <- dplyr::summarize(.data) | ||
|
||
# Add back in any grouping which existed in the data previously |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect what you're actually hitting here is ARROW-13550, that summarize isn't restoring groups as expected yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
}) | ||
|
||
test_that("distinct() can retain groups", { | ||
skip("ARROW-13777 - internally uses mutate on grouped data; should work after this is merged") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can remove your mutate call though and these should pass if you let group_by do the work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will try that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also now have a PR open for ARROW-13777
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: that PR has now merged, so you can rebase and unskip
b445d63
to
e9635a4
Compare
It it reasonable to chuck |
Yes |
r/R/dplyr-distinct.R
Outdated
arrow_not_supported("`distinct()` with `.keep_all = TRUE`") | ||
} | ||
|
||
distinct_groups <- ensure_named_exprs(quos(...)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd expect a function body something like this:
.data <- dplyr::group_by(.data, ..., .add = TRUE)
gv <- dplyr::group_vars(.data) # this is `names(.data$group_by_vars)`
if (length(gv)) {
# `data %>% group_by() %>% summarise()` returns cols in order supplied
# but distinct() returns cols in dataset order, so sort group vars
vars_in_data <- names(.data)[names(.data) %in% gv]
.data$group_by_vars <- c(
.data$group_by_vars[vars_in_data],
.data$group_by_vars[!(gv %in% vars_in_data)]
)
} else {
# distinct() with no vars specified means distinct across all cols
.data <- dplyr::group_by(.data, !!!syms(names(.data)))
}
# TODO: use .groups = "keep" (ARROW-13550)
dplyr::summarize(.data)
IIUC the only specialness about distinct() over group_by() %>% summarize() is the order of columns and the meaning of 0 groups. If that's true then we should delegate the rest of the work to group_by and summarize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to get it more clear in my head, but I think this won't work in cases where an expression has been passed into distinct()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's more complicated than that, that's fine. But group_by handles expressions too, so I would expect that it would do the right thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the nuance is in the difference between calling group_by()
which handles expressions before adding things to .data$group_by_vars
versus directly adding things to .data$group_by_vars
, which only works (AFAIK) when it's just the names of the columns which are not expressions. I think I could do a better job of refactoring some of this to make it a bit tidier still though - will update shortly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion doesn't add things to group_by_vars directly, it only reorders them to make sure they follow dataset order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tbh I'm struggling to follow this a bit. I'm not sure what this bit is doing:
vars_in_data <- names(.data)[names(.data) %in% gv]
.data$group_by_vars <- c(
.data$group_by_vars[vars_in_data],
.data$group_by_vars[!(gv %in% vars_in_data)]
)
Specifically, vars_in_data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, never mind, I see now. You subset to get the same thing but reordered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't get it to work that way at all, sorry. I think this is ready for re-review now @nealrichardson as have made some other changes in light of new tests which I think mean that this way no longer makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried again, but I end up trying to do horrible things with lists of quosures to rearrange them and then ensure they are named, which I can't get working. Given it works the way is is written now, I'm giving up on this, unless anyone has any very specific code suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was annoying me, so I tried it again today. I got 90% of the way there, but I think ultimately, it requires us to effectively undo and redo any group_by()
which happens before the call to distinct()
I ended up with this monstrosity at the start of the code:
gv <- dplyr::group_vars(.data)
gv_as_quo <- quos(!!!map(.data$selected_columns[gv], ~sym(.x$ToString())))
This isn't correct though, as it's pulling out the Expression and not the R expression used to create it, which I guess we can't get back at this point?
There's also a global option you can turn on, if you prefer. See NEWS.md and test-dplyr-summarize.R for an example. |
02e7268
to
a6eb1d1
Compare
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Ian Cook <ianmcook@gmail.com>
289513b
to
f955e89
Compare
Closes apache#11143 from thisisnic/ARROW-10415_distinct Lead-authored-by: Nic Crane <thisisnic@gmail.com> Co-authored-by: Nic <thisisnic@gmail.com> Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
No description provided.