-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-33987: [R] Support new dplyr .by/by argument #35667
Conversation
|
@eitsupi - if you give this PR a rebase to fix the CI failures, I'll take a look at this |
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making this PR; it's always good to keep up with the changes in the dplyr API, and this is really thorough.
Please could you add or update some of the tests to use more than 1 grouping variable here?
compare_dplyr_binding( | ||
.input %>% | ||
filter(int > 2, pnorm(dbl) > .99, .by = chr) %>% | ||
collect(), | ||
tbl, | ||
warning = "Expression pnorm\\(dbl\\) > 0.99 not supported in Arrow; pulling data into R" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is pulling the data into R, so we end up comparing dplyr with dplyr instead of Arrow. Is the purpose of this test to test with multiple filters? If so, how about swapping pnorm()
out for something else? Or is the intention here different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an intentional test to ensure that no grouping occurs during conversion to data.frame.
Added a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks for clarifying that!
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
I think I did it, but these are pretty much the same test cases. I think it is better to parameterize these tests using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with this, thanks! Will leave it a little longer before merging in case any other R folk want to have a skim over.
Are there any plans for merging? (I am just worried this has been forgotten) |
Thanks for the reminder @eitsupi! |
Thanks for merging! |
Benchmark runs are scheduled for baseline = c62ce6b and contender = a0d28de. a0d28de is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Rationale for this change
Implement the
.by
argument formutate
,summarise
,filter
andslice_*
family.What changes are included in this PR?
The
.by
argument that matchesdplyr
has been added to some functions.Most of the internal functions, such as
compute_by
, are copied from the existingdplyr
backends,dbplyr
anddtplyr
.Are these changes tested?
Yes.
Are there any user-facing changes?
Yes.