-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-13766: [R] Add slice_*() methods #14361
Conversation
arrow_not_supported("Slicing Arrow data with groups") | ||
} | ||
if (with_ties) { | ||
arrow_not_supported("with_ties = TRUE") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @paleolimbot noted on ARROW-13766, it is possible to do with_ties
but it requires doing the top-k calculation and then using that as a filtering join. I'm inclined either not to support it (as done here, before I saw that comment) or to flip the default value of with_ties
and perhaps raise a warning the first time the function is called to note the difference. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will be excellent to have these functions in 10.0.0 :D
A few bits and pieces I noticed when reviewing are below.
} | ||
slice_head.Dataset <- slice_head.ArrowTabular <- slice_head.RecordBatchReader <- slice_head.arrow_dplyr_query | ||
|
||
slice_tail.arrow_dplyr_query <- function(.data, ..., n, prop) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an error message for extraneous slice_*()
-related arguments which dplyr has but we don't have that here:
library(dplyr)
library(arrow)
mtcars %>%
slice_tail(n = 3, with_ties = FALSE) %>%
collect()
#> Error in `slice_tail()`:
#> ! `...` must be empty.
#> ✖ Problematic argument:
#> • with_ties = FALSE
mtcars %>%
arrow_table() %>%
slice_tail(n = 3, with_ties = FALSE) %>%
collect()
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
#> 2 15.0 8 301 335 3.54 3.57 14.6 0 1 5 8
#> 3 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, added the rlang::check_dots_empty call. I left it as rlang::
instead of adding it to the importFrom we use because I thought you had some PRs open that were touching that, can move there before merging. (Also need to re-run the docgen.R anyway.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a check, and none of my open PRs are touching that, so good to move whenever.
57a6760
to
f042e57
Compare
@thisisnic thanks for your excellent review. I've addressed your feedback and pushed a little more to unlock |
44d50d2
to
d20afba
Compare
Co-authored-by: Nic Crane <thisisnic@gmail.com>
…without projecting it first)
7e4857f
to
e62b0f7
Compare
@thisisnic want to give this one more look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Benchmark runs are scheduled for baseline = b5b41cc and contender = 80e3986. 80e3986 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
['Python', 'R'] benchmarks have high level of regressions. |
This PR implements
slice_head,()
slice_tail()
,slice_min()
,slice_max()
andslice_sample()
.slice_sample()
requires a clever hack using a UDF because therandom()
C++ function apparently does not work; see ARROW-17974.