Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-13766: [R] Add slice_*() methods #14361

Merged
merged 11 commits into from
Oct 13, 2022

Conversation

nealrichardson
Copy link
Member

@nealrichardson nealrichardson commented Oct 10, 2022

This PR implements slice_head,() slice_tail(), slice_min(), slice_max() and slice_sample(). slice_sample() requires a clever hack using a UDF because the random() C++ function apparently does not work; see ARROW-17974.

arrow_not_supported("Slicing Arrow data with groups")
}
if (with_ties) {
arrow_not_supported("with_ties = TRUE")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @paleolimbot noted on ARROW-13766, it is possible to do with_ties but it requires doing the top-k calculation and then using that as a filtering join. I'm inclined either not to support it (as done here, before I saw that comment) or to flip the default value of with_ties and perhaps raise a warning the first time the function is called to note the difference. Thoughts?

@github-actions
Copy link

Copy link
Member

@thisisnic thisisnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will be excellent to have these functions in 10.0.0 :D

A few bits and pieces I noticed when reviewing are below.

r/R/dplyr-slice.R Outdated Show resolved Hide resolved
r/R/dplyr-slice.R Outdated Show resolved Hide resolved
r/R/dplyr-slice.R Outdated Show resolved Hide resolved
r/R/dplyr-slice.R Outdated Show resolved Hide resolved
r/R/dplyr-slice.R Outdated Show resolved Hide resolved
r/R/dplyr-slice.R Outdated Show resolved Hide resolved
}
slice_head.Dataset <- slice_head.ArrowTabular <- slice_head.RecordBatchReader <- slice_head.arrow_dplyr_query

slice_tail.arrow_dplyr_query <- function(.data, ..., n, prop) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an error message for extraneous slice_*()-related arguments which dplyr has but we don't have that here:

library(dplyr)
library(arrow)

mtcars %>%
  slice_tail(n = 3, with_ties = FALSE) %>%
  collect()
#> Error in `slice_tail()`:
#> ! `...` must be empty.
#> ✖ Problematic argument:
#> • with_ties = FALSE

mtcars %>%
  arrow_table() %>%
  slice_tail(n = 3, with_ties = FALSE) %>%
  collect()
#>    mpg cyl disp  hp drat   wt qsec vs am gear carb
#> 1 19.7   6  145 175 3.62 2.77 15.5  0  1    5    6
#> 2 15.0   8  301 335 3.54 3.57 14.6  0  1    5    8
#> 3 21.4   4  121 109 4.11 2.78 18.6  1  1    4    2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, added the rlang::check_dots_empty call. I left it as rlang:: instead of adding it to the importFrom we use because I thought you had some PRs open that were touching that, can move there before merging. (Also need to re-run the docgen.R anyway.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a check, and none of my open PRs are touching that, so good to move whenever.

r/R/dplyr-slice.R Show resolved Hide resolved
@nealrichardson
Copy link
Member Author

@thisisnic thanks for your excellent review. I've addressed your feedback and pushed a little more to unlock slice_sample(), PTAL

r/R/dplyr-slice.R Outdated Show resolved Hide resolved
r/R/dplyr-slice.R Outdated Show resolved Hide resolved
r/tests/testthat/test-dplyr-slice.R Show resolved Hide resolved
@nealrichardson
Copy link
Member Author

@thisisnic want to give this one more look?

Copy link
Member

@thisisnic thisisnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@nealrichardson nealrichardson merged commit 80e3986 into apache:master Oct 13, 2022
@nealrichardson nealrichardson deleted the dplyr-slice branch October 14, 2022 00:00
@ursabot
Copy link

ursabot commented Oct 14, 2022

Benchmark runs are scheduled for baseline = b5b41cc and contender = 80e3986. 80e3986 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.0% ⬆️15.0%] test-mac-arm
[Failed ⬇️1.92% ⬆️5.75%] ursa-i9-9960x
[Finished ⬇️0.75% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 80e39862 ec2-t3-xlarge-us-east-2
[Failed] 80e39862 test-mac-arm
[Failed] 80e39862 ursa-i9-9960x
[Finished] 80e39862 ursa-thinkcentre-m75q
[Finished] b5b41ccf ec2-t3-xlarge-us-east-2
[Failed] b5b41ccf test-mac-arm
[Failed] b5b41ccf ursa-i9-9960x
[Finished] b5b41ccf ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Oct 14, 2022

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants