Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Add Arrow methods slice_min(), slice_max() #29394

Closed
Tracked by #32656
asfimport opened this issue Aug 26, 2021 · 3 comments
Closed
Tracked by #32656

[R] Add Arrow methods slice_min(), slice_max() #29394

asfimport opened this issue Aug 26, 2021 · 3 comments
Assignees
Milestone

Comments

@asfimport
Copy link
Collaborator

asfimport commented Aug 26, 2021

Implement slice_min() and slice_max() methods for ArrowTabular, Dataset, and arrow_dplyr_query objects.

These dplyr functions supersede the older dplyr function top_n() which I suppose we should also consider implementing a method for.

Reporter: Ian Cook / @ianmcook
Assignee: Neal Richardson / @nealrichardson

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-13766. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Dewey Dunnington / @paleolimbot:
Some example usage maybe useful for a test:

library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

df <- tibble(a = rep(letters, 10), b = 1:260, c = 260:1)

df %>% slice_min(a, n = 5, with_ties = TRUE)
#> # A tibble: 10 × 3
#>    a         b     c
#>    <chr> <int> <int>
#>  1 a         1   260
#>  2 a        27   234
#>  3 a        53   208
#>  4 a        79   182
#>  5 a       105   156
#>  6 a       131   130
#>  7 a       157   104
#>  8 a       183    78
#>  9 a       209    52
#> 10 a       235    26
df %>% slice_min(a, n = 5, with_ties = FALSE)
#> # A tibble: 5 × 3
#>   a         b     c
#>   <chr> <int> <int>
#> 1 a         1   260
#> 2 a        27   234
#> 3 a        53   208
#> 4 a        79   182
#> 5 a       105   156

df %>% slice_min(c, n = 5)
#> # A tibble: 5 × 3
#>   a         b     c
#>   <chr> <int> <int>
#> 1 z       260     1
#> 2 y       259     2
#> 3 x       258     3
#> 4 w       257     4
#> 5 v       256     5
df %>% slice_min(c, prop = 5 / 260)
#> # A tibble: 5 × 3
#>   a         b     c
#>   <chr> <int> <int>
#> 1 z       260     1
#> 2 y       259     2
#> 3 x       258     3
#> 4 w       257     4
#> 5 v       256     5

@asfimport
Copy link
Collaborator Author

Dewey Dunnington / @paleolimbot:
Without ties this isn't bad:

library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

df <- tibble(a = rep(letters, 10), b = 1:260, c = 260:1)

# slice_*() without ties is easier
record_batch(df) %>% 
  arrange(c) %>% head(5) %>%
  collect()
#> # A tibble: 5 × 3
#>   a         b     c
#>   <chr> <int> <int>
#> 1 z       260     1
#> 2 y       259     2
#> 3 x       258     3
#> 4 w       257     4
#> 5 v       256     5

record_batch(df) %>% 
  arrange(desc(c)) %>% head(5) %>%
  collect()
#> # A tibble: 5 × 3
#>   a         b     c
#>   <chr> <int> <int>
#> 1 a         1   260
#> 2 b         2   259
#> 3 c         3   258
#> 4 d         4   257
#> 5 e         5   256

With ties isn't too bad either (just needs a join):

library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

df <- tibble(a = rep(letters, 10), b = 1:260, c = 260:1)

# slice_*() with ties needs a join
rb <-  record_batch(df)
rb %>% arrange(a) %>% select(a) %>% head(5) %>% distinct() %>% left_join(rb) %>% collect()
#> # A tibble: 10 × 3
#>    a         b     c
#>    <chr> <int> <int>
#>  1 a         1   260
#>  2 a        27   234
#>  3 a        53   208
#>  4 a        79   182
#>  5 a       105   156
#>  6 a       131   130
#>  7 a       157   104
#>  8 a       183    78
#>  9 a       209    52
#> 10 a       235    26

@asfimport
Copy link
Collaborator Author

Neal Richardson / @nealrichardson:
Issue resolved by pull request 14361
#14361

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants