Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] slice_sample returns 0 rows #38638

Open
thisisnic opened this issue Nov 8, 2023 · 4 comments
Open

[R] slice_sample returns 0 rows #38638

thisisnic opened this issue Nov 8, 2023 · 4 comments
Labels
Component: R Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Priority: Critical Type: bug

Comments

@thisisnic
Copy link
Member

Describe the bug, including details regarding any error messages, version, and platform.

library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
tf <- tempfile()
dir.create(tf)
write_dataset(group_by(mtcars, cyl), tf)
open_dataset(tf) %>%
  slice_sample(n = 3) %>%
  collect()
#> # A tibble: 0 × 11
#> # ℹ 11 variables: mpg <dbl>, disp <dbl>, hp <dbl>, drat <dbl>, wt <dbl>,
#> #   qsec <dbl>, vs <dbl>, am <dbl>, gear <dbl>, carb <dbl>, cyl <int>

Created on 2023-11-08 with reprex v2.0.2

Component(s)

R

@thisisnic
Copy link
Member Author

I think this is an implementation issue and we need to re-implement this differently; if I run this code repeatedly, sometimes I do get a number of rows equal or fewer to n back.

@amoeba amoeba added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Nov 22, 2023
@lgaborini
Copy link

I have a probably related issue where slice_sample(n = 100) tends to sample the same rows (out of a Table with 2922121 rows), and from the beginning of the Table.
The row count always respects n.

If I specify the expected row count with a proportion:

nr <- nrow(tbl_df)
slice_sample(tbl_df, prop = 100/nr)

I encounter the above issue (not exactly 100 rows but sometimes fewer or more), but the rows are truly randomized.

@thisisnic
Copy link
Member Author

Thanks for the extra information there @lgaborini!

I've looked at this again, and I think it's an unfortunate quirk of the original implementation (i.e. a known issue), as we've had to implement it a little differently as the C++ random function doesn't work, e.g. #14361 (comment).

I've tried updating the min parameter in the internal UDF to higher than the default (we get fewer rows selected) or lower than the default (we get the right number of rows selected but we get a lot of repetition).

There's this line that just takes the first n rows of data, which is probably the source of the lack of randomness. I was wondering if we can call arrange() to order by the random number and then take the top n rows, though I'm not sure if that will actually work or not.

@amoeba amoeba added this to the 15.0.0 milestone Dec 21, 2023
@paleolimbot
Copy link
Member

I was wondering if we can call arrange() to order by the random number and then take the top n rows

I think that will work, although I don't know if it will be slower or faster than calling compute() (i.e., get me a Table) and subset using integers obtained using sample(seq_len(x$num_rows)). It is essentially the same thing: in order to do an accurate sample, the final number of rows are needed.

One can do a streaming (but approximate) sampling, too, which might be useful for non-statistical purposes (e.g., testing on something more realistic than the first n rows of data).

@raulcd raulcd modified the milestones: 15.0.0, 16.0.0 Jan 8, 2024
@raulcd raulcd modified the milestones: 16.0.0, 17.0.0 Apr 8, 2024
@raulcd raulcd removed this from the 17.0.0 milestone Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: R Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Priority: Critical Type: bug
Projects
None yet
Development

No branches or pull requests

5 participants