Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
ARROW-16578: [R] unique() and is.na() on a column of a tibble is much…
… slower after writing to and reading from a parquet file (#13415) Fixes ARROW-16578 "[R] unique() and is.na() on a column of a tibble is much slower after writing to and reading from a parquet file". Here I'm materializing the AltrepVectorString at the first call to Elt. My thought is that it would make sense since it is likely that there will be another call from R if there is one call (e.g. unique()), and also because getting a string from Array seems to be much more costly than from data2. Something like 3-strike rule may make sense too, but here in this PR, I'm taking this simple approach. ARROW-16578 reprex with the fix: ``` > df1 <- tibble::tibble(x=as.character(floor(runif(1000000) * 20))) > write_parquet(df1,"/tmp/test.parquet") > df2 <- read_parquet("/tmp/test.parquet") > system.time(unique(df2$x)) user system elapsed 0.074 0.002 0.082 > system.time(unique(df1$x)) user system elapsed 0.022 0.001 0.025 > system.time(is.na(df2$x)) user system elapsed 0.006 0.001 0.006 > system.time(is.na(df1$x)) user system elapsed 0.003 0.000 0.004 ``` devtools::test() result: ``` [ FAIL 0 | WARN 0 | SKIP 30 | PASS 7271 ] ``` Authored-by: Hideaki Hayashi <hihayash@gmail.com> Signed-off-by: Dewey Dunnington <dewey@fishandwhistle.net>
- Loading branch information