Skip to content

Commit

Permalink
ARROW-16578: [R] unique() and is.na() on a column of a tibble is much…
Browse files Browse the repository at this point in the history
… slower after writing to and reading from a parquet file (#13415)

Fixes ARROW-16578 "[R] unique() and is.na() on a column of a tibble is much slower after writing to and reading from a parquet file".

Here I'm materializing the AltrepVectorString at the first call to Elt.
My thought is that it would make sense since it is likely that there will be another call from R if there is one call (e.g. unique()), and also because getting a string from Array seems to be much more costly than from data2.
Something like 3-strike rule may make sense too, but here in this PR, I'm taking this simple approach.

ARROW-16578 reprex with the fix:
```
> df1 <- tibble::tibble(x=as.character(floor(runif(1000000) * 20)))
> write_parquet(df1,"/tmp/test.parquet")
> df2 <- read_parquet("/tmp/test.parquet")
> system.time(unique(df2$x))
   user  system elapsed 
  0.074   0.002   0.082 
> system.time(unique(df1$x))
   user  system elapsed 
  0.022   0.001   0.025 
> system.time(is.na(df2$x))
   user  system elapsed 
  0.006   0.001   0.006 
> system.time(is.na(df1$x))
   user  system elapsed 
  0.003   0.000   0.004 
```

devtools::test() result:
```
[ FAIL 0 | WARN 0 | SKIP 30 | PASS 7271 ]
```


Authored-by: Hideaki Hayashi <hihayash@gmail.com>
Signed-off-by: Dewey Dunnington <dewey@fishandwhistle.net>
  • Loading branch information
hideaki committed Jul 22, 2022
1 parent f9f0e65 commit 9ad2255
Showing 1 changed file with 7 additions and 26 deletions.
33 changes: 7 additions & 26 deletions r/src/altrep.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -782,34 +782,15 @@ struct AltrepVectorString : public AltrepVectorBase<AltrepVectorString<Type>> {
util::string_view view_;
};

// Get a single string, as a CHARSXP SEXP,
// either from data2 or directly from the Array
// Get a single string, as a CHARSXP SEXP from data2.
// Materialize if not done so yet, given that it is
// likely that there will be another call from R if there is a call (e.g. unique()),
// and getting a string from Array is much more costly than from data2.
static SEXP Elt(SEXP alt, R_xlen_t i) {
if (Base::IsMaterialized(alt)) {
return STRING_ELT(Representation(alt), i);
if (!Base::IsMaterialized(alt)) {
Materialize(alt);
}
BEGIN_CPP11

ArrayResolve resolve(GetChunkedArray(alt), i);
auto array = resolve.array_;
auto j = resolve.index_;

RStringViewer r_string_viewer;
r_string_viewer.SetArray(array);

// r_string_viewer.Convert() might jump so it's wrapped
// in cpp11::unwind_protect() so that string_viewer
// can be properly destructed before the unwinding continues
SEXP s = NA_STRING;
cpp11::unwind_protect([&]() {
s = r_string_viewer.Convert(j);
if (r_string_viewer.nul_was_stripped()) {
cpp11::warning("Stripping '\\0' (nul) from character vector");
}
});
return s;

END_CPP11
return STRING_ELT(Representation(alt), i);
}

static void* Dataptr(SEXP alt, Rboolean writeable) { return DATAPTR(Materialize(alt)); }
Expand Down

0 comments on commit 9ad2255

Please sign in to comment.