Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-5488: [R] Workaround when C++ lib not available #4471

Closed

Conversation

romainfrancois
Copy link
Contributor

This is very much a work in progress. The idea is to replace the code generation that is usually done by Rcpp::compileAttributes() with something custom. This is driven by the data-raw/codegen.R file, which I'll polish. I'm opening this right now for feedback opportunities.

All of the .cpp files are guarded by :

#if defined(ARROW_R_WITH_ARROW)
... 
#endif

And ARROW_R_WITH_ARROW is defined via the configure file, if the library was indeed used.

For example, this function:

// [[arrow::export]]
std::shared_ptr<arrow::Column> ipc___feather___TableReader__GetColumn(
    const std::unique_ptr<arrow::ipc::feather::TableReader>& reader, int i) {
  std::shared_ptr<arrow::Column> column;
  STOP_IF_NOT_OK(reader->GetColumn(i, &column));
  return column;
}

triggers generation of this in the generated.cpp file:

#if defined(ARROW_R_WITH_ARROW)
std::string ipc___feather___TableReader__GetColumnName(const std::unique_ptr<arrow::ipc::feather::TableReader>& reader, int i);
SEXP _arrow_ipc___feather___TableReader__GetColumnName(SEXP reader_sexp, SEXP i_sexp){
  BEGIN_RCPP
  Rcpp::traits::input_parameter<const std::unique_ptr<arrow::ipc::feather::TableReader>&>::type reader(reader_sexp);
Rcpp::traits::input_parameter<int>::type i(i_sexp);Rcpp::Shield<SEXP> rcpp_result_gen(Rcpp::wrap(ipc___feather___TableReader__GetColumnName( reader, i)));
return rcpp_result_gen;
  END_RCPP
}
#else
SEXP _arrow_ipc___feather___TableReader__GetColumnName(SEXP reader_sexp, SEXP i_sexp){
  BEGIN_RCPP
  Rcpp::stop("arrow C++ library not available");
  END_RCPP
}
#endif

So the generated R api SEXPy functions only call the real thing when the c++ library is available, otherwise they just throw an error.

and this in the generated.R file:

ipc___feather___TableReader__GetColumnName <- function(reader, i) {
    .Call(`_arrow_ipc___feather___TableReader__GetColumnName` , reader, i )
}

This also needed some extra care in test functions so that the tests only run if Arrow is available.

@wesm comment from https://issues.apache.org/jira/browse/ARROW-5488 might be more practical and closer to what @jjallaire mentioned about how the RcppParallel package does it with Intel tbb:
https://github.com/RcppCore/RcppParallel/blob/master/R/hooks.R

One possibility is to bundle the Arrow header files with the CRAN package and build against them, but do not include libarrow and libparquet when linking. When the library is loaded, the libraries must be loaded in-process via dlopen before loading the Rcpp extensions. The C++ libraries can be installed then after the fact

@romainfrancois romainfrancois added WIP PR is work in progress Component: R labels Jun 4, 2019
@wesm
Copy link
Member

wesm commented Jun 4, 2019

So what would occur if:

  • User installs the package without the C++ library available
  • User then installs the Arrow C++ libraries

Is the installed R package broken until it is reinstalled? If the Arrow C++ headers are bundled privately with the R package then it should be able to build against them, and you can omit -larrow -lparquet at link time. The question is whether you can get library(arrow) to do a dynload of libarrow.so prior to attempting to load the Rcpp extensions

@nealrichardson
Copy link
Member

Correct, under this approach, the R package would not find the C++ library if it were installed after the R package were installed. That said, when we discussed this approach before, we proposed adding an arrow::install_arrow() function that the error message would suggest that you run, and this function could, after installing the C++ library, reinstall the R package itself to fix it. So the user experience either with this approach or with the bundling of headers like you suggest would be the same: if you install the R package without Arrow C++ present, when you try to use the package, you'll be told to call arrow::install_arrow(), the result of which will be a functioning C++/R arrow environment.

@romainfrancois
Copy link
Contributor Author

@wesm I'm going to try the approach of bundling headers in inst/include perhaps only when they're not found otherwise.

I'm just a little worried about code duplication and bloat for the R package, but this is less problematic as if we were bundling all of arrow (*.h and *.cpp).

If we are successful here, then I guess we can do something similar to what @jjallaire did with RcppParallel, i.e. not using @useDynLib as per usual but rather first dynamically load the arrow lib and then load the library of the package, i.e.: https://github.com/RcppCore/RcppParallel/blob/master/R/hooks.R#L5

.onLoad <- function(libname, pkgname) {
   
   # load tbb and tbbmalloc on supported platforms   
   tbb <- tbbLibPath()
   if (!is.null(tbb)) {
      if (!file.exists(tbb)) {
         warning(paste("TBB library", tbb, "not found."))
      } else {
         dllInfo <<- dyn.load(tbb, local = FALSE, now = TRUE)
      }
   }
   tbbMalloc <- tbbLibPath("malloc")
   if (!is.null(tbbMalloc)) {
      if (!file.exists(tbbMalloc)) {
         warning(paste("TBB malloc library", tbbMalloc, "not found."))
      } else {
         mallocDllInfo <<- dyn.load(tbbMalloc, local = FALSE, now = TRUE)
      }
   }
   
   # load the package library
   library.dynam("RcppParallel", pkgname, libname)
   
   # set default thread options
   numThreads <- "auto"
   numThreadsEnv <- Sys.getenv("RCPP_PARALLEL_NUM_THREADS", unset = NA)
   if (!is.na(numThreadsEnv))
      setThreadOptions(numThreads = as.integer(numThreadsEnv))
   else
      setThreadOptions(numThreads = "auto")
}

@romainfrancois

This comment has been minimized.

@jjallaire
Copy link

I easily might be missing something here, so take my comments w/ a grain of salt if my mental picture is incomplete.

I'm wondering a couple of things about including the headers:

  1. I realize that we can optionally omit omit -larrow -lparquet when we know Arrow isn't installed, but wouldn't our C++ library still have unresolved symbols and as a result yield build errors? (or perhaps there is a way to suppress those)

  2. I would guess that we strongly prefer to compile against the headers from the same version of Arrow that we are linking against (otherwise there can be skew/incompatibility). I guess we could only build against those headers in the "arrow not installed" scenario, but in that case I'm not sure what it's buying us.

@wesm
Copy link
Member

wesm commented Jun 5, 2019

@jjallaire

On #1, on Linux I believe if the symbols that the Rcpp library are available in process (e.g. having been loaded with dlopen earlier) then they'll be resolved correctly (this will only work on Linux, though)? I might be wrong about this.

On #2, yes if the user has some other version of the shared libraries installed then ABI mismatch will most likely cause symbol resolution failure more often than a flat out crash. The shared libraries have ABI tags (e.g. libarrow.so.13) so we can check for this if it becomes a problem

@jjallaire
Copy link

@wesm My worry though is that when we are building/testing on CRAN, we won't have loaded the symbols w/ dlopen (since Arrow won't be on the build servers), so there will be a link failure. My hope is that when on systems w/o Arrow (inevitable on CRAN) we still build, link, and run tests "successfully". In this mode the package is effectively a shim that tells you to call install_arrow() after which point you'll be good to go (which is basically the experience we have now for Spark and TF).

w/r/t ABI issues, I'm wondering if we are always better off building against the headers that match the library (i.e. I'm not sure what including the headers in the package buys us). Forgive me for being dense if it's implicit in some previous discussion!

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few notes. It might also be good to add a travis job to check the package without the arrow C++ present.

r/tests/testthat/test-field.R Outdated Show resolved Hide resolved
r/src/table.cpp Outdated Show resolved Hide resolved
r/src/recordbatch.cpp Outdated Show resolved Hide resolved
@@ -0,0 +1,160 @@
# pak::pkg_install("romainfrancois/decor")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does data-raw/codegen.R get called? Can you please update the README with developer instructions (and if necessary/appropriate, add Suggests dependencies to DESCRIPTION)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Will do that when I'm sure this is solid.

@romainfrancois
Copy link
Contributor Author

I guess I need to edit the ./configure error here:

# Customize the error
if [ $? -ne 0 ]; then
  echo "------------------------- ANTICONF ERROR ---------------------------"
  echo "Configuration failed because $PKG_CONFIG_NAME was not found. Try installing:"
  echo " * brew: $PKG_BREW_NAME (Mac OSX)"
  echo "If $PKG_CONFIG_NAME is already installed, check that 'pkg-config' is in your"
  echo "PATH and PKG_CONFIG_PATH contains a $PKG_CONFIG_NAME.pc file. If pkg-config"
  echo "is unavailable you can set INCLUDE_DIR and LIB_DIR manually via:"
  echo "R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'"
  echo "--------------------------------------------------------------------"

  PKG_LIBS=""
  PKG_CFLAGS=""
else
  PKG_CFLAGS="$PKG_CFLAGS -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET"
fi

so that it says a shim package has been compiled and the user should use arrow::install_arrow() to install lib arrow and reinstall the package.

I'll probably do install_arrow() in another PR though.

The files R/generated.R and src/generated.cpp are generated by the data-raw/codegen.R file, so now it is a manual thing to source this file every time something changes in a .cpp file. Is there a hook in pkgbuild to trigger code generation with custom code instead of Rcpp::compileAttributes() ? @jimhester @jjallaire (I'm pretty sure there isn't, but could there be)

@jjallaire
Copy link

There isn't a hook like this that I'm aware of (I guess configure could do it if could successfully detect changes). There is this function from the utils package that lets you easily do before/after snapshots of a directory to detect changes: https://stat.ethz.ch/R-manual/R-devel/library/utils/html/changedFiles.html

@jimhester
Copy link

You could regenerate it unconditionally in configure, it should be fast, so there is probably no harm in doing it every time configure is called, assuming the output is deterministic across platforms for the same input file.

@jjallaire
Copy link

jjallaire commented Jun 6, 2019 via email

@romainfrancois
Copy link
Contributor Author

❯ system.time(source('~/git/apache/arrow/r/data-raw/codegen.R'))
   user  system elapsed 
  0.656   0.036   0.694 

The only thing is that it uses a bunch of packages

# pak::pkg_install("romainfrancois/decor")
library(decor)

library(dplyr)
library(stringr)
library(purrr)
library(glue)

@jjallaire
Copy link

jjallaire commented Jun 6, 2019 via email

@nealrichardson
Copy link
Member

Also, data_raw/ is in .Rbuildignore so IIUC we couldn't call data_raw/codegen.R in configure because the file won't be present in the built package (unless we were to move it elsewhere).

@romainfrancois
Copy link
Contributor Author

@nealrichardson I believe this is ready for another review now. I will open another JIRA/PR for the install_arrow() function.

Code generation is triggered on each ./configure if the ARROW_R_DEV environment variable is set to TRUE. Only on systems that use configure (i.e. not windows). I think that's fine because it's a development time thing, i.e. the generated files are pushed.

The code generation is somewhat craft right now, I'm hoping to make progress with my hobby decor package later.

@romainfrancois romainfrancois added ready-for-review and removed WIP PR is work in progress labels Jun 10, 2019
@wesm
Copy link
Member

wesm commented Jun 10, 2019

@romainfrancois since we're talking about interactions with GPL2 build tools can you confirm that this patch includes no reused code from RcppCore/Rcpp?

@romainfrancois
Copy link
Contributor Author

romainfrancois commented Jun 10, 2019

Yes. It generates similar code (and Rcpp is still being used in the package), but this patch does not borrow code from Rcpp.

@wesm
Copy link
Member

wesm commented Jun 10, 2019

OK, thanks. Note that it's fine to utilize GPL tools as part of a build system (for example: GNU autotools), but it's the code reuse / copying where things become problematic

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few 💅 ✍️ comments, but assuming CHECK is passing everywhere, looks like we're really close here.

To @jjallaire 's point about the examples, that's interesting, I hadn't thought about the prospect that readers of the help pages might see the examples and then think they need to wrap everything in if(arrow_available()) {...}, so it would be better to make them all \dontrun. Have we seen that happen for other packages?

A counterargument would be that if we did keep the arrow_available() check in the examples and let them run, not only would we avoid any potential CRAN litigation about \dontrun, we would also get examples with rendered output in the pkgdown site, which would be nice.

r/configure Outdated Show resolved Hide resolved
r/configure Outdated Show resolved Hide resolved
r/data-raw/codegen.R Outdated Show resolved Hide resolved
r/data-raw/codegen.R Show resolved Hide resolved
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script and the ARROW_R_DEV env var should be documented at least minimally for developers. Somewhere around here in the readme it should say that anyone working on the C++ code in the R package should export ARROW_R_DEV=TRUE and will also need to remotes::install_github("romainfrancois/decor").

It might then be sufficient to have a brief comment here at the top of this file (below the 🐀 license) just explaining that this is a supplement to the usual Rcpp magic and wraps all functions in a check for the existence of the Arrow library, or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some words

r/data-raw/codegen.R Outdated Show resolved Hide resolved
glue_collapse(sep = "\n")

writeLines(con = "R/arrowExports.R", glue::glue('
# Generated by using data-raw/codegen.R -> do not edit by hand
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor point but is "codegen" the best name for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷‍♂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess Rcpp::compileAttributes() isn't all that vivid either ;)


#' Is the C++ Arrow library available
#'
#' @export
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function could use some more substantive docs, like to say that package users generally wouldn't need to call this, to address the concern that JJ raised about the examples.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we use \dontrun extensively for TensorFlow, Spark, and reticulate. We do however have example functions for any code that doesn't have dependencies (in this case e.g. arrow_available()) so that we don't trip the "no running examples" flag.

pkgdown is actually setup to deal with this pretty gracefully (example code is enclosed in a subtle not run treatment, e.g. see https://keras.rstudio.com/reference/keras_model.html)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, pkgdown handles dontrun gracefully, but the downside is that you don't get to see the example results, like with e.g. https://dplyr.tidyverse.org/reference/select.html#examples. Provided that we build the pkgdown static site in an environment where libarrow is installed (which we will), if we used the arrow_available() check in the examples, we'd get the example output showing. For pedagogical purposes, I'm not sure which is the better tradeoff (i.e. see real output but possibly be mislead that you always have to check if arrow is available, vs. no output but also no extraneous code).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like pkgdown has a run_dont_run argument (https://github.com/r-lib/pkgdown/blob/89af5341ef2a0eedf07770d3e98f5094dacb9029/R/build-reference.R#L86), so we can have the cleaner \dontrun and still get example output on the pkgdown site.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I'd like to document this function in a subsequent pull request, maybe together with install_arrow().

@jjallaire
Copy link

jjallaire commented Jun 10, 2019 via email

@nealrichardson
Copy link
Member

@romainfrancois just came across this script that probably needs updating with your change: https://github.com/apache/arrow/blob/master/ci/docker_build_r.sh#L26-L27

@romainfrancois
Copy link
Contributor Author

@jjallaire thanks, I did remove them based on your previous comment.

@romainfrancois
Copy link
Contributor Author

Thanks @nealrichardson I did not know about these docker files, I don't really speak docker, I guess I should. @jameslamb @xhochy I updated them to the best of my abilities given other changes in this pull request, it should be fine, but you might want to check.

@romainfrancois
Copy link
Contributor Author

As long as redoing code generation, perhaps it would be useful to spit out several files in a directory instead of the giant arrowExports.cpp to limit merge conflicts ...

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; I say merge this and we can follow with any cleanups after.

glue_collapse(sep = "\n")

writeLines(con = "R/arrowExports.R", glue::glue('
# Generated by using data-raw/codegen.R -> do not edit by hand
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess Rcpp::compileAttributes() isn't all that vivid either ;)

r/data-raw/codegen.R Show resolved Hide resolved
@romainfrancois romainfrancois deleted the ARROW-5488/workaround branch June 12, 2019 07:13
fsaintjacques pushed a commit that referenced this pull request Jun 18, 2019
This is very much a work-in-progress for running the R package tests on Appveyor. It's an attempt to bring in the "standard" R Appveyor build tooling (https://github.com/krlmlr/r-appveyor, Apache 2.0 licensed) into Arrow's configuration.

I currently have the C++ library building and installing, and R and all R package dependencies installing, and R checks beginning, but they fail. This is because https://github.com/apache/arrow/blob/master/r/tools/winlibs.R is configured only to build with a released [sic] version of the arrow C++ binaries pre-built and hosted on rwinlib. Two immediate to-dos are:

1. get the package to install and check without libarrow at all (as #4471 now allows)
2. get the package to find the libarrow we built locally from source

After that, I can reverse-engineer what appveyor does and document it for any Windows developers (https://issues.apache.org/jira/browse/ARROW-3758).

There's obviously some mess I've made of appveyor.yml that I'll clean up before I'm done, and I'll do my best to document and refactor some of the non-obvious features of the setup that I've discovered (e.g. that `ci/appveyor-cpp-build-mingw.bat` also installs GLib and Ruby and runs their test suites too) while I'm at it so that the next person might be able to pick things up faster.

cc @romainfrancois @javierluraschi

Author: Neal Richardson <neal.p.richardson@gmail.com>

Closes #4538 from nealrichardson/r-appveyor and squashes the following commits:

6d6a74e <Neal Richardson> PR feedback
84eef1b <Neal Richardson> Merge upstream/master
328bc37 <Neal Richardson> Merge remote-tracking branch 'upstream/master' into r-appveyor
53c3a20 <Neal Richardson> Makevars.win is dynamically generated now, no need to tweak it in the release script
567f041 <Neal Richardson> Cleanups
f4cf5c3 <Neal Richardson> Some appveyor.yml cleanup
19b60df <Neal Richardson> Enable R package to install on windows even if rwinlib isn't found
fb19911 <Neal Richardson> Try this name hack
15c0342 <Neal Richardson> Try building first, see where that goes
18803c3 <Neal Richardson> cd this way
b6f6c8d <Neal Richardson> No popd?
1ba4839 <Neal Richardson> let the debugging begin
d5e8f4e <Neal Richardson> Bootstrap within the package dir
f7327ed <Neal Richardson> Find shell script up a level
8c3e898 <Neal Richardson> nm we already fast finish
39236f3 <Neal Richardson> Put mine first and fast-finish
94c9a42 <Neal Richardson> Debug with fewer jobs in the matrix
af1852e <Neal Richardson> Throw this at appveyor and see how far off we are
alamb pushed a commit to apache/arrow-rs that referenced this pull request Apr 20, 2021
This is very much a work-in-progress for running the R package tests on Appveyor. It's an attempt to bring in the "standard" R Appveyor build tooling (https://github.com/krlmlr/r-appveyor, Apache 2.0 licensed) into Arrow's configuration.

I currently have the C++ library building and installing, and R and all R package dependencies installing, and R checks beginning, but they fail. This is because https://github.com/apache/arrow/blob/master/r/tools/winlibs.R is configured only to build with a released [sic] version of the arrow C++ binaries pre-built and hosted on rwinlib. Two immediate to-dos are:

1. get the package to install and check without libarrow at all (as apache/arrow#4471 now allows)
2. get the package to find the libarrow we built locally from source

After that, I can reverse-engineer what appveyor does and document it for any Windows developers (https://issues.apache.org/jira/browse/ARROW-3758).

There's obviously some mess I've made of appveyor.yml that I'll clean up before I'm done, and I'll do my best to document and refactor some of the non-obvious features of the setup that I've discovered (e.g. that `ci/appveyor-cpp-build-mingw.bat` also installs GLib and Ruby and runs their test suites too) while I'm at it so that the next person might be able to pick things up faster.

cc @romainfrancois @javierluraschi

Author: Neal Richardson <neal.p.richardson@gmail.com>

Closes #4538 from nealrichardson/r-appveyor and squashes the following commits:

6d6a74e31 <Neal Richardson> PR feedback
84eef1b2d <Neal Richardson> Merge upstream/master
328bc376f <Neal Richardson> Merge remote-tracking branch 'upstream/master' into r-appveyor
53c3a2084 <Neal Richardson> Makevars.win is dynamically generated now, no need to tweak it in the release script
567f04142 <Neal Richardson> Cleanups
f4cf5c3dd <Neal Richardson> Some appveyor.yml cleanup
19b60df07 <Neal Richardson> Enable R package to install on windows even if rwinlib isn't found
fb199110a <Neal Richardson> Try this name hack
15c0342d1 <Neal Richardson> Try building first, see where that goes
18803c353 <Neal Richardson> cd this way
b6f6c8de7 <Neal Richardson> No popd?
1ba4839e2 <Neal Richardson> let the debugging begin
d5e8f4ed4 <Neal Richardson> Bootstrap within the package dir
f7327ed23 <Neal Richardson> Find shell script up a level
8c3e898f8 <Neal Richardson> nm we already fast finish
39236f3f7 <Neal Richardson> Put mine first and fast-finish
94c9a42df <Neal Richardson> Debug with fewer jobs in the matrix
af1852e79 <Neal Richardson> Throw this at appveyor and see how far off we are
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants