Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-15092: [R] Support create_package_with_all_dependencies() on non-linux systems #12849

Closed
wants to merge 10 commits into from

Conversation

karldw
Copy link
Contributor

@karldw karldw commented Apr 9, 2022

This PR aims to address ARROW-15092 and ARROW-15608, allowing the function create_package_with_all_dependencies() to be run on non-linux systems.

To accomplish this, the code parses the versions.txt file in R. This process is a little hairy, because the existing code in versions.txt and download_dependencies.sh depend on shell substitution and array parsing.

In writing this PR, I assumed that only base R functions were available, and the format of versions.txt couldn't be changed to make things easier. I wrote some roxygen-like documentation, but didn't actually want to generate the *.Rd files, so just started the lines with # instead of #'.

I'd be very grateful for feedback here. The first question is whether this approach makes sense at a macro level, and then whether my implementation seems reasonable.

@github-actions
Copy link

github-actions bot commented Apr 9, 2022

@github-actions
Copy link

github-actions bot commented Apr 9, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

Copy link
Member

@assignUser assignUser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your work on this PR I commented some minor things but other than that I think it is ready to go. I tested it on my windows machine and it works!

I noticed that there are no tests for create_package_with_all_dependencies so that might be something we could add in this PR.

As I am not a committer I will ask for a second opinion from someone who can merge this.

# Only supports a small subset of bash substitution patterns. May have multiple
# bash variables in `one_string`
# Used as a helper to parse versions.txt
..install_substitute_like_bash <- function(one_string, possible_values) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not necessary to use the dot-prefix for internal functions as it is done nowhere else in the package. Nice job keeping a consistent naming scheme though! 👍

Comment on lines 140 to 150
# Substitute like the bash shell
#
# @param one_string A length-1 character vector
# @param possible_values A dictionary-ish set of variables that could provide
# values to substitute in.
# @return `one_string`, with values substituted like bash would.
#
# Only supports a small subset of bash substitution patterns. May have multiple
# bash variables in `one_string`
# Used as a helper to parse versions.txt
..install_substitute_like_bash <- function(one_string, possible_values) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote some roxygen-like documentation, but didn't actually want to generate the *.Rd files, so just started the lines with # instead of #'.

You can prevent the creation of an Rd file by using @noRd .

Comment on lines 169 to 176
)[[1]] # Subset [[1]] because one_string has length 1
# `matched_substrings` is a character vector with length equal to the number
# of non-overlapping matches of `version_regex` in `one_string`. `match_list`
# is a list (same length as `matched_substrings`), where each list element is
# a length-3 character vector. The first element of the vector is the value
# from `matched_substrings` (e.g. "${ARROW_ZZZ_VERSION//./_}"). The following
# two values are the captured groups specified in `version_regex` e.g.
# "ARROW_ZZZ_VERSION" and "//./_".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

)

if (any(failed_to_parse)) {
stop(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use rlang::abort here but that is very heterogenous in the package so 🤷 (and you aimed for base R which is good!)

@nealrichardson
Copy link
Contributor

@karldw thanks for taking a stab at this and apologies for missing your earlier ping on jira (I get lots of jira notifications). From the error message reported on the issue, it looks like the problem is the call to readlink in the download_dependencies.sh script. Reading the man page for that, it looks like it exists only to resolve any symlinks that might be in the DESTDIR. I'm not sure why exactly (ARROW-4033 last edited this line; the change was initially added in #2673 with no discussion). Before adding all of this R code, I might try editing that line to

DESTDIR=$(readlink -f "${DESTDIR}" || readlink "${DESTDIR}" || ${DESTDIR})

(macOS has readlink but doesn't support -f; || ${DESTDIR} because it probably doesn't matter anyway).

In general I think trying to download_dependencies.sh work on more platforms is the right solution, rather than hacking around it in R.

@assignUser
Copy link
Member

In general I think trying to download_dependencies.sh work on more platforms is the right solution, rather than hacking around it in R.

Is there a setup agnostic way to run bash scripts on windows from R? The only possibility I see is using WSL or checking for git-bash (which is not on path by default afaik).

@wjones127
Copy link
Member

Is there a setup agnostic way to run bash scripts on windows from R? The only possibility I see is using WSL or checking for git-bash (which is not on path by default afaik).

On Windows I think we can take for granted that if a user is building a package they have already installed RTools (or should do so).

It looks like pkgbuild has a function with_build_tools() (docs) that can temporarily add the appropriate RTools to the PATH before running a supplied script. Perhaps that is the appropriate thing to use here?

@nealrichardson
Copy link
Contributor

Yeah it is safe to assume bash these days, especially for scripts that CRAN won't run.

@assignUser
Copy link
Member

@wjones127 create_package_with_all_dependencies itself doesn't build the package, so rtools wouldn't necessarily be installed (windows binaries are on CRAN after all) but requiring rtools would be fine if there is a proper error message if it isn't found vs. the unclear error that we have now. Although pkgbuild is neither a soft or hard dependency at the moment so we would need to add that, which is a con in my opinion.

In the end the R solution works everywhere and does not require new dependencies but adds a lot of code and is brittle to changes in versions.txt. Where as the bash solution requires less changes/new code but (maybe) disadvantages windows users and possibly adds a new soft dependency. 🤷

I'll defer to your experience with the package and userbase @wjones127 @nealrichardson

@nealrichardson
Copy link
Contributor

I'll defer to your experience with the package and userbase @wjones127 @nealrichardson

IMO the bash script should be fixed anyway since it's not robust. I suspect that will solve the issue here too, in which case we don't need all of this R code (though I recognize and appreciate the effort).

@karldw
Copy link
Contributor Author

karldw commented Apr 12, 2022

Okay, great! @nealrichardson, just to confirm: do you want to go with the suggestion to add pkgbuild as a new dependency? I added it and tweaked the bash script - let's see how tests go on this new version.


Specific replies:

@nealrichardson:

I think readlink without -f isn't useful here, so I removed it from the chain. I also had to change the syntax a little to get things working, but I might be missing some clever bash-ism.

Just to repeat @assignUser's earlier comment, the use case I have in my head is that the package is downloaded on one machine, then installed on another. For that reason, I was trying not to make too many assumptions about the build capabilities on the downloading machine. But this offline build is a pretty niche demand, and it's probably okay to ask those users to make sure they have bash available when downloading.

@assignUser:

I noticed that there are no tests for create_package_with_all_dependencies so that might be something we could add in this PR.

Running create_package_with_all_dependencies requires downloading ~100MB of files, which seemed like a pretty heavy test to run every time. I added a test for run_download_script that skips the actual download, but checks that the requirements are in place.

@assignUser and @wjones127, thanks for the tips!

r/R/install-arrow.R Outdated Show resolved Hide resolved
@karldw
Copy link
Contributor Author

karldw commented Apr 13, 2022

There are three categories of failed test here:

  1. Test failures I think I understand and can fix. In these first two builds, I think the issues is that the build doesn't copy files into tools/ when I expected. That's fine.
    • R / AMD64 Ubuntu 20.04 R 4.1
    • R / rstudio/r-base:4.0-centos7
    • R / rhub/debian-gcc-devel:latest (This one fails because wget isn't installed. Surprising, to me, but also fine.)
  2. Test failures I think aren't related to the changes I made, but please let me know if that's wrong.
    • Python / AMD64 MacOS 10.15 Python 3: Segfault in pytest.
    • C++ / AMD64 MacOS 10.15 C++: Core dump in TestS3FS.GetFileInfoRoot
  3. Test failures I don't understand, and could use help diagnosing. All of these fail at is_wget_available, but I thought bash -c 'wget -V' would run successfully on all of these:
    • R / AMD64 Windows R 3.6 RTools 35
    • R / AMD64 Windows R 4.1 RTools 40
    • R / AMD64 Windows R devel RTools 42

I can add code to skip the test if wget isn't available, but first it seems important to track down what's happening in the cases I thought wget would be available, but isn't. I don't have a Windows system to test on -- any tips would be very welcome.

@wjones127
Copy link
Member

I don't have a Windows system to test on -- any tips would be very welcome.

I'll take a look tomorrow on my Windows machine.

@assignUser
Copy link
Member

@karldw I checked and it looks like rtools40 does not come with wget (and it can't be installed with pacman either. That would explain the test failures under 3.

@wjones127
Copy link
Member

Yeah maybe we need to download with curl instead?

@nealrichardson
Copy link
Contributor

What about a different approach? We don't need to use the download script itself to download the files, but we do need to use versions.txt. We could use a slightly different bash script to generate R syntax that would download the files, then use R to download--no wget or curl or other system things required beyond what R already has, just need bash.

To illustrate, I patched the existing download script:

diff --git a/cpp/thirdparty/download_dependencies.sh b/cpp/thirdparty/download_dependencies.sh
index 7ffffa08c..2c0447cdb 100755
--- a/cpp/thirdparty/download_dependencies.sh
+++ b/cpp/thirdparty/download_dependencies.sh
@@ -30,14 +30,11 @@ else
   DESTDIR=$1
 fi
 
-DESTDIR=$(readlink -f "${DESTDIR}")
-
 download_dependency() {
   local url=$1
   local out=$2
 
-  wget --quiet --continue --output-document="${out}" "${url}" || \
-    (echo "Failed downloading ${url}" 1>&2; exit 1)
+  echo 'download.file("'${url}'", "'${out}'")'
 }
 
 main() {
@@ -46,7 +43,6 @@ main() {
   # Load `DEPENDENCIES` variable.
   source ${SOURCE_DIR}/versions.txt
 
-  echo "# Environment variables for offline Arrow build"
   for ((i = 0; i < ${#DEPENDENCIES[@]}; i++)); do
     local dep_packed=${DEPENDENCIES[$i]}
 
@@ -55,8 +51,6 @@ main() {
 
     local out=${DESTDIR}/${dep_tar_name}
     download_dependency "${dep_url}" "${out}"
-
-    echo "export ${dep_url_var}=${out}"
   done
 }

in R we can get:

> system("source ../cpp/thirdparty/download_dependencies.sh /tmp", intern=TRUE)
 [1] "download.file(\"https://github.com/abseil/abseil-cpp/archive/20210324.2.tar.gz\", \"/tmp/absl-20210324.2.tar.gz\")"                                                                 
 [2] "download.file(\"https://github.com/aws/aws-sdk-cpp/archive/1.8.133.tar.gz\", \"/tmp/aws-sdk-cpp-1.8.133.tar.gz\")"                                                                  
 [3] "download.file(\"https://github.com/awslabs/aws-checksums/archive/v0.1.12.tar.gz\", \"/tmp/aws-checksums-v0.1.12.tar.gz\")"                                                          
 [4] "download.file(\"https://github.com/awslabs/aws-c-common/archive/v0.6.9.tar.gz\", \"/tmp/aws-c-common-v0.6.9.tar.gz\")" 
...

which you could then source.

I'm also ok with just fixing the one readlink line and saying that the script requires bash and wget, possibly just adding that to the error message that happens if the download script errors. That's much simpler. I don't think R code and tests for whether we can determine if the current machine has bash and wget installed are worth the trouble.

@karldw
Copy link
Contributor Author

karldw commented Apr 16, 2022

Thanks, that's clever!

Since we're only using versions.txt from cpp/thirdparty/, I also changed the Makefile to skip the other bundled files.

I also left the readline changes in the original download_dependencies.sh file -- let me know if you want it out.

@karldw
Copy link
Contributor Author

karldw commented Apr 17, 2022

If any of the folks here want to try this out:

git clone --single-branch --branch=fix-15092 --depth 1 git@github.com:karldw/arrow.git
cd arrow/r
make sync-cpp
R CMD build --no-build-vignettes .
R -e 'source("R/install-arrow.R"); create_package_with_all_dependencies(source_file = "arrow_7.0.0.9000.tar.gz")'

Copy link
Contributor

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this out, and while the tarball built successfully, it did not install because we needed the thirdparty dir as it was before. With the suggested changes, I was able to install successfully.

r/Makefile Outdated Show resolved Hide resolved
r/tools/download_dependencies_R.sh Outdated Show resolved Hide resolved
@nealrichardson
Copy link
Contributor

Not immediately relevant here but in tools/nixlibs.R, there is code in set_thirdparty_urls() that generates the env vars that point to the files, and there are more cases that needs special handling now: google cloud cpp (comes out as ARROW_GOOGLE_URL) and nhlomann json (comes out as ARROW_NLOHMANN_URL). These aren't used now because we haven't added GCP bindings to the R package yet. But whenever we do, we'll need to fix those urls or else this fat package won't work for them.

(Feel free to ignore for now or ticket for later, just commenting because I noticed it when trying to debug why it wasn't installing before.)

@karldw
Copy link
Contributor Author

karldw commented Apr 20, 2022

Not immediately relevant here but in tools/nixlibs.R, there is code in set_thirdparty_urls() that generates the env vars that point to the files, and there are more cases that needs special handling now: google cloud cpp (comes out as ARROW_GOOGLE_URL) and nhlomann json (comes out as ARROW_NLOHMANN_URL). These aren't used now because we haven't added GCP bindings to the R package yet. But whenever we do, we'll need to fix those urls or else this fat package won't work for them.

One option might be to modify download_dependencies_R.sh to pull the actual names from versions.txt? Here's a demo that spits out the names we need in CSV format. (In reality we would want a command line flag rather than simply removing the download code.)

@@ -44,20 +44,25 @@ download_dependency() {
   echo 'download.file("'${url}'", "'${out}'", quiet = TRUE)'
 }
 
+print_tar_name() {
+  local url_var=$1
+  local tar_name=$2
+  echo "'${url_var}','${tar_name}'"
+}
+
 main() {
-  mkdir -p "${DESTDIR}"
 
   # Load `DEPENDENCIES` variable.
   source ${SOURCE_DIR}/cpp/thirdparty/versions.txt
 
+  echo 'env_var,filename'
   for ((i = 0; i < ${#DEPENDENCIES[@]}; i++)); do
     local dep_packed=${DEPENDENCIES[$i]}
 
     # Unpack each entry of the form "$home_var $tar_out $dep_url"
     IFS=" " read -r dep_url_var dep_tar_name dep_url <<< "${dep_packed}"
 
-    local out=${DESTDIR}/${dep_tar_name}
-    download_dependency "${dep_url}" "${out}"
+    print_tar_name "${dep_url_var}" "${dep_tar_name}"
   done
 }
Output
env_var,filename
'ARROW_ABSL_URL','absl-20210324.2.tar.gz'
'ARROW_AWSSDK_URL','aws-sdk-cpp-1.8.133.tar.gz'
'ARROW_AWS_CHECKSUMS_URL','aws-checksums-v0.1.12.tar.gz'
'ARROW_AWS_C_COMMON_URL','aws-c-common-v0.6.9.tar.gz'
'ARROW_AWS_C_EVENT_STREAM_URL','aws-c-event-stream-v0.1.5.tar.gz'
'ARROW_BOOST_URL','boost-1.75.0.tar.gz'
'ARROW_BROTLI_URL','brotli-v1.0.9.tar.gz'
'ARROW_BZIP2_URL','bzip2-1.0.8.tar.gz'
'ARROW_CARES_URL','cares-1.17.2.tar.gz'
'ARROW_CRC32C_URL','crc32c-1.1.2.tar.gz'
'ARROW_GBENCHMARK_URL','gbenchmark-v1.6.0.tar.gz'
'ARROW_GFLAGS_URL','gflags-v2.2.2.tar.gz'
'ARROW_GLOG_URL','glog-v0.5.0.tar.gz'
'ARROW_GOOGLE_CLOUD_CPP_URL','google-cloud-cpp-v1.39.0.tar.gz'
'ARROW_GRPC_URL','grpc-v1.35.0.tar.gz'
'ARROW_GTEST_URL','gtest-1.11.0.tar.gz'
'ARROW_JEMALLOC_URL','jemalloc-5.2.1.tar.bz2'
'ARROW_LZ4_URL','lz4-8f61d8eb7c6979769a484cde8df61ff7c4c77765.tar.gz'
'ARROW_MIMALLOC_URL','mimalloc-v1.7.3.tar.gz'
'ARROW_NLOHMANN_JSON_URL','nlohmann-json-v3.10.2.tar.gz'
'ARROW_OPENTELEMETRY_URL','opentelemetry-cpp-v1.2.0.tar.gz'
'ARROW_OPENTELEMETRY_PROTO_URL','opentelemetry-proto-v0.11.0.tar.gz'
'ARROW_ORC_URL','orc-1.7.3.tar.gz'
'ARROW_PROTOBUF_URL','protobuf-v3.18.1.tar.gz'
'ARROW_RAPIDJSON_URL','rapidjson-1a803826f1197b5e30703afe4b9c0e7dd48074f5.tar.gz'
'ARROW_RE2_URL','re2-2021-11-01.tar.gz'
'ARROW_SNAPPY_URL','snappy-1.1.9.tar.gz'
'ARROW_THRIFT_URL','thrift-0.13.0.tar.gz'
'ARROW_UTF8PROC_URL','utf8proc-v2.7.0.tar.gz'
'ARROW_XSIMD_URL','xsimd-7d1778c3b38d63db7cec7145d939f40bc5d859d1.tar.gz'
'ARROW_ZLIB_URL','zlib-1.2.12.tar.gz'
'ARROW_ZSTD_URL','zstd-v1.5.1.tar.gz'

@nealrichardson
Copy link
Contributor

Yeah, something like that is probably the right solution. Would you mind making a JIRA for that? I don't think we need to tackle that here, and I want to make sure this fix gets in the upcoming release.

Copy link
Contributor

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@jonkeane jonkeane closed this in c73870a Apr 20, 2022
@ursabot
Copy link

ursabot commented Apr 22, 2022

Benchmark runs are scheduled for baseline = 20bc63a and contender = c73870a. c73870a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.12% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.25% ⬆️0.04%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/561| c73870ac ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/549| c73870ac test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/547| c73870ac ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/559| c73870ac ursa-thinkcentre-m75q>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/560| 20bc63a8 ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/548| 20bc63a8 test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/546| 20bc63a8 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/558| 20bc63a8 ursa-thinkcentre-m75q>
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@karldw karldw deleted the fix-15092 branch April 23, 2022 19:55
nealrichardson added a commit that referenced this pull request May 10, 2022
…ne build

As Neal mentioned in #12849 (comment), the current code in nixlibs.R doesn't handle URL variable names components that have multiple words (because of the way it parses variable names from filenames). Until now, we've had a special case for the AWS variables, but `ARROW_GOOGLE_CLOUD_CPP_URL` and `ARROW_NLOHMANN_JSON_URL` also need handling. Instead of adding special cases, we can provide the correct `ARROW_*_URL` values with the new bash script added as part of ARROW-15092 (in PR #12849).

Please let me know what you think!

Closes #12973 from karldw/fix-16297

Lead-authored-by: karldw <karldw@users.noreply.github.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants