Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow CUDA build v1.0.1 from source missing build.sh #10222

Closed
Atharex opened this issue May 3, 2021 · 8 comments
Closed

Arrow CUDA build v1.0.1 from source missing build.sh #10222

Atharex opened this issue May 3, 2021 · 8 comments

Comments

@Atharex
Copy link

Atharex commented May 3, 2021

Hi!

I have been trying to compile apache arrow 1.0.1 with CUDA support to use with RAPIDS and I am getting stuck it seems because of some boost dependencies.

These are the commands I run in a docker container that has CUDA libraries preinstalled:

## Apache Arrow repo  (instructions from https://randyzwitch.com/pyarrow-cuda-support/)
export ARROW_HOME=/root/home/arrow
git clone https://github.com/apache/arrow.git $ARROW_HOME
cd $ARROW_HOME
git submodule update --init
git checkout apache-arrow-1.0.1
mkdir -p $ARROW_HOME/cpp/build && cd $ARROW_HOME/cpp/build

## Apache Arrow build dependencies
bash $ARROW_HOME/cpp/thirdparty/download_dependencies.sh

## Apache Arrow C++ build
cmake -DCMAKE_INSTALL_PREFIX=/usr/local/lib/arrow \
-DCMAKE_INSTALL_LIBDIR=lib \
-DARROW_FLIGHT=ON \
-DARROW_ORC=ON \
-DARROW_WITH_BZ2=ON \
-DARROW_WITH_ZLIB=ON \
-DARROW_WITH_ZSTD=ON \
-DARROW_WITH_LZ4=ON \
-DARROW_WITH_SNAPPY=ON \
-DARROW_WITH_BROTLI=ON \
-DARROW_PARQUET=ON \
-DARROW_PYTHON=ON \
-DARROW_PLASMA=ON \
-DARROW_CUDA=ON \
..
make -j 24

The make command fails quickly with this output:

CMake Error at /root/home/arrow/cpp/build/boost_ep-prefix/src/boost_ep-stamp/boost_ep-configure-RELEASE.cmake:37 (message):
  Command failed: 1

   './bootstrap.sh' '--prefix=/root/home/arrow/cpp/build/boost_ep-prefix/src/boost_ep' '--with-libraries=filesystem,regex,system'

  See also

    /root/home/arrow/cpp/build/boost_ep-prefix/src/boost_ep-stamp/boost_ep-configure-*.log


-- stdout output is:
Building Boost.Build engine with toolset ... 
Failed to build Boost.Build build engine
Consult 'bootstrap.log' for more details

-- stderr output is:
./bootstrap.sh: line 196: ./tools/build/src/engine/build.sh: No such file or directory

CMake Error at /root/home/arrow/cpp/build/boost_ep-prefix/src/boost_ep-stamp/boost_ep-configure-RELEASE.cmake:47 (message):
  Stopping after outputting logs.

Am I missing an installation step here?

When I run the download_dependencies.sh script, it seems it fails. Could that be the reason for this missing build.sh script?


bash $ARROW_HOME/cpp/thirdparty/download_dependencies.sh
# Environment variables for offline Arrow build
export ARROW_ABSL_URL=/root/home/arrow/cpp/build/absl-2eba343b51e0923cd3fb919a6abd6120590fc059.tar.gz
export ARROW_AWSSDK_URL=/root/home/arrow/cpp/build/aws-sdk-cpp-1.7.160.tar.gz
Failed downloading https://dl.bintray.com/ursalabs/arrow-boost/boost_1_71_0.tar.gz
@jorisvandenbossche
Copy link
Member

The bintray url has been retired, and thus the download_dependencies.sh script as included in the older Arrow 1.0.1 release no longer works out of the box.

You can update the URL to download from Sourceforge or https://github.com/ursa-labs/thirdparty instead, see the changes in #9483

@Atharex
Copy link
Author

Atharex commented May 3, 2021

Thanks @jorisvandenbossche I've manually set the download links to the ursa-labs/thirdparty repo and then the downloads completed successfully. However I still encounter the same error as before...

It seems the build script of boost is the one failing, as I get the same error if I extract the boost from the new thirdparty repo and try to build it separately by hand.

bash-4.2# bash bootstrap.sh
bootstrap.sh: line 196: ./tools/build/src/engine/build.sh: No such file or directory
Building Boost.Build engine with toolset ... 
Failed to build Boost.Build build engine
Consult 'bootstrap.log' for more details

Are there any additional steps I need to do besides the ones I wrote in my first post?

@edponce
Copy link
Contributor

edponce commented May 3, 2021

@Atharex Besides downloading the dependency packages (e.g., Boost) you can also force cmake to use the downloaded (aka bundled) packages and ignore system-wide versions.

I am able to do an offline build of Arrow 1.0.1 via the following steps:

  1. Given that bintray has been retired, in cpp/thirdparty/versions.txt change the Boost URL from https://dl.bintray.com/ursalabs/arrow-boost to https://github.com/ursa-labs/thirdparty/releases/download/27apr2021.
  2. In cmake command, specify the package dependency resolution to use the bundled version, refer to https://arrow.apache.org/docs/developers/cpp/building.html#individual-dependency-resolution. If you want to use all the downloaded packages, add -DARROW_DEPENDENCY_SOURCE=BUNDLED, and if you want only the Boost library package, use only -DBOOST_SOURCE=BUNDLED.
  3. From your initial build, use the following commands
## Apache Arrow build dependencies and source corresponding environment variables
bash $ARROW_HOME/cpp/thirdparty/download_dependencies.sh > environ.sh
source environ.sh

## Apache Arrow C++ build
cmake -DCMAKE_INSTALL_PREFIX=/usr/local/lib/arrow \
-DCMAKE_INSTALL_LIBDIR=lib \
-DARROW_FLIGHT=ON \
-DARROW_ORC=ON \
-DARROW_WITH_BZ2=ON \
-DARROW_WITH_ZLIB=ON \
-DARROW_WITH_ZSTD=ON \
-DARROW_WITH_LZ4=ON \
-DARROW_WITH_SNAPPY=ON \
-DARROW_WITH_BROTLI=ON \
-DARROW_PARQUET=ON \
-DARROW_PYTHON=ON \
-DARROW_PLASMA=ON \
-DARROW_CUDA=ON \
-DARROW_DEPENDENCY_SOURCE=BUNDLED \
..
make -j 24

@edponce
Copy link
Contributor

edponce commented May 3, 2021

The Ursa Labs thirdparty project has been updated so you can use https://github.com/ursa-labs/thirdparty/releases/download/latest for the ARROW_BOOST_URL in cpp/thirdparty/versions.txt.

@Atharex
Copy link
Author

Atharex commented May 5, 2021

Thanks @edponce this solved the c++ code building issues!
My mistake was also running download_dependencies.sh through the bash interpreter, so the exports were never saved in the current session.

I did notice also that if I run cmake with -DCMAKE_INSTALL_PREFIX=/usr/local/lib/arrow
Then I cannot later build the python wheel, as it complains it cannot find ARROW_INCLUDE_DIR and ARROW_LIB_DIR

cd $ARROW_HOME/python
export PYARROW_WITH_PARQUET=1
export PYARROW_WITH_CUDA=1
python3 setup.py build_ext -j 24 --build-type=release  --bundle-arrow-cpp bdist_wheel

Call Stack (most recent call first):
  /opt/cmake-3.20.1-linux-x86_64/share/cmake-3.20/Modules/FindPkgConfig.cmake:70 (find_package_handle_standard_args)
  cmake_modules/FindArrow.cmake:39 (include)
  cmake_modules/FindArrowPython.cmake:46 (find_package)
  CMakeLists.txt:210 (find_package)
This warning is for project developers.  Use -Wno-dev to suppress it.

-- Found PkgConfig: /usr/bin/pkg-config (found version "0.27.1") 
CMake Error at /opt/cmake-3.20.1-linux-x86_64/share/cmake-3.20/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
  Could NOT find Arrow (missing: ARROW_INCLUDE_DIR ARROW_LIB_DIR
  ARROW_FULL_SO_VERSION ARROW_SO_VERSION)
Call Stack (most recent call first):
  /opt/cmake-3.20.1-linux-x86_64/share/cmake-3.20/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE)
  cmake_modules/FindArrow.cmake:412 (find_package_handle_standard_args)
  cmake_modules/FindArrowPython.cmake:46 (find_package)
  CMakeLists.txt:210 (find_package)

When I use -DCMAKE_INSTALL_PREFIX=$ARROW_HOME then the python wheel build finds all those variables and works normally. Any ideas why that would be the case?

@Atharex
Copy link
Author

Atharex commented May 5, 2021

Also the wheel that is built in the end seems to have some troubles during import.

bash-4.2# pip3 install $ARROW_HOME/python/dist/pyarrow*.whl
Processing ./dist/pyarrow-1.0.1.dev0+g886d87bde.d20210505-cp38-cp38-linux_x86_64.whl
Requirement already satisfied: numpy>=1.14 in /usr/local/lib/python3.8/site-packages (from pyarrow==1.0.1.dev0+g886d87bde.d20210505) (1.19.5)
Installing collected packages: pyarrow
Successfully installed pyarrow-1.0.1.dev0+g886d87bde.d20210505
WARNING: You are using pip version 20.2.4; however, version 21.1.1 is available.
You should consider upgrading via the '/usr/local/bin/python3.8 -m pip install --upgrade pip' command.
bash-4.2# python3
Python 3.8.9 (default, May  3 2021, 14:20:19) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/home/arrow/python/pyarrow/__init__.py", line 62, in <module>
    import pyarrow.lib as _lib
ModuleNotFoundError: No module named 'pyarrow.lib'

Though the package itself seems correctly installed, as the lib.pyx file is present

bash-4.2# ls /root/home/arrow/python/pyarrow/
__init__.pxd  _csv.pxd      _flight.pyx            _json.pyx     _plasma.pyx    builder.pxi  config.pxi  feather.pxi    gandiva.pyx  ipc.pxi  lib.pyx           parquet.py         serialization.py  types.pxi
__init__.py   _csv.pyx      _fs.pxd                _orc.pxd      _s3fs.pyx      cffi.py      csv.py      feather.py     hdfs.py      ipc.py   memory.pxi        plasma.py          table.pxi         types.py
__pycache__   _cuda.pxd     _fs.pyx                _orc.pyx      array.pxi      compat.pxi   cuda.py     filesystem.py  includes     json.py  orc.py            public-api.pxi     tensor.pxi        util.py
_compute.pxd  _cuda.pyx     _generated_version.py  _parquet.pxd  benchmark.pxi  compat.py    dataset.py  flight.py      io-hdfs.pxi  jvm.py   pandas-shim.pxi   scalar.pxi         tensorflow
_compute.pyx  _dataset.pyx  _hdfs.pyx              _parquet.pyx  benchmark.py   compute.py   error.pxi   fs.py          io.pxi       lib.pxd  pandas_compat.py  serialization.pxi  tests

Am I missing something in the build steps for the python wheel here?

@edponce
Copy link
Contributor

edponce commented Jun 2, 2021

@Atharex The error usually occurs if you run the Python interpreter from ARROW_ROOT/python directory because the import pyarrow statement will try to use the local directory of ARROW_ROOT/python/pyarrow path instead of the installed pyarrow wheel.

Also, I noticed from the previous threads that ARROW_HOME is used somewhat incorrectly. ARROW_HOME should point to the installation directory of Arrow C++. Use ARROW_ROOT for the top directory of the repo. Putting all of this together, here are steps you can follow to successfully build/install Arrow 1.0.1 system-wide:

export ARROW_HOME=/usr/local/lib/arrow
export ARROW_ROOT=/root/home/arrow

# Download Apache Arrow repo
git clone https://github.com/apache/arrow.git "$ARROW_ROOT"
cd "$ARROW_ROOT"
git checkout apache-arrow-1.0.1

# (Optional) Update submodules for unit tests
git submodule update --init
export PARQUET_TEST_DATA="$ARROW_ROOT/cpp/submodules/parquet-testing/data"
export ARROW_TEST_DATA="$ARROW_ROOT/testing/data"

# Download Arrow C++ third party build dependencies
cd "$ARROW_ROOT/cpp/thirdparty"

# Fix Boost URL issue
sed -i 's@dl.bintray.com/ursalabs/arrow-boost@github.com/ursa-labs/thirdparty/releases/download/latest@' versions.txt

bash download_dependencies.sh > environ.sh
source environ.sh

# Apache Arrow C++ build
export ARROW_CPP_BUILD="$ARROW_ROOT/cpp/build"
mkdir -p "$ARROW_CPP_BUILD"
cd "$ARROW_CPP_BUILD"
cmake \
    -DCMAKE_BUILD_TYPE=release \
    -DCMAKE_INSTALL_PREFIX="$ARROW_HOME" \
    -DCMAKE_INSTALL_LIBDIR=lib \
    -DARROW_FLIGHT=ON \
    -DARROW_ORC=ON \
    -DARROW_WITH_BZ2=ON \
    -DARROW_WITH_ZLIB=ON \
    -DARROW_WITH_ZSTD=ON \
    -DARROW_WITH_LZ4=ON \
    -DARROW_WITH_SNAPPY=ON \
    -DARROW_WITH_BROTLI=ON \
    -DARROW_PARQUET=ON \
    -DARROW_PYTHON=ON \
    -DARROW_PLASMA=ON \
    -DARROW_CUDA=ON \
    -DARROW_DEPENDENCY_SOURCE=BUNDLED \
    ..
make -j 24
make install

# Set up Python requirements and environment
cd "$ARROW_ROOT/python"
export PYARROW_BUILD_TYPE=release
export PYARROW_WITH_FLIGHT=1
export PYARROW_WITH_ORC=1
export PYARROW_WITH_PARQUET=1
export PYARROW_WITH_PLASMA=1
export PYARROW_WITH_CUDA=1
pip3 install -r requirements-wheel-build.txt
python3 setup.py build_ext -j 24 --bundle-arrow-cpp bdist_wheel
pip3 install "$ARROW_ROOT"/python/dist/pyarrow*.whl

@Atharex
Copy link
Author

Atharex commented Jun 7, 2021

Ah true, the import was failing because I was testing it from the ARROW_ROOT/python directory. Works fine now!

Also, thanks for the updated steps @edponce

@Atharex Atharex closed this as completed Jun 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants