Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
7 contributors

Users who have contributed to this file

@wesm @crepererum @kou @hatemhelal @bkietz @nealrichardson @daslu
1053 lines (773 sloc) 37.4 KB

C++ Development

System setup

Arrow uses CMake as a build configuration system. We recommend building out-of-source. If you are not familiar with this terminology:

  • In-source build: cmake is invoked directly from the cpp directory. This can be inflexible when you wish to maintain multiple build environments (e.g. one for debug builds and another for release builds)
  • Out-of-source build: cmake is invoked from another directory, creating an isolated build environment that does not interact with any other build environment. For example, you could create cpp/build-debug and invoke cmake $CMAKE_ARGS .. from this directory

Building requires:

  • A C++11-enabled compiler. On Linux, gcc 4.8 and higher should be sufficient. For Windows, at least Visual Studio 2015 is required.
  • CMake 3.2 or higher
  • Boost 1.58 or higher, though some unit tests require 1.64 or newer.
  • bison and flex (for building Apache Thrift from source only, an Apache Parquet dependency.)

Running the unit tests using ctest requires:

  • python

On Ubuntu/Debian you can install the requirements with:

sudo apt-get install \
     autoconf \
     build-essential \
     cmake \
     libboost-dev \
     libboost-filesystem-dev \
     libboost-regex-dev \
     libboost-system-dev \
     python \
     bison \
     flex

On Alpine Linux:

apk add autoconf \
        bash \
        boost-dev \
        cmake \
        g++ \
        gcc \
        make

On macOS, you can use Homebrew.

git clone https://github.com/apache/arrow.git
cd arrow
brew update && brew bundle --file=cpp/Brewfile

On MSYS2:

pacman --sync --refresh --noconfirm \
  ccache \
  git \
  mingw-w64-${MSYSTEM_CARCH}-boost \
  mingw-w64-${MSYSTEM_CARCH}-brotli \
  mingw-w64-${MSYSTEM_CARCH}-cmake \
  mingw-w64-${MSYSTEM_CARCH}-double-conversion \
  mingw-w64-${MSYSTEM_CARCH}-flatbuffers \
  mingw-w64-${MSYSTEM_CARCH}-gcc \
  mingw-w64-${MSYSTEM_CARCH}-gflags \
  mingw-w64-${MSYSTEM_CARCH}-glog \
  mingw-w64-${MSYSTEM_CARCH}-gtest \
  mingw-w64-${MSYSTEM_CARCH}-lz4 \
  mingw-w64-${MSYSTEM_CARCH}-protobuf \
  mingw-w64-${MSYSTEM_CARCH}-python3-numpy \
  mingw-w64-${MSYSTEM_CARCH}-rapidjson \
  mingw-w64-${MSYSTEM_CARCH}-snappy \
  mingw-w64-${MSYSTEM_CARCH}-thrift \
  mingw-w64-${MSYSTEM_CARCH}-uriparser \
  mingw-w64-${MSYSTEM_CARCH}-zlib \
  mingw-w64-${MSYSTEM_CARCH}-zstd

Building

The build system uses CMAKE_BUILD_TYPE=release by default, so if this argument is omitted then a release build will be produced.

Note

You need to more options to build on Windows. See :ref:`developers-cpp-windows` for details.

Minimal release build:

git clone https://github.com/apache/arrow.git
cd arrow/cpp
mkdir release
cd release
cmake -DARROW_BUILD_TESTS=ON ..
make unittest

Minimal debug build:

git clone https://github.com/apache/arrow.git
cd arrow/cpp
mkdir debug
cd debug
cmake -DCMAKE_BUILD_TYPE=Debug -DARROW_BUILD_TESTS=ON ..
make unittest

If you do not need to build the test suite, you can omit the ARROW_BUILD_TESTS option (the default is not to build the unit tests).

On some Linux distributions, running the test suite might require setting an explicit locale. If you see any locale-related errors, try setting the environment variable (which requires the locales package or equivalent):

export LC_ALL="en_US.UTF-8"

Faster builds with Ninja

Many contributors use the Ninja build system to get faster builds. It especially speeds up incremental builds. To use ninja, pass -GNinja when calling cmake and then use the ninja command instead of make.

Optional Components

By default, the C++ build system creates a fairly minimal build. We have several optional system components which you can opt into building by passing boolean flags to cmake.

  • -DARROW_CUDA=ON: CUDA integration for GPU development. Depends on NVIDIA CUDA toolkit. The CUDA toolchain used to build the library can be customized by using the $CUDA_HOME environment variable.
  • -DARROW_FLIGHT=ON: Arrow Flight RPC system, which depends at least on gRPC
  • -DARROW_GANDIVA=ON: Gandiva expression compiler, depends on LLVM, Protocol Buffers, and re2
  • -DARROW_GANDIVA_JAVA=ON: Gandiva JNI bindings for Java
  • -DARROW_HDFS=ON: Arrow integration with libhdfs for accessing the Hadoop Filesystem
  • -DARROW_HIVESERVER2=ON: Client library for HiveServer2 database protocol
  • -DARROW_ORC=ON: Arrow integration with Apache ORC
  • -DARROW_PARQUET=ON: Apache Parquet libraries and Arrow integration
  • -DARROW_PLASMA=ON: Plasma Shared Memory Object Store
  • -DARROW_PLASMA_JAVA_CLIENT=ON: Build Java client for Plasma
  • -DARROW_PYTHON=ON: Arrow Python C++ integration library (required for building pyarrow). This library must be built against the same Python version for which you are building pyarrow, e.g. Python 2.7 or Python 3.6. NumPy must also be installed.

Some features of the core Arrow shared library can be switched off for improved build times if they are not required for your application:

  • -DARROW_COMPUTE=ON: build the in-memory analytics module
  • -DARROW_IPC=ON: build the IPC extensions (requiring Flatbuffers)

CMake version requirements

While we support CMake 3.2 and higher, some features require a newer version of CMake:

  • Building the benchmarks requires 3.6 or higher
  • Building zstd from source requires 3.7 or higher
  • Building Gandiva JNI bindings requires 3.11 or higher

LLVM and Clang Tools

We are currently using LLVM 7 for library builds and for other developer tools such as code formatting with clang-format. LLVM can be installed via most modern package managers (apt, yum, conda, Homebrew, chocolatey).

Build Dependency Management

The build system supports a number of third-party dependencies

  • BOOST: for cross-platform support
  • BROTLI: for data compression
  • double-conversion: for text-to-numeric conversions
  • Snappy: for data compression
  • gflags: for command line utilities (formerly Googleflags)
  • glog: for logging
  • Thrift: Apache Thrift, for data serialization
  • Protobuf: Google Protocol Buffers, for data serialization
  • GTEST: Googletest, for testing
  • benchmark: Google benchmark, for testing
  • RapidJSON: for data serialization
  • Flatbuffers: for data serialization
  • ZLIB: for data compression
  • BZip2: for data compression
  • LZ4: for data compression
  • ZSTD: for data compression
  • RE2: for regular expressions
  • gRPC: for remote procedure calls
  • c-ares: a dependency of gRPC
  • LLVM: a dependency of Gandiva

The CMake option ARROW_DEPENDENCY_SOURCE is a global option that instructs the build system how to resolve each dependency. There are a few options:

  • AUTO: try to find package in the system default locations and build from source if not found
  • BUNDLED: Building the dependency automatically from source
  • SYSTEM: Finding the dependency in system paths using CMake's built-in find_package function, or using pkg-config for packages that do not have this feature
  • BREW: Use Homebrew default paths as an alternative SYSTEM path
  • CONDA: Use $CONDA_PREFIX as alternative SYSTEM PATH

The default method is AUTO unless you are developing within an active conda environment (detected by presence of the $CONDA_PREFIX environment variable), in which case it is CONDA.

Individual Dependency Resolution

While -DARROW_DEPENDENCY_SOURCE=$SOURCE sets a global default for all packages, the resolution strategy can be overridden for individual packages by setting -D$PACKAGE_NAME_SOURCE=... For example, to build Protocol Buffers from source, set

-DProtobuf_SOURCE=BUNDLED

This variable is unfortunately case-sensitive; the name used for each package is listed above, but the most up-to-date listing can be found in cpp/cmake_modules/ThirdpartyToolchain.cmake.

Bundled Dependency Versions

When using the BUNDLED method to build a dependency from source, the version number from cpp/thirdparty/versions.txt is used. There is also a dependency source downloader script (see below), which can be used to set up offline builds.

Boost-related Options

We depend on some Boost C++ libraries for cross-platform suport. In most cases, the Boost version available in your package manager may be new enough, and the build system will find it automatically. If you have Boost installed in a non-standard location, you can specify it by passing -DBOOST_ROOT=$MY_BOOST_ROOT or setting the BOOST_ROOT environment variable.

Offline Builds

If you do not use the above variables to direct the Arrow build system to preinstalled dependencies, they will be built automatically by the Arrow build system. The source archive for each dependency will be downloaded via the internet, which can cause issues in environments with limited access to the internet.

To enable offline builds, you can download the source artifacts yourself and use environment variables of the form ARROW_$LIBRARY_URL to direct the build system to read from a local file rather than accessing the internet.

To make this easier for you, we have prepared a script thirdparty/download_dependencies.sh which will download the correct version of each dependency to a directory of your choosing. It will print a list of bash-style environment variable statements at the end to use for your build script.

# Download tarballs into $HOME/arrow-thirdparty
$ ./thirdparty/download_dependencies.sh $HOME/arrow-thirdparty

You can then invoke CMake to create the build directory and it will use the declared environment variable pointing to downloaded archives instead of downloading them (one for each build dir!).

General C++ Development

This section provides information for developers who wish to contribute to the C++ codebase.

Note

Since most of the project's developers work on Linux or macOS, not all features or developer tools are uniformly supported on Windows. If you are on Windows, have a look at :ref:`developers-cpp-windows`.

Compiler warning levels

The BUILD_WARNING_LEVEL CMake option switches between sets of predetermined compiler warning levels that we use for code tidiness. For release builds, the default warning level is PRODUCTION, while for debug builds the default is CHECKIN.

When using CHECKIN for debug builds, -Werror is added when using gcc and clang, causing build failures for any warning, and /WX is set with MSVC having the same effect.

Code Style, Linting, and CI

This project follows Google's C++ Style Guide with minor exceptions:

  • We relax the line length restriction to 90 characters.
  • We use the NULLPTR macro in header files (instead of nullptr) defined in src/arrow/util/macros.h to support building C++/CLI (ARROW-1134)

Our continuous integration builds in Travis CI and Appveyor run the unit test suites on a variety of platforms and configuration, including using valgrind to check for memory leaks or bad memory accesses. In addition, the codebase is subjected to a number of code style and code cleanliness checks.

In order to have a passing CI build, your modified git branch must pass the following checks:

  • C++ builds with the project's active version of clang without compiler warnings with -DBUILD_WARNING_LEVEL=CHECKIN. Note that there are classes of warnings (such as -Wdocumentation, see more on this below) that are not caught by gcc.
  • C++ unit test suite with valgrind enabled, use -DARROW_TEST_MEMCHECK=ON when invoking CMake
  • Passes cpplint checks, checked with make lint
  • Conforms to clang-format style, checked with make check-format
  • Passes C++/CLI header file checks, invoked with cpp/build-support/lint_cpp_cli.py cpp/src
  • CMake files pass style checks, can be fixed by running run-cmake-format.py from the root of the repository. This requires Python 3 and cmake_format (note: this currently does not work on Windows)

In order to account for variations in the behavior of clang-format between major versions of LLVM, we pin the version of clang-format used (current LLVM 7).

Depending on how you installed clang-format, the build system may not be able to find it. You can provide an explicit path to your LLVM installation (or the root path for the clang tools) with the environment variable $CLANG_TOOLS_PATH or by passing -DClangTools_PATH=$PATH_TO_CLANG_TOOLS when invoking CMake.

To make linting more reproducible for everyone, we provide a docker-compose target that is executable from the root of the repository:

docker-compose run lint

See :ref:`integration` for more information about the project's docker-compose configuration.

API Documentation

We use Doxygen style comments (///) in header files for comments that we wish to show up in API documentation for classes and functions.

When using clang and building with -DBUILD_WARNING_LEVEL=CHECKIN, the -Wdocumentation flag is used which checks for some common documnetation inconsistencies, like documenting some, but not all function parameters with \param. See the LLVM documentation warnings section for more about this.

While we publish the API documentation as part of the main Sphinx-based documentation site, you can also build the C++ API documentation anytime using Doxygen. Run the following command from the cpp/apidoc directory:

doxygen Doxyfile

This requires Doxygen to be installed.

Modular Build Targets

Since there are several major parts of the C++ project, we have provided modular CMake targets for building each library component, group of unit tests and benchmarks, and their dependencies:

  • make arrow for Arrow core libraries
  • make parquet for Parquet libraries
  • make gandiva for Gandiva (LLVM expression compiler) libraries
  • make plasma for Plasma libraries, server

To build the unit tests or benchmarks, add -tests or -benchmarks to the target name. So make arrow-tests will build the Arrow core unit tests. Using the -all target, e.g. parquet-all, will build everything.

If you wish to only build and install one or more project subcomponents, we have provided the CMake option ARROW_OPTIONAL_INSTALL to only install targets that have been built. For example, if you only wish to build the Parquet libraries, its tests, and its dependencies, you can run:

cmake .. -DARROW_PARQUET=ON \
      -DARROW_OPTIONAL_INSTALL=ON \
      -DARROW_BUILD_TESTS=ON
make parquet
make install

If you omit an explicit target when invoking make, all targets will be built.

Benchmarking

Follow the directions for simple build except run cmake with the ARROW_BUILD_BENCHMARKS parameter set to ON:

cmake -DARROW_BUILD_TESTS=ON -DARROW_BUILD_BENCHMARKS=ON ..

and instead of make unittest run either make; ctest to run both unit tests and benchmarks or make benchmark to run only the benchmarks. Benchmark logs will be placed in the build directory under build/benchmark-logs.

You can also invoke a single benchmark executable directly:

./release/arrow-builder-benchmark

The build system uses CMAKE_BUILD_TYPE=release by default which enables compiler optimizations. It is also recommended to disable CPU throttling or such hardware features as "Turbo Boost" to obtain more consistent and comparable. benchmark results

Testing with LLVM AddressSanitizer

To use AddressSanitizer (ASAN) to find bad memory accesses or leaks with LLVM, pass -DARROW_USE_ASAN=ON when building. You must use clang to compile with ASAN, and ARROW_USE_ASAN is mutually-exclusive with the valgrind option ARROW_TEST_MEMCHECK.

Fuzz testing with libfuzzer

Fuzzers can help finding unhandled exceptions and problems with untrusted input that may lead to crashes, security issues and undefined behavior. They do this by generating random input data and observing the behavior of the executed code. To build the fuzzer code, LLVM is required (GCC-based compilers won't work). You can build them using the following code:

export CC=clang
export CXX=clang++
cmake -DARROW_FUZZING=ON -DARROW_USE_ASAN=ON -DCMAKE_BUILD_TYPE=RelWithDebInfo ..
make

ARROW_FUZZING will enable building of fuzzer executables as well as enable the addition of coverage helpers via ARROW_USE_COVERAGE, so that the fuzzer can observe the program execution.

It is also wise to enable some sanitizers like ARROW_USE_ASAN (see above), which activates the address sanitizer. This way, we ensure that bad memory operations provoked by the fuzzer will be found early. You may also enable other sanitizers as well. Just keep in mind that some of them do not work together and some may result in very long execution times, which will slow down the fuzzing procedure.

We use the RelWithDebInfo build type which is optimized Release but contains debug information. Just using Debug would be too slow to get proper fuzzing results and Release would make it impossible to get proper tracebacks. Also, some bugs might (but hopefully are not) be specific to the release build due to misoptimization.

Now you can start one of the fuzzer, e.g.:

./relwithdebinfo/arrow-ipc-fuzzing-test corpus

This will try to find a malformed input that crashes the payload. A corpus of interesting inputs will be stored into the corpus directory. You can save and share this with others if you want, or even pre-fill it with files to provide the fuzzer with a warm-start. Apache provides a test corpus under https://github.com/apache/arrow-testing. If a crash was found, the program will show the stack trace as well as the input data. The input data will also be written to a file named crash-<some id>. After a problem was found this way, it should be reported and fixed. Usually, the fuzzing process cannot be continued until the fix is applied, since the fuzzer usually converts to the problem again. To debug the underlying issue, you can use GDB:

env ASAN_OPTIONS=abort_on_error=1 gdb -ex r --args ./relwithdebinfo/arrow-ipc-fuzzing-test crash-<some id>

For more options, use:

./relwithdebinfo/arrow-ipc-fuzzing-test -help=1

or visit the libFuzzer documentation.

If you build fuzzers with ASAN, you need to set the ASAN_SYMBOLIZER_PATH environment variable to the absolute path of llvm-symbolizer, which is a tool that ships with LLVM.

export ASAN_SYMBOLIZER_PATH=$(type -p llvm-symbolizer)

Note that some fuzzer builds currently reject paths with a version qualifier (like llvm-sanitizer-5.0). To overcome this, set an appropriate symlink (here, when using LLVM 5.0):

ln -sf /usr/bin/llvm-sanitizer-5.0 /usr/bin/llvm-sanitizer

There are some problems that may occur during the compilation process:

  • libfuzzer was not distributed with your LLVM: ld: file not found: .../libLLVMFuzzer.a
  • your LLVM is too old: clang: error: unsupported argument 'fuzzer' to option 'fsanitize='

Extra debugging help

If you use the CMake option -DARROW_EXTRA_ERROR_CONTEXT=ON it will compile the libraries with extra debugging information on error checks inside the RETURN_NOT_OK macro. In unit tests with ASSERT_OK, this will yield error outputs like:

../src/arrow/ipc/ipc-read-write-test.cc:609: Failure
Failed
../src/arrow/ipc/metadata-internal.cc:508 code: TypeToFlatbuffer(fbb, *field.type(), &children, &layout, &type_enum, dictionary_memo, &type_offset)
../src/arrow/ipc/metadata-internal.cc:598 code: FieldToFlatbuffer(fbb, *schema.field(i), dictionary_memo, &offset)
../src/arrow/ipc/metadata-internal.cc:651 code: SchemaToFlatbuffer(fbb, schema, dictionary_memo, &fb_schema)
../src/arrow/ipc/writer.cc:697 code: WriteSchemaMessage(schema_, dictionary_memo_, &schema_fb)
../src/arrow/ipc/writer.cc:730 code: WriteSchema()
../src/arrow/ipc/writer.cc:755 code: schema_writer.Write(&dictionaries_)
../src/arrow/ipc/writer.cc:778 code: CheckStarted()
../src/arrow/ipc/ipc-read-write-test.cc:574 code: writer->WriteRecordBatch(batch)
NotImplemented: Unable to convert type: decimal(19, 4)

Deprecations and API Changes

We use the compiler definition ARROW_NO_DEPRECATED_API to disable APIs that have been deprecated. It is a good practice to compile third party applications with this flag to proactively catch and account for API changes.

Cleaning includes with include-what-you-use (IWYU)

We occasionally use Google's include-what-you-use tool, also known as IWYU, to remove unnecessary imports. Since setting up IWYU can be a bit tedious, we provide a docker-compose target for running it on the C++ codebase:

make -f Makefile.docker build-iwyu
docker-compose run lint

Checking for ABI and API stability

To build ABI compliance reports, you need to install the two tools abi-dumper and abi-compliance-checker.

Build Arrow C++ in Debug mode, alternatively you could use -Og which also builds with the necessary symbols but includes a bit of code optimization. Once the build has finished, you can generate ABI reports using:

abi-dumper -lver 9 debug/libarrow.so -o ABI-9.dump

The above version number is freely selectable. As we want to compare versions, you should now git checkout the version you want to compare it to and re-run the above command using a different version number. Once both reports are generated, you can build a comparision report using

abi-compliance-checker -l libarrow -d1 ABI-PY-9.dump -d2 ABI-PY-10.dump

The report is then generated in compat_reports/libarrow as a HTML.

Debugging with Xcode on macOS

Xcode is the IDE provided with macOS and can be use to develop and debug Arrow by generating an Xcode project:

cd cpp
mkdir xcode-build
cd xcode-build
cmake .. -G Xcode -DARROW_BUILD_TESTS=ON
open arrow.xcodeproj

This will generate a project and open it in the Xcode app. As an alternative, the command xcodebuild will perform a command-line build using the generated project. It is recommended to use the "Automatically Create Schemes" option when first launching the project. Selecting an auto-generated scheme will allow you to build and run a unittest with breakpoints enabled.

Developing on Windows

Like Linux and macOS, we have worked to enable builds to work "out of the box" with CMake for a reasonably large subset of the project.

System Setup

Microsoft provides the free Visual Studio Community edition. When doing development in the the shell, you must initialize the development environment.

For Visual Studio 2015, execute the following batch script:

"C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\vcvarsall.bat" amd64

For Visual Studio 2017, the script is:

"C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\Common7\Tools\VsDevCmd.bat" -arch=amd64

One can configure a console emulator like cmder to automatically launch this when starting a new development console.

Using conda-forge for build dependencies

Miniconda is a minimal Python distribution including the conda package manager. Some memers of the Apache Arrow community participate in the maintenance of conda-forge, a community-maintained cross-platform package repository for conda.

To use conda-forge for your C++ build dependencies on Windows, first download and install a 64-bit distribution from the Miniconda homepage

To configure conda to use the conda-forge channel by default, launch a command prompt (cmd.exe) and run the command:

conda config --add channels conda-forge

Now, you can bootstrap a build environment (call from the root directory of the Arrow codebase):

conda create -y -n arrow-dev --file=ci\conda_env_cpp.yml

Then "activate" this conda environment with:

activate arrow-dev

If the environment has been activated, the Arrow build system will automatically see the %CONDA_PREFIX% environment variable and use that for resolving the build dependencies. This is equivalent to setting

-DARROW_DEPENDENCY_SOURCE=SYSTEM ^
-DARROW_PACKAGE_PREFIX=%CONDA_PREFIX%\Library

Note that these packages are only supported for release builds. If you intend to use -DCMAKE_BUILD_TYPE=debug then you must build the packages from source.

Note

If you run into any problems using conda packages for dependencies, a very common problem is mixing packages from the defaults channel with those from conda-forge. You can examine the installed packages in your environment (and their origin) with conda list

Building using Visual Studio (MSVC) Solution Files

Change working directory in cmd.exe to the root directory of Arrow and do an out of source build by generating a MSVC solution:

cd cpp
mkdir build
cd build
cmake .. -G "Visual Studio 14 2015 Win64" ^
      -DARROW_BUILD_TESTS=ON
cmake --build . --config Release

Building with Ninja and clcache

The Ninja build system offsets better build parallelization, and the optional clcache compiler cache which keeps track of past compilations to avoid running them over and over again (in a way similar to the Unix-specific ccache).

Activate your conda build environment to first install those utilities:

activate arrow-dev
conda install -c conda-forge ninja
pip install git+https://github.com/frerich/clcache.git

Change working directory in cmd.exe to the root directory of Arrow and do an out of source build by generating Ninja files:

cd cpp
mkdir build
cd build
cmake -G "Ninja" -DARROW_BUILD_TESTS=ON ^
      -DGTest_SOURCE=BUNDLED ..
cmake --build . --config Release

Building with NMake

Change working directory in cmd.exe to the root directory of Arrow and do an out of source build using nmake:

cd cpp
mkdir build
cd build
cmake -G "NMake Makefiles" ..
nmake

Building on MSYS2

You can build on MSYS2 terminal, cmd.exe or PowerShell terminal.

On MSYS2 terminal:

cd cpp
mkdir build
cd build
cmake -G "MSYS Makefiles" ..
make

On cmd.exe or PowerShell terminal, you can use the following batch file:

setlocal

REM For 64bit
set MINGW_PACKAGE_PREFIX=mingw-w64-x86_64
set MINGW_PREFIX=c:\msys64\mingw64
set MSYSTEM=MINGW64

set PATH=%MINGW_PREFIX%\bin;c:\msys64\usr\bin;%PATH%

rmdir /S /Q cpp\build
mkdir cpp\build
pushd cpp\build
cmake -G "MSYS Makefiles" .. || exit /B
make || exit /B
popd

Debug builds

To build Debug version of Arrow you should have pre-installed a Debug version of Boost. It's recommended to configure cmake build with the following variables for Debug build:

  • -DARROW_BOOST_USE_SHARED=OFF: enables static linking with boost debug libs and simplifies run-time loading of 3rd parties
  • -DBOOST_ROOT: sets the root directory of boost libs. (Optional)
  • -DBOOST_LIBRARYDIR: sets the directory with boost lib files. (Optional)

The command line to build Arrow in Debug will look something like this:

cd cpp
mkdir build
cd build
cmake .. -G "Visual Studio 14 2015 Win64" ^
      -DARROW_BOOST_USE_SHARED=OFF ^
      -DCMAKE_BUILD_TYPE=Debug ^
      -DBOOST_ROOT=C:/local/boost_1_63_0  ^
      -DBOOST_LIBRARYDIR=C:/local/boost_1_63_0/lib64-msvc-14.0
cmake --build . --config Debug

Windows dependency resolution issues

Because Windows uses .lib files for both static and dynamic linking of dependencies, the static library sometimes may be named something different like %PACKAGE%_static.lib to distinguish itself. If you are statically linking some dependencies, we provide some options

  • -DBROTLI_MSVC_STATIC_LIB_SUFFIX=%BROTLI_SUFFIX%
  • -DSNAPPY_MSVC_STATIC_LIB_SUFFIX=%SNAPPY_SUFFIX%
  • -LZ4_MSVC_STATIC_LIB_SUFFIX=%LZ4_SUFFIX%
  • -ZSTD_MSVC_STATIC_LIB_SUFFIX=%ZSTD_SUFFIX%

To get the latest build instructions, you can reference ci/appveyor-built.bat, which is used by automated Appveyor builds.

Statically linking to Arrow on Windows

The Arrow headers on Windows static library builds (enabled by the CMake option ARROW_BUILD_STATIC) use the preprocessor macro ARROW_STATIC to suppress dllimport/dllexport marking of symbols. Projects that statically link against Arrow on Windows additionally need this definition. The Unix builds do not use the macro.

Replicating Appveyor Builds

For people more familiar with linux development but need to replicate a failing appveyor build, here are some rough notes from replicating the Static_Crt_Build (make unittest will probably still fail but many unit tests can be made with there individual make targets).

  1. Microsoft offers trial VMs for Windows with Microsoft Visual Studio. Download and install a version.
  2. Run the VM and install CMake and Miniconda or Anaconda (these instructions assume Anaconda).
  3. Download pre-built Boost debug binaries and install it (run from command prompt opened by "Developer Command Prompt for MSVC 2017"):
cd $EXTRACT_BOOST_DIRECTORY
.\bootstrap.bat
@rem This is for static libraries needed for static_crt_build in appvyor
.\b2 link=static -with-filesystem -with-regex -with-system install
@rem this should put libraries and headers in c:\Boost
  1. Activate ananaconda/miniconda:
@rem this might differ for miniconda
C:\Users\User\Anaconda3\Scripts\activate
  1. Clone and change directories to the arrow source code (you might need to install git).
  2. Setup environment variables:
@rem Change the build type based on which appveyor job you want.
SET JOB=Static_Crt_Build
SET GENERATOR=Ninja
SET APPVEYOR_BUILD_WORKER_IMAGE=Visual Studio 2017
SET USE_CLCACHE=false
SET ARROW_BUILD_GANDIVA=OFF
SET ARROW_LLVM_VERSION=7.0.*
SET PYTHON=3.6
SET ARCH=64
SET PATH=C:\Users\User\Anaconda3;C:\Users\User\Anaconda3\Scripts;C:\Users\User\Anaconda3\Library\bin;%PATH%
SET BOOST_LIBRARYDIR=C:\Boost\lib
SET BOOST_ROOT=C:\Boost
  1. Run appveyor scripts:
.\ci\appveyor-install.bat
@rem this might fail but at this point most unit tests should be buildable by there individual targets
@rem see next line for example.
.\ci\appveyor-build.bat
cmake --build . --config Release --target arrow-compute-hash-test

Apache Parquet Development

To build the C++ libraries for Apache Parquet, add the flag -DARROW_PARQUET=ON when invoking CMake. The Parquet libraries and unit tests can be built with the parquet make target:

make parquet

Running ctest -L unittest will run all built C++ unit tests, while ctest -L parquet will run only the Parquet unit tests. The unit tests depend on an environment variable PARQUET_TEST_DATA that depends on a git submodule to the repository https://github.com/apache/parquet-testing:

git submodule update --init
export PARQUET_TEST_DATA=$ARROW_ROOT/cpp/submodules/parquet-testing/data

Here $ARROW_ROOT is the absolute path to the Arrow codebase.

Arrow Flight RPC

In addition to the Arrow dependencies, Flight requires:

  • gRPC (>= 1.14, roughly)
  • Protobuf (>= 3.6, earlier versions may work)
  • c-ares (used by gRPC)

By default, Arrow will try to download and build these dependencies when building Flight.

The optional flight libraries and tests can be built by passing -DARROW_FLIGHT=ON.

cmake .. -DARROW_FLIGHT=ON -DARROW_BUILD_TESTS=ON
make

You can also use existing installations of the extra dependencies. When building, set the environment variables gRPC_ROOT and/or Protobuf_ROOT and/or c-ares_ROOT.

We are developing against recent versions of gRPC, and the versions. The grpc-cpp package available from https://conda-forge.org/ is one reliable way to obtain gRPC in a cross-platform way. You may try using system libraries for gRPC and Protobuf, but these are likely to be too old. On macOS, you can try Homebrew:

brew install grpc

Development Conventions

This section provides some information about some of the abstractions and development approaches we use to solve problems common to many parts of the C++ project.

Memory Pools

We provide a default memory pool with arrow::default_memory_pool(). As a matter of convenience, some of the array builder classes have constructors which use the default pool without explicitly passing it. You can disable these constructors in your application (so that you are accounting properly for all memory allocations) by defining ARROW_NO_DEFAULT_MEMORY_POOL.

Header files

We use the .h extension for C++ header files. Any header file name not containing internal is considered to be a public header, and will be automatically installed by the build.

Error Handling and Exceptions

For error handling, we use arrow::Status values instead of throwing C++ exceptions. Since the Arrow C++ libraries are intended to be useful as a component in larger C++ projects, using Status objects can help with good code hygiene by making explicit when a function is expected to be able to fail.

For expressing invariants and "cannot fail" errors, we use DCHECK macros defined in arrow/util/logging.h. These checks are disabled in release builds and are intended to catch internal development errors, particularly when refactoring. These macros are not to be included in any public header files.

Since we do not use exceptions, we avoid doing expensive work in object constructors. Objects that are expensive to construct may often have private constructors, with public static factory methods that return Status.

There are a number of object constructors, like arrow::Schema and arrow::RecordBatch where larger STL container objects like std::vector may be created. While it is possible for std::bad_alloc to be thrown in these constructors, the circumstances where they would are somewhat esoteric, and it is likely that an application would have encountered other more serious problems prior to having std::bad_alloc thrown in a constructor.

You can’t perform that action at this time.