-
Notifications
You must be signed in to change notification settings - Fork 28
Description
Problem Statement
Currently, paimon-cpp uses a fixed BUNDLED approach for all third-party dependencies through CMake's ExternalProject_Add. While this ensures build reproducibility, it has several limitations:
Current Limitations
- No choice in dependency sources: All dependencies are downloaded and built from source, even when system libraries are already available
- Long build times: Building all dependencies (Arrow, ORC, Protobuf, compression libraries, etc.) from source can take significant time
- No reuse of existing installations: Users cannot leverage pre-installed libraries from system package managers (apt, yum, brew, conda, vcpkg)
- Inflexible for different environments: Different deployment scenarios (development, CI, production) may benefit from different dependency management strategies
- No per-dependency control: Cannot selectively choose BUNDLED for some dependencies and SYSTEM for others
Example Use Cases That Are Currently Difficult
- Development: Developer has Arrow 17.0.0 already installed system-wide, but still needs to rebuild it
- CI/CD: Build containers with pre-installed dependencies to speed up CI pipelines
- Custom builds: Organizations with specific library versions or patches in non-standard locations
- Conda environments: Users working within conda environments want to use conda-provided libraries
Proposed Solution
Implement a flexible dependency management system similar to Apache Arrow C++, which provides:
1. Global Dependency Source Control
Add a PAIMON_DEPENDENCY_SOURCE option:
-DPAIMON_DEPENDENCY_SOURCE=<AUTO|BUNDLED|SYSTEM|CONDA>- AUTO (default): Try to find system libraries first, fall back to bundled build if not found
- BUNDLED: Always download and build dependencies from source (current behavior)
- SYSTEM: Use only system-installed libraries (fail if not found)
- CONDA: Use libraries from
$CONDA_PREFIXenvironment
2. Per-Dependency Source Control
Allow users to override individual dependencies:
-DArrow_SOURCE=SYSTEM
-DArrow_ROOT=/usr/local/arrow-17.0.0
-Dzstd_SOURCE=BUNDLED
-Dglog_SOURCE=AUTO3. Unified Path Prefix
Support a common prefix for all unspecified dependencies:
-DPAIMON_PACKAGE_PREFIX=/opt/mylibsThis automatically sets Arrow_ROOT, zstd_ROOT, etc. to /opt/mylibs.
4. Shared vs Static Library Control
-DPAIMON_DEPENDENCY_USE_SHARED=OFF
-DPAIMON_ARROW_USE_SHARED=ONImplementation Approach
Following Arrow's design pattern:
-
Create a
resolve_dependency()macro that:- Checks
${DEPENDENCY_NAME}_SOURCEvariable - Falls back to
PAIMON_ACTUAL_DEPENDENCY_SOURCEif not set - Calls
find_package()for SYSTEM/AUTO orbuild_dependency()for BUNDLED
- Checks
-
Create Find modules (e.g.,
FindArrowAlt.cmake) that:- Respect
${PACKAGE}_ROOTCMake variable - Search in
${PACKAGE}_ROOT/{include,lib}withNO_DEFAULT_PATH - Fall back to system paths if
_ROOTis not set - Support both shared and static library preferences
- Respect
-
Update ThirdpartyToolchain.cmake:
- Replace direct
build_<dependency>()calls withresolve_dependency() - Set default
_SOURCEvalues based onPAIMON_DEPENDENCY_SOURCE
- Replace direct
-
Maintain backward compatibility:
- Default to
AUTOorBUNDLEDto preserve current behavior - Existing build commands work without changes
- Default to
Example Usage
Use System Libraries
cmake -B build \
-DPAIMON_DEPENDENCY_SOURCE=SYSTEM \
-DArrow_ROOT=/usr/local \
-Dglog_ROOT=/usr/local \
-Dzstd_ROOT=/usr/localMixed Approach (some bundled, some system)
cmake -B build \
-DPAIMON_DEPENDENCY_SOURCE=AUTO \
-DArrow_SOURCE=SYSTEM \
-DArrow_ROOT=/custom/arrow \
-Dzstd_SOURCE=BUNDLEDConda Environment
cmake -B build \
-DPAIMON_DEPENDENCY_SOURCE=CONDABenefits
- Faster builds: Reuse pre-installed libraries, especially for iterative development
- Flexible deployment: Support diverse environments (bare metal, containers, HPC clusters)
- Better CI integration: Cache dependencies across builds
- Ecosystem compatibility: Work seamlessly with conda, vcpkg, conan, system package managers
- Gradual adoption: Users can opt-in to new features without breaking existing builds
- Resource efficiency: Avoid rebuilding large dependencies like Arrow (which itself has many dependencies)
Reference Implementation
Apache Arrow C++ has successfully implemented this pattern:
ARROW_DEPENDENCY_SOURCE: https://github.com/apache/arrow/blob/main/cpp/cmake_modules/DefineOptions.cmake#L456-L464resolve_dependency()macro: https://github.com/apache/arrow/blob/main/cpp/cmake_modules/ThirdpartyToolchain.cmake#L252-L366- Find modules: https://github.com/apache/arrow/tree/main/cpp/cmake_modules
Questions for Discussion
- Should the default be
AUTO(convenient) orBUNDLED(current, most reproducible)? - Which dependencies should support this first? (Suggestion: Start with Arrow, compression libraries)
- Should we maintain compatibility with older CMake package formats, or require modern targets?
- How to handle transitive dependency conflicts between SYSTEM and BUNDLED libraries?
Implementation Phases
Phase 1: Core infrastructure
- Implement
resolve_dependency()macro - Add
PAIMON_DEPENDENCY_SOURCEoption - Support
<PACKAGE>_SOURCEand<PACKAGE>_ROOTvariables
Phase 2: Major dependencies
- Arrow (including Parquet)
- ORC + Protobuf
- Compression libraries (Snappy, zstd, lz4, zlib)
Phase 3: Additional dependencies
- Avro
- glog, fmt, RapidJSON
- TBB
- Testing libraries (GTest)
Phase 4: Advanced features
- Conda/vcpkg integration
- Shared vs static library preferences
- Better error messages and diagnostics
Compatibility
- ✅ Backward compatible: Existing build scripts work unchanged
- ✅ Opt-in: New features are optional
- ✅ No breaking changes: Default behavior can remain BUNDLED initially
Would appreciate feedback from maintainers and community members on this proposal!