Skip to content

[Feature] Flexible Third-Party Dependency Management System #103

@suxiaogang223

Description

@suxiaogang223

Problem Statement

Currently, paimon-cpp uses a fixed BUNDLED approach for all third-party dependencies through CMake's ExternalProject_Add. While this ensures build reproducibility, it has several limitations:

Current Limitations

  1. No choice in dependency sources: All dependencies are downloaded and built from source, even when system libraries are already available
  2. Long build times: Building all dependencies (Arrow, ORC, Protobuf, compression libraries, etc.) from source can take significant time
  3. No reuse of existing installations: Users cannot leverage pre-installed libraries from system package managers (apt, yum, brew, conda, vcpkg)
  4. Inflexible for different environments: Different deployment scenarios (development, CI, production) may benefit from different dependency management strategies
  5. No per-dependency control: Cannot selectively choose BUNDLED for some dependencies and SYSTEM for others

Example Use Cases That Are Currently Difficult

  • Development: Developer has Arrow 17.0.0 already installed system-wide, but still needs to rebuild it
  • CI/CD: Build containers with pre-installed dependencies to speed up CI pipelines
  • Custom builds: Organizations with specific library versions or patches in non-standard locations
  • Conda environments: Users working within conda environments want to use conda-provided libraries

Proposed Solution

Implement a flexible dependency management system similar to Apache Arrow C++, which provides:

1. Global Dependency Source Control

Add a PAIMON_DEPENDENCY_SOURCE option:

-DPAIMON_DEPENDENCY_SOURCE=<AUTO|BUNDLED|SYSTEM|CONDA>
  • AUTO (default): Try to find system libraries first, fall back to bundled build if not found
  • BUNDLED: Always download and build dependencies from source (current behavior)
  • SYSTEM: Use only system-installed libraries (fail if not found)
  • CONDA: Use libraries from $CONDA_PREFIX environment

2. Per-Dependency Source Control

Allow users to override individual dependencies:

-DArrow_SOURCE=SYSTEM
-DArrow_ROOT=/usr/local/arrow-17.0.0
-Dzstd_SOURCE=BUNDLED
-Dglog_SOURCE=AUTO

3. Unified Path Prefix

Support a common prefix for all unspecified dependencies:

-DPAIMON_PACKAGE_PREFIX=/opt/mylibs

This automatically sets Arrow_ROOT, zstd_ROOT, etc. to /opt/mylibs.

4. Shared vs Static Library Control

-DPAIMON_DEPENDENCY_USE_SHARED=OFF
-DPAIMON_ARROW_USE_SHARED=ON

Implementation Approach

Following Arrow's design pattern:

  1. Create a resolve_dependency() macro that:

    • Checks ${DEPENDENCY_NAME}_SOURCE variable
    • Falls back to PAIMON_ACTUAL_DEPENDENCY_SOURCE if not set
    • Calls find_package() for SYSTEM/AUTO or build_dependency() for BUNDLED
  2. Create Find modules (e.g., FindArrowAlt.cmake) that:

    • Respect ${PACKAGE}_ROOT CMake variable
    • Search in ${PACKAGE}_ROOT/{include,lib} with NO_DEFAULT_PATH
    • Fall back to system paths if _ROOT is not set
    • Support both shared and static library preferences
  3. Update ThirdpartyToolchain.cmake:

    • Replace direct build_<dependency>() calls with resolve_dependency()
    • Set default _SOURCE values based on PAIMON_DEPENDENCY_SOURCE
  4. Maintain backward compatibility:

    • Default to AUTO or BUNDLED to preserve current behavior
    • Existing build commands work without changes

Example Usage

Use System Libraries

cmake -B build \
  -DPAIMON_DEPENDENCY_SOURCE=SYSTEM \
  -DArrow_ROOT=/usr/local \
  -Dglog_ROOT=/usr/local \
  -Dzstd_ROOT=/usr/local

Mixed Approach (some bundled, some system)

cmake -B build \
  -DPAIMON_DEPENDENCY_SOURCE=AUTO \
  -DArrow_SOURCE=SYSTEM \
  -DArrow_ROOT=/custom/arrow \
  -Dzstd_SOURCE=BUNDLED

Conda Environment

cmake -B build \
  -DPAIMON_DEPENDENCY_SOURCE=CONDA

Benefits

  1. Faster builds: Reuse pre-installed libraries, especially for iterative development
  2. Flexible deployment: Support diverse environments (bare metal, containers, HPC clusters)
  3. Better CI integration: Cache dependencies across builds
  4. Ecosystem compatibility: Work seamlessly with conda, vcpkg, conan, system package managers
  5. Gradual adoption: Users can opt-in to new features without breaking existing builds
  6. Resource efficiency: Avoid rebuilding large dependencies like Arrow (which itself has many dependencies)

Reference Implementation

Apache Arrow C++ has successfully implemented this pattern:

Questions for Discussion

  1. Should the default be AUTO (convenient) or BUNDLED (current, most reproducible)?
  2. Which dependencies should support this first? (Suggestion: Start with Arrow, compression libraries)
  3. Should we maintain compatibility with older CMake package formats, or require modern targets?
  4. How to handle transitive dependency conflicts between SYSTEM and BUNDLED libraries?

Implementation Phases

Phase 1: Core infrastructure

  • Implement resolve_dependency() macro
  • Add PAIMON_DEPENDENCY_SOURCE option
  • Support <PACKAGE>_SOURCE and <PACKAGE>_ROOT variables

Phase 2: Major dependencies

  • Arrow (including Parquet)
  • ORC + Protobuf
  • Compression libraries (Snappy, zstd, lz4, zlib)

Phase 3: Additional dependencies

  • Avro
  • glog, fmt, RapidJSON
  • TBB
  • Testing libraries (GTest)

Phase 4: Advanced features

  • Conda/vcpkg integration
  • Shared vs static library preferences
  • Better error messages and diagnostics

Compatibility

  • Backward compatible: Existing build scripts work unchanged
  • Opt-in: New features are optional
  • No breaking changes: Default behavior can remain BUNDLED initially

Would appreciate feedback from maintainers and community members on this proposal!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions