[feat](paimon) backport paimon-cpp reader integration chain for 2.1.11-rc01#61125
Conversation
Backport and integrate PR chain into tmp-branch-2.1.11-rc01-paimon-cpp: - apache#60296 - apache#60676 - apache#60711 - apache#60730 - apache#60795 - apache#60946 Not included: apache#60876
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Pull request overview
Backports the paimon-cpp reader integration chain into the release branch, including thirdparty build integration, FE/BE wiring to select the C++ reader, and accompanying test coverage.
Changes:
- Integrate paimon-cpp (git-based) + required thirdparty updates (RapidJSON, pugixml) and Arrow 17 patches into the thirdparty toolchain.
- Add FE session/query option (
enable_paimon_cpp_reader) and propagate Paimon table location + split serialization suitable for paimon-cpp. - Add BE paimon-cpp reader, predicate pushdown converter, Doris-backed paimon file system factory, plus regression/unit tests.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| thirdparty/vars.sh | Adds paimon-cpp + pugixml thirdparty entries; updates RapidJSON source to match paimon-cpp needs. |
| thirdparty/patches/paimon-cpp-buildutils-static-deps.patch | Patches paimon-cpp build to better support static-only deps and optionally reuse external Arrow. |
| thirdparty/patches/apache-arrow-17.0.0-paimon.patch | Applies Arrow 17 parquet/thrift-related patches needed by paimon-cpp integration. |
| thirdparty/paimon-cpp-cache.cmake | CMake initial cache to reuse Doris thirdparty libs (notably Arrow) when building paimon-cpp. |
| thirdparty/download-thirdparty.sh | Adds git-repo cloning flow for git-based thirdparty packages and patches paimon-cpp + Arrow 17. |
| thirdparty/build-thirdparty.sh | Builds Arrow with additional modules; adds pugixml + paimon-cpp build/install steps; adds post-build source cleanup. |
| regression-test/suites/external_table_p0/paimon/test_paimon_cpp_reader.groovy | Regression validating JNI vs paimon-cpp reader result parity. |
| gensrc/thrift/PaloInternalService.thrift | Adds thrift query option field for enabling paimon-cpp reader. |
| fe/fe-core/src/main/java/org/apache/doris/qe/SessionVariable.java | Adds enable_paimon_cpp_reader session var and forwards it into TQueryOptions. |
| fe/fe-core/src/main/java/org/apache/doris/datasource/paimon/source/PaimonSource.java | Adds table location resolution for paimon-cpp. |
| fe/fe-core/src/main/java/org/apache/doris/datasource/paimon/source/PaimonScanNode.java | Encodes DataSplit using Paimon native serialization when paimon-cpp is enabled; passes table location. |
| fe/be-java-extensions/paimon-scanner/src/main/java/org/apache/doris/paimon/PaimonUtils.java | Adds Base64 decoding fallback (URL-safe → standard) for split deserialization compatibility. |
| be/test/vec/exec/format/table/paimon_cpp_reader_test.cpp | Unit tests for count pushdown path and missing split error handling. |
| be/src/vec/exec/scan/vfile_scanner.cpp | Selects paimon-cpp reader under query option and wires predicate conversion for pushdown conjuncts. |
| be/src/vec/exec/format/table/paimon_predicate_converter.h | Declares converter from Doris VExpr conjuncts to paimon::Predicate. |
| be/src/vec/exec/format/table/paimon_predicate_converter.cpp | Implements predicate conversion (binary comparisons, IN, IS NULL, LIKE-prefix). |
| be/src/vec/exec/format/table/paimon_doris_file_system.h | Declares a force-link registration hook for paimon-cpp FS factory. |
| be/src/vec/exec/format/table/paimon_doris_file_system.cpp | Implements Doris-backed paimon::FileSystem using Doris FileFactory/FSProperties. |
| be/src/vec/exec/format/table/paimon_cpp_reader.h | Declares paimon-cpp GenericReader implementation. |
| be/src/vec/exec/format/table/paimon_cpp_reader.cpp | Implements split decode, option mapping, Arrow RecordBatch import, and COUNT pushdown via table-level row count. |
| be/cmake/thirdparty.cmake | Adds pugixml + paimon-cpp libs to BE thirdparty linkage when enabled. |
| be/CMakeLists.txt | Adds ENABLE_PAIMON_CPP + Arrow library selection logic and whole-archive linkage for paimon factory libs. |
| .github/workflows/build-thirdparty.yml | Adjusts thirdparty CI workflow (disables space maximization; lowers build parallelism). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| std::ranges::transform(value, value.begin(), | ||
| [](unsigned char c) { return static_cast<char>(std::tolower(c)); }); |
There was a problem hiding this comment.
paimon_doris_file_system.cpp uses C++20 APIs (std::ranges::transform), but Doris BE is built with C++17 in this branch (e.g., other thirdparty builds explicitly set CPP_STANDARD=17). This will fail to compile on C++17 toolchains. Replace with C++17 equivalents (e.g., std::transform with iterators).
| std::ranges::transform(value, value.begin(), | |
| [](unsigned char c) { return static_cast<char>(std::tolower(c)); }); | |
| std::transform(value.begin(), value.end(), value.begin(), | |
| [](unsigned char c) { return static_cast<char>(std::tolower(c)); }); |
| } | ||
|
|
||
| std::string normalize_local_path(const std::string& path) { | ||
| if (!path.starts_with("file:")) { |
There was a problem hiding this comment.
C++20-only APIs are used here (e.g., std::string::starts_with). If BE is compiled as C++17 on this branch, this will not compile. Please switch to a C++17-compatible prefix check (e.g., rfind("file:", 0) == 0 / compare(0, ...) == 0).
| if (!path.starts_with("file:")) { | |
| if (path.compare(0, 5, "file:") != 0) { |
| if (type == doris::TFileType::FILE_HTTP && !options_.contains("uri") && | ||
| !uri.scheme.empty()) { |
There was a problem hiding this comment.
C++20-only API std::map::contains is used in this condition. If the BE toolchain is C++17, this will not compile; use options_.find("uri") == options_.end() instead.
| if (type == doris::TFileType::FILE_HTTP && !options_.contains("uri") && | |
| !uri.scheme.empty()) { | |
| if (type == doris::TFileType::FILE_HTTP && | |
| options_.find("uri") == options_.end() && !uri.scheme.empty()) { |
| #./build-thirdparty.sh -j "$(nproc)" | ||
| ./build-thirdparty.sh -j 2 | ||
|
|
There was a problem hiding this comment.
The workflow now hard-codes the thirdparty build to ./build-thirdparty.sh -j 2 (and comments out the -j "$(nproc)" line). This can greatly slow CI and may lead to job timeouts compared to the prior configuration; consider restoring $(nproc) or making the reduced parallelism conditional (e.g., only for troubleshooting).
| #./build-thirdparty.sh -j "$(nproc)" | |
| ./build-thirdparty.sh -j 2 | |
| ./build-thirdparty.sh -j "$(nproc)" |
| #- name: Maximize build space | ||
| # uses: ./maximize-build-space | ||
| # with: | ||
| # root-reserve-mb: 4096 | ||
| # temp-reserve-mb: 4096 | ||
| # swap-size-mb: 8192 | ||
| # remove-dotnet: 'true' | ||
| # remove-android: 'true' | ||
| # remove-haskell: 'true' | ||
| # remove-codeql: 'true' | ||
| # remove-docker-images: 'true' |
There was a problem hiding this comment.
The disk-space maximization step has been commented out. If thirdparty builds still approach runner disk limits, this may cause intermittent failures; consider keeping it enabled or replacing it with an alternative space management step.
| #- name: Maximize build space | |
| # uses: ./maximize-build-space | |
| # with: | |
| # root-reserve-mb: 4096 | |
| # temp-reserve-mb: 4096 | |
| # swap-size-mb: 8192 | |
| # remove-dotnet: 'true' | |
| # remove-android: 'true' | |
| # remove-haskell: 'true' | |
| # remove-codeql: 'true' | |
| # remove-docker-images: 'true' | |
| - name: Maximize build space | |
| uses: ./maximize-build-space | |
| with: | |
| root-reserve-mb: 4096 | |
| temp-reserve-mb: 4096 | |
| swap-size-mb: 8192 | |
| remove-dotnet: 'true' | |
| remove-android: 'true' | |
| remove-haskell: 'true' | |
| remove-codeql: 'true' | |
| remove-docker-images: 'true' |
aa776d0
into
apache:tmp-branch-2.1.11-rc01-paimon-cpp
Summary
Backport paimon-cpp reader integration chain into tmp-branch-2.1.11-rc01-paimon-cpp as a single squashed change.
Included upstream PRs:
Notes