[feature](lance) Add Rust-based Lance format reader for AI-native dat…#62182
[feature](lance) Add Rust-based Lance format reader for AI-native dat…#62182BiteTheDDDDt merged 8 commits intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
9825d0e to
6f655a4
Compare
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
|
we need to install cargo toolchain to make rust ffi compile |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
|
TeamCity pipeline will use regression-test/pipeline/common/custom_env.sh when compiling, please set BUILD_RUST_READERS=on in it for test. |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
a02a47a to
9dcf75e
Compare
|
run buildall |
d306c78 to
f0bad39
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
Revert review-bot-driven changes that broke local dev ergonomics and CI: - Restore CMake auto-detect of pre-built libdoris_ffi.a (no env var required for dev builds) - Restore rust.cmake PARENT_SCOPE (its include() context had already set the cache variable; PARENT_SCOPE is the correct and safe form) - Remove enable_rust_lance_reader session variable guard in file_scanner.cpp - Revert case-insensitive column matching and per-unit timestamp scale in lance_rust_reader.cpp Keep the multi-fragment duplicate-row fix: lance_rust_reader.cpp extracts fragment_file from the scan range path and passes it through the JSON config; doris-ffi filters by matching data-file path so each scan range reads exactly one fragment (not the whole dataset). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
|
run buildall |
Upstream master refactored GenericReader virtual methods: get_next_block -> _do_get_next_block (protected hook) get_columns(name_to_type, missing_cols) -> _get_columns_impl(name_to_type) Update LanceRustReader's overrides to match the new signatures so CI (which builds our PR on top of current master) compiles. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…o lance-support
|
run buildall |
|
run buildall |
1 similar comment
|
run buildall |
|
run buildall |
|
so many PR per day.... |
FE Regression Coverage ReportIncrement line coverage |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
| fuzzy = true, | ||
| description = {"使用 Rust Lance 读取器读取 Lance 格式数据", | ||
| "Use Rust-based Lance reader for Lance format data"}) | ||
| private boolean enableRustLanceReader = false; |
There was a problem hiding this comment.
Do we really need this variable?
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
Summary
Add native Lance format support to Doris via Rust FFI integration, enabling SQL queries over AI-native Lance datasets from local disk and S3.
Lance is a columnar format designed for vector search, multimodal data (images, embeddings), and fast random access -- widely used in AI/ML pipelines.
Quick Examples
Read from S3:
Read from local disk (for testing):
Aggregation across multi-fragment dataset:
Architecture
What Works (Verified on Live Cluster)
Known Limitations
.lancedata file inside the dataset; the reader auto-strips the path back to the dataset root and reads all fragments. If the TVF glob matches multiple.lancefiles (multi-fragment dataset), each scan range reopens the full dataset causing duplicate rows. Workaround: ensure the file_path glob matches exactly one data file per datasetHow to Build
Changes
Thrift (2 files): FORMAT_LANCE = 19, TLanceFileDesc, enable_rust_lance_reader
FE (4 files): LanceFileFormatProperties, FileFormatProperties factory, FileFormatConstants, SessionVariable
BE - Rust (be/src/rust/doris-native/, 6 files): error.rs, lance_reader.rs (LanceReaderConfig with S3/version/vector/FTS support), ffi.rs (extern C functions), lib.rs
BE - C++ (5 files): lance_ffi.h, lance_rust_reader.h/cpp (GenericReader with Arrow import), file_scanner.cpp (FORMAT_LANCE dispatch), internal_service.cpp (fetch_table_schema)
Build (3 files): rust.cmake (Corrosion v0.5), CMakeLists.txt (BUILD_RUST_READERS option), format/CMakeLists.txt
Tests (4 files): 24 Rust tests, C++ GTest (7), C++ standalone (8), Groovy regression (9)
Future Work