Skip to content

[feature](lance) Add Rust-based Lance format reader for AI-native dat…#62182

Merged
BiteTheDDDDt merged 8 commits intoapache:masterfrom
tomz-alt:lance-support
Apr 22, 2026
Merged

[feature](lance) Add Rust-based Lance format reader for AI-native dat…#62182
BiteTheDDDDt merged 8 commits intoapache:masterfrom
tomz-alt:lance-support

Conversation

@tomz-alt
Copy link
Copy Markdown
Contributor

@tomz-alt tomz-alt commented Apr 7, 2026

Summary

Add native Lance format support to Doris via Rust FFI integration, enabling SQL queries over AI-native Lance datasets from local disk and S3.

Lance is a columnar format designed for vector search, multimodal data (images, embeddings), and fast random access -- widely used in AI/ML pipelines.

Quick Examples

Read from S3:

SELECT * FROM s3(
    "uri" = "s3://bucket/embeddings.lance/data/fragment.lance",
    "format" = "lance",
    "s3.access_key" = "...",
    "s3.secret_key" = "...",
    "s3.region" = "us-east-1",
    "s3.endpoint" = "https://s3.us-east-1.amazonaws.com"
) ORDER BY id LIMIT 10;

Read from local disk (for testing):

-- Get backend_id from: SHOW BACKENDS;
SELECT * FROM local(
    "file_path" = "data/my_dataset.lance/data/fragment.lance",
    "backend_id" = "<backend_id from SHOW BACKENDS>",
    "format" = "lance"
) ORDER BY id LIMIT 10;

Aggregation across multi-fragment dataset:

SELECT count(*), min(id), max(id) FROM s3(
    "uri" = "s3://bucket/large.lance/data/fragment.lance",
    "format" = "lance",
    "s3.access_key" = "...", "s3.secret_key" = "...",
    "s3.region" = "us-east-1",
    "s3.endpoint" = "https://s3.us-east-1.amazonaws.com"
);

Architecture

  • Data exchange: Arrow C Data Interface (zero-copy between Rust and C++)
  • Async containment: block_on() with single-threaded tokio runtime (zero extra OS threads)
  • Build gating: BUILD_RUST_READERS=OFF by default, zero impact on existing builds

What Works (Verified on Live Cluster)

Feature Status
SELECT * / column projection Tested
WHERE filter / LIMIT / COUNT(*) Tested
SUM() / AVG() aggregation Tested
Multi-fragment datasets (3 fragments, 15 rows) Tested
S3 access with AWS credentials Tested
Schema inference (fetch_table_schema) Tested
Time travel version (config wired) Config ready
Vector ANN search / FTS / filter pushdown Config ready

Known Limitations

  • TVF path only: No CREATE CATALOG support yet. Must use local() or s3() TVF
  • Directory-based format workaround: Lance datasets are directories. The TVF file_path must point to a single .lance data file inside the dataset; the reader auto-strips the path back to the dataset root and reads all fragments. If the TVF glob matches multiple .lance files (multi-fragment dataset), each scan range reopens the full dataset causing duplicate rows. Workaround: ensure the file_path glob matches exactly one data file per dataset
  • No Doris data cache integration: Lance reads bypass BlockFileCache. S3 reads are not cached on local SSD
  • No filter/vector pushdown from FE: The Rust config supports filter, vector_search, full_text_search but the FE planner does not populate them yet
  • BUILD_RUST_READERS=OFF default: Requires explicit opt-in and Rust toolchain
  • Binary size: Rust static lib ~430MB (.a), adds ~50-80MB to final doris_be after LTO

How to Build

# Rust tests only:
cd be/src/rust/doris-native && cargo test

# Full BE with Lance:
BUILD_RUST_READERS=ON ./build.sh --be

# Regression test:
./run-regression-test.sh --run -s test_lance_tvf

Changes

Thrift (2 files): FORMAT_LANCE = 19, TLanceFileDesc, enable_rust_lance_reader

FE (4 files): LanceFileFormatProperties, FileFormatProperties factory, FileFormatConstants, SessionVariable

BE - Rust (be/src/rust/doris-native/, 6 files): error.rs, lance_reader.rs (LanceReaderConfig with S3/version/vector/FTS support), ffi.rs (extern C functions), lib.rs

BE - C++ (5 files): lance_ffi.h, lance_rust_reader.h/cpp (GenericReader with Arrow import), file_scanner.cpp (FORMAT_LANCE dispatch), internal_service.cpp (fetch_table_schema)

Build (3 files): rust.cmake (Corrosion v0.5), CMakeLists.txt (BUILD_RUST_READERS option), format/CMakeLists.txt

Tests (4 files): 24 Rust tests, C++ GTest (7), C++ standalone (8), Groovy regression (9)

Future Work

  • Lance Catalog: CREATE CATALOG for dataset discovery (eliminates TVF path limitations and backend_id requirement)
  • Fragment-level scan ranges: FE lists fragments and creates one scan range per fragment with fragment ID, avoiding duplicate reads
  • FE filter/vector pushdown: Pass WHERE predicates and ANN queries to lance-rs scanner
  • Doris BlockFileCache integration: Route lance I/O through Doris cache for S3 data caching
  • Lance Session cache: Shared IndexCache for vector/FTS index reuse across queries
  • Time travel SQL: FOR VERSION AS OF N via LanceMvccSnapshot

@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 7, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@morningman morningman self-assigned this Apr 7, 2026
@tomz-alt tomz-alt force-pushed the lance-support branch 5 times, most recently from 9825d0e to 6f655a4 Compare April 7, 2026 18:15
@morningman
Copy link
Copy Markdown
Contributor

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 18.18% (2/11) 🎉
Increment coverage report
Complete coverage report

@tomz-alt
Copy link
Copy Markdown
Contributor Author

tomz-alt commented Apr 7, 2026

we need to install cargo toolchain to make rust ffi compile

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.98% (20105/37947)
Line Coverage 36.53% (188940/517204)
Region Coverage 32.78% (146566/447093)
Branch Coverage 33.93% (64209/189233)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.69% (27386/37162)
Line Coverage 57.31% (295496/515622)
Region Coverage 54.51% (245979/451226)
Branch Coverage 56.18% (106642/189815)

@hello-stephen
Copy link
Copy Markdown
Contributor

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TeamCity pipeline will use regression-test/pipeline/common/custom_env.sh when compiling, please set BUILD_RUST_READERS=on in it for test.

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.00% (20112/37947)
Line Coverage 36.56% (189072/517225)
Region Coverage 32.81% (146716/447111)
Branch Coverage 33.94% (64233/189251)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 89.87% (71/79) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.72% (27397/37162)
Line Coverage 57.36% (295782/515643)
Region Coverage 54.61% (246412/451244)
Branch Coverage 56.29% (106852/189833)

@hello-stephen
Copy link
Copy Markdown
Contributor

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 18.18% (2/11) 🎉
Increment coverage report
Complete coverage report

@tomz-alt tomz-alt force-pushed the lance-support branch 3 times, most recently from a02a47a to 9dcf75e Compare April 13, 2026 06:19
@morningman
Copy link
Copy Markdown
Contributor

run buildall

@tomz-alt tomz-alt force-pushed the lance-support branch 2 times, most recently from d306c78 to f0bad39 Compare April 13, 2026 18:41
@tomz-alt
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.48% (1798/2291)
Line Coverage 64.17% (32304/50345)
Region Coverage 65.13% (16260/24967)
Branch Coverage 55.63% (8689/15620)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 18.18% (2/11) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.48% (1798/2291)
Line Coverage 64.10% (32273/50345)
Region Coverage 65.06% (16244/24967)
Branch Coverage 55.58% (8681/15620)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 90.91% (10/11) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.27% (20296/38100)
Line Coverage 36.76% (191118/519960)
Region Coverage 33.04% (148446/449349)
Branch Coverage 34.13% (64891/190108)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.68% (27490/37309)
Line Coverage 57.41% (297600/518344)
Region Coverage 54.59% (247557/453460)
Branch Coverage 56.18% (107131/190678)

Revert review-bot-driven changes that broke local dev ergonomics and CI:
- Restore CMake auto-detect of pre-built libdoris_ffi.a (no env var required for dev builds)
- Restore rust.cmake PARENT_SCOPE (its include() context had already set the cache variable; PARENT_SCOPE is the correct and safe form)
- Remove enable_rust_lance_reader session variable guard in file_scanner.cpp
- Revert case-insensitive column matching and per-unit timestamp scale in lance_rust_reader.cpp

Keep the multi-fragment duplicate-row fix: lance_rust_reader.cpp extracts
fragment_file from the scan range path and passes it through the JSON
config; doris-ffi filters by matching data-file path so each scan range
reads exactly one fragment (not the whole dataset).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tomz-alt
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 18.18% (2/11) 🎉
Increment coverage report
Complete coverage report

@tomz-alt
Copy link
Copy Markdown
Contributor Author

run buildall

tom zhang and others added 3 commits April 20, 2026 15:42
Upstream master refactored GenericReader virtual methods:
  get_next_block  -> _do_get_next_block  (protected hook)
  get_columns(name_to_type, missing_cols) -> _get_columns_impl(name_to_type)

Update LanceRustReader's overrides to match the new signatures so CI
(which builds our PR on top of current master) compiles.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@morningman
Copy link
Copy Markdown
Contributor

run buildall

@tomz-alt
Copy link
Copy Markdown
Contributor Author

run buildall

1 similar comment
@suxiaogang223
Copy link
Copy Markdown
Member

run buildall

@morningman
Copy link
Copy Markdown
Contributor

run buildall

@tomz-alt
Copy link
Copy Markdown
Contributor Author

so many PR per day....

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 90.91% (10/11) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 0.68% (2/296) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.71% (26833/37418)
Line Coverage 54.95% (285305/519212)
Region Coverage 51.76% (235098/454223)
Branch Coverage 53.27% (101803/191106)

fuzzy = true,
description = {"使用 Rust Lance 读取器读取 Lance 格式数据",
"Use Rust-based Lance reader for Lance format data"})
private boolean enableRustLanceReader = false;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this variable?

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Apr 21, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

@BiteTheDDDDt BiteTheDDDDt merged commit b613c37 into apache:master Apr 22, 2026
30 of 33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. kind/need-document reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants