Skip to content

feat: add lance format for Flink MOR table#18911

Merged
danny0405 merged 3 commits into
apache:masterfrom
danny0405:lance-mor
Jun 5, 2026
Merged

feat: add lance format for Flink MOR table#18911
danny0405 merged 3 commits into
apache:masterfrom
danny0405:lance-mor

Conversation

@danny0405

Copy link
Copy Markdown
Contributor

Describe the issue this Pull Request addresses

This closes #18907 .

Flink Lance base-file support was previously scoped away from merge-on-read tables, so MOR write/read flows could not use Lance base files even when the Flink path had Lance-specific readers available. This blocked users from combining Lance base files with MOR log-file merging, CDC base-file reads, and async compaction in Flink SQL pipelines.

This PR expands the Flink Lance path to support merge-on-read writes and reads while keeping the existing schema-evolution restriction for Lance files.

Summary and Changelog

This PR enables Lance base files for Flink merge-on-read tables, wires Lance readers into MOR and CDC base-file reads, and updates tests to cover both Parquet and Lance MOR base/log-file reads.

Working tree: Support Flink Lance MOR write and read path

  • Allows hoodie.table.base.file.format = LANCE for Flink merge-on-read tables by removing the previous MOR rejection in HoodieTableFactory.
  • Adds Lance base-file handling in MergeOnReadInputFormat using HoodieRowDataLanceReader and requested-schema projection.
  • Adds Lance base-file handling in HoodieCdcSplitReaderFunction so CDC split reads can load Lance base files.
  • Narrows FlinkRowDataReaderContext schema-evolution rejection to only fail when a non-empty merge schema is required, while still rejecting actual Lance schema evolution.
  • Updates the Lance unsupported-path error message to avoid saying Lance is Spark-only.

Working tree: Tests and validation

  • Adds ITTestHoodieDataSource.testLanceFormatMergeOnReadUpsertWriteAndRead for Flink SQL MOR upsert/write/read with Lance base files and async compaction enabled through SQL table options.
  • Parameterizes TestInputFormat.testReadBaseAndLogFiles to run for both PARQUET and LANCE.
  • Updates table-factory and Hive catalog assertions for the new Lance support boundary.
  • Adds a test utility helper for checking completed compaction timeline state.
  • Validation run:
    • mvn -pl hudi-flink-datasource/hudi-flink -am -DskipITs -DskipIT -Dcheckstyle.skip -Drat.skip=true -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false -Dtest=TestInputFormat#testReadBaseAndLogFiles test
    • mvn -pl hudi-flink-datasource/hudi-flink -am -DskipITs=false -DskipIT=false -Dcheckstyle.skip -Drat.skip=true -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false -Dtest=ITTestHoodieDataSource#testLanceFormatMergeOnReadUpsertWriteAndRead test

Impact

This expands Flink user-facing behavior by allowing Lance base files with merge-on-read tables and by enabling MOR/CDC read paths to read Lance base files. Schema evolution for Flink Lance base files remains unsupported. There is no new public API, but the accepted configuration surface changes because Flink MOR tables can now use hoodie.table.base.file.format = LANCE.

Risk Level

medium

This touches Flink MOR read/write behavior, CDC split reads, table factory validation, and a storage-format-specific reader path. Risk is mitigated by targeted unit and integration coverage for Lance MOR SQL writes/reads, MOR base/log-file reads, and table factory validation. One targeted IT run completed successfully with Surefire retry after an initial transient row assertion mismatch.

Documentation Update

Required. The Flink/base-file-format support matrix or configuration documentation should be updated to note that Lance base files are supported for Flink merge-on-read tables, with schema evolution still unsupported.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions Bot added the size:M PR with lines of changes in (100, 300] label Jun 4, 2026
@@ -56,8 +56,8 @@ public enum HoodieFileFormat {
LANCE(".lance");

public static final String LANCE_SPARK_ONLY_ERROR_MSG =

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also rename the field name?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Renamed the constant to LANCE_UNSUPPORTED_ERROR_MSG and updated all call sites.

}

protected ClosableIterator<RowData> getBaseFileIterator(String path) throws IOException {
if (path.endsWith(HoodieFileFormat.LANCE.getFileExtension())) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a CDC coverage case for Lance MOR here? This PR adds Lance base-file handling for BASE_FILE_INSERT in both the Source V2 CDC reader and the legacy CDC path, but the new tests only cover snapshot reads. A CDC-enabled test that exercises a Lance MOR base-file CDC inference case would make this path much safer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added CDC coverage by parameterizing TestInputFormat.testReadChangelogIncrementallyForMorWithCompaction over PARQUET and LANCE. The Lance case exercises MOR compaction with CDC enabled so the base-file CDC inference path reads Lance base files.

/** Reads a parquet CDC base file returning required-schema records. */
/** Reads a CDC base file returning required-schema records. */
private ClosableIterator<RowData> getBaseFileIterator(String path) throws IOException {
if (path.endsWith(HoodieFileFormat.LANCE.getFileExtension())) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Lance base-file reader setup is now duplicated between MergeOnReadInputFormat and HoodieCdcSplitReaderFunction. Could we extract a small shared helper for building the selected DataType / requested HoodieSchema and opening HoodieRowDataLanceReader? That would reduce drift if schema, predicate, or close/error handling needs to change later.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Extracted the shared Lance RowData reader setup into FormatUtils.getLanceRecordIterator and reused it from MergeOnReadInputFormat, HoodieCdcSplitReaderFunction, and the existing COW Lance path so schema construction and close/error handling stay in one place.

@codecov-commenter

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 4.16667% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.90%. Comparing base (b7adecc) to head (7d532b8).
⚠️ Report is 9 commits behind head on master.

Files with missing lines Patch % Lines
...java/org/apache/hudi/table/format/FormatUtils.java 0.00% 11 Missing ⚠️
.../reader/function/HoodieCdcSplitReaderFunction.java 0.00% 3 Missing ⚠️
...e/hudi/table/format/FlinkRowDataReaderContext.java 0.00% 1 Missing and 1 partial ⚠️
.../hudi/table/format/cow/CopyOnWriteInputFormat.java 0.00% 2 Missing ⚠️
.../hudi/table/format/mor/MergeOnReadInputFormat.java 0.00% 1 Missing and 1 partial ⚠️
...pache/hudi/io/storage/HoodieFileReaderFactory.java 0.00% 1 Missing ⚠️
...pache/hudi/io/storage/HoodieFileWriterFactory.java 0.00% 1 Missing ⚠️
...rg/apache/hudi/hadoop/HiveHoodieReaderContext.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18911      +/-   ##
============================================
+ Coverage     68.81%   68.90%   +0.09%     
+ Complexity    29160    29110      -50     
============================================
  Files          2520     2517       -3     
  Lines        140056   139802     -254     
  Branches      17209    17204       -5     
============================================
- Hits          96373    96329      -44     
+ Misses        35909    35695     -214     
- Partials       7774     7778       +4     
Flag Coverage Δ
common-and-other-modules 44.64% <4.16%> (+0.31%) ⬆️
hadoop-mr-java-client 44.78% <0.00%> (-0.10%) ⬇️
spark-client-hadoop-common 48.04% <0.00%> (-0.12%) ⬇️
spark-java-tests 49.33% <0.00%> (-0.03%) ⬇️
spark-scala-tests 45.21% <0.00%> (-0.04%) ⬇️
utilities 37.28% <0.00%> (-0.10%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...org/apache/hudi/common/model/HoodieFileFormat.java 80.95% <ø> (ø)
...java/org/apache/hudi/table/HoodieTableFactory.java 75.94% <ø> (-0.21%) ⬇️
...g/apache/hudi/table/catalog/HoodieHiveCatalog.java 48.46% <100.00%> (ø)
...pache/hudi/io/storage/HoodieFileReaderFactory.java 58.33% <0.00%> (ø)
...pache/hudi/io/storage/HoodieFileWriterFactory.java 71.87% <0.00%> (ø)
...rg/apache/hudi/hadoop/HiveHoodieReaderContext.java 80.64% <0.00%> (ø)
...e/hudi/table/format/FlinkRowDataReaderContext.java 77.10% <0.00%> (+6.37%) ⬆️
.../hudi/table/format/cow/CopyOnWriteInputFormat.java 45.51% <0.00%> (+2.22%) ⬆️
.../hudi/table/format/mor/MergeOnReadInputFormat.java 88.77% <0.00%> (-1.85%) ⬇️
.../reader/function/HoodieCdcSplitReaderFunction.java 16.21% <0.00%> (-0.46%) ⬇️
... and 1 more

... and 82 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hudi-bot

hudi-bot commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@cshuo cshuo left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

@danny0405 danny0405 merged commit 091caad into apache:master Jun 5, 2026
58 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support lance format for Flink MOR table

5 participants