Skip to content

feat(metadata): Defer RLI initialization for fresh tables to optimize file group allocation#18353

Merged
nsivabalan merged 7 commits intoapache:masterfrom
nsivabalan:deferRliFreshTable
Mar 25, 2026
Merged

feat(metadata): Defer RLI initialization for fresh tables to optimize file group allocation#18353
nsivabalan merged 7 commits intoapache:masterfrom
nsivabalan:deferRliFreshTable

Conversation

@nsivabalan
Copy link
Copy Markdown
Contributor

@nsivabalan nsivabalan commented Mar 19, 2026

Describe the issue this Pull Request addresses

This PR optimizes Record Level Index (RLI) initialization for fresh Hudi tables by deferring RLI bootstrapping to the second commit. This enhancement allows for dynamic file group count determination based on actual data characteristics, improving both performance and resource utilization.

Key Benefits:

  • Reduces initial metadata table setup overhead for fresh tables
  • Enables data-driven file group count estimation for RLI partitions
  • Improves scalability for tables with varying data volumes
  • Works for both global and partitioned RLI configurations

Summary and Changelog

Core Implementation (HoodieBackedTableMetadataWriter.java)

  • Added logic to defer RLI partition initialization when metadata table is first created (data table has 0 completed instants)
  • Deferred initialization applies to both RECORD_INDEX (partitioned RLI) and global RLI when enabled via config
  • RLI partition is automatically initialized on subsequent commits with programmatically determined file group count

Configuration (HoodieMetadataConfig.java)

  • New Config: hoodie.metadata.record.level.index.defer.for.fresh.table
    • Type: Boolean (default: false)
    • Category: Advanced config for power users
    • Purpose: Controls whether RLI initialization is deferred to 2nd commit
    • Impact: When enabled, allows file group count to be estimated based on initial data patterns rather than using defaults

Testing

  • Added comprehensive tests validating deferred RLI initialization behavior:
    • testPartitionedRecordIndexDeferredInitializationForFreshTable(): Validates partitioned RLI is NOT initialized on first commit but is initialized on second commit with correct file group count
    • testGlobalRecordIndexDeferredInitialization(): Validates global RLI deferred initialization with programmatic file group determination
  • Updated existing tests to account for the new RLI initialization timing:
    • TestMetadataWriterCommit.testCreateHandleRLIStats(): Added data table timeline commit to properly test deferred initialization
    • TestSparkRDDMetadataWriteClient: Tests remain unchanged as config is disabled by default

Breaking Changes

None - This is a backward-compatible change controlled by an opt-in configuration flag.

Impact

User-facing changes:

  • Behavioral change (non-breaking): For new tables with RLI enabled with the new config enabled (hoodie.metadata.record.level.index.defer.for.fresh.table), the RLI partition will not be available after the first commit, but will be available starting from the second commit
  • Performance improvement: Reduced overhead for small tables, better scaling for large bootstrap scenarios
  • Config changes needed: Existing configurations continue to work; the optimization is opt in.

Performance impact:

  • Small tables (< 1000 records): Expected reduction from 10 file groups to 1-2 file groups, reducing metadata table overhead
  • Large bootstrap tables (> 100K records): Better distribution across more file groups within max bounds

Risk Level

low

Documentation Update

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Mar 19, 2026
@nsivabalan
Copy link
Copy Markdown
Contributor Author

@codope : addressed all feedback

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach of deferring RLI initialization for fresh tables is clean and well-contained. A couple of minor items in the inline comments; the main logic looks sound.

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 93.75000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 68.37%. Comparing base (a16d431) to head (46cc6b3).
⚠️ Report is 23 commits behind head on master.

Files with missing lines Patch % Lines
...hudi/metadata/HoodieBackedTableMetadataWriter.java 75.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##             master   #18353       +/-   ##
=============================================
+ Coverage     57.27%   68.37%   +11.09%     
- Complexity    18669    27572     +8903     
=============================================
  Files          1957     2433      +476     
  Lines        107176   133268    +26092     
  Branches      13267    16034     +2767     
=============================================
+ Hits          61388    91125    +29737     
+ Misses        39974    35085     -4889     
- Partials       5814     7058     +1244     
Flag Coverage Δ
common-and-other-modules 44.34% <68.75%> (?)
hadoop-mr-java-client 45.16% <68.75%> (-0.01%) ⬇️
spark-client-hadoop-common 48.57% <68.75%> (?)
spark-java-tests 48.71% <93.75%> (+1.24%) ⬆️
spark-scala-tests 45.39% <56.25%> (-0.15%) ⬇️
utilities 38.53% <68.75%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...pache/hudi/common/config/HoodieMetadataConfig.java 88.65% <100.00%> (+5.29%) ⬆️
...hudi/metadata/HoodieBackedTableMetadataWriter.java 84.01% <75.00%> (+4.22%) ⬆️

... and 1295 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nsivabalan
Copy link
Copy Markdown
Contributor Author

azure CI succeeded
image

@nsivabalan nsivabalan merged commit 69fa35b into apache:master Mar 25, 2026
55 of 56 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants