test(metadata): Add test coverage for deferred RLI init and bulk_insert by nsivabalan · Pull Request #18865 · apache/hudi

nsivabalan · 2026-05-27T17:37:15Z

Describe the issue this Pull Request addresses

Follow-up to #18353 (defer RLI init for fresh tables) and #18836 (robust schema resolution during RLI bootstrap). #18836 already lands the schema-fallback fix on dataWriteConfig.getWriteSchema() for the RLI bootstrap path, which also unblocks the deferred-init scenario. This PR adds the test coverage that exercises the deferred RLI init flow end-to-end, including the bulk_insert path, which previously had no dedicated tests.

Tracking issue: #18866 (kept open as the test-coverage gap; the underlying fix is in #18836).

Summary and Changelog

Test-only changes in TestRecordLevelIndex.scala:

Extend testRecordLevelIndex with a deferRLIInit parameter. When set, the test asserts that after the first save the RLI metadata partition is not yet present; the subsequent metadata-writer entry then triggers the deferred bootstrap and the rest of the test validates the resulting index.
Add testPartitionedRecordLevelIndexDefer(streamingWriteEnabled): drives the deferred path via the existing helper and verifies compaction afterwards.
Add testPartitionedRecordLevelIndexDeferWithBulkInsert(streamingWriteEnabled): commit Add hoodie-hadoop-mr module with support for InputFormat #1 and commit Add hoodie-hive module to support hive registration #2 are both bulk_insert against a fresh table with defer enabled. Validates:
- After commit Add hoodie-hadoop-mr module with support for InputFormat #1 the RLI metadata partition is not initialized.
- After commit Add hoodie-hive module to support hive registration #2 the deferred RLI bootstrap completes (partition present, partitioned RLI type).
- Record-key -> location mapping is correct across all data partitions for both batches, including cross-partition negative lookups.

Impact

User-facing changes: none. Test coverage only.

Performance impact: none.

Risk Level

low

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR fixes RLI initialization when the data write config doesn't carry the Avro schema string (deferred init + 2nd commit, or metadata writer constructed for reads), by resolving the schema at the driver with a sensible fallback to tryResolveSchemaForTable. The plumbing is consistent across the RLI init chain and the new tests exercise the bulk_insert defer path end-to-end. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

Follow-up to apache#18353 (defer RLI init for fresh tables) and apache#18836 (robust schema resolution during RLI bootstrap). Adds test coverage for the deferred RLI init flow: - Extend testRecordLevelIndex with a deferRLIInit parameter so the existing assertions can also exercise the deferred-init configuration. When set, the test asserts that after the first save the RLI metadata partition is not yet present; the subsequent metadata writer entry then triggers the deferred bootstrap and the rest of the test flow validates the resulting index. - Add testPartitionedRecordLevelIndexDefer which drives the deferred path via the helper above and verifies compaction afterwards. - Add testPartitionedRecordLevelIndexDeferWithBulkInsert which issues two bulk_insert commits on a fresh table with defer enabled and validates the record key -> location mapping after the deferred RLI bootstraps on the second commit, including cross-partition negative lookups. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

nsivabalan · 2026-05-27T18:10:42Z

Rebased on top of latest master.

Note: #18836 (merged before this PR) already lands the schema-resolution fix on the RLI bootstrap path that my original commit was addressing. The Java change in this PR was therefore redundant on rebase and has been dropped — the PR is now scoped to test coverage only:

Extend testRecordLevelIndex with a deferRLIInit parameter.
Add testPartitionedRecordLevelIndexDefer.
Add testPartitionedRecordLevelIndexDeferWithBulkInsert (two bulk_insert commits on a fresh table with defer enabled, validating the record-key -> location mapping after the deferred RLI bootstraps on the 2nd commit).

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR is test-only, adding coverage for the deferred RLI init flow (including the previously untested bulk_insert path) as a follow-up to #18353 and #18836. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of minor naming nits, otherwise clean and well-commented test code.

cc @yihua

hudi-agent · 2026-05-27T20:05:58Z

@@ -146,7 +146,8 @@ class TestRecordLevelIndex extends RecordLevelIndexTestBase with SparkDatasetMix
      "Metadata files partition count should be lower than data table file count after rebootstrap")
  }


🤖 nit: deferRLIInit reads as an imperative verb rather than a boolean state — could you rename it to rliInitDeferred (or isRliInitDeferred) to match the past-participle style used by the sibling parameter streamingWriteEnabled?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-05-27T20:05:58Z

+      metaClient.getIndexMetadata.get().getIndexDefinitions.get(HoodieTableMetadataUtil.PARTITION_NAME_RECORD_INDEX)),
+      "RLI should be initialized as partitioned RLI")
+
+    // Validate record key -> location mapping for both batches against the data.


🤖 nit: df is a bit misleading here since .collect() returns an Array[Row], not a DataFrame — something like allRows or tableRows would make the type/intent clearer for validateDFWithLocations callers reading below.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

codecov-commenter · 2026-05-27T20:08:01Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.77%. Comparing base (1bf6b44) to head (eac93d0).

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18865      +/-   ##
============================================
- Coverage     68.78%   68.77%   -0.02%     
+ Complexity    29142    29118      -24     
============================================
  Files          2514     2514              
  Lines        139914   139914              
  Branches      17184    17187       +3     
============================================
- Hits          96240    96224      -16     
- Misses        35903    35910       +7     
- Partials       7771     7780       +9

Flag	Coverage Δ
common-and-other-modules	`44.32% <ø> (-0.01%)`	⬇️
hadoop-mr-java-client	`44.91% <ø> (-0.01%)`	⬇️
spark-client-hadoop-common	`48.22% <ø> (-0.01%)`	⬇️
spark-java-tests	`49.34% <ø> (-0.03%)`	⬇️
spark-scala-tests	`45.27% <ø> (-0.01%)`	⬇️
utilities	`37.44% <ø> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 13 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hudi-bot · 2026-05-27T20:16:55Z

CI report:

eac93d0 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

github-actions Bot added the size:M PR with lines of changes in (100, 300] label May 27, 2026

hudi-agent reviewed May 27, 2026

View reviewed changes

nsivabalan force-pushed the rliDeferInitbulkInsert branch from 8cffc8e to eac93d0 Compare May 27, 2026 18:09

nsivabalan changed the title ~~fix(metadata): Robust RLI init schema resolution for deferred fresh-table init~~ test(metadata): Add test coverage for deferred RLI init and bulk_insert May 27, 2026

hudi-agent reviewed May 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(metadata): Add test coverage for deferred RLI init and bulk_insert#18865

test(metadata): Add test coverage for deferred RLI init and bulk_insert#18865
nsivabalan wants to merge 1 commit into
apache:masterfrom
nsivabalan:rliDeferInitbulkInsert

nsivabalan commented May 27, 2026 •

edited

Loading

Uh oh!

hudi-agent left a comment

Uh oh!

nsivabalan commented May 27, 2026

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent May 27, 2026

Uh oh!

hudi-agent May 27, 2026

Uh oh!

codecov-commenter commented May 27, 2026

Uh oh!

hudi-bot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		@@ -146,7 +146,8 @@ class TestRecordLevelIndex extends RecordLevelIndexTestBase with SparkDatasetMix
		"Metadata files partition count should be lower than data table file count after rebootstrap")
		}

Conversation

nsivabalan commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

nsivabalan commented May 27, 2026

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 27, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 27, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 27, 2026

Codecov Report

Uh oh!

hudi-bot commented May 27, 2026

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nsivabalan commented May 27, 2026 •

edited

Loading