test(metadata): Add test coverage for deferred RLI init and bulk_insert#18865
test(metadata): Add test coverage for deferred RLI init and bulk_insert#18865nsivabalan wants to merge 1 commit into
Conversation
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This PR fixes RLI initialization when the data write config doesn't carry the Avro schema string (deferred init + 2nd commit, or metadata writer constructed for reads), by resolving the schema at the driver with a sensible fallback to tryResolveSchemaForTable. The plumbing is consistent across the RLI init chain and the new tests exercise the bulk_insert defer path end-to-end. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.
cc @yihua
Follow-up to apache#18353 (defer RLI init for fresh tables) and apache#18836 (robust schema resolution during RLI bootstrap). Adds test coverage for the deferred RLI init flow: - Extend testRecordLevelIndex with a deferRLIInit parameter so the existing assertions can also exercise the deferred-init configuration. When set, the test asserts that after the first save the RLI metadata partition is not yet present; the subsequent metadata writer entry then triggers the deferred bootstrap and the rest of the test flow validates the resulting index. - Add testPartitionedRecordLevelIndexDefer which drives the deferred path via the helper above and verifies compaction afterwards. - Add testPartitionedRecordLevelIndexDeferWithBulkInsert which issues two bulk_insert commits on a fresh table with defer enabled and validates the record key -> location mapping after the deferred RLI bootstraps on the second commit, including cross-partition negative lookups. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8cffc8e to
eac93d0
Compare
|
Rebased on top of latest master. Note: #18836 (merged before this PR) already lands the schema-resolution fix on the RLI bootstrap path that my original commit was addressing. The Java change in this PR was therefore redundant on rebase and has been dropped — the PR is now scoped to test coverage only:
|
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This PR is test-only, adding coverage for the deferred RLI init flow (including the previously untested bulk_insert path) as a follow-up to #18353 and #18836. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of minor naming nits, otherwise clean and well-commented test code.
cc @yihua
| @@ -146,7 +146,8 @@ class TestRecordLevelIndex extends RecordLevelIndexTestBase with SparkDatasetMix | |||
| "Metadata files partition count should be lower than data table file count after rebootstrap") | |||
| } | |||
There was a problem hiding this comment.
🤖 nit: deferRLIInit reads as an imperative verb rather than a boolean state — could you rename it to rliInitDeferred (or isRliInitDeferred) to match the past-participle style used by the sibling parameter streamingWriteEnabled?
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| metaClient.getIndexMetadata.get().getIndexDefinitions.get(HoodieTableMetadataUtil.PARTITION_NAME_RECORD_INDEX)), | ||
| "RLI should be initialized as partitioned RLI") | ||
|
|
||
| // Validate record key -> location mapping for both batches against the data. |
There was a problem hiding this comment.
🤖 nit: df is a bit misleading here since .collect() returns an Array[Row], not a DataFrame — something like allRows or tableRows would make the type/intent clearer for validateDFWithLocations callers reading below.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #18865 +/- ##
============================================
- Coverage 68.78% 68.77% -0.02%
+ Complexity 29142 29118 -24
============================================
Files 2514 2514
Lines 139914 139914
Branches 17184 17187 +3
============================================
- Hits 96240 96224 -16
- Misses 35903 35910 +7
- Partials 7771 7780 +9
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Describe the issue this Pull Request addresses
Follow-up to #18353 (defer RLI init for fresh tables) and #18836 (robust schema resolution during RLI bootstrap). #18836 already lands the schema-fallback fix on
dataWriteConfig.getWriteSchema()for the RLI bootstrap path, which also unblocks the deferred-init scenario. This PR adds the test coverage that exercises the deferred RLI init flow end-to-end, including the bulk_insert path, which previously had no dedicated tests.Tracking issue: #18866 (kept open as the test-coverage gap; the underlying fix is in #18836).
Summary and Changelog
Test-only changes in
TestRecordLevelIndex.scala:testRecordLevelIndexwith adeferRLIInitparameter. When set, the test asserts that after the first save the RLI metadata partition is not yet present; the subsequent metadata-writer entry then triggers the deferred bootstrap and the rest of the test validates the resulting index.testPartitionedRecordLevelIndexDefer(streamingWriteEnabled): drives the deferred path via the existing helper and verifies compaction afterwards.testPartitionedRecordLevelIndexDeferWithBulkInsert(streamingWriteEnabled): commit Add hoodie-hadoop-mr module with support for InputFormat #1 and commit Add hoodie-hive module to support hive registration #2 are bothbulk_insertagainst a fresh table with defer enabled. Validates:Impact
User-facing changes: none. Test coverage only.
Performance impact: none.
Risk Level
low
Documentation Update
none
Contributor's checklist