refactor(core): Unify record key splitting and extraction#18842
Conversation
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This PR consolidates record key field parsing across the codebase by routing all call sites through a single KeyGenUtils.getRecordKeyFields helper. A few behavioral edge cases worth double-checking in the inline comments, particularly around an existing test that asserts untrimmed values and one place where only one side of a containsAll comparison is being normalized. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of nits below — one test file that appears accidentally included, and one spot where the raw .split(",") wasn't replaced to complete the unification.
| return new String[]{}; | ||
| } | ||
| return recordKeyStr.split(","); | ||
| return KeyGenUtils.getRecordKeyFields(recordKeyStr).toArray(new String[0]); |
There was a problem hiding this comment.
🤖 Heads up — the existing TestOptionsResolver.testGetRecordKeys asserts new String[]{"uuid", " name"} (note the leading space) for input "uuid, name". Routing through KeyGenUtils.getRecordKeyFields will now trim, so that test (and testGetBucketIndexKeys at line 99) should break. Could you update them to reflect the new trimming behavior?
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| .collect(Collectors.toSet()) | ||
| .containsAll(Arrays.asList(hoodieIndexConfig.getString(BUCKET_INDEX_HASH_FIELD).split(","))); | ||
| List<String> recordKeyFields = KeyGenUtils.getRecordKeyFields(hoodieIndexConfig.getString(KeyGeneratorOptions.RECORDKEY_FIELD_NAME)); | ||
| boolean valid = recordKeyFields.containsAll(Arrays.asList(hoodieIndexConfig.getString(BUCKET_INDEX_HASH_FIELD).split(","))); |
There was a problem hiding this comment.
🤖 The two sides of the containsAll are now normalized inconsistently — recordKeyFields is trimmed/empty-filtered via KeyGenUtils.getRecordKeyFields, but BUCKET_INDEX_HASH_FIELD is still a raw .split(","). If a user configures the hash field with whitespace (e.g. "a, b"), containsAll will now incorrectly fail since the record-key side is ["a", "b"] but the hash side is ["a", " b"]. Should the hash field be normalized through getRecordKeyFields too?
- AI-generated; verify before applying. React 👍/👎 to flag quality.
|
|
||
| class TestTryWithResourcesExceptionHandling { | ||
|
|
||
| @Test |
There was a problem hiding this comment.
🤖 This test file looks unrelated to the record-key parsing refactor — and as written the test has no assertions and unconditionally throws from the finally block, so it will always fail. Was it accidentally included in this PR?
- AI-generated; verify before applying. React 👍/👎 to flag quality.
|
|
||
| @Test | ||
| void testResourceCloseExceptionIsSwallowedWhenFinallyThrows() { | ||
| try (ThrowingResource ignored = new ThrowingResource("resource close failed")) { |
There was a problem hiding this comment.
🤖 nit: this file looks like it may have been accidentally included — it tests try-with-resources exception suppression, not record key splitting, and the test has no assertions (assertEquals/assertThrows are imported but never used). As written, the test also throws unconditionally from the finally block, which would always make it fail.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
8774815 to
900e018
Compare
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This PR centralizes record key field splitting/extraction through KeyGenUtils.getRecordKeyFields, replacing several duplicated parsing snippets across the key gen, index config, Flink resolver, and Spark delete paths. Existing reviewers have flagged the main concerns (OptionsResolver test compatibility, asymmetric normalization in HoodieIndexConfig, and the seemingly unrelated try-with-resources test file). Please take a look at the inline comments from earlier rounds, and this should be ready for a Hudi committer or PMC member to take it from here. One consistency nit on the hash field split — otherwise the refactoring looks clean.
cc @yihua
| .collect(Collectors.toSet()) | ||
| .containsAll(Arrays.asList(hoodieIndexConfig.getString(BUCKET_INDEX_HASH_FIELD).split(","))); | ||
| List<String> recordKeyFields = KeyGenUtils.getRecordKeyFields(hoodieIndexConfig.getString(KeyGeneratorOptions.RECORDKEY_FIELD_NAME)); | ||
| boolean valid = recordKeyFields.containsAll(Arrays.asList(hoodieIndexConfig.getString(BUCKET_INDEX_HASH_FIELD).split(","))); |
There was a problem hiding this comment.
🤖 nit: the hash field is still split with a raw .split(",") here while the record key now goes through getRecordKeyFields() — could you apply the same helper to BUCKET_INDEX_HASH_FIELD so both sides get consistent trimming/empty-field filtering?
- AI-generated; verify before applying. React 👍/👎 to flag quality.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #18842 +/- ##
============================================
- Coverage 68.92% 66.52% -2.40%
+ Complexity 29095 28258 -837
============================================
Files 2509 2509
Lines 139528 139521 -7
Branches 17127 17127
============================================
- Hits 96167 92818 -3349
- Misses 35603 39087 +3484
+ Partials 7758 7616 -142
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Describe the issue this Pull Request addresses
Record key configuration parsing was handled independently across key generation, index configuration, Flink option resolution, and Spark SQL delete logic. These duplicated parsing paths could diverge around trimming, empty entries, and field counting, making record-key behavior harder to reason about consistently.
This PR centralizes record key field splitting and extraction through
KeyGenUtils, then updates callers to use the shared helper.Summary and Changelog
KeyGenUtils.getRecordKeyFields(String)and routed the existingTypedPropertieshelper through it.CustomAvroKeyGenerator, SparkCustomKeyGenerator,HoodieIndexConfig, FlinkOptionsResolver, and Spark 3/4DeleteHoodieTableCommand.Impact
Record key parsing is normalized across affected Hudi modules, especially for whitespace and empty comma-separated entries, and reduces duplicated string-splitting logic and makes future record key parsing changes local to
KeyGenUtils.Risk Level
low
Documentation Update
Contributor's checklist