Skip to content

refactor(core): Unify record key splitting and extraction#18842

Open
cshuo wants to merge 1 commit into
apache:masterfrom
cshuo:unify_record_key_extraction
Open

refactor(core): Unify record key splitting and extraction#18842
cshuo wants to merge 1 commit into
apache:masterfrom
cshuo:unify_record_key_extraction

Conversation

@cshuo
Copy link
Copy Markdown
Collaborator

@cshuo cshuo commented May 26, 2026

Describe the issue this Pull Request addresses

Record key configuration parsing was handled independently across key generation, index configuration, Flink option resolution, and Spark SQL delete logic. These duplicated parsing paths could diverge around trimming, empty entries, and field counting, making record-key behavior harder to reason about consistently.

This PR centralizes record key field splitting and extraction through KeyGenUtils, then updates callers to use the shared helper.

Summary and Changelog

  • Added KeyGenUtils.getRecordKeyFields(String) and routed the existing TypedProperties helper through it.
  • Updated key generator type inference to count parsed record key fields using the shared helper.
  • Replaced duplicated record key parsing in CustomAvroKeyGenerator, Spark CustomKeyGenerator, HoodieIndexConfig, Flink OptionsResolver, and Spark 3/4 DeleteHoodieTableCommand.

Impact

Record key parsing is normalized across affected Hudi modules, especially for whitespace and empty comma-separated entries, and reduces duplicated string-splitting logic and makes future record key parsing changes local to KeyGenUtils.

Risk Level

low

Documentation Update

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR consolidates record key field parsing across the codebase by routing all call sites through a single KeyGenUtils.getRecordKeyFields helper. A few behavioral edge cases worth double-checking in the inline comments, particularly around an existing test that asserts untrimmed values and one place where only one side of a containsAll comparison is being normalized. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of nits below — one test file that appears accidentally included, and one spot where the raw .split(",") wasn't replaced to complete the unification.

return new String[]{};
}
return recordKeyStr.split(",");
return KeyGenUtils.getRecordKeyFields(recordKeyStr).toArray(new String[0]);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Heads up — the existing TestOptionsResolver.testGetRecordKeys asserts new String[]{"uuid", " name"} (note the leading space) for input "uuid, name". Routing through KeyGenUtils.getRecordKeyFields will now trim, so that test (and testGetBucketIndexKeys at line 99) should break. Could you update them to reflect the new trimming behavior?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

.collect(Collectors.toSet())
.containsAll(Arrays.asList(hoodieIndexConfig.getString(BUCKET_INDEX_HASH_FIELD).split(",")));
List<String> recordKeyFields = KeyGenUtils.getRecordKeyFields(hoodieIndexConfig.getString(KeyGeneratorOptions.RECORDKEY_FIELD_NAME));
boolean valid = recordKeyFields.containsAll(Arrays.asList(hoodieIndexConfig.getString(BUCKET_INDEX_HASH_FIELD).split(",")));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The two sides of the containsAll are now normalized inconsistently — recordKeyFields is trimmed/empty-filtered via KeyGenUtils.getRecordKeyFields, but BUCKET_INDEX_HASH_FIELD is still a raw .split(","). If a user configures the hash field with whitespace (e.g. "a, b"), containsAll will now incorrectly fail since the record-key side is ["a", "b"] but the hash side is ["a", " b"]. Should the hash field be normalized through getRecordKeyFields too?

- AI-generated; verify before applying. React 👍/👎 to flag quality.


class TestTryWithResourcesExceptionHandling {

@Test
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This test file looks unrelated to the record-key parsing refactor — and as written the test has no assertions and unconditionally throws from the finally block, so it will always fail. Was it accidentally included in this PR?

- AI-generated; verify before applying. React 👍/👎 to flag quality.


@Test
void testResourceCloseExceptionIsSwallowedWhenFinallyThrows() {
try (ThrowingResource ignored = new ThrowingResource("resource close failed")) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: this file looks like it may have been accidentally included — it tests try-with-resources exception suppression, not record key splitting, and the test has no assertions (assertEquals/assertThrows are imported but never used). As written, the test also throws unconditionally from the finally block, which would always make it fail.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

@cshuo cshuo force-pushed the unify_record_key_extraction branch from 8774815 to 900e018 Compare May 26, 2026 07:26
@github-actions github-actions Bot added the size:S PR with lines of changes in (10, 100] label May 26, 2026
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR centralizes record key field splitting/extraction through KeyGenUtils.getRecordKeyFields, replacing several duplicated parsing snippets across the key gen, index config, Flink resolver, and Spark delete paths. Existing reviewers have flagged the main concerns (OptionsResolver test compatibility, asymmetric normalization in HoodieIndexConfig, and the seemingly unrelated try-with-resources test file). Please take a look at the inline comments from earlier rounds, and this should be ready for a Hudi committer or PMC member to take it from here. One consistency nit on the hash field split — otherwise the refactoring looks clean.

cc @yihua

.collect(Collectors.toSet())
.containsAll(Arrays.asList(hoodieIndexConfig.getString(BUCKET_INDEX_HASH_FIELD).split(",")));
List<String> recordKeyFields = KeyGenUtils.getRecordKeyFields(hoodieIndexConfig.getString(KeyGeneratorOptions.RECORDKEY_FIELD_NAME));
boolean valid = recordKeyFields.containsAll(Arrays.asList(hoodieIndexConfig.getString(BUCKET_INDEX_HASH_FIELD).split(",")));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the hash field is still split with a raw .split(",") here while the record key now goes through getRecordKeyFields() — could you apply the same helper to BUCKET_INDEX_HASH_FIELD so both sides get consistent trimming/empty-field filtering?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 66.52%. Comparing base (b5c5801) to head (900e018).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18842      +/-   ##
============================================
- Coverage     68.92%   66.52%   -2.40%     
+ Complexity    29095    28258     -837     
============================================
  Files          2509     2509              
  Lines        139528   139521       -7     
  Branches      17127    17127              
============================================
- Hits          96167    92818    -3349     
- Misses        35603    39087    +3484     
+ Partials       7758     7616     -142     
Flag Coverage Δ
common-and-other-modules 37.20% <90.00%> (-7.21%) ⬇️
hadoop-mr-java-client 44.91% <0.00%> (+0.01%) ⬆️
spark-client-hadoop-common 48.22% <100.00%> (-0.01%) ⬇️
spark-java-tests 49.34% <100.00%> (+<0.01%) ⬆️
spark-scala-tests 45.27% <100.00%> (-0.02%) ⬇️
utilities 37.42% <100.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...java/org/apache/hudi/config/HoodieIndexConfig.java 89.29% <100.00%> (-0.07%) ⬇️
...org/apache/hudi/keygen/CustomAvroKeyGenerator.java 80.64% <100.00%> (-1.18%) ⬇️
.../main/java/org/apache/hudi/keygen/KeyGenUtils.java 86.77% <100.00%> (+0.48%) ⬆️
...ava/org/apache/hudi/keygen/CustomKeyGenerator.java 76.47% <100.00%> (-2.11%) ⬇️
...org/apache/hudi/configuration/OptionsResolver.java 72.07% <100.00%> (-0.36%) ⬇️

... and 134 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S PR with lines of changes in (10, 100]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants