refactor(core): Unify record key splitting and extraction by cshuo · Pull Request #18842 · apache/hudi

cshuo · 2026-05-26T06:53:58Z

Describe the issue this Pull Request addresses

Record key configuration parsing was handled independently across key generation, index configuration, Flink option resolution, and Spark SQL delete logic. These duplicated parsing paths could diverge around trimming, empty entries, and field counting, making record-key behavior harder to reason about consistently.

This PR centralizes record key field splitting and extraction through KeyGenUtils, then updates callers to use the shared helper.

Summary and Changelog

Added KeyGenUtils.getRecordKeyFields(String) and routed the existing TypedProperties helper through it.
Updated key generator type inference to count parsed record key fields using the shared helper.
Replaced duplicated record key parsing in CustomAvroKeyGenerator, Spark CustomKeyGenerator, HoodieIndexConfig, Flink OptionsResolver, and Spark 3/4 DeleteHoodieTableCommand.

Impact

Record key parsing is normalized across affected Hudi modules, especially for whitespace and empty comma-separated entries, and reduces duplicated string-splitting logic and makes future record key parsing changes local to KeyGenUtils.

Risk Level

low

Documentation Update

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR consolidates record key field parsing across the codebase by routing all call sites through a single KeyGenUtils.getRecordKeyFields helper. A few behavioral edge cases worth double-checking in the inline comments, particularly around an existing test that asserts untrimmed values and one place where only one side of a containsAll comparison is being normalized. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of nits below — one test file that appears accidentally included, and one spot where the raw .split(",") wasn't replaced to complete the unification.

hudi-agent · 2026-05-26T07:02:15Z

-      return new String[]{};
-    }
-    return recordKeyStr.split(",");
+    return KeyGenUtils.getRecordKeyFields(recordKeyStr).toArray(new String[0]);


🤖 Heads up — the existing TestOptionsResolver.testGetRecordKeys asserts new String[]{"uuid", " name"} (note the leading space) for input "uuid, name". Routing through KeyGenUtils.getRecordKeyFields will now trim, so that test (and testGetBucketIndexKeys at line 99) should break. Could you update them to reflect the new trimming behavior?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-05-26T07:02:15Z

-              .collect(Collectors.toSet())
-              .containsAll(Arrays.asList(hoodieIndexConfig.getString(BUCKET_INDEX_HASH_FIELD).split(",")));
+          List<String> recordKeyFields = KeyGenUtils.getRecordKeyFields(hoodieIndexConfig.getString(KeyGeneratorOptions.RECORDKEY_FIELD_NAME));
+          boolean valid = recordKeyFields.containsAll(Arrays.asList(hoodieIndexConfig.getString(BUCKET_INDEX_HASH_FIELD).split(",")));


🤖 The two sides of the containsAll are now normalized inconsistently — recordKeyFields is trimmed/empty-filtered via KeyGenUtils.getRecordKeyFields, but BUCKET_INDEX_HASH_FIELD is still a raw .split(","). If a user configures the hash field with whitespace (e.g. "a, b"), containsAll will now incorrectly fail since the record-key side is ["a", "b"] but the hash side is ["a", " b"]. Should the hash field be normalized through getRecordKeyFields too?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-05-26T07:02:15Z

+
+class TestTryWithResourcesExceptionHandling {
+
+  @Test


🤖 This test file looks unrelated to the record-key parsing refactor — and as written the test has no assertions and unconditionally throws from the finally block, so it will always fail. Was it accidentally included in this PR?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-05-26T07:02:15Z

+
+  @Test
+  void testResourceCloseExceptionIsSwallowedWhenFinallyThrows() {
+      try (ThrowingResource ignored = new ThrowingResource("resource close failed")) {


🤖 nit: this file looks like it may have been accidentally included — it tests try-with-resources exception suppression, not record key splitting, and the test has no assertions (assertEquals/assertThrows are imported but never used). As written, the test also throws unconditionally from the finally block, which would always make it fail.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR centralizes record key field splitting/extraction through KeyGenUtils.getRecordKeyFields, replacing several duplicated parsing snippets across the key gen, index config, Flink resolver, and Spark delete paths. Existing reviewers have flagged the main concerns (OptionsResolver test compatibility, asymmetric normalization in HoodieIndexConfig, and the seemingly unrelated try-with-resources test file). Please take a look at the inline comments from earlier rounds, and this should be ready for a Hudi committer or PMC member to take it from here. One consistency nit on the hash field split — otherwise the refactoring looks clean.

cc @yihua

hudi-agent · 2026-05-26T08:13:56Z

-              .collect(Collectors.toSet())
-              .containsAll(Arrays.asList(hoodieIndexConfig.getString(BUCKET_INDEX_HASH_FIELD).split(",")));
+          List<String> recordKeyFields = KeyGenUtils.getRecordKeyFields(hoodieIndexConfig.getString(KeyGeneratorOptions.RECORDKEY_FIELD_NAME));
+          boolean valid = recordKeyFields.containsAll(Arrays.asList(hoodieIndexConfig.getString(BUCKET_INDEX_HASH_FIELD).split(",")));


🤖 nit: the hash field is still split with a raw .split(",") here while the record key now goes through getRecordKeyFields() — could you apply the same helper to BUCKET_INDEX_HASH_FIELD so both sides get consistent trimming/empty-field filtering?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

codecov-commenter · 2026-05-26T08:47:49Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 66.52%. Comparing base (b5c5801) to head (900e018).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18842      +/-   ##
============================================
- Coverage     68.92%   66.52%   -2.40%     
+ Complexity    29095    28258     -837     
============================================
  Files          2509     2509              
  Lines        139528   139521       -7     
  Branches      17127    17127              
============================================
- Hits          96167    92818    -3349     
- Misses        35603    39087    +3484     
+ Partials       7758     7616     -142

Flag	Coverage Δ
common-and-other-modules	`37.20% <90.00%> (-7.21%)`	⬇️
hadoop-mr-java-client	`44.91% <0.00%> (+0.01%)`	⬆️
spark-client-hadoop-common	`48.22% <100.00%> (-0.01%)`	⬇️
spark-java-tests	`49.34% <100.00%> (+<0.01%)`	⬆️
spark-scala-tests	`45.27% <100.00%> (-0.02%)`	⬇️
utilities	`37.42% <100.00%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...java/org/apache/hudi/config/HoodieIndexConfig.java	`89.29% <100.00%> (-0.07%)`	⬇️
...org/apache/hudi/keygen/CustomAvroKeyGenerator.java	`80.64% <100.00%> (-1.18%)`	⬇️
.../main/java/org/apache/hudi/keygen/KeyGenUtils.java	`86.77% <100.00%> (+0.48%)`	⬆️
...ava/org/apache/hudi/keygen/CustomKeyGenerator.java	`76.47% <100.00%> (-2.11%)`	⬇️
...org/apache/hudi/configuration/OptionsResolver.java	`72.07% <100.00%> (-0.36%)`	⬇️

... and 134 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hudi-bot · 2026-05-26T09:14:45Z

CI report:

900e018 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

hudi-agent reviewed May 26, 2026

View reviewed changes

refactor(core): Unify record key splitting and extraction

900e018

cshuo force-pushed the unify_record_key_extraction branch from 8774815 to 900e018 Compare May 26, 2026 07:26

github-actions Bot added the size:S PR with lines of changes in (10, 100] label May 26, 2026

hudi-agent reviewed May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(core): Unify record key splitting and extraction#18842

refactor(core): Unify record key splitting and extraction#18842
cshuo wants to merge 1 commit into
apache:masterfrom
cshuo:unify_record_key_extraction

cshuo commented May 26, 2026

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent May 26, 2026

Uh oh!

hudi-agent May 26, 2026

Uh oh!

hudi-agent May 26, 2026

Uh oh!

hudi-agent May 26, 2026

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent May 26, 2026

Uh oh!

codecov-commenter commented May 26, 2026

Uh oh!

hudi-bot commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

cshuo commented May 26, 2026

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 26, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 26, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 26, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 26, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 26, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 26, 2026

Codecov Report

Uh oh!

hudi-bot commented May 26, 2026

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants