feat(flink): Support dynamic bucket for flink streaming with partitio… by cshuo · Pull Request #18640 · apache/hudi

cshuo · 2026-04-29T01:36:38Z

…ned RLI

Describe the issue this Pull Request addresses

Flink streaming writes need a partitioned record-level-index path that can support dynamic bucket assignment without relying on global record-key-to-location lookups. The existing RLI integration was centered on global record locations and did not provide a partition-scoped backend or pipeline wiring for routing records to dynamic bucket file groups.

This PR adds the Flink dynamic bucket write path backed by partitioned RLI, separates global and partitioned index backend responsibilities, and wires the corresponding index write partitioning and table options.

Summary and Changelog

Adds DynamicBucketAssignFunction and DynamicBucketAssignOperator to route Flink records through partition-scoped RLI-backed bucket assignment.
Introduces RecordLevelIndexBackend as the partitioned RLI backend with partition-local spillable caches.
Adds GlobalRecordIndexPartitioner and updates RecordIndexPartitioner so index writes route consistently for global and partitioned RLI layouts.
Updates Flink pipeline construction to select the dynamic bucket assign path for partitioned RLI, validate supported write modes, and choose the correct index write partitioner.
Enables RECORD_LEVEL_INDEX table setup in HoodieTableFactory, including metadata table streaming writes and RLI write buffer defaults.

Impact

Functional impact: Adds Flink streaming support for dynamic bucket assignment with partitioned record-level index, including upsert and insert overwrite paths.
Maintainability: Separates global and partitioned index backend contracts, reducing overloading of the previous IndexBackend abstraction.

Risk Level

low

Documentation Update

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

…ned RLI

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This chunk contains the test coverage for the new dynamic bucket / partitioned RLI write path: unit tests for the new RecordLevelIndexBackend lazy-eviction and dictionary-encoded bucket caches, the renamed GlobalRecordLevelIndexBackend and GlobalRecordIndexPartitioner, the partitioned RLI routing in RecordIndexPartitioner, and an end-to-end integration test (ITTestDynamicBucketStreamWrite) covering CoW/MoR, partitioned/non-partitioned, insert overwrite and bucket scaling. I traced the lazy-eviction math, the access-order semantics of the LinkedHashMap-based partitionBucketCaches, and the partitioner routing — the assertions hold for the chosen FILE_GROUP_COUNT/NUM_PARTITIONS pair. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. A few small naming and simplification suggestions below.

cc @yihua

hudi-agent · 2026-05-07T08:02:27Z

   */
  protected Pair<List<HoodieFileGroupId>, HoodieData<HoodieRecord>> tagRecordsWithLocationForStreamingWrites(HoodieData<HoodieRecord> untaggedRecords,
                                                                                                             Set<String> enabledMetadataPartitions) {
+    // no need to tag of the incoming records is empty.


🤖 nit: typo — "no need to tag of the incoming records is empty" should probably be "no need to tag records if the incoming data is empty".

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-05-07T08:02:27Z

+        RuntimeContextUtils.getNumberOfParallelSubtasks(getRuntimeContext()),
+        writeProfile,
+        writeConfig);
+    this.maxParallelism = RuntimeContextUtils.getMaxNumberOfParallelSubtasks(getRuntimeContext());


🤖 nit: getMaxNumberOfParallelSubtasks, getNumberOfParallelSubtasks, and getIndexOfThisSubtask are each called twice — once inside the BucketAssigner constructor args and again to assign the fields. Could you assign the three fields first and then pass them to the constructor?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-05-07T08:02:27Z

  public static boolean isStreamingIndexWriteEnabled(Configuration conf) {
    return conf.get(FlinkOptions.METADATA_ENABLED)
-        && OptionsResolver.getIndexType(conf) == HoodieIndex.IndexType.GLOBAL_RECORD_LEVEL_INDEX
+        && (OptionsResolver.getIndexType(conf) == HoodieIndex.IndexType.GLOBAL_RECORD_LEVEL_INDEX


🤖 nit: OptionsResolver.getIndexType(conf) is called twice in the same boolean expression — could you extract it to a local variable like the surrounding methods do?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the updates! This round adds an INDEX_RLI_CACHE_CONCURRENT_PARTITIONS_NUM config to drive inferMemorySizeForCache, raises the SamplingActionExecutor default record interval from 1000 to 10000, and consolidates the partitioned-RLI write-mode validation from Pipelines into a switch in HoodieTableFactory.checkIndexType. One thing worth a second look: the move dropped the !OptionsResolver.isMultiWriter(conf) check (so OCC is now silently accepted for partitioned RLI), and sanityCheck doesn't run on the HoodieFlinkStreamer path — see the inline comment to confirm intent. The three NITs from the prior pass (typo in HoodieBackedTableMetadataWriter, duplicate RuntimeContextUtils calls in DynamicBucketAssignFunction, duplicate getIndexType in OptionsResolver) are still open. Please take a look at the inline comment, and this should be ready for a Hudi committer or PMC member to take it from here.

hudi-agent · 2026-05-08T04:05:30Z

+        ValidationUtils.checkArgument(OptionsResolver.isUpsertOperation(conf) || OptionsResolver.isInsertOverwrite(conf),
+            "Partitioned record level index supports only Flink streaming upsert and insert overwrite.");
+        ValidationUtils.checkArgument(!OptionsResolver.isNonBlockingConcurrencyControl(conf),
+            "Partitioned record level index does not support non-blocking concurrency control.");


🤖 The previous Pipelines.validateRecordLevelIndexStreamWrite had three checks; this case keeps the upsert/insertOverwrite and NBCC ones but drops !OptionsResolver.isMultiWriter(conf). Since isMultiWriter is true for both OCC and NBCC, OPTIMISTIC_CONCURRENCY_CONTROL is now silently accepted for partitioned RLI — was that intentional? Each writer bootstraps its own RecordLevelIndexBackend cache from MDT and won't observe the other writer's in-flight assignments. Also, sanityCheck doesn't run on the HoodieFlinkStreamer path, so misconfigurations there will no longer be rejected up front.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-bot · 2026-05-08T05:20:33Z

CI report:

7b52b57 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

codecov-commenter · 2026-05-08T17:15:37Z

Codecov Report

❌ Patch coverage is 50.14006% with 178 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.09%. Comparing base (4029560) to head (7b52b57).
⚠️ Report is 4 commits behind head on master.

Files with missing lines	Patch %	Lines
...ink/partitioner/index/RecordLevelIndexBackend.java	53.78%	55 Missing and 6 partials ⚠️
.../sink/partitioner/DynamicBucketAssignFunction.java	0.00%	47 Missing ⚠️
...ain/java/org/apache/hudi/sink/utils/Pipelines.java	0.00%	15 Missing ⚠️
...java/org/apache/hudi/table/HoodieTableFactory.java	35.00%	10 Missing and 3 partials ⚠️
.../hudi/sink/partitioner/RecordIndexPartitioner.java	54.54%	8 Missing and 2 partials ⚠️
.../sink/partitioner/DynamicBucketAssignOperator.java	0.00%	8 Missing ⚠️
...rtitioner/index/GlobalRecordLevelIndexBackend.java	88.23%	5 Missing and 1 partial ⚠️
...artitioner/index/DummyPartitionedIndexBackend.java	0.00%	4 Missing ⚠️
...apache/hudi/sink/utils/SamplingActionExecutor.java	55.55%	4 Missing ⚠️
...udi/sink/partitioner/index/IndexWriteFunction.java	57.14%	2 Missing and 1 partial ⚠️
... and 4 more

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18640      +/-   ##
============================================
+ Coverage     67.90%   68.09%   +0.18%     
- Complexity    28958    29119     +161     
============================================
  Files          2521     2528       +7     
  Lines        141039   141466     +427     
  Branches      17480    17541      +61     
============================================
+ Hits          95777    96330     +553     
+ Misses        37401    37218     -183     
- Partials       7861     7918      +57

Flag	Coverage Δ
common-and-other-modules	`44.41% <50.14%> (+0.19%)`	⬆️
hadoop-mr-java-client	`45.00% <0.00%> (+0.12%)`	⬆️
spark-client-hadoop-common	`48.35% <0.00%> (-0.08%)`	⬇️
spark-java-tests	`49.00% <0.00%> (+0.34%)`	⬆️
spark-scala-tests	`44.90% <0.00%> (+0.13%)`	⬆️
utilities	`37.62% <0.00%> (-0.07%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...org/apache/hudi/index/FlinkHoodieIndexFactory.java	`28.57% <ø> (ø)`
...va/org/apache/hudi/configuration/FlinkOptions.java	`99.78% <100.00%> (+<0.01%)`	⬆️
...org/apache/hudi/configuration/OptionsResolver.java	`69.44% <100.00%> (+1.35%)`	⬆️
...he/hudi/sink/partitioner/BucketAssignFunction.java	`93.42% <ø> (ø)`
...g/apache/hudi/sink/partitioner/BucketAssigner.java	`89.32% <100.00%> (+0.21%)`	⬆️
...sink/partitioner/index/FlinkStateIndexBackend.java	`100.00% <ø> (ø)`
...ache/hudi/sink/partitioner/index/IndexBackend.java	`100.00% <ø> (ø)`
...di/sink/partitioner/index/IndexBackendFactory.java	`46.66% <100.00%> (ø)`
...di/sink/partitioner/index/RocksDBIndexBackend.java	`80.00% <ø> (ø)`
...hudi/metadata/HoodieBackedTableMetadataWriter.java	`83.79% <50.00%> (-0.07%)`	⬇️
... and 13 more

... and 34 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions Bot added the size:XL PR with lines of changes > 1000 label Apr 29, 2026

cshuo force-pushed the dynamic_bucket_rli branch 2 times, most recently from 111f50b to 36f9935 Compare April 29, 2026 03:36

cshuo marked this pull request as draft April 29, 2026 06:45

cshuo changed the title ~~[WIP] feat(flink): Support dynamic bucket for flink streaming with partitio…~~ feat(flink): Support dynamic bucket for flink streaming with partitio… Apr 29, 2026

cshuo mentioned this pull request Apr 30, 2026

feat(flink): Support streaming RLI write for simple bucket index #18645

Open

3 tasks

cshuo added 2 commits May 7, 2026 15:13

feat(flink): Support dynamic bucket for flink streaming with partitio…

1d1b0dc

…ned RLI

fix comments

3bca7e6

cshuo force-pushed the dynamic_bucket_rli branch from 36f9935 to 3bca7e6 Compare May 7, 2026 07:14

cshuo marked this pull request as ready for review May 7, 2026 07:14

hudi-agent reviewed May 7, 2026

View reviewed changes

hudi-agent mentioned this pull request May 7, 2026

[OSS PR #18640] feat(flink): Support dynamic bucket for flink streaming with partitio… hudi-agent/hudi#28

Open

fix comments

7b52b57

hudi-agent reviewed May 8, 2026

View reviewed changes

danny0405 approved these changes May 9, 2026

View reviewed changes

danny0405 merged commit 93e334c into apache:master May 9, 2026
63 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(flink): Support dynamic bucket for flink streaming with partitio…#18640

feat(flink): Support dynamic bucket for flink streaming with partitio…#18640
danny0405 merged 3 commits into
apache:masterfrom
cshuo:dynamic_bucket_rli

cshuo commented Apr 29, 2026

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent May 7, 2026

Uh oh!

hudi-agent May 7, 2026

Uh oh!

hudi-agent May 7, 2026

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent May 8, 2026

Uh oh!

hudi-bot commented May 8, 2026

Uh oh!

codecov-commenter commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

cshuo commented Apr 29, 2026

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 7, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 7, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 7, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 8, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented May 8, 2026

CI report:

Uh oh!

codecov-commenter commented May 8, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants