feat(flink): add metrics for RLI load time for bucket assign functions by HuangZhenQiu · Pull Request #18762 · apache/hudi

HuangZhenQiu · 2026-05-17T21:35:48Z

Describe the issue this Pull Request addresses

Add metrics for RLI load time for bucket assign functions

Close #18733

Summary and Changelog

add FlinkBucketAssignMetrics for buffer time and index lookup time
Enable FlinkBucketAssignMetrics for bucket assign functions
Add tests for FlinkBucketAssignMetrics

Impact

none

Risk Level

none

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

cshuo

Thks for contributing, left some comments.

HuangZhenQiu

@cshuo Thanks for fixing the index lookup. Actually, it confuses me when I find test access index for 1001 times.

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR adds metrics for local/remote RLI lookup latency, key counts, and record buffering time to the Flink bucket assign functions. One edge case worth double-checking in the inline comments around partial buffer flushes. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. A few naming and simplification suggestions below.

hudi-agent · 2026-05-18T16:09:46Z

      // Process the buffer if it reaches the configured size
      if (recordBuffer.size() >= miniBatchSize) {
+        // Record how long the oldest record in the batch was buffered
+        delegateFunction.getMetrics().endRecordBuffering();


🤖 endRecordBuffering() is only called when the buffer fills to miniBatchSize, but processBufferedRecords() is also invoked from prepareSnapshotPreBarrier() (line 202) and endInput() (line 223) on partial buffers. In those paths the timer is never stopped, so the buffering metric is dropped for partial batches and the next startRecordBuffering() will hit the Restarting timer for name: record_buffering, overriding the existing value warning in HoodieFlinkMetrics#startTimer on every checkpoint. Could you move the endRecordBuffering() call into processBufferedRecords() (after the recordBuffer.isEmpty() guard) so all three flush paths are covered?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-05-18T16:09:47Z

+  }
+
+  @VisibleForTesting
+  public long getLocalLookupKeysNumCount() {


🤖 nit: getLocalLookupKeysNumCount() is easy to misread — 'Num' (keys-per-lookup distribution) plus 'Count' (histogram sample count) together read as 'total key count', which is not what this returns. Something like getLocalLookupKeysSampleCount() would be less ambiguous (same concern for getRemoteLookupKeysNumCount()).

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! The PR adds histogram metrics for local/remote RLI cache lookup latency, key counts, and record buffering time across the Flink bucket-assign functions. The major concerns (timer leak when a partial buffer is flushed via prepareSnapshotPreBarrier/endInput, per-record hot-path overhead, and a few naming nits) have already been flagged in prior rounds. No additional correctness or architectural issues surfaced from this automated pass — a Hudi committer or PMC member can take it from here for a final review. A few small naming and readability suggestions below — magic constant, terse parameter name, and repeated null guards in the hot path.

cc @yihua

cshuo · 2026-05-19T04:03:15Z

    this.metaClient = StreamerUtil.createMetaClient(conf);
    this.conf = conf;
    this.recordIndexCache = new RecordIndexCache(conf, initCheckpointId);
+    registerMetrics(new UnregisteredMetricsGroup());


Why do we need this?

initialize metrics, so that we don't need to do metrics != null check.

cshuo

+1

cshuo · 2026-05-19T06:15:25Z

cc @danny0405

codecov-commenter · 2026-05-19T07:14:03Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.18%. Comparing base (9026c7d) to head (8d6bdb3).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18762      +/-   ##
============================================
+ Coverage     68.16%   68.18%   +0.01%     
- Complexity    29158    29185      +27     
============================================
  Files          2521     2523       +2     
  Lines        141371   141427      +56     
  Branches      17549    17550       +1     
============================================
+ Hits          96365    96425      +60     
+ Misses        37076    37074       -2     
+ Partials       7930     7928       -2

Flag	Coverage Δ
common-and-other-modules	`44.46% <100.00%> (+0.03%)`	⬆️
hadoop-mr-java-client	`44.98% <ø> (-0.04%)`	⬇️
spark-client-hadoop-common	`48.31% <ø> (ø)`
spark-java-tests	`48.92% <ø> (-0.03%)`	⬇️
spark-scala-tests	`44.85% <ø> (+<0.01%)`	⬆️
utilities	`37.56% <ø> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
.../apache/hudi/metrics/FlinkBucketAssignMetrics.java	`100.00% <100.00%> (ø)`
.../apache/hudi/metrics/FlinkIndexBackendMetrics.java	`100.00% <100.00%> (ø)`
...he/hudi/sink/partitioner/BucketAssignFunction.java	`94.11% <100.00%> (+0.14%)`	⬆️
...ink/partitioner/MinibatchBucketAssignFunction.java	`100.00% <100.00%> (+10.16%)`	⬆️
...rtitioner/index/GlobalRecordLevelIndexBackend.java	`90.00% <100.00%> (+2.00%)`	⬆️
...di/sink/partitioner/index/IndexBackendFactory.java	`46.66% <ø> (ø)`

... and 16 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hudi-bot · 2026-05-19T07:17:07Z

CI report:

8d6bdb3 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the iterative improvements on this PR! The buffering and index-lookup metric instrumentation looks reasonable now — start/end pairings trace through correctly for the normal flow, checkpoint flushes, endInput, and index-record interleaving paths. The constructor-initializes-with-UnregisteredMetricsGroup pattern avoids the null-guard noise from earlier rounds. No new critical correctness issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review. A couple of naming inconsistencies worth tidying up.

cc @yihua

hudi-agent · 2026-05-19T14:23:14Z

+  /** Number of keys resolved from the local cache per lookup. */
+  private final Histogram localLookupKeysNum;
+
+  /** Number of keys that missed the local cache and were fetched remotely per lookup. */


🤖 nit: the fields/registered metric names use Num (e.g. localLookupKeysNum, "localLookupKeysNum") but the public methods use Count (e.g. updateLocalLookupKeysCount). Could you pick one suffix and use it consistently? Either rename the fields/strings to localLookupKeysCount/remoteLookupKeysCount, or rename the methods to updateLocalLookupKeysNum.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-05-19T14:23:14Z

    this.delegateFunction.setCorrespondent(correspondent);
  }

+  @VisibleForTesting


🤖 nit: getDelegateMetrics() leaks the internal delegation structure — if the implementation ever stops delegating, the name becomes misleading. Since BucketAssignFunction already exposes getMetrics(), could you use the same name here for consistency?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

apache#18762)

github-actions Bot added the size:M PR with lines of changes in (100, 300] label May 17, 2026

HuangZhenQiu force-pushed the buffer-time-metrics branch from 2467ee7 to bceeae4 Compare May 17, 2026 22:58

HuangZhenQiu requested review from cshuo and danny0405 May 17, 2026 22:58

github-actions Bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels May 17, 2026

cshuo reviewed May 18, 2026

View reviewed changes

HuangZhenQiu force-pushed the buffer-time-metrics branch 3 times, most recently from aac24a8 to c3491bc Compare May 18, 2026 12:25

HuangZhenQiu commented May 18, 2026

View reviewed changes

Comment thread ...e/hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/DynamicBucketAssignFunction.java Outdated

HuangZhenQiu commented May 18, 2026

View reviewed changes

hudi-agent reviewed May 18, 2026

View reviewed changes

HuangZhenQiu force-pushed the buffer-time-metrics branch from c3491bc to d8f6c25 Compare May 18, 2026 22:49

hudi-agent reviewed May 18, 2026

View reviewed changes

HuangZhenQiu force-pushed the buffer-time-metrics branch 3 times, most recently from 6752919 to 02d2590 Compare May 19, 2026 03:37

cshuo reviewed May 19, 2026

View reviewed changes

feat(flink): add metrics for RLI load time for bucket assign functions

8d6bdb3

HuangZhenQiu force-pushed the buffer-time-metrics branch from 02d2590 to 8d6bdb3 Compare May 19, 2026 05:49

cshuo approved these changes May 19, 2026

View reviewed changes

hudi-agent reviewed May 19, 2026

View reviewed changes

danny0405 approved these changes May 20, 2026

View reviewed changes

danny0405 merged commit 990fc29 into apache:master May 20, 2026
63 checks passed

yaoliclshlmch mentioned this pull request May 20, 2026

feat: add bucketassign minibatch cache hit/miss metrics #18761

Merged

3 tasks

dwshmilyss pushed a commit to dwshmilyss/hudi that referenced this pull request May 21, 2026

feat(flink): add metrics for RLI load time for bucket assign functions (

280e694

apache#18762)

Conversation

HuangZhenQiu commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

cshuo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HuangZhenQiu left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hudi-agent May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cshuo May 19, 2026

Choose a reason for hiding this comment

Uh oh!

HuangZhenQiu May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cshuo left a comment

Choose a reason for hiding this comment

Uh oh!

cshuo commented May 19, 2026

Uh oh!

codecov-commenter commented May 19, 2026

Codecov Report

Uh oh!

hudi-bot commented May 19, 2026

CI report:

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 19, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

HuangZhenQiu commented May 17, 2026 •

edited

Loading