Skip to content

feat(flink): add metrics for RLI load time for bucket assign functions#18762

Merged
danny0405 merged 1 commit into
apache:masterfrom
HuangZhenQiu:buffer-time-metrics
May 20, 2026
Merged

feat(flink): add metrics for RLI load time for bucket assign functions#18762
danny0405 merged 1 commit into
apache:masterfrom
HuangZhenQiu:buffer-time-metrics

Conversation

@HuangZhenQiu
Copy link
Copy Markdown
Member

@HuangZhenQiu HuangZhenQiu commented May 17, 2026

Describe the issue this Pull Request addresses

Add metrics for RLI load time for bucket assign functions

Close #18733

Summary and Changelog

  1. add FlinkBucketAssignMetrics for buffer time and index lookup time
  2. Enable FlinkBucketAssignMetrics for bucket assign functions
  3. Add tests for FlinkBucketAssignMetrics

Impact

none

Risk Level

none

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions Bot added the size:M PR with lines of changes in (100, 300] label May 17, 2026
@HuangZhenQiu HuangZhenQiu force-pushed the buffer-time-metrics branch from 2467ee7 to bceeae4 Compare May 17, 2026 22:58
@HuangZhenQiu HuangZhenQiu requested review from cshuo and danny0405 May 17, 2026 22:58
@github-actions github-actions Bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels May 17, 2026
Copy link
Copy Markdown
Collaborator

@cshuo cshuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thks for contributing, left some comments.

@HuangZhenQiu HuangZhenQiu force-pushed the buffer-time-metrics branch 3 times, most recently from aac24a8 to c3491bc Compare May 18, 2026 12:25
Copy link
Copy Markdown
Member Author

@HuangZhenQiu HuangZhenQiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cshuo Thanks for fixing the index lookup. Actually, it confuses me when I find test access index for 1001 times.

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR adds metrics for local/remote RLI lookup latency, key counts, and record buffering time to the Flink bucket assign functions. One edge case worth double-checking in the inline comments around partial buffer flushes. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. A few naming and simplification suggestions below.

// Process the buffer if it reaches the configured size
if (recordBuffer.size() >= miniBatchSize) {
// Record how long the oldest record in the batch was buffered
delegateFunction.getMetrics().endRecordBuffering();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 endRecordBuffering() is only called when the buffer fills to miniBatchSize, but processBufferedRecords() is also invoked from prepareSnapshotPreBarrier() (line 202) and endInput() (line 223) on partial buffers. In those paths the timer is never stopped, so the buffering metric is dropped for partial batches and the next startRecordBuffering() will hit the Restarting timer for name: record_buffering, overriding the existing value warning in HoodieFlinkMetrics#startTimer on every checkpoint. Could you move the endRecordBuffering() call into processBufferedRecords() (after the recordBuffer.isEmpty() guard) so all three flush paths are covered?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

}

@VisibleForTesting
public long getLocalLookupKeysNumCount() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: getLocalLookupKeysNumCount() is easy to misread — 'Num' (keys-per-lookup distribution) plus 'Count' (histogram sample count) together read as 'total key count', which is not what this returns. Something like getLocalLookupKeysSampleCount() would be less ambiguous (same concern for getRemoteLookupKeysNumCount()).

- AI-generated; verify before applying. React 👍/👎 to flag quality.

@HuangZhenQiu HuangZhenQiu force-pushed the buffer-time-metrics branch from c3491bc to d8f6c25 Compare May 18, 2026 22:49
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! The PR adds histogram metrics for local/remote RLI cache lookup latency, key counts, and record buffering time across the Flink bucket-assign functions. The major concerns (timer leak when a partial buffer is flushed via prepareSnapshotPreBarrier/endInput, per-record hot-path overhead, and a few naming nits) have already been flagged in prior rounds. No additional correctness or architectural issues surfaced from this automated pass — a Hudi committer or PMC member can take it from here for a final review. A few small naming and readability suggestions below — magic constant, terse parameter name, and repeated null guards in the hot path.

cc @yihua

@HuangZhenQiu HuangZhenQiu force-pushed the buffer-time-metrics branch 3 times, most recently from 6752919 to 02d2590 Compare May 19, 2026 03:37
this.metaClient = StreamerUtil.createMetaClient(conf);
this.conf = conf;
this.recordIndexCache = new RecordIndexCache(conf, initCheckpointId);
registerMetrics(new UnregisteredMetricsGroup());
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initialize metrics, so that we don't need to do metrics != null check.

@HuangZhenQiu HuangZhenQiu force-pushed the buffer-time-metrics branch from 02d2590 to 8d6bdb3 Compare May 19, 2026 05:49
Copy link
Copy Markdown
Collaborator

@cshuo cshuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@cshuo
Copy link
Copy Markdown
Collaborator

cshuo commented May 19, 2026

cc @danny0405

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.18%. Comparing base (9026c7d) to head (8d6bdb3).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18762      +/-   ##
============================================
+ Coverage     68.16%   68.18%   +0.01%     
- Complexity    29158    29185      +27     
============================================
  Files          2521     2523       +2     
  Lines        141371   141427      +56     
  Branches      17549    17550       +1     
============================================
+ Hits          96365    96425      +60     
+ Misses        37076    37074       -2     
+ Partials       7930     7928       -2     
Flag Coverage Δ
common-and-other-modules 44.46% <100.00%> (+0.03%) ⬆️
hadoop-mr-java-client 44.98% <ø> (-0.04%) ⬇️
spark-client-hadoop-common 48.31% <ø> (ø)
spark-java-tests 48.92% <ø> (-0.03%) ⬇️
spark-scala-tests 44.85% <ø> (+<0.01%) ⬆️
utilities 37.56% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
.../apache/hudi/metrics/FlinkBucketAssignMetrics.java 100.00% <100.00%> (ø)
.../apache/hudi/metrics/FlinkIndexBackendMetrics.java 100.00% <100.00%> (ø)
...he/hudi/sink/partitioner/BucketAssignFunction.java 94.11% <100.00%> (+0.14%) ⬆️
...ink/partitioner/MinibatchBucketAssignFunction.java 100.00% <100.00%> (+10.16%) ⬆️
...rtitioner/index/GlobalRecordLevelIndexBackend.java 90.00% <100.00%> (+2.00%) ⬆️
...di/sink/partitioner/index/IndexBackendFactory.java 46.66% <ø> (ø)

... and 16 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the iterative improvements on this PR! The buffering and index-lookup metric instrumentation looks reasonable now — start/end pairings trace through correctly for the normal flow, checkpoint flushes, endInput, and index-record interleaving paths. The constructor-initializes-with-UnregisteredMetricsGroup pattern avoids the null-guard noise from earlier rounds. No new critical correctness issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review. A couple of naming inconsistencies worth tidying up.

cc @yihua

/** Number of keys resolved from the local cache per lookup. */
private final Histogram localLookupKeysNum;

/** Number of keys that missed the local cache and were fetched remotely per lookup. */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the fields/registered metric names use Num (e.g. localLookupKeysNum, "localLookupKeysNum") but the public methods use Count (e.g. updateLocalLookupKeysCount). Could you pick one suffix and use it consistently? Either rename the fields/strings to localLookupKeysCount/remoteLookupKeysCount, or rename the methods to updateLocalLookupKeysNum.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

this.delegateFunction.setCorrespondent(correspondent);
}

@VisibleForTesting
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: getDelegateMetrics() leaks the internal delegation structure — if the implementation ever stops delegating, the name becomes misleading. Since BucketAssignFunction already exposes getMetrics(), could you use the same name here for consistency?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

@danny0405 danny0405 merged commit 990fc29 into apache:master May 20, 2026
63 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add metrics to minitor the minibatch buffering time in BucketAssign op for RLI with remote MDT access

6 participants