feat: add bucketassign minibatch cache hit/miss metrics by yaoliclshlmch · Pull Request #18761 · apache/hudi

yaoliclshlmch · 2026-05-17T07:38:15Z

Describe the issue this Pull Request addresses

The Flink bucket-assign operator on the global record-level index (GRLI) path keeps an in-memory RecordIndexCache to serve record-key lookups, falling back to the metadata table on a miss. Today the cache is a black box at runtime — operators have no visibility into how effective it is, which makes it impossible to alert on a degraded hit ratio (e.g. due to undersized cache, eviction churn, or key-skew patterns) or to tune INDEX_RLI_CACHE_SIZE based on real workload behavior.
This PR adds two counter metrics so the in-memory cache hit / miss volume becomes observable and alertable. It builds on top of #18762, which recently introduced FlinkIndexBackendMetrics and the per-call histograms (localLookupKeysNum, remoteLookupKeysNum). Those histograms describe the per-batch distribution; the new counters add the cumulative running totals that are cheaper to scrape and easier to ratio in dashboards/alerts.

Summary and Changelog

Adds two Flink counter metrics on the bucket-assign operator's MetricGroup, registered automatically whenever the GRLI backend is in use:

bucketassign.minibatch.cache.hit.count — cumulative record-key lookups served by the in-memory RecordIndexCache.
bucketassign.minibatch.cache.miss.count — cumulative record-key lookups that fell back to a metadata-table read.
Changelog:
org.apache.hudi.metrics.FlinkIndexBackendMetrics (existing class, introduced in feat(flink): add metrics for RLI load time for bucket assign functions #18762):
- Adds two SimpleCounter fields cacheHitCount / cacheMissCount and public constants BUCKET_ASSIGN_MINIBATCH_CACHE_HIT_COUNT / BUCKET_ASSIGN_MINIBATCH_CACHE_MISS_COUNT.
- Registers both counters in registerMetrics() alongside the existing histograms.
- Exposes markCacheHit(long) / markCacheMiss(long) helpers that short-circuit on n <= 0 so a single call site can drain a whole batch's hit/miss counts in one shot.
- Adds @VisibleForTesting getters getCacheHitCount() / getCacheMissCount().
org.apache.hudi.sink.partitioner.index.GlobalRecordLevelIndexBackend:
- In get(List<String> recordKeys), alongside the existing updateLocalLookupKeysCount(...) call, bumps the new hit/miss counters by recordKeys.size() - missedKeys.size() and missedKeys.size() respectively. No new field, no extra branching — same metrics object that the backend already constructs.
New tests in TestFlinkIndexBackendMetrics (counter registration, increments, zero/negative no-op) and TestGlobalRecordLevelIndexBackend (end-to-end: counters bump correctly across a cold and a warm batch lookup; counters are registered on the supplied MetricGroup).
No code copied from other projects.

Impact

User-facing: Two new Flink counter metrics are exposed under the bucket-assign operator's metric group. Existing dashboards/alerts are unchanged; the new metrics are purely additive.
Public API: None. No new configuration keys, no behavior changes to bucket assignment, cache eviction, or metadata-table reads.
Performance: Negligible. The hot path adds one subtraction and at most two Counter.inc(long) calls per batch lookup, sitting next to the existing histogram update.

Risk Level

low
The change is observability-only: cache lookup semantics are unchanged, no new public APIs, no new config keys, and the increment path is a constant-time pair of counter bumps next to existing metric updates. The backend constructor already pre-initializes metrics with an UnregisteredMetricsGroup, so the new calls cannot NPE regardless of whether registerMetrics(...) was later invoked by the operator.
Verification: new and existing unit tests covering the bucket-assign + metrics surface pass with -Pspark3.3,flink1.18 (TestFlinkIndexBackendMetrics, TestFlinkBucketAssignMetrics, TestGlobalRecordLevelIndexBackend).

Documentation Update

none — no new config or user-facing feature beyond two additional metrics emitted by the existing operator. The metric names are self-descriptive and follow the same convention as the sibling histograms in FlinkIndexBackendMetrics.

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR adds two simple counter metrics (bucketassign.minibatch.cache.hit.count / miss.count) to the GRLI backend by mirroring the existing RocksDBIndexBackend / FlinkRocksDBIndexMetrics pattern, with a re-registration guard and a null-guard at the call site. The hit/miss arithmetic (recordKeys.size() - missedKeys.size() vs missedKeys.size()) lines up with how the existing miss-collection loop builds missedKeys, and the test coverage exercises both the single-key and batch paths plus the unregistered-metrics path. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of small naming and delegation nits below, but overall the code is clean and well-structured.

cc @yihua

hudi-agent · 2026-05-18T22:03:14Z

+  }
+
+  public void markCacheHit() {
+    cacheHitCount.inc();


🤖 nit: could you have the no-arg markCacheHit() delegate to markCacheHit(1L) (and same for markCacheMiss())? Right now the two overloads are independent, so if any logic is ever added to the bulk path the single-increment path would silently miss it.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

danny0405 · 2026-05-20T06:32:07Z

+    metricGroup.counter(BUCKET_ASSIGN_MINIBATCH_CACHE_MISS_COUNT, cacheMissCount);
+  }
+
+  public void markCacheHit(long n) {


do you think the cache hit ratio makes more sense? hitCnt/missCnt + hitCnt instead of specific hit/miss numbers

yeah we can emit cache hit ratio directly, but we will only get the ratio number per mini batch. keeping the cache hit / miss count can give us the flexibility of querying the ratio at any time range. wdyt?

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR adds cumulative hit/miss counters to the Flink bucket-assign GRLI cache metrics, wired in alongside the existing histograms. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of small naming inconsistencies below — the new metric key values mix dot-separated style with the existing camelCase convention, and the n parameter in the public markCacheHit/markCacheMiss methods could use a more descriptive name.

cc @yihua

hudi-agent · 2026-05-20T08:51:13Z

  private static final String LOCAL_INDEX_LOOKUP_KEY = "local_index_lookup";
  private static final String REMOTE_INDEX_LOOKUP_KEY = "remote_index_lookup";

+  public static final String BUCKET_ASSIGN_MINIBATCH_CACHE_HIT_COUNT = "bucketassign.minibatch.cache.hit.count";


🤖 nit: the metric name values use dot-separated format ("bucketassign.minibatch.cache.hit.count") while the other metrics registered in this class use camelCase ("remoteIndexLookupLatency", "localLookupKeysNum", etc.). Could you align them — e.g. "bucketAssignMinibatchCacheHitCount" / "bucketAssignMinibatchCacheMissCount"?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-05-20T08:51:13Z

+  }
+
+  public void markCacheHit(long n) {
+    if (n <= 0) {


🤖 nit: the parameter n is a bit opaque for a public method — could you rename it to count in both markCacheHit and markCacheMiss? Callers reading metrics.markCacheHit(hitCount) will see the argument, but the method signature itself is easier to reason about with count.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

cshuo · 2026-05-20T12:46:19Z

    metricGroup.histogram("localLookupKeysNum", localLookupKeysNum);
    metricGroup.histogram("remoteLookupKeysNum", remoteLookupKeysNum);
+    metricGroup.counter(BUCKET_ASSIGN_MINIBATCH_CACHE_HIT_COUNT, cacheHitCount);
+    metricGroup.counter(BUCKET_ASSIGN_MINIBATCH_CACHE_MISS_COUNT, cacheMissCount);


We already have localLookupKeysNum and remoteLookupKeysNum, can we can add a hit ratio gauge metric , like 'lookupCacheHitRatio' based on these two fields?

localLookupKeysNum and remoteLookupKeysNum are histogram, which has a 100s aggregation, if we want the counter data point, we should keep these 2. we can add the hit ratio gauge to get the ratio per mini batch., but keeping the cache hit / miss count can give us the flexibility of querying the ratio at any time range. wdyt?

we can add the hit ratio gauge to get the ratio per mini batch

The per mini-batch hit ratio makes more sense instead of the global one, we can reuse the numbers used by updateRemoteLookupKeysCount & updateLocalLookupKeysCount, and keep the ratio as a gague metric?

makes sense. addressed

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR replaces the previous hit/miss counter design with a single per-mini-batch hit-ratio gauge in FlinkIndexBackendMetrics and wires it up at the local-cache lookup site in GlobalRecordLevelIndexBackend. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One naming inconsistency in the metric key value worth resolving for dashboard clarity.

cc @yihua

hudi-agent · 2026-05-21T06:16:17Z

  private static final String LOCAL_INDEX_LOOKUP_KEY = "local_index_lookup";
  private static final String REMOTE_INDEX_LOOKUP_KEY = "remote_index_lookup";

+  public static final String BUCKET_ASSIGN_MINIBATCH_CACHE_HIT_RATIO = "bucketassign.minibatch.cache.hit_ratio";


🤖 nit: the value "bucketassign.minibatch.cache.hit_ratio" uses dot-separated lowercase with an underscore, while every other metric registered in this class uses flat camelCase ("remoteIndexLookupLatency", "localLookupKeysNum", etc.). In Flink, dots in a metric name create sub-group path segments, so this gauge would land in a different namespace than the histograms — likely surprising on a dashboard. Could you align it to something like "minibatchCacheHitRatio" or "bucketAssignMinibatchCacheHitRatio"?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

Co-authored-by: Cursor <cursoragent@cursor.com>

hudi-bot · 2026-05-21T07:36:04Z

CI report:

bbdf42f Azure: SUCCESS
d1435b0 UNKNOWN
e3548ca Azure: PENDING

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

codecov-commenter · 2026-05-21T07:43:52Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.25%. Comparing base (0138f41) to head (e3548ca).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff            @@
##             master   #18761   +/-   ##
=========================================
  Coverage     68.25%   68.25%           
- Complexity    29324    29340   +16     
=========================================
  Files          2527     2527           
  Lines        141804   141868   +64     
  Branches      17622    17628    +6     
=========================================
+ Hits          96787    96836   +49     
- Misses        37052    37067   +15     
  Partials       7965     7965

Flag	Coverage Δ
common-and-other-modules	`44.42% <100.00%> (+0.01%)`	⬆️
hadoop-mr-java-client	`44.91% <ø> (-0.03%)`	⬇️
spark-client-hadoop-common	`48.23% <ø> (-0.03%)`	⬇️
spark-java-tests	`48.85% <ø> (-0.01%)`	⬇️
spark-scala-tests	`44.93% <ø> (-0.03%)`	⬇️
utilities	`37.49% <ø> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
.../apache/hudi/metrics/FlinkIndexBackendMetrics.java	`100.00% <100.00%> (ø)`
...rtitioner/index/GlobalRecordLevelIndexBackend.java	`90.32% <100.00%> (+0.32%)`	⬆️

... and 21 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! This PR replaces the proposed counter pair with a single per-mini-batch lookupCacheHitRatio gauge on FlinkIndexBackendMetrics, wired up from GlobalRecordLevelIndexBackend.get(...) after the local cache lookup. The volatile-double + early-return-on-zero-total pattern looks safe for the Flink reporter/task-thread interaction, and the tests cover the relevant edge cases (zero total, negative inputs, all-hit, all-miss, end-to-end via the backend). No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One javadoc phrasing could mislead future readers into thinking the method reads shared state; everything else looks clean.

cc @yihua

hudi-agent · 2026-05-21T08:20:13Z

+   * When the lookup observed no keys, the previous value is preserved so dashboards
+   * don't oscillate back to zero on idle mini-batches.
+   */
+  public void updateLookupCacheHitRatio(long hitCount, long missCount) {


🤖 nit: the Javadoc says "from the hit/miss counts already fed into updateLocalLookupKeysCount and updateRemoteLookupKeysCount", which implies this method reads from shared state populated by those calls — but it actually uses its own hitCount/missCount parameters directly. Could you drop that cross-reference and just describe what the two parameters represent?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

github-actions Bot added the size:L PR with lines of changes in (300, 1000] label May 17, 2026

yaoliclshlmch changed the title ~~[HUDI-18735][FLINK] Add bucketassign minibatch cache hit/miss metrics~~ feat: [HUDI-18735][FLINK] Add bucketassign minibatch cache hit/miss metrics May 17, 2026

hudi-agent reviewed May 18, 2026

View reviewed changes

yaoliclshlmch force-pushed the yaoli/18735_add_metrics_for_cache_hit_ratio branch from 508d5d8 to 3fc37dc Compare May 20, 2026 05:19

github-actions Bot added size:M PR with lines of changes in (100, 300] and removed size:L PR with lines of changes in (300, 1000] labels May 20, 2026

danny0405 reviewed May 20, 2026

View reviewed changes

hudi-agent reviewed May 20, 2026

View reviewed changes

cshuo reviewed May 20, 2026

View reviewed changes

danny0405 changed the title ~~feat: [HUDI-18735][FLINK] Add bucketassign minibatch cache hit/miss metrics~~ feat: add bucketassign minibatch cache hit/miss metrics May 21, 2026

yaoliclshlmch force-pushed the yaoli/18735_add_metrics_for_cache_hit_ratio branch 2 times, most recently from bbdf42f to d1435b0 Compare May 21, 2026 06:15

hudi-agent reviewed May 21, 2026

View reviewed changes

add metrics for bucketassign.minibatch cache hit ratio

e3548ca

Co-authored-by: Cursor <cursoragent@cursor.com>

yaoliclshlmch force-pushed the yaoli/18735_add_metrics_for_cache_hit_ratio branch from d1435b0 to e3548ca Compare May 21, 2026 06:24

hudi-agent reviewed May 21, 2026

View reviewed changes

danny0405 approved these changes May 21, 2026

View reviewed changes

danny0405 merged commit b82a5b2 into apache:master May 21, 2026
62 of 63 checks passed

Conversation

yaoliclshlmch commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented May 21, 2026

CI report:

Uh oh!

codecov-commenter commented May 21, 2026

Codecov Report

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yaoliclshlmch commented May 17, 2026 •

edited

Loading