[improve][ml] Warn and emit metric when cursor ack state exceeds persist limits#25548
Conversation
…ist limits Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…maxRanges Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lhotari
left a comment
There was a problem hiding this comment.
Good work @ng-galien! Some comments on naming. In the metric names, I think it's clearer to describe the effect, "persisted unacked ranges being truncated" or "persisted batch deleted indexes being truncated" — rather than using "overflow". The former tells operators directly what happened (state was dropped on persist), whereas "overflow" is a more abstract term that requires them to infer the consequence.
|
Hi @lhotari, thanks for your review. Semantic is aligned with truncate and telemetry is more precise as you suggest. |
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@lhotari |
lhotari
left a comment
There was a problem hiding this comment.
A few comments about details that Claude Code spotted.
In addition 2 comments about the description:
- reconcile the PR description (it claims broker‑level / no cardinality growth, but the code emits per‑cursor labels),
- update the Modifications section to use the final metric names.
Since the counter will only be emitted in Otel when the threshold has been crossed, it's fine to increase cardinality. In Prometheus this would be different.
After review on apache/pulsar#25548, the two new counters emit the cursor's standard attribute set (pulsar.namespace, pulsar.managed_ledger.name, pulsar.managed_ledger.cursor.name) instead of custom managedLedger/cursor keys. Update the reference doc to match. See apache/pulsar#25548.
Fixes #25540
Motivation
When a cursor persists more ack ranges than
managedLedgerMaxUnackedRangesToPersistor more batch deleted indexes thanmanagedLedgerMaxBatchDeletedIndexToPersist, the excess is silently truncated. On broker restart those acks are lost and messages are redelivered. Today there is no signal when this happens — operators have to monitortotalNonContiguousDeletedMessagesRangemanually. The issue discussion asks for a WARN log (with tuning advice) and a cursor-level metric.Modifications
Commit 1 — warn and emit metric on truncation:
buildIndividualDeletedMessageRanges()andbuildBatchEntryDeletionIndexInfoList(), with tuning advice covering the two limits,managedLedgerPersistIndividualAckAsLongArray, andmanagedCursorInfoCompressionType.broker.conf.Commit 2 — fix pre-existing off-by-one in the ranges cap:
buildIndividualDeletedMessageRangesused to persistmaxRanges + 1entries because theforEachcallback added before testingrangeList.size() <= maxRanges. Regression introduced in #3819 whenstream().limit(N)was dropped. Without this fix the new WARN/counter fire spuriously whentotalRanges == maxRanges + 1. Fixed by switching to a check-before-add pattern (symmetric withbuildBatchEntryDeletionIndexInfoList) with aMutableBooleantruncated flag.Verifying this change
ManagedCursorTest.testPersistUnackedRangesTruncatedCounterManagedCursorTest.testPersistBatchDeletedIndexesTruncatedCounterDoes this pull request potentially affect one of the following parts:
broker.conf).