Skip to content

[improve][ml] Warn and emit metric when cursor ack state exceeds persist limits#25548

Merged
lhotari merged 5 commits intoapache:masterfrom
ng-galien:fix/warn-ack-holes-exceed-persist-limit
Apr 21, 2026
Merged

[improve][ml] Warn and emit metric when cursor ack state exceeds persist limits#25548
lhotari merged 5 commits intoapache:masterfrom
ng-galien:fix/warn-ack-holes-exceed-persist-limit

Conversation

@ng-galien
Copy link
Copy Markdown
Contributor

@ng-galien ng-galien commented Apr 17, 2026

Fixes #25540

Motivation

When a cursor persists more ack ranges than managedLedgerMaxUnackedRangesToPersist or more batch deleted indexes than managedLedgerMaxBatchDeletedIndexToPersist, the excess is silently truncated. On broker restart those acks are lost and messages are redelivered. Today there is no signal when this happens — operators have to monitor totalNonContiguousDeletedMessagesRange manually. The issue discussion asks for a WARN log (with tuning advice) and a cursor-level metric.

Modifications

Commit 1 — warn and emit metric on truncation:

  • WARN log emitted once per crossing (edge-detected) in both buildIndividualDeletedMessageRanges() and buildBatchEntryDeletionIndexInfoList(), with tuning advice covering the two limits, managedLedgerPersistIndividualAckAsLongArray, and managedCursorInfoCompressionType.
  • Two OTel counters tagged with the cursor's standard attributes (pulsar.managed_ledger.name, pulsar.managed_ledger.cursor.name, pulsar.namespace)
  • Signals documented next to the settings in broker.conf.

Commit 2 — fix pre-existing off-by-one in the ranges cap:

buildIndividualDeletedMessageRanges used to persist maxRanges + 1 entries because the forEach callback added before testing rangeList.size() <= maxRanges. Regression introduced in #3819 when stream().limit(N) was dropped. Without this fix the new WARN/counter fire spuriously when totalRanges == maxRanges + 1. Fixed by switching to a check-before-add pattern (symmetric with buildBatchEntryDeletionIndexInfoList) with a MutableBoolean truncated flag.

Verifying this change

  • ManagedCursorTest.testPersistUnackedRangesTruncatedCounter
  • ManagedCursorTest.testPersistBatchDeletedIndexesTruncatedCounter

Does this pull request potentially affect one of the following parts:

  • The metrics — adds two cursor-level OTel counters (also documented in broker.conf).

ng-galien and others added 2 commits April 17, 2026 13:41
…ist limits

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…maxRanges

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ng-galien ng-galien marked this pull request as ready for review April 17, 2026 12:03
Copy link
Copy Markdown
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work @ng-galien! Some comments on naming. In the metric names, I think it's clearer to describe the effect, "persisted unacked ranges being truncated" or "persisted batch deleted indexes being truncated" — rather than using "overflow". The former tells operators directly what happened (state was dropped on persist), whereas "overflow" is a more abstract term that requires them to infer the consequence.

Copy link
Copy Markdown
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, good work @ng-galien

@ng-galien
Copy link
Copy Markdown
Contributor Author

Hi @lhotari, thanks for your review. Semantic is aligned with truncate and telemetry is more precise as you suggest.
I've just noticed the logger already have ledger and cursor name at constructor level.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ng-galien
Copy link
Copy Markdown
Contributor Author

@lhotari managed-ledger:checkstyleTest and and managed-ledger:test are now green on side, my bad.

Copy link
Copy Markdown
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments about details that Claude Code spotted.

In addition 2 comments about the description:

  • reconcile the PR description (it claims broker‑level / no cardinality growth, but the code emits per‑cursor labels),
  • update the Modifications section to use the final metric names.

Since the counter will only be emitted in Otel when the threshold has been crossed, it's fine to increase cardinality. In Prometheus this would be different.

@lhotari lhotari merged commit d553cec into apache:master Apr 21, 2026
44 checks passed
@lhotari lhotari added this to the 4.3.0 milestone Apr 21, 2026
ng-galien added a commit to ng-galien/pulsar-site that referenced this pull request Apr 21, 2026
After review on apache/pulsar#25548, the two new counters emit the
cursor's standard attribute set (pulsar.namespace, pulsar.managed_ledger.name,
pulsar.managed_ledger.cursor.name) instead of custom managedLedger/cursor
keys. Update the reference doc to match.

See apache/pulsar#25548.
lhotari pushed a commit to apache/pulsar-site that referenced this pull request Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add WARN log when cursor ack holes exceed managedLedgerMaxUnackedRangesToPersist

2 participants