Skip to content

[fix][broker] Fix stuck chunks in SharedConsumerAssignor permit tracking#25620

Merged
lhotari merged 1 commit into
apache:masterfrom
merlimat:fix-chunked-shared-assignor-permits
Apr 30, 2026
Merged

[fix][broker] Fix stuck chunks in SharedConsumerAssignor permit tracking#25620
lhotari merged 1 commit into
apache:masterfrom
merlimat:fix-chunked-shared-assignor-permits

Conversation

@merlimat
Copy link
Copy Markdown
Contributor

@merlimat merlimat commented Apr 29, 2026

Motivation

MessageChunkingSharedTest.testMultiConsumers is flaky — it hangs intermittently and was bumped from 3s to 15s in #25475 to mask the symptom. The root cause is a real correctness bug in SharedConsumerAssignor that can cause chunked messages on Shared subscriptions to get permanently stuck in the redelivery queue, producing partial reassemblies on the consumer that never complete.

Bug

In SharedConsumerAssignor.getConsumerForUuid:

consumerToPermits.put(consumer, currentAvailablePermits - 1);

currentAvailablePermits is the assignor loop's local tracker for the loop's defaultConsumer. But when a cache hit on uuidToConsumer redirects a chunk to a different consumer, consumer and defaultConsumer are not the same. The line above then writes the loop consumer's tracker into the target consumer's slot, corrupting consumerToPermits and making the assignor's per-consumer accounting drift over-optimistic.

The chain that produces a stuck chunk:

  1. Over-allocation makes consumerToPermits[X] claim more permits than X actually has.
  2. The assignor allocates more entries to X than X can take.
  3. sendChunkedMessagesToConsumers trims via Math.min(consumer.getAvailablePermits(), entryList.size()) and pushes the excess to the redelivery queue.
  4. For a last-chunk entry that gets trimmed, the assignor already removed uuidToConsumer[uuid] on line 145.
  5. When that last-chunk entry is re-read from redelivery, the assignor sees no cache entry and chunkId != 0, so it returns null — and the entry goes right back to redelivery. Forever.
  6. The consumer holds chunks 0..N-2 in chunkedMessagesMap waiting for chunk N-1 that will never arrive; the partial message never reassembles, never acks.

When this happens the test hangs until TestNG's retry kicks in 30s later, masking the failure as a "slow run."

Modifications

getConsumerForUuid now decrements the target consumer's permits using the value it just read from consumerToPermits, not the caller's local tracker. The currentAvailablePermits parameter is no longer needed and was removed.

Verification

Locally, before the fix the test was bimodal: either ~0.5s or ~30s (TestNG retry threshold). After the fix, 0/30 runs were slow:

Before fix After fix
Slow runs (>5s) 2/10 0/30

Other chunking suites (MessageChunkingTest, MessageChunkingSharedTest, MessageChunkingDeduplicationTest, SharedConsumerAssignorTest) still pass.

Test plan

  • mvn test -pl pulsar-broker -Dtest='MessageChunkingSharedTest' — passes
  • mvn test -pl pulsar-broker -Dtest='SharedConsumerAssignorTest' — passes
  • Stress run: 30 iterations of MessageChunkingSharedTest.testMultiConsumers — 0 slow / 0 failed

Matching PR types in the table

  • area/broker

`getConsumerForUuid` decremented `consumerToPermits` for the target consumer
using `currentAvailablePermits - 1`, but `currentAvailablePermits` tracks the
loop's `defaultConsumer`. When a cache hit redirects a chunk to a different
consumer, those two are different, so the target consumer's tracked permits
drift over-optimistic. The assignor then over-allocates entries to that
consumer; `sendChunkedMessagesToConsumers` trims the excess to the redelivery
queue. For a last-chunk entry whose `uuidToConsumer` cache was already removed
on line 145, the chunk re-reads from redelivery as `chunkId != 0` with no
cache entry — unassignable, stuck forever — and the partial message never
reassembles on the consumer.

Decrement `consumer`'s permits using the value just read from the map.
@merlimat merlimat requested a review from lhotari April 29, 2026 20:25
@lhotari lhotari merged commit 759a5f5 into apache:master Apr 30, 2026
82 of 84 checks passed
poorbarcode pushed a commit to poorbarcode/pulsar that referenced this pull request May 6, 2026
@lhotari lhotari added this to the 5.0.0-M1 milestone Jun 1, 2026
lhotari pushed a commit that referenced this pull request Jun 1, 2026
lhotari pushed a commit that referenced this pull request Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants