[fix][broker] Fix stuck chunks in SharedConsumerAssignor permit tracking#25620
Merged
lhotari merged 1 commit intoApr 30, 2026
Merged
Conversation
`getConsumerForUuid` decremented `consumerToPermits` for the target consumer using `currentAvailablePermits - 1`, but `currentAvailablePermits` tracks the loop's `defaultConsumer`. When a cache hit redirects a chunk to a different consumer, those two are different, so the target consumer's tracked permits drift over-optimistic. The assignor then over-allocates entries to that consumer; `sendChunkedMessagesToConsumers` trims the excess to the redelivery queue. For a last-chunk entry whose `uuidToConsumer` cache was already removed on line 145, the chunk re-reads from redelivery as `chunkId != 0` with no cache entry — unassignable, stuck forever — and the partial message never reassembles on the consumer. Decrement `consumer`'s permits using the value just read from the map.
lhotari
approved these changes
Apr 30, 2026
poorbarcode
pushed a commit
to poorbarcode/pulsar
that referenced
this pull request
May 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
MessageChunkingSharedTest.testMultiConsumersis flaky — it hangs intermittently and was bumped from 3s to 15s in #25475 to mask the symptom. The root cause is a real correctness bug inSharedConsumerAssignorthat can cause chunked messages on Shared subscriptions to get permanently stuck in the redelivery queue, producing partial reassemblies on the consumer that never complete.Bug
In
SharedConsumerAssignor.getConsumerForUuid:currentAvailablePermitsis the assignor loop's local tracker for the loop'sdefaultConsumer. But when a cache hit onuuidToConsumerredirects a chunk to a different consumer,consumeranddefaultConsumerare not the same. The line above then writes the loop consumer's tracker into the target consumer's slot, corruptingconsumerToPermitsand making the assignor's per-consumer accounting drift over-optimistic.The chain that produces a stuck chunk:
consumerToPermits[X]claim more permits than X actually has.sendChunkedMessagesToConsumerstrims viaMath.min(consumer.getAvailablePermits(), entryList.size())and pushes the excess to the redelivery queue.uuidToConsumer[uuid]on line 145.chunkId != 0, so it returnsnull— and the entry goes right back to redelivery. Forever.chunkedMessagesMapwaiting for chunk N-1 that will never arrive; the partial message never reassembles, never acks.When this happens the test hangs until TestNG's retry kicks in 30s later, masking the failure as a "slow run."
Modifications
getConsumerForUuidnow decrements the target consumer's permits using the value it just read fromconsumerToPermits, not the caller's local tracker. ThecurrentAvailablePermitsparameter is no longer needed and was removed.Verification
Locally, before the fix the test was bimodal: either ~0.5s or ~30s (TestNG retry threshold). After the fix, 0/30 runs were slow:
Other chunking suites (
MessageChunkingTest,MessageChunkingSharedTest,MessageChunkingDeduplicationTest,SharedConsumerAssignorTest) still pass.Test plan
mvn test -pl pulsar-broker -Dtest='MessageChunkingSharedTest'— passesmvn test -pl pulsar-broker -Dtest='SharedConsumerAssignorTest'— passesMessageChunkingSharedTest.testMultiConsumers— 0 slow / 0 failedMatching PR types in the table
area/broker