Fix consumer block problem: skip no entry#14543
Closed
lordcheng10 wants to merge 2 commits intoapache:masterfrom
Closed
Fix consumer block problem: skip no entry#14543lordcheng10 wants to merge 2 commits intoapache:masterfrom
lordcheng10 wants to merge 2 commits intoapache:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
We encountered such a problem: the broker never pushes messages because it reads a non-existing entry, and will keep trying to read the non-existing entry.
Through the stats command, we found that some partitions have not pushed messages to consumers:

The error log is as follows:
00:00:10.094 [pulsar-io-4-122] INFO org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://teg_onion_onion_common_data_gz/teg_onion_onion_common_data_gz/teg_onion_onion_common_data_gz-partition-159 / t_teg_onion_b_teg_onion_onion_common_data_gz_cg_onion_common_data_cos_svr_gz_001] Retrying read operation
00:00:28.320 [BookKeeperClientWorker-OrderedExecutor-21-0] ERROR org.apache.bookkeeper.client.PendingReadOp - Read of ledger entry failed: L15759573 E15389-E15389, Sent to [11.135.219.182:3181, 11.135.217.80:3181], Heard from [] : bitset = {}, Error = 'No such entry'. First unread entry is (-1, rc = null)
00:00:28.320 [BookKeeperClientWorker-OrderedExecutor-33-0] WARN org.apache.bookkeeper.mledger.impl.OpReadEntry - [teg_onion_onion_common_data_gz/teg_onion_onion_common_data_gz/persistent/teg_onion_onion_common_data_gz-partition-275][t_teg_onion_b_teg_onion_onion_common_data_gz_cg_onion_common_data_cos_svr_gz_001] read failed from ledger at position:15759573:15389
org.apache.bookkeeper.mledger.ManagedLedgerException$NonRecoverableLedgerException: No such entry
00:00:28.320 [BookKeeperClientWorker-OrderedExecutor-33-0] ERROR org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://teg_onion_onion_common_data_gz/teg_onion_onion_common_data_gz/teg_onion_onion_common_data_gz-partition-275 / t_teg_onion_b_teg_onion_onion_common_data_gz_cg_onion_common_data_cos_svr_gz_001] Error reading entries at 15759573:15389 : No such entry, Read Type Normal - Retrying to read in 54.299 seconds
And I found through the bookkeeper command: sh bin/bookkeeper shell ledger 15759573 that entryId=15389 does not exist, and the smallest entryid=15558:

Through monitoring, we found that the consumption delay of some partiotns continued to rise:

Modifications
When we encounter this situation, skip the entry
Documentation
Check the box below or label this PR directly (if you have committer privilege).
Need to update docs?
doc-required(If you need help on updating docs, create a doc issue)
no-need-doc(Please explain why)
doc(If this PR contains doc changes)