-
Notifications
You must be signed in to change notification settings - Fork 11.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
possbile LitePullConusmer rebalance bug #2732
Comments
When a new consumer sends a heartbeat to a broker, the broker will notify all consumers of the related group to do rebalance. If there are multiple brokers, a consumer will receive multiple notifications. In the rebalance process, the consumer selects a broker randomly to look up the consumer list. If the consumer did not send any heartbeat to the broker at the time, the returned list is incorrect. |
yes, and incorrect consumer list should not result in lost msgs since it will be corrected anyway. My major confusion is the first question, why one messageQueue is processed(Pull Task) by two pull threads(suggested by rocketmq-client log). Do you have any ideas? |
@Zanglei06 Multiple rebalance processes with different consumer lists will cause this issue. |
Heartbeat failure may cause load balancing problems |
the LB problems should not include msgs lost. only some short time msg duplicates occurs. can you explain why the 1st log appears? |
I had reproduce the bug in this commit: At first, create a topic with only one queue: then run LitePullConsumerBug, it send two messages, receive one, and the offset is update to consumerOffset.json The output:
The root cause in DefaultLitePullConsumerImpl.PullTaskImpl.run After a ProcessQueue is dropped, but the PullTask is still run. The invalid task will call updatePullOffset. |
Merged |
* [ISSUE #1233] Fix CVE-2011-1473 * fix Multiple instances in the same application share MQClientInstance * [ISSUE #2748] Fix deleteSubscriptionGroup not remove consumer offset * [ISSUE #2745] Changed the support time of the request/reply feature to 4.6.0. Co-authored-by: von gosling <vongosling@apache.org> * [ISSUE #2729] Replace with Math.min method call * [ISSUE #2801]Fix NamesrvAddr connot set in Producer * [ISSUE 2800] optimize: the spelling of topicSynFlag Co-authored-by: ph3636 <tianxingguang@kanzhun.com> * [ISSUE #2803] Fix the endpoint cannot get instanceId without http (#2804) * fix the endpoint cannot get instanceId without http * fix the endpoint cannot get instanceId without http * add unit test * add unit test * add unit test Co-authored-by: panzhi33 <wb-pz502261@alibaba-inc.com> * fix messageArrivingListener NPE * [ISSUE #2538]Optimize log output when message trace saving fails * [ISSUE #2811] Fix the wrong topic was consumed in the DefaultMessageStoreTest test program * [ISSUE #2821] Overriding the ServiceThread#shutdown in HAClient class * [ISSUE #2805] remove redundant package imports * [ISSUE #2833] Support trace for TranscationProducer (#2834) * [ISSUE #2732] Fix message loss problem when rebalance with LitePullConsumer (#2832) * [ISSUE #2732] Fix message loss problem when rebalance with LitePullConsumer * Fix message loss problem when rebalance with LitePullConsumer, update 2 * [ISSUE #2846]fix -E might not port to other systems * fix some nonconformity after checkstyle * Support OpenTracing(#2861) * [ISSUE #2872] remove log files created by integration test when mvn clean * [ISSUE #2872] move log files created by integration test to target dir * Change log level to debug: "Half offset {} has been committed/rolled back" * Fix unit test stability Bump mockito-core to 3.10.0, remove powermock dependency, suppress useless logging * [ISSUE #2898] Resolve rocketmq-example project failed during checkstyle execution (#2899) Co-authored-by: SSpirits <shadowyspirits@outlook.com> Co-authored-by: panzhi33 <wb-pz502261@alibaba-inc.com> Co-authored-by: panzhi <panzhi33@qq.com> Co-authored-by: ArronHuang <41609451+ArronHuang@users.noreply.github.com> Co-authored-by: von gosling <vongosling@apache.org> Co-authored-by: drgnchan <40224023+drgnchan@users.noreply.github.com> Co-authored-by: zhangjidi2016 <zhangjidi@cmss.chinamobile.com> Co-authored-by: ph3636 <38041490+ph3636@users.noreply.github.com> Co-authored-by: ph3636 <tianxingguang@kanzhun.com> Co-authored-by: BurningCN <1015773611@qq.com> Co-authored-by: francis lee <francislee.cn@outlook.com> Co-authored-by: 灼华 <43363120+BurningCN@users.noreply.github.com> Co-authored-by: yuz10 <845238369@qq.com> Co-authored-by: huangli <areyouok@gmail.com> Co-authored-by: chenrl <raymond2366@outlook.com> Co-authored-by: ayanamist <ayanamist@gmail.com> Co-authored-by: zhangjidi2016 <1017543663@qq.com>
…PullConsumer (apache#2832) * [ISSUE apache#2732] Fix message loss problem when rebalance with LitePullConsumer * Fix message loss problem when rebalance with LitePullConsumer, update 2
…PullConsumer (apache#2832) * [ISSUE apache#2732] Fix message loss problem when rebalance with LitePullConsumer * Fix message loss problem when rebalance with LitePullConsumer, update 2
What did you do (The steps to reproduce)?
What did you expect to see?
What did you see instead?
In our production environment, I find some msgs lost when a new consumer started(causing rebalance), the RMQ version we use is 4.7.1 and we use the new LitePullConsumer API.
From the rocketmq-client log, something unexpected happened:
2021-03-09 20:16:19.911 WARN [PullMsgThread-c_g1] (Slf4jLoggerFactory.java:115) - The Pull Task is cancelled after doPullTask, MessageQueue [topic=t, brokerName=rmq-b5, queueId=3]
2021-03-09 20:16:19.911 WARN [PullMsgThread-c_g2] (Slf4jLoggerFactory.java:115) - The Pull Task is cancelled after doPullTask, MessageQueue [topic=t, brokerName=rmq-b5, queueId=3]
below is the logs for rebalance result.(I changed some inner ip and brokerName info for security reasons)
2021-03-09 20:16:19.777 INFO [RebalanceService] (Slf4jLoggerFactory.java:100) - rebalanced result changed. allocateMessageQueueStrategyName=AVG, group=c_g, topic=t, clientId=XXX_C1, mqAllSize=24, cidAllSize=3, rebalanceResultSize=8, rebalanceResultSet=XXX (3 cid, corrent)
2021-03-09 20:16:19.779 INFO [RebalanceService] (Slf4jLoggerFactory.java:100) - rebalanced result changed. allocateMessageQueueStrategyName=AVG, group=c_g, topic=t, clientId=XXX_C1, mqAllSize=24, cidAllSize=2, rebalanceResultSize=12, rebalanceResultSet=XXX (2 cid, wrong)
2021-03-09 20:16:19.781 INFO [RebalanceService] (Slf4jLoggerFactory.java:100) - rebalanced result changed. allocateMessageQueueStrategyName=AVG, group=c_g, topic=t, clientId=XXX_C1, mqAllSize=24, cidAllSize=3, rebalanceResultSize=8, rebalanceResultSet=XXX (3 cid, correct)
2021-03-09 20:16:19.784 INFO [RebalanceService] (Slf4jLoggerFactory.java:100) - rebalanced result changed. allocateMessageQueueStrategyName=AVG, group=c_g, topic=t, clientId=XXX_C1, mqAllSize=24, cidAllSize=2, rebalanceResultSize=12, rebalanceResultSet=XXX (2 cid , wrong)
2021-03-09 20:16:19.785 INFO [RebalanceService] (Slf4jLoggerFactory.java:100) - rebalanced result changed. allocateMessageQueueStrategyName=AVG, group=c_g, topic=t, clientId=XXX_C1, mqAllSize=24, cidAllSize=3, rebalanceResultSize=8, rebalanceResultSet=XXX (3 cid, correct)
additional info:
in one java process we have one consumer and one producer with different clientId; the consumer is polling messages for one group and one topic(only one subscription); the producer is sending messages to many topics( different from consumer topic);
RMQ 4.7.1
LitePullConsumer
The text was updated successfully, but these errors were encountered: