Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possbile LitePullConusmer rebalance bug #2732

Closed
Zanglei06 opened this issue Mar 16, 2021 · 7 comments · Fixed by #2832
Closed

possbile LitePullConusmer rebalance bug #2732

Zanglei06 opened this issue Mar 16, 2021 · 7 comments · Fixed by #2832

Comments

@Zanglei06
Copy link
Contributor

Zanglei06 commented Mar 16, 2021

  1. Please describe the issue you observed:
  • What did you do (The steps to reproduce)?

  • What did you expect to see?

  • What did you see instead?

In our production environment, I find some msgs lost when a new consumer started(causing rebalance), the RMQ version we use is 4.7.1 and we use the new LitePullConsumer API.

From the rocketmq-client log, something unexpected happened:

  1. the same messageQueue is detected should cancel its pullTask in two different threads in almost the same time.

2021-03-09 20:16:19.911 WARN [PullMsgThread-c_g1] (Slf4jLoggerFactory.java:115) - The Pull Task is cancelled after doPullTask, MessageQueue [topic=t, brokerName=rmq-b5, queueId=3]
2021-03-09 20:16:19.911 WARN [PullMsgThread-c_g2] (Slf4jLoggerFactory.java:115) - The Pull Task is cancelled after doPullTask, MessageQueue [topic=t, brokerName=rmq-b5, queueId=3]

  1. before rebalance, there are two consumer(two consumer cid), when a new consumer is started, it becomes three(3 consumer cid), but the rebalance triggered several times with wrong cid count, which means findConsumerIdList API returns wrong value. I think the wrong rebalance should not cause any msg lost since rebalance is done in a single thread and finally should be correct. But why wrong cid returned is interesting( calling a wrong broker? slave? but why ?).

below is the logs for rebalance result.(I changed some inner ip and brokerName info for security reasons)

2021-03-09 20:16:19.777 INFO [RebalanceService] (Slf4jLoggerFactory.java:100) - rebalanced result changed. allocateMessageQueueStrategyName=AVG, group=c_g, topic=t, clientId=XXX_C1, mqAllSize=24, cidAllSize=3, rebalanceResultSize=8, rebalanceResultSet=XXX (3 cid, corrent)
2021-03-09 20:16:19.779 INFO [RebalanceService] (Slf4jLoggerFactory.java:100) - rebalanced result changed. allocateMessageQueueStrategyName=AVG, group=c_g, topic=t, clientId=XXX_C1, mqAllSize=24, cidAllSize=2, rebalanceResultSize=12, rebalanceResultSet=XXX (2 cid, wrong)
2021-03-09 20:16:19.781 INFO [RebalanceService] (Slf4jLoggerFactory.java:100) - rebalanced result changed. allocateMessageQueueStrategyName=AVG, group=c_g, topic=t, clientId=XXX_C1, mqAllSize=24, cidAllSize=3, rebalanceResultSize=8, rebalanceResultSet=XXX (3 cid, correct)
2021-03-09 20:16:19.784 INFO [RebalanceService] (Slf4jLoggerFactory.java:100) - rebalanced result changed. allocateMessageQueueStrategyName=AVG, group=c_g, topic=t, clientId=XXX_C1, mqAllSize=24, cidAllSize=2, rebalanceResultSize=12, rebalanceResultSet=XXX (2 cid , wrong)
2021-03-09 20:16:19.785 INFO [RebalanceService] (Slf4jLoggerFactory.java:100) - rebalanced result changed. allocateMessageQueueStrategyName=AVG, group=c_g, topic=t, clientId=XXX_C1, mqAllSize=24, cidAllSize=3, rebalanceResultSize=8, rebalanceResultSet=XXX (3 cid, correct)

  1. from rocketmq-client log the rebalance notification from broker found slave broker ips

additional info:
in one java process we have one consumer and one producer with different clientId; the consumer is polling messages for one group and one topic(only one subscription); the producer is sending messages to many topics( different from consumer topic);

  1. Please tell us about your environment:

RMQ 4.7.1

LitePullConsumer

@ifplusor
Copy link
Contributor

When a new consumer sends a heartbeat to a broker, the broker will notify all consumers of the related group to do rebalance. If there are multiple brokers, a consumer will receive multiple notifications.

In the rebalance process, the consumer selects a broker randomly to look up the consumer list. If the consumer did not send any heartbeat to the broker at the time, the returned list is incorrect.

@Zanglei06
Copy link
Contributor Author

When a new consumer sends a heartbeat to a broker, the broker will notify all consumers of the related group to do rebalance. If there are multiple brokers, a consumer will receive multiple notifications.

In the rebalance process, the consumer selects a broker randomly to look up the consumer list. If the consumer did not send any heartbeat to the broker at the time, the returned list is incorrect.

yes, and incorrect consumer list should not result in lost msgs since it will be corrected anyway. My major confusion is the first question, why one messageQueue is processed(Pull Task) by two pull threads(suggested by rocketmq-client log). Do you have any ideas?

@ifplusor
Copy link
Contributor

ifplusor commented Mar 16, 2021

@Zanglei06 Multiple rebalance processes with different consumer lists will cause this issue.

@Zanglei06 Zanglei06 changed the title LitePullConusmer rebalance bug possbile LitePullConusmer rebalance bug Mar 16, 2021
@lizhimins
Copy link
Member

Heartbeat failure may cause load balancing problems

@duhenglucky duhenglucky added this to the 4.8.1 milestone Mar 18, 2021
@Zanglei06
Copy link
Contributor Author

Heartbeat failure may cause load balancing problems

the LB problems should not include msgs lost. only some short time msg duplicates occurs. can you explain why the 1st log appears?

@areyouok
Copy link
Contributor

I had reproduce the bug in this commit:
areyouok@a54599e

At first, create a topic with only one queue:
./mqadmin updateTopic -n 127.0.0.1:9876 -t TopicTest -r 1 -w 1 -c DefaultCluster

then run LitePullConsumerBug, it send two messages, receive one, and the offset is update to consumerOffset.json

The output:

RocketMQLog:WARN No appenders could be found for logger (io.netty.util.internal.PlatformDependent0).
RocketMQLog:WARN Please initialize the logger system properly.
SendResult [sendStatus=SEND_OK, msgId=AC1128381E4718B4AAC2691C5DF80000, offsetMsgId=AC11283800002A9F0000000000000C60, messageQueue=MessageQueue [topic=TopicTest, brokerName=broker-a, queueId=0], queueOffset=18]
09:51:07,801 1749887142 get queue
09:51:07,803 1749887142 pull step 1
09:51:08,776 start consumer 2
09:51:09,787 203029914 get queue
09:51:09,787 1749887142 lost queue
09:51:09,788 203029914 pull step 1
09:51:10,790 1749887142 get queue
09:51:10,791 1749887142 pull step 1
09:51:10,808 1749887142 pull step 2
09:51:10,814 submit 1 msgs
09:51:10,815 drop 1 msgs
09:51:12,792 203029914 pull step 2
09:51:13,793 1749887142 pull step 2
SendResult [sendStatus=SEND_OK, msgId=AC1128381E4718B4AAC2691C815A0001, offsetMsgId=AC11283800002A9F0000000000000D10, messageQueue=MessageQueue [topic=TopicTest, brokerName=broker-a, queueId=0], queueOffset=19]
09:51:15,806 submit 1 msgs
receive msg: AC1128381E4718B4AAC2691C815A0001
09:51:15,807 1749887142 pull step 1
09:51:18,807 1749887142 pull step 2

The root cause in DefaultLitePullConsumerImpl.PullTaskImpl.run
...
long offset = nextPullOffset(messageQueue);
...
updatePullOffset(messageQueue, pullResult.getNextBeginOffset());

After a ProcessQueue is dropped, but the PullTask is still run. The invalid task will call updatePullOffset.

areyouok added a commit to areyouok/rocketmq that referenced this issue Apr 22, 2021
areyouok added a commit to areyouok/rocketmq that referenced this issue Apr 22, 2021
RongtongJin pushed a commit that referenced this issue Apr 27, 2021
…nsumer (#2832)

* [ISSUE #2732] Fix message loss problem when rebalance with LitePullConsumer

* Fix message loss problem when rebalance with LitePullConsumer, update 2
@RongtongJin
Copy link
Contributor

Merged

vongosling added a commit that referenced this issue May 19, 2021
* [ISSUE #1233] Fix CVE-2011-1473

* fix Multiple instances in the same application share MQClientInstance

* [ISSUE #2748] Fix deleteSubscriptionGroup not remove consumer offset

* [ISSUE #2745] Changed the support time of the request/reply feature to 4.6.0.

Co-authored-by: von gosling <vongosling@apache.org>

* [ISSUE #2729] Replace with Math.min method call

* [ISSUE #2801]Fix NamesrvAddr connot set in Producer

* [ISSUE 2800] optimize: the spelling of topicSynFlag

Co-authored-by: ph3636 <tianxingguang@kanzhun.com>

* [ISSUE #2803] Fix the endpoint cannot get instanceId without http (#2804)

* fix the endpoint cannot get instanceId without http

* fix the endpoint cannot get instanceId without http

* add unit test

* add unit test

* add unit test

Co-authored-by: panzhi33 <wb-pz502261@alibaba-inc.com>

* fix messageArrivingListener NPE

* [ISSUE #2538]Optimize log output when message trace saving fails

* [ISSUE #2811] Fix the wrong topic was consumed in the DefaultMessageStoreTest test program

* [ISSUE #2821] Overriding the ServiceThread#shutdown in HAClient class

* [ISSUE #2805] remove redundant package imports

* [ISSUE #2833] Support trace for TranscationProducer (#2834)

* [ISSUE #2732] Fix message loss problem when rebalance with LitePullConsumer (#2832)

* [ISSUE #2732] Fix message loss problem when rebalance with LitePullConsumer

* Fix message loss problem when rebalance with LitePullConsumer, update 2

* [ISSUE #2846]fix -E might not port to other systems

* fix some nonconformity after checkstyle

* Support OpenTracing(#2861)

* [ISSUE #2872] remove log files created by integration test when mvn clean

* [ISSUE #2872] move log files created by integration test to target dir

* Change log level to debug: "Half offset {} has been committed/rolled back"

* Fix unit test stability

Bump mockito-core to 3.10.0, remove powermock dependency, suppress useless logging

* [ISSUE #2898] Resolve rocketmq-example project failed during checkstyle execution (#2899)

Co-authored-by: SSpirits <shadowyspirits@outlook.com>
Co-authored-by: panzhi33 <wb-pz502261@alibaba-inc.com>
Co-authored-by: panzhi <panzhi33@qq.com>
Co-authored-by: ArronHuang <41609451+ArronHuang@users.noreply.github.com>
Co-authored-by: von gosling <vongosling@apache.org>
Co-authored-by: drgnchan <40224023+drgnchan@users.noreply.github.com>
Co-authored-by: zhangjidi2016 <zhangjidi@cmss.chinamobile.com>
Co-authored-by: ph3636 <38041490+ph3636@users.noreply.github.com>
Co-authored-by: ph3636 <tianxingguang@kanzhun.com>
Co-authored-by: BurningCN <1015773611@qq.com>
Co-authored-by: francis lee <francislee.cn@outlook.com>
Co-authored-by: 灼华 <43363120+BurningCN@users.noreply.github.com>
Co-authored-by: yuz10 <845238369@qq.com>
Co-authored-by: huangli <areyouok@gmail.com>
Co-authored-by: chenrl <raymond2366@outlook.com>
Co-authored-by: ayanamist <ayanamist@gmail.com>
Co-authored-by: zhangjidi2016 <1017543663@qq.com>
@zongtanghu zongtanghu modified the milestones: 4.8.1, 4.9.0 Jun 4, 2021
GenerousMan pushed a commit to GenerousMan/rocketmq that referenced this issue Aug 12, 2022
…PullConsumer (apache#2832)

* [ISSUE apache#2732] Fix message loss problem when rebalance with LitePullConsumer

* Fix message loss problem when rebalance with LitePullConsumer, update 2
pulllock pushed a commit to pulllock/rocketmq that referenced this issue Oct 19, 2023
…PullConsumer (apache#2832)

* [ISSUE apache#2732] Fix message loss problem when rebalance with LitePullConsumer

* Fix message loss problem when rebalance with LitePullConsumer, update 2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants