possbile LitePullConusmer rebalance bug #2732

Zanglei06 · 2021-03-16T07:24:17Z

Please describe the issue you observed:

What did you do (The steps to reproduce)?
What did you expect to see?
What did you see instead?

In our production environment, I find some msgs lost when a new consumer started(causing rebalance), the RMQ version we use is 4.7.1 and we use the new LitePullConsumer API.

From the rocketmq-client log, something unexpected happened:

the same messageQueue is detected should cancel its pullTask in two different threads in almost the same time.

2021-03-09 20:16:19.911 WARN [PullMsgThread-c_g1] (Slf4jLoggerFactory.java:115) - The Pull Task is cancelled after doPullTask, MessageQueue [topic=t, brokerName=rmq-b5, queueId=3]
2021-03-09 20:16:19.911 WARN [PullMsgThread-c_g2] (Slf4jLoggerFactory.java:115) - The Pull Task is cancelled after doPullTask, MessageQueue [topic=t, brokerName=rmq-b5, queueId=3]

before rebalance, there are two consumer(two consumer cid), when a new consumer is started, it becomes three(3 consumer cid), but the rebalance triggered several times with wrong cid count, which means findConsumerIdList API returns wrong value. I think the wrong rebalance should not cause any msg lost since rebalance is done in a single thread and finally should be correct. But why wrong cid returned is interesting( calling a wrong broker? slave? but why ?).

below is the logs for rebalance result.(I changed some inner ip and brokerName info for security reasons)

2021-03-09 20:16:19.777 INFO [RebalanceService] (Slf4jLoggerFactory.java:100) - rebalanced result changed. allocateMessageQueueStrategyName=AVG, group=c_g, topic=t, clientId=XXX_C1, mqAllSize=24, cidAllSize=3, rebalanceResultSize=8, rebalanceResultSet=XXX (3 cid, corrent)
2021-03-09 20:16:19.779 INFO [RebalanceService] (Slf4jLoggerFactory.java:100) - rebalanced result changed. allocateMessageQueueStrategyName=AVG, group=c_g, topic=t, clientId=XXX_C1, mqAllSize=24, cidAllSize=2, rebalanceResultSize=12, rebalanceResultSet=XXX (2 cid, wrong)
2021-03-09 20:16:19.781 INFO [RebalanceService] (Slf4jLoggerFactory.java:100) - rebalanced result changed. allocateMessageQueueStrategyName=AVG, group=c_g, topic=t, clientId=XXX_C1, mqAllSize=24, cidAllSize=3, rebalanceResultSize=8, rebalanceResultSet=XXX (3 cid, correct)
2021-03-09 20:16:19.784 INFO [RebalanceService] (Slf4jLoggerFactory.java:100) - rebalanced result changed. allocateMessageQueueStrategyName=AVG, group=c_g, topic=t, clientId=XXX_C1, mqAllSize=24, cidAllSize=2, rebalanceResultSize=12, rebalanceResultSet=XXX (2 cid , wrong)
2021-03-09 20:16:19.785 INFO [RebalanceService] (Slf4jLoggerFactory.java:100) - rebalanced result changed. allocateMessageQueueStrategyName=AVG, group=c_g, topic=t, clientId=XXX_C1, mqAllSize=24, cidAllSize=3, rebalanceResultSize=8, rebalanceResultSet=XXX (3 cid, correct)

from rocketmq-client log the rebalance notification from broker found slave broker ips

additional info:
in one java process we have one consumer and one producer with different clientId; the consumer is polling messages for one group and one topic(only one subscription); the producer is sending messages to many topics( different from consumer topic);

Please tell us about your environment:

RMQ 4.7.1

LitePullConsumer

ifplusor · 2021-03-16T08:54:39Z

When a new consumer sends a heartbeat to a broker, the broker will notify all consumers of the related group to do rebalance. If there are multiple brokers, a consumer will receive multiple notifications.

In the rebalance process, the consumer selects a broker randomly to look up the consumer list. If the consumer did not send any heartbeat to the broker at the time, the returned list is incorrect.

Zanglei06 · 2021-03-16T09:17:09Z

When a new consumer sends a heartbeat to a broker, the broker will notify all consumers of the related group to do rebalance. If there are multiple brokers, a consumer will receive multiple notifications.

In the rebalance process, the consumer selects a broker randomly to look up the consumer list. If the consumer did not send any heartbeat to the broker at the time, the returned list is incorrect.

yes, and incorrect consumer list should not result in lost msgs since it will be corrected anyway. My major confusion is the first question, why one messageQueue is processed(Pull Task) by two pull threads(suggested by rocketmq-client log). Do you have any ideas?

ifplusor · 2021-03-16T09:32:55Z

@Zanglei06 Multiple rebalance processes with different consumer lists will cause this issue.

lizhimins · 2021-03-17T13:38:45Z

Heartbeat failure may cause load balancing problems

Zanglei06 · 2021-03-21T06:59:17Z

Heartbeat failure may cause load balancing problems

the LB problems should not include msgs lost. only some short time msg duplicates occurs. can you explain why the 1st log appears?

areyouok · 2021-04-21T01:53:54Z

I had reproduce the bug in this commit:
areyouok@a54599e

At first, create a topic with only one queue:
./mqadmin updateTopic -n 127.0.0.1:9876 -t TopicTest -r 1 -w 1 -c DefaultCluster

then run LitePullConsumerBug, it send two messages, receive one, and the offset is update to consumerOffset.json

The output:

RocketMQLog:WARN No appenders could be found for logger (io.netty.util.internal.PlatformDependent0).
RocketMQLog:WARN Please initialize the logger system properly.
SendResult [sendStatus=SEND_OK, msgId=AC1128381E4718B4AAC2691C5DF80000, offsetMsgId=AC11283800002A9F0000000000000C60, messageQueue=MessageQueue [topic=TopicTest, brokerName=broker-a, queueId=0], queueOffset=18]
09:51:07,801 1749887142 get queue
09:51:07,803 1749887142 pull step 1
09:51:08,776 start consumer 2
09:51:09,787 203029914 get queue
09:51:09,787 1749887142 lost queue
09:51:09,788 203029914 pull step 1
09:51:10,790 1749887142 get queue
09:51:10,791 1749887142 pull step 1
09:51:10,808 1749887142 pull step 2
09:51:10,814 submit 1 msgs
09:51:10,815 drop 1 msgs
09:51:12,792 203029914 pull step 2
09:51:13,793 1749887142 pull step 2
SendResult [sendStatus=SEND_OK, msgId=AC1128381E4718B4AAC2691C815A0001, offsetMsgId=AC11283800002A9F0000000000000D10, messageQueue=MessageQueue [topic=TopicTest, brokerName=broker-a, queueId=0], queueOffset=19]
09:51:15,806 submit 1 msgs
receive msg: AC1128381E4718B4AAC2691C815A0001
09:51:15,807 1749887142 pull step 1
09:51:18,807 1749887142 pull step 2

The root cause in DefaultLitePullConsumerImpl.PullTaskImpl.run
...
long offset = nextPullOffset(messageQueue);
...
updatePullOffset(messageQueue, pullResult.getNextBeginOffset());

After a ProcessQueue is dropped, but the PullTask is still run. The invalid task will call updatePullOffset.

…ePullConsumer

…nsumer (#2832) * [ISSUE #2732] Fix message loss problem when rebalance with LitePullConsumer * Fix message loss problem when rebalance with LitePullConsumer, update 2

RongtongJin · 2021-04-27T09:33:29Z

Merged

* [ISSUE #1233] Fix CVE-2011-1473 * fix Multiple instances in the same application share MQClientInstance * [ISSUE #2748] Fix deleteSubscriptionGroup not remove consumer offset * [ISSUE #2745] Changed the support time of the request/reply feature to 4.6.0. Co-authored-by: von gosling <vongosling@apache.org> * [ISSUE #2729] Replace with Math.min method call * [ISSUE #2801]Fix NamesrvAddr connot set in Producer * [ISSUE 2800] optimize: the spelling of topicSynFlag Co-authored-by: ph3636 <tianxingguang@kanzhun.com> * [ISSUE #2803] Fix the endpoint cannot get instanceId without http (#2804) * fix the endpoint cannot get instanceId without http * fix the endpoint cannot get instanceId without http * add unit test * add unit test * add unit test Co-authored-by: panzhi33 <wb-pz502261@alibaba-inc.com> * fix messageArrivingListener NPE * [ISSUE #2538]Optimize log output when message trace saving fails * [ISSUE #2811] Fix the wrong topic was consumed in the DefaultMessageStoreTest test program * [ISSUE #2821] Overriding the ServiceThread#shutdown in HAClient class * [ISSUE #2805] remove redundant package imports * [ISSUE #2833] Support trace for TranscationProducer (#2834) * [ISSUE #2732] Fix message loss problem when rebalance with LitePullConsumer (#2832) * [ISSUE #2732] Fix message loss problem when rebalance with LitePullConsumer * Fix message loss problem when rebalance with LitePullConsumer, update 2 * [ISSUE #2846]fix -E might not port to other systems * fix some nonconformity after checkstyle * Support OpenTracing(#2861) * [ISSUE #2872] remove log files created by integration test when mvn clean * [ISSUE #2872] move log files created by integration test to target dir * Change log level to debug: "Half offset {} has been committed/rolled back" * Fix unit test stability Bump mockito-core to 3.10.0, remove powermock dependency, suppress useless logging * [ISSUE #2898] Resolve rocketmq-example project failed during checkstyle execution (#2899) Co-authored-by: SSpirits <shadowyspirits@outlook.com> Co-authored-by: panzhi33 <wb-pz502261@alibaba-inc.com> Co-authored-by: panzhi <panzhi33@qq.com> Co-authored-by: ArronHuang <41609451+ArronHuang@users.noreply.github.com> Co-authored-by: von gosling <vongosling@apache.org> Co-authored-by: drgnchan <40224023+drgnchan@users.noreply.github.com> Co-authored-by: zhangjidi2016 <zhangjidi@cmss.chinamobile.com> Co-authored-by: ph3636 <38041490+ph3636@users.noreply.github.com> Co-authored-by: ph3636 <tianxingguang@kanzhun.com> Co-authored-by: BurningCN <1015773611@qq.com> Co-authored-by: francis lee <francislee.cn@outlook.com> Co-authored-by: 灼华 <43363120+BurningCN@users.noreply.github.com> Co-authored-by: yuz10 <845238369@qq.com> Co-authored-by: huangli <areyouok@gmail.com> Co-authored-by: chenrl <raymond2366@outlook.com> Co-authored-by: ayanamist <ayanamist@gmail.com> Co-authored-by: zhangjidi2016 <1017543663@qq.com>

…PullConsumer (apache#2832) * [ISSUE apache#2732] Fix message loss problem when rebalance with LitePullConsumer * Fix message loss problem when rebalance with LitePullConsumer, update 2

Zanglei06 changed the title ~~LitePullConusmer rebalance bug~~ possbile LitePullConusmer rebalance bug Mar 16, 2021

duhenglucky added the type/bug label Mar 18, 2021

duhenglucky added this to the 4.8.1 milestone Mar 18, 2021

duhenglucky added the module/client label Mar 18, 2021

areyouok added a commit to areyouok/rocketmq that referenced this issue Apr 22, 2021

[ISSUE apache#2732] Fix message loss problem when rebalance with Lit…

f0cc043

…ePullConsumer

areyouok added a commit to areyouok/rocketmq that referenced this issue Apr 22, 2021

[ISSUE apache#2732] Fix message loss problem when rebalance with Lit…

a905aff

…ePullConsumer

areyouok mentioned this issue Apr 22, 2021

[ISSUE #2732] Fix message loss problem when rebalance with LitePullConsumer #2832

Merged

6 tasks

vongosling linked a pull request Apr 26, 2021 that will close this issue

[ISSUE #2732] Fix message loss problem when rebalance with LitePullConsumer #2832

Merged

6 tasks

RongtongJin closed this as completed Apr 27, 2021

zongtanghu modified the milestones: 4.8.1, 4.9.0 Jun 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

possbile LitePullConusmer rebalance bug #2732

possbile LitePullConusmer rebalance bug #2732

Zanglei06 commented Mar 16, 2021 •

edited

Loading

ifplusor commented Mar 16, 2021

Zanglei06 commented Mar 16, 2021

ifplusor commented Mar 16, 2021 •

edited

Loading

lizhimins commented Mar 17, 2021

Zanglei06 commented Mar 21, 2021

areyouok commented Apr 21, 2021

RongtongJin commented Apr 27, 2021

possbile LitePullConusmer rebalance bug #2732

possbile LitePullConusmer rebalance bug #2732

Comments

Zanglei06 commented Mar 16, 2021 • edited Loading

ifplusor commented Mar 16, 2021

Zanglei06 commented Mar 16, 2021

ifplusor commented Mar 16, 2021 • edited Loading

lizhimins commented Mar 17, 2021

Zanglei06 commented Mar 21, 2021

areyouok commented Apr 21, 2021

RongtongJin commented Apr 27, 2021

Zanglei06 commented Mar 16, 2021 •

edited

Loading

ifplusor commented Mar 16, 2021 •

edited

Loading