Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Key_Shared consumers with different subscriptions get messages out-of-order in 2.6.0. #7455

Closed
feeblefakie opened this issue Jul 6, 2020 · 17 comments
Assignees
Labels
help wanted type/bug The PR fixed a bug or issue reported a bug

Comments

@feeblefakie
Copy link

feeblefakie commented Jul 6, 2020

Describe the bug
Key_Shared consumers with different subscriptions get messages out-of-order in 2.6.0.
As far as I checked, it doesn't happen with Consistent Hashing approach.
Using consistent hashing makes it less likely happen, but it actually still happens when I try the following several times (try 5 times and you can observe it)

To Reproduce
Steps to reproduce the behavior:
https://github.com/feeblefakie/misc/blob/master/pulsar/HOW-TO-REPRODUCE.md

Expected behavior
Key_Shared consumers with different subscriptions should get messages in a consistent order to meet the ordering guarantee of Key_Shared.

Desktop (please complete the following information):

  • MacOSX (it could happen in a single node testing)

Additional context

I reported the bug in 2.5.0 and it seems fixed by the following PRs, but it is still happening.
http://mail-archives.apache.org/mod_mbox/pulsar-users/202003.mbox/%3CCAPDOW74LN3WtdhpG_cgCCSg9MuMmkNV6giCGD5p%3DW1wWji0W7w%40mail.gmail.com%3E

#6554
#6977
#7106

@sijie
Copy link
Member

sijie commented Jul 6, 2020

@codelipenghui can you take a look at this issue?

@feeblefakie
Copy link
Author

feeblefakie commented Jul 7, 2020

@codelipenghui @sijie @merlimat Sorry I reported that using consistent hashing makes it not happen but it actually produces inconsistently ordered messages.
So the ordering guarantee can be violated in both auto-split and consistent-hashing.
Consistent hashing just makes it less likely happen but you can observe it if you try several times, say 5 times.

@feeblefakie
Copy link
Author

feeblefakie commented Jul 7, 2020

@codelipenghui Here is what is happening.
For example, consumers for sub1 consumed messages with key "15" as follows.

#consumerName key value(produced-time) messageId(ledgerId:entryId:partitionIdx)
46e6d 15 1594111896905 119:5:0
46e6d 15 1594111899192 119:253:0
46e6d 15 1594111901040 119:501:0
46e6d 15 1594111902763 119:749:0
46e6d 15 1594111904385 119:997:0
46e6d 15 1594111905990 119:1245:0
46e6d 15 1594111907638 119:1493:0
46e6d 15 1594111909202 119:1741:0
46e6d 15 1594111910738 119:1989:0
46e6d 15 1594111912326 119:2237:0

And, consumers for sub2 consumed messages with key "15" as follows.

17a20 15 1594111896905 119:5:0
17a20 15 1594111899192 119:253:0
17a20 15 1594111901040 119:501:0
17a20 15 1594111902763 119:749:0
17a20 15 1594111904385 119:997:0
5523e 15 1594111910738 119:1989:0 <-- this causes the inconsistency
17a20 15 1594111905990 119:1245:0
17a20 15 1594111907638 119:1493:0
17a20 15 1594111909202 119:1741:0
5523e 15 1594111912326 119:2237:0

You can see that there are two consumers for the key for sub2, and consumer 5523e consumed a value 1594111910738 in the wrong order. It is easily identified since the value is a timestamp assigned by a single synchronous producer at the time of producing the message.

Regarding the behavior, I have several questions.

  1. Is it expected to have multiple live consumers for a key ?
  2. If yes to the above question, how are multiple consumers supposed to cooperate to guarantee the ordering of the messages ?
  3. If no to the 1st question, that's the cause of the inconsistency.

@feeblefakie
Copy link
Author

feeblefakie commented Jul 7, 2020

@codelipenghui I'm not sure but as far as I can see the code,
the following code could return different consumers for the same key if the number of consumers is changed even if consistent hashing or auto-split is used, and multiple consumers are working independently and concurrently, so the messages with the same key can be consumed out of order in different consumers ?

https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentStickyKeyDispatcherMultipleConsumers.java#L151

@codelipenghui
Copy link
Contributor

codelipenghui commented Jul 8, 2020

@feeblefakie could you please give me a broker dump and inconsistency logs like #7455 (comment) when the issue happens? I want to check the details of the recentlyJoinedConsumers

@feeblefakie
Copy link
Author

@codelipenghui Sure. How can I get a broker dump ?

@feeblefakie
Copy link
Author

feeblefakie commented Jul 13, 2020

@codelipenghui I just re-executed it . Here are the logs.
sub1.txt
sub2.txt

You can check the inconsistency with check.sh as needed. (You need to delete txt suffix to run. I added .txt suffix for GitHub)

Regarding the broker dump, I don't know what you mean/need. Can you clarify ?

@codelipenghui
Copy link
Contributor

Regarding the broker dump, I don't know what you mean/need. Can you clarify ?

the broker heap dump, I want to check some state of key_shared subscription.

@feeblefakie
Copy link
Author

@codelipenghui @merlimat Could anyone give me the answers for the questions ? I want to dig into it deeper but I'm not very experienced so I need some more information about the spec.
#7455 (comment)

@sijie
The issue is pretty critical since it violates the guarantee that the Key_Shared feature must meet, but I feel it is not regarded as important somehow. Do you have any reason for that ?
Anyways, we should make the doc says the current implementation violates the ordering guarantee or should fix it as soon as possible. The current situation (saying it guarantees but it actually doesn't) is not just right.

@sijie
Copy link
Member

sijie commented Jul 17, 2020

@feeblefakie we realized the importance of this issue. @codelipenghui is working on this. please give us some time on fixing this issue since we also have other tasks ongoing.

@feeblefakie
Copy link
Author

@sijie Thank you for the reply. It is great to hear that it is now regarded as important.
I would like to help so feel free to ask me to do testing.

@codelipenghui
Copy link
Contributor

Is it expected to have multiple live consumers for a key ?
If yes to the above question, how are multiple consumers supposed to cooperate to guarantee the ordering of the messages ?
If no to the 1st question, that's the cause of the inconsistency.

Could you please take a look at this PR #7106 ? I think the above questions we have discussed in #7106 and #6554.

And, now looks the problem is #7553 that described, we will review #7553 soon.

@feeblefakie
Copy link
Author

feeblefakie commented Jul 22, 2020

@codelipenghui Thanks.
I've checked #7553 but it seems to have another issue other than out-of-order messages.
Consumers are not consuming anything now. It can be reproduced with pulsar-perf.
#7553 (comment)

@feeblefakie
Copy link
Author

@codelipenghui @sijie #7553 seems not the fix for this issue.
Could you take another look ?

@feeblefakie
Copy link
Author

@codelipenghui @sijie Do you have any updates on this?
#7553 is not merged and doesn't seem to be even reviewed either.
How much priority do you assign these kinds of critical ordering issues in Key_Shared?
(I just don't think it is right to keep lying about the ordering guarantee property even there are easily reproducible bugs.)
It would be great if you can give me your honest opinions and the real situation around this.
We need to find/explore another way based on the answer.

@codelipenghui
Copy link
Contributor

@feeblefakie I have pushed a PR #8292, it fixes an ordering issue of the Key_Shared subscription. I'm sure if it can fix this issue. Could you please help double-check?

@wolfstudy
Copy link
Member

It seems that #8292 has solved this problem, let us consider closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

No branches or pull requests

4 participants