New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rebalance: filter messages of revoked partitions #946
Conversation
As the `RebalanceSpec` test in akka#865 shows, the consumer stage's buffer continues to emit messages of revoked partitions that will be re-emitted by a differnt consumer. This adds another `PartitionAssignmentHandler` which will add filtering in the stage so that messages of partitions that were revoked are not issued anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@ennru @seglo can we please also extend filter stale messages support to SubSourceLogic class. I've run into exactly same issue after rebalancing when using Consumer.committablePartitionedSource. Sometimes there are even three threads (two stale and one active) emitting messages in parallel causing confusion and skipping offsets, leading to loss of messages. I've tested it for both head (master) and 1.1.0. Thanks. |
@jhooda Thanks for reporting. We can certainly take a look at implementing this for partitioned sources. You mentioned that you tested this use case, do you happen to have a test written that you can share? I'll try to reproduce it myself in the meantime. |
I recreated the issue: #992 |
Follow-up to #946 to cover partitioned sources.
@seglo @ennru Thank you for recreating the issue #992 and also committing a solution. I have now also tested after #992 merge and looks like the message loss is still there. My test scenario is as follows: I am running a kafka cluster with four brokers running kafka_2.12-2.2.0. There are 30 test topics with 10 partitions each, each topic has one replica. I publish 100K messages per topic, with identifier 1 to 100K, and then store Each topic has its own group_id, same as the topic name, and consumer source. Below is a snippet of the code on how I'm creating the consumers:
For rebalance testing I run four independent jvms with identical code base,
At the end of the run I analyze DB to tally messages consuming count. Please note without stop/start there are mostly no issues and all messages are
Above setting is suppose to be a workaround for KAFKA-4682, but when one closely examine
it looks like the Impl class also participates during the rebalance events, it specially |
A sample run resulted into following messages per topic (29 topics, correction I am using 29 topics not 30). The expected right hand count is 100000
Attached is a detail look at topic26 showing how many messages got duplicated and how many were missed |
I also noticed a substantial message loss when following setting is used (to recall, the expected right hand count was 100000)
|
@jhooda Thanks for doing this testing. I'll put the commit refreshing issue aside for now to reduce the number of variables. It's strange that you're seeing message loss for your use case. The reason for the fixes #872 and #992 was to reduce the number of use cases where there would be message duplication. You shouldn't see any message loss as long as your app is at-least-once, which at face value it appears to be. Some questions regarding your use case:
One way to rule out whether this is an issue with your business logic is to instead write the messages to a Kafka topic and then write a small app that asserts the consistency of those messages. We do something like this for our transactional tests to assert if there are missing or duplicate messages. See this highlight in Something else that would be helpful to us is if you could reproduce this issue in our own testsuite. I understand that this may be difficult given that a 3rd party database is involved. We have a few example tests where we transiently fail to assert consistency that maybe you could use for inspiration. There are a few such tests in
|
@seglo thank you for the pointers. Based on these pointers, I am writing a test that matches my use case, will post the results once I have it squared out. |
@seglo sorry I took a small break from this issue. I just submitted a pull request that has a test case which can reproduce the offset skipping issue. Although, the issue is reproducible only 10% of the time. For example, I ran the test 100 times with about 10 failures. Reference #1016 . Will appreciate if you can review the pull request. |
Purpose
This adds another
PartitionAssignmentHandler
which will add filtering in the stage so that messages of partitions that were revoked are not issued anymore.References
Issue #872
Test in #865
Changes
filterRevokedPartitions
inBaseSingleSourceLogic
to filter the internal buffer from subclassesonAssigned
Background Context
As the
RebalanceSpec
test in #865 shows, the consumer stage's buffer continues to emit messages of revoked partitions that will be re-emitted by a different consumer.