Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix message-batch loss when rebalancing partitioned sources #1263

Merged
merged 8 commits into from Dec 8, 2020
Merged

Fix message-batch loss when rebalancing partitioned sources #1263

merged 8 commits into from Dec 8, 2020

Conversation

jhooda
Copy link
Contributor

@jhooda jhooda commented Nov 24, 2020

Issue: When there are multiple KafkaConsumerActors they may get
assigned an orthogonal set of topic-partitions, backed by independent
StageActors during group re-balancing. All StageActors receive
a feed of message streams from KafkaConsumerActor. At the other end,
KafkaConsumerActor receives a primary message stream from a shared
message Kafka Consumer Fetcher as shown below:

[Kafka Broker]-->[Fetcher]--+>[KafkaConsumerActor]--+>[StageActor]

During normal message flow Fetcher saves the next message offset
internally and uses that as a reference offset to deliver the next
batch of messages to KafkaConsumerActor. However, during
re-balancing the in-progress message batch can be lost as shown below

[Fetcher]--+>[KafkaConsumerActor]--+>[*defuct* StageActor]

Since the entire message stream is asynchronous, Fetcher doesn't
always know about the lost message batch, and instead delivers the
next message batch to post-rebalance new StageActor

[Fetcher]-->[KafkaConsumerActor]-->[New StageActor]

This fix keeps an up-to-date mapping between the topic-partition and newer
StageActor, whenever a new StageActor is initialized. This map is
referred by KafkaConsumerActor to prevent requests emitted by the
defunct StageActor.

How to reproduce the issue?

  1. Reset changes for core src files included in this commit
  2. Run the following test
sbt 'tests/testOnly akka.kafka.scaladsl.RebalanceExtSpec -- -z "no
messages should be lost when two consumers consume from one topic and
two partitions and one consumer aborts mid-stream"'

Copy link
Member

@seglo seglo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welcome back @jhooda. I like the direction. The old impl. has already fallen out of my head, but I do recall mentioning if there was a way to simplify the test, perhaps so an integration test isn't required. I admit that will be a challenge though so I'll dig into your test this week.

@seglo seglo changed the title added a fix for message-batch loss during re-balancing added a fix for message-batch loss during re-balancing with partitioned sources Nov 26, 2020
@seglo seglo changed the title added a fix for message-batch loss during re-balancing with partitioned sources Fix for message-batch loss during re-balancing with partitioned sources Nov 26, 2020
Copy link
Member

@seglo seglo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I can't think of a way to optimize this test further without creating just as much complexity as you already have, so I'm inclined to keep it. I did a review with suggestions to make it more readable.

@jhooda
Copy link
Contributor Author

jhooda commented Nov 28, 2020

Please do let me know if I need to rebase the commits into a single commit before the merge. Thanks.

Copy link
Member

@seglo seglo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking better. There's still some more cleanup to do for readability. I think some of the new types introduced can be condensed.

@jhooda
Copy link
Contributor Author

jhooda commented Dec 2, 2020

@seglo Please do let me know if anything else is needed to close this request. Thanks.

Copy link
Member

@seglo seglo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there. Thank you for your patience. I'm just trying to distill this test down as much as possible to make it easier to maintain.

There's some build warnings failing the build:

https://travis-ci.com/github/akka/alpakka-kafka/jobs/452075335

Comment on lines 64 to 66
.withCloseTimeout(5.seconds)
.withPollInterval(300.millis)
.withPollTimeout(200.millis)
.withProperty(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, maxPollRecords) // 500 is the default value
.withProperty(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG, partitionAssignmentStrategy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see why MAX_POLL_RECORDS_CONFIG is important to this test, but close timeout, poll interval, and poll timeout seem rather arbitrary. Do you think these are necessary to reproduce the problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seglo I have removed those configuration which are not relevant. I have also removed few assertions for brevity. As I said earlier, the failure rate is not 100% (it is more like 85% on one machine vs. 50% on another). One can vary the poll interval to enhance the failure rate, for me it is about 400.ms. I'll take a look at the build failure, doesn't seem to be related with this change, I may need to re-merge the latest master.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seglo All except "Run tests with Scala 2.12 and AdoptOpenJDK 8" passed. Tests are also successful on my dev machine (for both scala_12 and scala_13 with 1.8 hotspot jvm). Can you please provide some guidance on how I should proceed. I also have doubts that the failures are related to my changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests look fine. That was a flakey test that failed.

Issue: When there are multiple [KafkaConsumerActor]s they may get
assigned an orthoginal set of topic-partitions, backed by independent
[StageActor]s during group re-balancing. All [StageActor]s receive
a feed of message streams from [KafkaConsumerActor]. At the other end,
[KafkaConsumerActor] receives a primary message stream from a shared
message [Fetcher] as shown below:

[Kafka Broker]-->[Fetcher]--+>[KafkaConsumerActor]--+>[StageActor]

During normal message flow [Fetcher] saves the next message offset
internally and uses that as a reference offset to deliver the next
batch of messages to [KafkaConsumerActor]. However, during
re-balancing the in-progress message batch can be lost as shown below

[Fetcher]--+>[KafkaConsumerActor]--+>[*defuct* StageActor]

Since the entire message stream is asynchornous, [Fetcher] doesn't
always know about the lost message batch, and instead delivers the
next message batch to post-rebalance [new StageActor]

[Fetcher]-->[KafkaConsumerActor]-->[New StageActor]

This fix keeps an uptodate mapping between the topic-partition and newer
StageActor, whenever a [new StageActor] is initialized. This map is
referred by KafkaConsumerActor to prevent requests emitted by the
*defunct* StageActor.

How to reproduce the issue?
1) Reset changes for core src files included in this commit
2) Run the following test

sbt 'tests/testOnly akka.kafka.scaladsl.RebalanceExtSpec -- -z "no
messages should be lost when two consumers consume from one topic and
two partitions and one consumer aborts mid-stream"'
intellij did some import re-ordering and caused the failure
Copy link
Member

@ennru ennru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!
LGTM.

@ennru ennru changed the title Fix for message-batch loss during re-balancing with partitioned sources Fix message-batch loss when rebalancing partitioned sources Dec 8, 2020
@ennru ennru merged commit cee2927 into akka:master Dec 8, 2020
@seglo
Copy link
Member

seglo commented Dec 8, 2020

Thanks @jhooda !

@jhooda
Copy link
Contributor Author

jhooda commented Dec 8, 2020

Thanks for the guidance @seglo and @ennru .

@ennru
Copy link
Member

ennru commented Dec 22, 2020

This has been backported to the release-2.0.x branch to be part of 2.0.6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants