Fix message-batch loss when rebalancing partitioned sources #1263

jhooda · 2020-11-24T04:37:55Z

Issue: When there are multiple KafkaConsumerActors they may get
assigned an orthogonal set of topic-partitions, backed by independent
StageActors during group re-balancing. All StageActors receive
a feed of message streams from KafkaConsumerActor. At the other end,
KafkaConsumerActor receives a primary message stream from a shared
message Kafka Consumer Fetcher as shown below:

[Kafka Broker]-->[Fetcher]--+>[KafkaConsumerActor]--+>[StageActor]

During normal message flow Fetcher saves the next message offset
internally and uses that as a reference offset to deliver the next
batch of messages to KafkaConsumerActor. However, during
re-balancing the in-progress message batch can be lost as shown below

[Fetcher]--+>[KafkaConsumerActor]--+>[*defuct* StageActor]

Since the entire message stream is asynchronous, Fetcher doesn't
always know about the lost message batch, and instead delivers the
next message batch to post-rebalance new StageActor

[Fetcher]-->[KafkaConsumerActor]-->[New StageActor]

This fix keeps an up-to-date mapping between the topic-partition and newer
StageActor, whenever a new StageActor is initialized. This map is
referred by KafkaConsumerActor to prevent requests emitted by the
defunct StageActor.

How to reproduce the issue?

Reset changes for core src files included in this commit
Run the following test

sbt 'tests/testOnly akka.kafka.scaladsl.RebalanceExtSpec -- -z "no
messages should be lost when two consumers consume from one topic and
two partitions and one consumer aborts mid-stream"'

seglo

Welcome back @jhooda. I like the direction. The old impl. has already fallen out of my head, but I do recall mentioning if there was a way to simplify the test, perhaps so an integration test isn't required. I admit that will be a challenge though so I'll dig into your test this week.

core/src/main/scala/akka/kafka/internal/KafkaConsumerActor.scala

seglo

Ok, I can't think of a way to optimize this test further without creating just as much complexity as you already have, so I'm inclined to keep it. I did a review with suggestions to make it more readable.

core/src/main/scala/akka/kafka/internal/KafkaConsumerActor.scala

tests/src/test/scala/akka/kafka/scaladsl/RebalanceExtSpec.scala

core/src/main/scala/akka/kafka/internal/KafkaConsumerActor.scala

tests/src/test/scala/akka/kafka/scaladsl/RebalanceExtSpec.scala

jhooda · 2020-11-28T06:50:16Z

Please do let me know if I need to rebase the commits into a single commit before the merge. Thanks.

seglo

Looking better. There's still some more cleanup to do for readability. I think some of the new types introduced can be condensed.

tests/src/test/scala/akka/kafka/scaladsl/RebalanceExtSpec.scala

jhooda · 2020-12-02T20:38:58Z

@seglo Please do let me know if anything else is needed to close this request. Thanks.

seglo

Almost there. Thank you for your patience. I'm just trying to distill this test down as much as possible to make it easier to maintain.

There's some build warnings failing the build:

https://travis-ci.com/github/akka/alpakka-kafka/jobs/452075335

seglo · 2020-12-03T21:02:38Z

tests/src/test/scala/akka/kafka/scaladsl/RebalanceExtSpec.scala

+      .withCloseTimeout(5.seconds)
+      .withPollInterval(300.millis)
+      .withPollTimeout(200.millis)
+      .withProperty(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, maxPollRecords) // 500 is the default value
+      .withProperty(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG, partitionAssignmentStrategy)


I see why MAX_POLL_RECORDS_CONFIG is important to this test, but close timeout, poll interval, and poll timeout seem rather arbitrary. Do you think these are necessary to reproduce the problem?

@seglo I have removed those configuration which are not relevant. I have also removed few assertions for brevity. As I said earlier, the failure rate is not 100% (it is more like 85% on one machine vs. 50% on another). One can vary the poll interval to enhance the failure rate, for me it is about 400.ms. I'll take a look at the build failure, doesn't seem to be related with this change, I may need to re-merge the latest master.

@seglo All except "Run tests with Scala 2.12 and AdoptOpenJDK 8" passed. Tests are also successful on my dev machine (for both scala_12 and scala_13 with 1.8 hotspot jvm). Can you please provide some guidance on how I should proceed. I also have doubts that the failures are related to my changes.

The tests look fine. That was a flakey test that failed.

Issue: When there are multiple [KafkaConsumerActor]s they may get assigned an orthoginal set of topic-partitions, backed by independent [StageActor]s during group re-balancing. All [StageActor]s receive a feed of message streams from [KafkaConsumerActor]. At the other end, [KafkaConsumerActor] receives a primary message stream from a shared message [Fetcher] as shown below: [Kafka Broker]-->[Fetcher]--+>[KafkaConsumerActor]--+>[StageActor] During normal message flow [Fetcher] saves the next message offset internally and uses that as a reference offset to deliver the next batch of messages to [KafkaConsumerActor]. However, during re-balancing the in-progress message batch can be lost as shown below [Fetcher]--+>[KafkaConsumerActor]--+>[*defuct* StageActor] Since the entire message stream is asynchornous, [Fetcher] doesn't always know about the lost message batch, and instead delivers the next message batch to post-rebalance [new StageActor] [Fetcher]-->[KafkaConsumerActor]-->[New StageActor] This fix keeps an uptodate mapping between the topic-partition and newer StageActor, whenever a [new StageActor] is initialized. This map is referred by KafkaConsumerActor to prevent requests emitted by the *defunct* StageActor. How to reproduce the issue? 1) Reset changes for core src files included in this commit 2) Run the following test sbt 'tests/testOnly akka.kafka.scaladsl.RebalanceExtSpec -- -z "no messages should be lost when two consumers consume from one topic and two partitions and one consumer aborts mid-stream"'

…r single sources

…vant

intellij did some import re-ordering and caused the failure

ennru

Great work!
LGTM.

seglo · 2020-12-08T14:59:32Z

Thanks @jhooda !

jhooda · 2020-12-08T21:52:20Z

Thanks for the guidance @seglo and @ennru .

ennru · 2020-12-22T09:39:05Z

This has been backported to the release-2.0.x branch to be part of 2.0.6.

jhooda mentioned this pull request Nov 24, 2020

Message loss testcase ennru/alpakka-kafka#3

Open

probot-autolabeler bot added core tests labels Nov 24, 2020

seglo reviewed Nov 24, 2020

View reviewed changes

core/src/main/scala/akka/kafka/internal/KafkaConsumerActor.scala Outdated Show resolved Hide resolved

seglo changed the title ~~added a fix for message-batch loss during re-balancing~~ added a fix for message-batch loss during re-balancing with partitioned sources Nov 26, 2020

seglo changed the title ~~added a fix for message-batch loss during re-balancing with partitioned sources~~ Fix for message-batch loss during re-balancing with partitioned sources Nov 26, 2020

seglo reviewed Nov 27, 2020

View reviewed changes

seglo reviewed Nov 30, 2020

View reviewed changes

seglo reviewed Dec 3, 2020

View reviewed changes

jhooda added 8 commits December 4, 2020 12:24

fixed sorting to work for both scala 2.12 and 2.13

ea926ac

consolidated stageActors and stageActorsMap also extended fix to cove…

a48e7dc

…r single sources

changed variable name to tps

c1d2057

renamed RequestMessage.(topics to tps) along with code cleanup

a19759f

Simplified data type usage and uncommented assertions

074613b

More cleanup and removed those custom configurations which are irrele…

072a1bb

…vant

fixed verifyDoc failure

58620c1

intellij did some import re-ordering and caused the failure

ennru approved these changes Dec 8, 2020

View reviewed changes

ennru added the to-be-backported label Dec 8, 2020

ennru changed the title ~~Fix for message-batch loss during re-balancing with partitioned sources~~ Fix message-batch loss when rebalancing partitioned sources Dec 8, 2020

ennru merged commit cee2927 into akka:master Dec 8, 2020

ennru pushed a commit that referenced this pull request Dec 21, 2020

Fix message-batch loss when rebalancing partitioned sources (#1263)

adcd19a

ennru removed the to-be-backported label Dec 22, 2020

dependabot bot mentioned this pull request Mar 15, 2021

Bump akka-stream-kafka_2.13 from 1.0.5 to 2.0.7 confluentinc/kafka-tutorials#759

Merged

Aaronontheweb mentioned this pull request Nov 11, 2021

[CRITICAL] Need to stop polling for new messages when there is no downstream demand akkadotnet/Akka.Streams.Kafka#245

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix message-batch loss when rebalancing partitioned sources #1263

Fix message-batch loss when rebalancing partitioned sources #1263

jhooda commented Nov 24, 2020 •

edited by seglo

seglo left a comment

seglo left a comment

jhooda commented Nov 28, 2020

seglo left a comment

jhooda commented Dec 2, 2020

seglo left a comment

seglo Dec 3, 2020

jhooda Dec 4, 2020

jhooda Dec 4, 2020

seglo Dec 7, 2020

ennru left a comment

seglo commented Dec 8, 2020

jhooda commented Dec 8, 2020

ennru commented Dec 22, 2020

Fix message-batch loss when rebalancing partitioned sources #1263

Fix message-batch loss when rebalancing partitioned sources #1263

Conversation

jhooda commented Nov 24, 2020 • edited by seglo

seglo left a comment

Choose a reason for hiding this comment

seglo left a comment

Choose a reason for hiding this comment

jhooda commented Nov 28, 2020

seglo left a comment

Choose a reason for hiding this comment

jhooda commented Dec 2, 2020

seglo left a comment

Choose a reason for hiding this comment

seglo Dec 3, 2020

Choose a reason for hiding this comment

jhooda Dec 4, 2020

Choose a reason for hiding this comment

jhooda Dec 4, 2020

Choose a reason for hiding this comment

seglo Dec 7, 2020

Choose a reason for hiding this comment

ennru left a comment

Choose a reason for hiding this comment

seglo commented Dec 8, 2020

jhooda commented Dec 8, 2020

ennru commented Dec 22, 2020

jhooda commented Nov 24, 2020 •

edited by seglo