Added a test case to validate no message loss during rebalance #1016

jhooda · 2020-01-10T00:24:58Z

Purpose

Added a test case that shows sometime offsets are skipped during re-balance
when multiple consumers belong to same group

References

References #382, #589, #946, #336

Changes

When the submitted test is run multiple times, we observed one failure approximately
every ten test runs on
OS: ubuntu 14.04LTS (6-cores),
java version: "1.8.0_144", Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode))

Background Context

There was a discussion in #946 in which @seglo suggested to submit a test to demonstrate that offsets are skipped with a Spec test.
my_test_run_loop_63510_77.txt.gz

my_test_run_loop.sh.txt

lightbend-cla-validator · 2020-01-10T00:25:00Z

At least one pull request committer is not linked to a user. See https://help.github.com/en/articles/why-are-my-commits-linked-to-the-wrong-user#commits-are-not-linked-to-any-user

seglo · 2020-01-10T14:05:18Z

@jhooda Thank you for taking the time to implement a test. I did a quick pass and nothing jumps out at me about your setup. I would expect to see message duplication since rebalances are being triggered, but not message loss. I'll dig into this test soon.

jhooda · 2020-01-10T17:40:39Z

@seglo thanks for quick response. Below is a quick view of 100 iterations of the test showing log file size. All log files with size > 2.6K are tests that failed. One of the failure run log file (gz) was also attached in the original pull request.

ubuntu@ubuntu-usnorth1:~/D total 1597384
-rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu -rw-rw-r-- 1 ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu -rw-rw-r-- 1 ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu -rw-rw-r-- 1 ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu -rw-rw-r-- 1 ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu -rw-rw-r-- 1 ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu -rw-rw-r-- 1 ubuntu ubuntu -rw-rw-r-- 1 ubuntu ownloads/alpakka-kafka/logs$ ls -lrt
2587 Jan 9 18:41 my_test_run_loop_24847_1.txt
ubuntu 68238902 Jan 9 18:44 my_test_run_loop_24847_2.txt
2586 Jan 9 18:47 my_test_run_loop_24847_3.txt
2587 Jan 9 18:51 my_test_run_loop_24847_4.txt
2587 Jan 9 18:54 my_test_run_loop_24847_5.txt
ubuntu 33886250 Jan 9 18:56 my_test_run_loop_24847_6.txt
2586 Jan 9 18:59 my_test_run_loop_24847_7.txt
ubuntu 34163080 Jan 9 19:01 my_test_run_loop_24847_8.txt
2587 Jan 9 19:04 my_test_run_loop_24847_9.txt
ubuntu 69568343 Jan 9 19:07 my_test_run_loop_24847_10.txt
2587 Jan 9 19:10 my_test_run_loop_24847_11.txt
ubuntu 34192915 Jan 9 19:12 my_test_run_loop_24847_12.txt
ubuntu 69799739 Jan 9 19:15 my_test_run_loop_24847_13.txt
2586 Jan 9 19:18 my_test_run_loop_24847_14.txt
ubuntu 63857029 Jan 9 19:21 my_test_run_loop_24847_15.txt
2587 Jan 9 19:25 my_test_run_loop_24847_16.txt
2587 Jan 9 19:28 my_test_run_loop_24847_17.txt
2587 Jan 9 19:31 my_test_run_loop_24847_18.txt
2587 Jan 9 19:34 my_test_run_loop_24847_19.txt
2586 Jan 9 19:37 my_test_run_loop_24847_20.txt
ubuntu 68755879 Jan 9 19:40 my_test_run_loop_24847_21.txt
2586 Jan 9 19:44 my_test_run_loop_24847_22.txt
2586 Jan 9 19:47 my_test_run_loop_24847_23.txt
ubuntu 72710323 Jan 9 19:50 my_test_run_loop_24847_24.txt
2587 Jan 9 19:53 my_test_run_loop_24847_25.txt
ubuntu 71248675 Jan 9 19:57 my_test_run_loop_24847_26.txt
ubuntu 68063215 Jan 9 20:00 my_test_run_loop_24847_27.txt
2587 Jan 9 20:03 my_test_run_loop_24847_28.txt
ubuntu 29770172 Jan 9 20:04 my_test_run_loop_24847_29.txt
2587 Jan 9 20:07 my_test_run_loop_24847_30.txt
2587 Jan 9 20:10 my_test_run_loop_24847_31.txt
2587 Jan 9 20:13 my_test_run_loop_24847_32.txt
2587 Jan 9 20:16 my_test_run_loop_24847_33.txt
2587 Jan 9 20:19 my_test_run_loop_24847_34.txt
2586 Jan 9 20:23 my_test_run_loop_24847_35.txt
2586 Jan 9 20:26 my_test_run_loop_24847_36.txt
2587 Jan 9 20:29 my_test_run_loop_24847_37.txt
2586 Jan 9 20:33 my_test_run_loop_24847_38.txt
2587 Jan 9 20:37 my_test_run_loop_24847_39.txt
2587 Jan 9 20:40 my_test_run_loop_24847_40.txt
2587 Jan 9 20:43 my_test_run_loop_24847_41.txt
2586 Jan 9 20:46 my_test_run_loop_24847_42.txt
ubuntu 69650696 Jan 9 20:49 my_test_run_loop_24847_43.txt
2586 Jan 9 20:53 my_test_run_loop_24847_44.txt
2587 Jan 9 20:56 my_test_run_loop_24847_45.txt
2587 Jan 9 21:00 my_test_run_loop_24847_46.txt
ubuntu 29250605 Jan 9 21:01 my_test_run_loop_24847_47.txt
ubuntu 67415474 Jan 9 21:04 my_test_run_loop_24847_48.txt
2587 Jan 9 21:07 my_test_run_loop_24847_49.txt
2587 Jan 9 21:10 my_test_run_loop_24847_50.txt
2587 Jan 9 21:14 my_test_run_loop_24847_51.txt
2587 Jan 9 21:18 my_test_run_loop_24847_52.txt
2586 Jan 9 21:21 my_test_run_loop_24847_53.txt
2587 Jan 9 21:24 my_test_run_loop_24847_54.txt
2586 Jan 9 21:28 my_test_run_loop_24847_55.txt
2586 Jan 9 21:30 my_test_run_loop_24847_56.txt
ubuntu 72402663 Jan 9 21:34 my_test_run_loop_24847_57.txt
2587 Jan 9 21:37 my_test_run_loop_24847_58.txt
2587 Jan 9 21:40 my_test_run_loop_24847_59.txt
ubuntu 67912254 Jan 9 21:42 my_test_run_loop_24847_60.txt
2587 Jan 9 21:46 my_test_run_loop_24847_61.txt
2586 Jan 9 21:49 my_test_run_loop_24847_62.txt
2587 Jan 9 21:52 my_test_run_loop_24847_63.txt
2587 Jan 9 21:55 my_test_run_loop_24847_64.txt
2587 Jan 9 21:58 my_test_run_loop_24847_65.txt
2588 Jan 9 22:01 my_test_run_loop_24847_66.txt
2587 Jan 9 22:05 my_test_run_loop_24847_67.txt
ubuntu 30209548 Jan 9 22:06 my_test_run_loop_24847_68.txt
2588 Jan 9 22:09 my_test_run_loop_24847_69.txt
ubuntu 66402980 Jan 9 22:12 my_test_run_loop_24847_70.txt
ubuntu 67261805 Jan 9 22:15 my_test_run_loop_24847_71.txt
2588 Jan 9 22:19 my_test_run_loop_24847_72.txt
2588 Jan 9 22:21 my_test_run_loop_24847_73.txt
2587 Jan 9 22:25 my_test_run_loop_24847_74.txt
2587 Jan 9 22:29 my_test_run_loop_24847_75.txt
2588 Jan 9 22:31 my_test_run_loop_24847_76.txt
ubuntu 66777819 Jan 9 22:34 my_test_run_loop_24847_77.txt
2588 Jan 9 22:37 my_test_run_loop_24847_78.txt
2588 Jan 9 22:40 my_test_run_loop_24847_79.txt
2587 Jan 9 22:43 my_test_run_loop_24847_80.txt
2588 Jan 9 22:46 my_test_run_loop_24847_81.txt
2588 Jan 9 22:49 my_test_run_loop_24847_82.txt
2588 Jan 9 22:52 my_test_run_loop_24847_83.txt
2588 Jan 9 22:55 my_test_run_loop_24847_84.txt
ubuntu 68702456 Jan 9 22:58 my_test_run_loop_24847_85.txt
2588 Jan 9 23:01 my_test_run_loop_24847_86.txt
2588 Jan 9 23:05 my_test_run_loop_24847_87.txt
ubuntu 66630102 Jan 9 23:07 my_test_run_loop_24847_88.txt
2588 Jan 9 23:10 my_test_run_loop_24847_89.txt
ubuntu 40348591 Jan 9 23:12 my_test_run_loop_24847_90.txt
ubuntu 64773970 Jan 9 23:15 my_test_run_loop_24847_91.txt
2587 Jan 9 23:18 my_test_run_loop_24847_92.txt
ubuntu 61840266 Jan 9 23:22 my_test_run_loop_24847_93.txt
2588 Jan 9 23:25 my_test_run_loop_24847_94.txt
2587 Jan 9 23:29 my_test_run_loop_24847_95.txt
2588 Jan 9 23:32 my_test_run_loop_24847_96.txt
2588 Jan 9 23:35 my_test_run_loop_24847_97.txt
ubuntu 44239131 Jan 9 23:37 my_test_run_loop_24847_98.txt
2587 Jan 9 23:41 my_test_run_loop_24847_99.txt
ubuntu 67272920 Jan 9 23:43 my_test_run_loop_24847_100.txt

Message loss grep:
ubuntu@ubuntu-usnorth1:~/Downloads/alpakka-kafka/logs$ grep 'info.*was not equal to' my_test_run_loop_24847_*txt
my_test_run_loop_24847_100.txt:[info] 99900 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_10.txt:[info] 99980 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_13.txt:[info] 99940 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_15.txt:[info] 99900 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_21.txt:[info] 99080 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_24.txt:[info] 99980 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_26.txt:[info] 99980 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_27.txt:[info] 99960 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_2.txt:[info] 99860 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_43.txt:[info] 99920 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_48.txt:[info] 99880 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_57.txt:[info] 99980 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_60.txt:[info] 99620 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_70.txt:[info] 99860 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_71.txt:[info] 99940 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_77.txt:[info] 99940 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_85.txt:[info] 96960 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_88.txt:[info] 99980 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_91.txt:[info] 99340 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_93.txt:[info] 99860 was not equal to 100000 (RebalanceSpec.scala:454)

jhooda · 2020-01-11T02:47:28Z

@seglo After analyzing the logs there seems to be some race condition between the older dying consumers and the new consumers: offsets are missed/jumped by the new consumers in a multiple of chunk determined by the config setting ConsumerConfig.MAX_POLL_RECORDS_CONFIG (e.g., "20" in this test case). Most likely Committer.sink is corrupting the offset start value for the second poll polled by the new consumers. (This is very similar to #336 )

ennru · 2020-01-13T12:36:45Z

Hi @jhooda
Thank you for digging into this and providing the test case. The tests fail on Travis as the log grows too big. I just created #1017 which logs less.
As you created this PR from your master branch of Alpakka Kafka, I did not want to push onto it. Please create a git branch before submitting changes in a Pull Request.

ennru

I suspected the mapAsyncUnordered causes the message-loss, but that was alright.

ennru · 2020-01-13T12:50:44Z

tests/src/test/scala/akka/kafka/scaladsl/RebalanceSpec.scala

+      t2 = subscribeAndConsumeMessages(consumerClientId2, probe2subscription, topicCount, partitionCount)
+      control2 = t2._1; probe2 = t2._2


What is the purpose of re-using these variables?

There isn't much purpose, except to make them follow the client suffix, e.g., control2, probe2 are for client2, useful to correlate when that particular client is closed and restarted. Looks like you took care of it in #1017. Thanks.

Could mapAsyncUnordered in Committer::batchFlow cause this? Can we make Committer listen to rebalance events and skip commits emitted by defunct consumers?

tests/src/test/scala/akka/kafka/scaladsl/RebalanceSpec.scala

Added a test case to validate no message loss during rebalance

4da4679

probot-autolabeler bot added the tests label Jan 10, 2020

jhooda mentioned this pull request Jan 10, 2020

Rebalance: filter messages of revoked partitions #946

Merged

seglo added this to Incoming Issues and PRs in Akka streaming via automation Jan 10, 2020

seglo moved this from Incoming Issues and PRs to To do in Akka streaming Jan 10, 2020

seglo self-requested a review January 10, 2020 14:10

seglo moved this from To do to Ready for review in Akka streaming Jan 10, 2020

seglo self-assigned this Jan 10, 2020

ennru mentioned this pull request Jan 13, 2020

Message loss testcase of #1016 #1017

Closed

ennru reviewed Jan 13, 2020

View reviewed changes

seglo removed their assignment Jan 13, 2020

seglo moved this from Ready for review to Review in progress in Akka streaming Jan 17, 2020

seglo closed this Jan 22, 2020

Akka streaming automation moved this from Review in progress to Done Jan 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added a test case to validate no message loss during rebalance #1016

Added a test case to validate no message loss during rebalance #1016

jhooda commented Jan 10, 2020 •

edited

lightbend-cla-validator commented Jan 10, 2020

seglo commented Jan 10, 2020

jhooda commented Jan 10, 2020 •

edited

jhooda commented Jan 11, 2020

ennru commented Jan 13, 2020

ennru left a comment •

edited

ennru Jan 13, 2020

jhooda Jan 13, 2020 •

edited

jhooda Jan 14, 2020 •

edited

		t2 = subscribeAndConsumeMessages(consumerClientId2, probe2subscription, topicCount, partitionCount)
		control2 = t2._1; probe2 = t2._2

Added a test case to validate no message loss during rebalance #1016

Added a test case to validate no message loss during rebalance #1016

Conversation

jhooda commented Jan 10, 2020 • edited

Purpose

References

Changes

Background Context

lightbend-cla-validator commented Jan 10, 2020

seglo commented Jan 10, 2020

jhooda commented Jan 10, 2020 • edited

jhooda commented Jan 11, 2020

ennru commented Jan 13, 2020

ennru left a comment • edited

Choose a reason for hiding this comment

ennru Jan 13, 2020

Choose a reason for hiding this comment

jhooda Jan 13, 2020 • edited

Choose a reason for hiding this comment

jhooda Jan 14, 2020 • edited

Choose a reason for hiding this comment

jhooda commented Jan 10, 2020 •

edited

jhooda commented Jan 10, 2020 •

edited

ennru left a comment •

edited

jhooda Jan 13, 2020 •

edited

jhooda Jan 14, 2020 •

edited