Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a test case to validate no message loss during rebalance #1016

Closed
wants to merge 1 commit into from
Closed

Added a test case to validate no message loss during rebalance #1016

wants to merge 1 commit into from

Conversation

jhooda
Copy link
Contributor

@jhooda jhooda commented Jan 10, 2020

Purpose

Added a test case that shows sometime offsets are skipped during re-balance
when multiple consumers belong to same group

References

References #382, #589, #946, #336

Changes

When the submitted test is run multiple times, we observed one failure approximately
every ten test runs on
OS: ubuntu 14.04LTS (6-cores),
java version: "1.8.0_144", Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode))

Background Context

There was a discussion in #946 in which @seglo suggested to submit a test to demonstrate that offsets are skipped with a Spec test.
my_test_run_loop_63510_77.txt.gz

my_test_run_loop.sh.txt

@lightbend-cla-validator

At least one pull request committer is not linked to a user. See https://help.github.com/en/articles/why-are-my-commits-linked-to-the-wrong-user#commits-are-not-linked-to-any-user

@seglo
Copy link
Member

seglo commented Jan 10, 2020

@jhooda Thank you for taking the time to implement a test. I did a quick pass and nothing jumps out at me about your setup. I would expect to see message duplication since rebalances are being triggered, but not message loss. I'll dig into this test soon.

@seglo seglo added this to Incoming Issues and PRs in Akka streaming via automation Jan 10, 2020
@seglo seglo moved this from Incoming Issues and PRs to To do in Akka streaming Jan 10, 2020
@seglo seglo self-requested a review January 10, 2020 14:10
@seglo seglo moved this from To do to Ready for review in Akka streaming Jan 10, 2020
@seglo seglo self-assigned this Jan 10, 2020
@jhooda
Copy link
Contributor Author

jhooda commented Jan 10, 2020

@seglo thanks for quick response. Below is a quick view of 100 iterations of the test showing log file size. All log files with size > 2.6K are tests that failed. One of the failure run log file (gz) was also attached in the original pull request.

ubuntu@ubuntu-usnorth1:~/Downloads/alpakka-kafka/logs$ ls -lrt
total 1597384
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 18:41 my_test_run_loop_24847_1.txt
-rw-rw-r-- 1 ubuntu ubuntu 68238902 Jan 9 18:44 my_test_run_loop_24847_2.txt
-rw-rw-r-- 1 ubuntu ubuntu 2586 Jan 9 18:47 my_test_run_loop_24847_3.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 18:51 my_test_run_loop_24847_4.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 18:54 my_test_run_loop_24847_5.txt
-rw-rw-r-- 1 ubuntu ubuntu 33886250 Jan 9 18:56 my_test_run_loop_24847_6.txt
-rw-rw-r-- 1 ubuntu ubuntu 2586 Jan 9 18:59 my_test_run_loop_24847_7.txt
-rw-rw-r-- 1 ubuntu ubuntu 34163080 Jan 9 19:01 my_test_run_loop_24847_8.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 19:04 my_test_run_loop_24847_9.txt
-rw-rw-r-- 1 ubuntu ubuntu 69568343 Jan 9 19:07 my_test_run_loop_24847_10.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 19:10 my_test_run_loop_24847_11.txt
-rw-rw-r-- 1 ubuntu ubuntu 34192915 Jan 9 19:12 my_test_run_loop_24847_12.txt
-rw-rw-r-- 1 ubuntu ubuntu 69799739 Jan 9 19:15 my_test_run_loop_24847_13.txt
-rw-rw-r-- 1 ubuntu ubuntu 2586 Jan 9 19:18 my_test_run_loop_24847_14.txt
-rw-rw-r-- 1 ubuntu ubuntu 63857029 Jan 9 19:21 my_test_run_loop_24847_15.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 19:25 my_test_run_loop_24847_16.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 19:28 my_test_run_loop_24847_17.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 19:31 my_test_run_loop_24847_18.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 19:34 my_test_run_loop_24847_19.txt
-rw-rw-r-- 1 ubuntu ubuntu 2586 Jan 9 19:37 my_test_run_loop_24847_20.txt
-rw-rw-r-- 1 ubuntu ubuntu 68755879 Jan 9 19:40 my_test_run_loop_24847_21.txt
-rw-rw-r-- 1 ubuntu ubuntu 2586 Jan 9 19:44 my_test_run_loop_24847_22.txt
-rw-rw-r-- 1 ubuntu ubuntu 2586 Jan 9 19:47 my_test_run_loop_24847_23.txt
-rw-rw-r-- 1 ubuntu ubuntu 72710323 Jan 9 19:50 my_test_run_loop_24847_24.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 19:53 my_test_run_loop_24847_25.txt
-rw-rw-r-- 1 ubuntu ubuntu 71248675 Jan 9 19:57 my_test_run_loop_24847_26.txt
-rw-rw-r-- 1 ubuntu ubuntu 68063215 Jan 9 20:00 my_test_run_loop_24847_27.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 20:03 my_test_run_loop_24847_28.txt
-rw-rw-r-- 1 ubuntu ubuntu 29770172 Jan 9 20:04 my_test_run_loop_24847_29.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 20:07 my_test_run_loop_24847_30.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 20:10 my_test_run_loop_24847_31.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 20:13 my_test_run_loop_24847_32.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 20:16 my_test_run_loop_24847_33.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 20:19 my_test_run_loop_24847_34.txt
-rw-rw-r-- 1 ubuntu ubuntu 2586 Jan 9 20:23 my_test_run_loop_24847_35.txt
-rw-rw-r-- 1 ubuntu ubuntu 2586 Jan 9 20:26 my_test_run_loop_24847_36.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 20:29 my_test_run_loop_24847_37.txt
-rw-rw-r-- 1 ubuntu ubuntu 2586 Jan 9 20:33 my_test_run_loop_24847_38.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 20:37 my_test_run_loop_24847_39.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 20:40 my_test_run_loop_24847_40.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 20:43 my_test_run_loop_24847_41.txt
-rw-rw-r-- 1 ubuntu ubuntu 2586 Jan 9 20:46 my_test_run_loop_24847_42.txt
-rw-rw-r-- 1 ubuntu ubuntu 69650696 Jan 9 20:49 my_test_run_loop_24847_43.txt
-rw-rw-r-- 1 ubuntu ubuntu 2586 Jan 9 20:53 my_test_run_loop_24847_44.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 20:56 my_test_run_loop_24847_45.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 21:00 my_test_run_loop_24847_46.txt
-rw-rw-r-- 1 ubuntu ubuntu 29250605 Jan 9 21:01 my_test_run_loop_24847_47.txt
-rw-rw-r-- 1 ubuntu ubuntu 67415474 Jan 9 21:04 my_test_run_loop_24847_48.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 21:07 my_test_run_loop_24847_49.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 21:10 my_test_run_loop_24847_50.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 21:14 my_test_run_loop_24847_51.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 21:18 my_test_run_loop_24847_52.txt
-rw-rw-r-- 1 ubuntu ubuntu 2586 Jan 9 21:21 my_test_run_loop_24847_53.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 21:24 my_test_run_loop_24847_54.txt
-rw-rw-r-- 1 ubuntu ubuntu 2586 Jan 9 21:28 my_test_run_loop_24847_55.txt
-rw-rw-r-- 1 ubuntu ubuntu 2586 Jan 9 21:30 my_test_run_loop_24847_56.txt
-rw-rw-r-- 1 ubuntu ubuntu 72402663 Jan 9 21:34 my_test_run_loop_24847_57.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 21:37 my_test_run_loop_24847_58.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 21:40 my_test_run_loop_24847_59.txt
-rw-rw-r-- 1 ubuntu ubuntu 67912254 Jan 9 21:42 my_test_run_loop_24847_60.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 21:46 my_test_run_loop_24847_61.txt
-rw-rw-r-- 1 ubuntu ubuntu 2586 Jan 9 21:49 my_test_run_loop_24847_62.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 21:52 my_test_run_loop_24847_63.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 21:55 my_test_run_loop_24847_64.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 21:58 my_test_run_loop_24847_65.txt
-rw-rw-r-- 1 ubuntu ubuntu 2588 Jan 9 22:01 my_test_run_loop_24847_66.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 22:05 my_test_run_loop_24847_67.txt
-rw-rw-r-- 1 ubuntu ubuntu 30209548 Jan 9 22:06 my_test_run_loop_24847_68.txt
-rw-rw-r-- 1 ubuntu ubuntu 2588 Jan 9 22:09 my_test_run_loop_24847_69.txt
-rw-rw-r-- 1 ubuntu ubuntu 66402980 Jan 9 22:12 my_test_run_loop_24847_70.txt
-rw-rw-r-- 1 ubuntu ubuntu 67261805 Jan 9 22:15 my_test_run_loop_24847_71.txt
-rw-rw-r-- 1 ubuntu ubuntu 2588 Jan 9 22:19 my_test_run_loop_24847_72.txt
-rw-rw-r-- 1 ubuntu ubuntu 2588 Jan 9 22:21 my_test_run_loop_24847_73.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 22:25 my_test_run_loop_24847_74.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 22:29 my_test_run_loop_24847_75.txt
-rw-rw-r-- 1 ubuntu ubuntu 2588 Jan 9 22:31 my_test_run_loop_24847_76.txt
-rw-rw-r-- 1 ubuntu ubuntu 66777819 Jan 9 22:34 my_test_run_loop_24847_77.txt
-rw-rw-r-- 1 ubuntu ubuntu 2588 Jan 9 22:37 my_test_run_loop_24847_78.txt
-rw-rw-r-- 1 ubuntu ubuntu 2588 Jan 9 22:40 my_test_run_loop_24847_79.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 22:43 my_test_run_loop_24847_80.txt
-rw-rw-r-- 1 ubuntu ubuntu 2588 Jan 9 22:46 my_test_run_loop_24847_81.txt
-rw-rw-r-- 1 ubuntu ubuntu 2588 Jan 9 22:49 my_test_run_loop_24847_82.txt
-rw-rw-r-- 1 ubuntu ubuntu 2588 Jan 9 22:52 my_test_run_loop_24847_83.txt
-rw-rw-r-- 1 ubuntu ubuntu 2588 Jan 9 22:55 my_test_run_loop_24847_84.txt
-rw-rw-r-- 1 ubuntu ubuntu 68702456 Jan 9 22:58 my_test_run_loop_24847_85.txt
-rw-rw-r-- 1 ubuntu ubuntu 2588 Jan 9 23:01 my_test_run_loop_24847_86.txt
-rw-rw-r-- 1 ubuntu ubuntu 2588 Jan 9 23:05 my_test_run_loop_24847_87.txt
-rw-rw-r-- 1 ubuntu ubuntu 66630102 Jan 9 23:07 my_test_run_loop_24847_88.txt
-rw-rw-r-- 1 ubuntu ubuntu 2588 Jan 9 23:10 my_test_run_loop_24847_89.txt
-rw-rw-r-- 1 ubuntu ubuntu 40348591 Jan 9 23:12 my_test_run_loop_24847_90.txt
-rw-rw-r-- 1 ubuntu ubuntu 64773970 Jan 9 23:15 my_test_run_loop_24847_91.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 23:18 my_test_run_loop_24847_92.txt
-rw-rw-r-- 1 ubuntu ubuntu 61840266 Jan 9 23:22 my_test_run_loop_24847_93.txt
-rw-rw-r-- 1 ubuntu ubuntu 2588 Jan 9 23:25 my_test_run_loop_24847_94.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 23:29 my_test_run_loop_24847_95.txt
-rw-rw-r-- 1 ubuntu ubuntu 2588 Jan 9 23:32 my_test_run_loop_24847_96.txt
-rw-rw-r-- 1 ubuntu ubuntu 2588 Jan 9 23:35 my_test_run_loop_24847_97.txt
-rw-rw-r-- 1 ubuntu ubuntu 44239131 Jan 9 23:37 my_test_run_loop_24847_98.txt
-rw-rw-r-- 1 ubuntu ubuntu 2587 Jan 9 23:41 my_test_run_loop_24847_99.txt
-rw-rw-r-- 1 ubuntu ubuntu 67272920 Jan 9 23:43 my_test_run_loop_24847_100.txt

Message loss grep:
ubuntu@ubuntu-usnorth1:~/Downloads/alpakka-kafka/logs$ grep 'info.*was not equal to' my_test_run_loop_24847_*txt
my_test_run_loop_24847_100.txt:[info] 99900 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_10.txt:[info] 99980 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_13.txt:[info] 99940 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_15.txt:[info] 99900 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_21.txt:[info] 99080 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_24.txt:[info] 99980 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_26.txt:[info] 99980 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_27.txt:[info] 99960 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_2.txt:[info] 99860 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_43.txt:[info] 99920 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_48.txt:[info] 99880 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_57.txt:[info] 99980 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_60.txt:[info] 99620 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_70.txt:[info] 99860 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_71.txt:[info] 99940 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_77.txt:[info] 99940 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_85.txt:[info] 96960 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_88.txt:[info] 99980 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_91.txt:[info] 99340 was not equal to 100000 (RebalanceSpec.scala:454)
my_test_run_loop_24847_93.txt:[info] 99860 was not equal to 100000 (RebalanceSpec.scala:454)

@jhooda
Copy link
Contributor Author

jhooda commented Jan 11, 2020

@seglo After analyzing the logs there seems to be some race condition between the older dying consumers and the new consumers: offsets are missed/jumped by the new consumers in a multiple of chunk determined by the config setting ConsumerConfig.MAX_POLL_RECORDS_CONFIG (e.g., "20" in this test case). Most likely Committer.sink is corrupting the offset start value for the second poll polled by the new consumers. (This is very similar to #336 )

@ennru
Copy link
Member

ennru commented Jan 13, 2020

Hi @jhooda
Thank you for digging into this and providing the test case. The tests fail on Travis as the log grows too big. I just created #1017 which logs less.
As you created this PR from your master branch of Alpakka Kafka, I did not want to push onto it. Please create a git branch before submitting changes in a Pull Request.

Copy link
Member

@ennru ennru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspected the mapAsyncUnordered causes the message-loss, but that was alright.

Comment on lines +406 to +407
t2 = subscribeAndConsumeMessages(consumerClientId2, probe2subscription, topicCount, partitionCount)
control2 = t2._1; probe2 = t2._2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of re-using these variables?

Copy link
Contributor Author

@jhooda jhooda Jan 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't much purpose, except to make them follow the client suffix, e.g., control2, probe2 are for client2, useful to correlate when that particular client is closed and restarted. Looks like you took care of it in #1017. Thanks.

Copy link
Contributor Author

@jhooda jhooda Jan 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could mapAsyncUnordered in Committer::batchFlow cause this? Can we make Committer listen to rebalance events and skip commits emitted by defunct consumers?

@seglo seglo removed their assignment Jan 13, 2020
@seglo seglo moved this from Ready for review to Review in progress in Akka streaming Jan 17, 2020
@seglo seglo closed this Jan 22, 2020
Akka streaming automation moved this from Review in progress to Done Jan 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants