KAFKA-8940: Tighten up SmokeTestDriver #7565

guozhangwang · 2019-10-20T18:10:58Z

After many runs of reproducing the failure (on my local MP5 it takes about 100 - 200 run to get one) I think it is more likely a flaky one and not exposing a real bug in rebalance protocol.

What I've observed is that, when the verifying consumer is trying to fetch from the output topics (there are 11 of them), it poll(1sec) each time, and retries 30 times if there's no more data to fetch and stop. It means that if there are no data fetched from the output topics for 30 * 1 = 30 seconds then the verification would stop (potentially too early). And for the failure cases, we observe consistent rebalancing among the closing / newly created clients since the closing is async, i.e. while new clients are added it is possible that closing clients triggered rebalance are not completed yet (note that each instance is configured with 3 threads, and in the worst case there are 6 instances running / pending shutdown at the same time, so a group fo 3 * 6 = 18 members is possible).

However, there's still a possible bug that in KIP-429, somehow the rebalance can never stabilize and members keep re-rejoining and hence cause it to fail. We have another unit test that have bumped up to 3 rebalance by @ableegoldman and if that failed again then it may be a better confirmation such bug may exist.

So what I've done so far for this test:

When closing a client, wait for it to complete closure before moving on to the next iteration and starting a new instance to reduce the rebalance churns.
Poll for 5 seconds instead of 1 to wait for longer time: 5 * 30 = 150 seconds, and locally my laptop finished this test in about 50 seconds.
Minor debug logging improvement; in fact some of them is to reduce redundant debug logging since it is too long and sometimes hides the key information.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

guozhangwang

Ping @ableegoldman @vvcephei @bbejeck to take a look.

guozhangwang · 2019-10-20T18:11:43Z

streams/src/test/java/org/apache/kafka/streams/tests/SmokeTestDriver.java

            if (records.isEmpty() && recordsProcessed >= recordsGenerated) {
-                verificationResult = verifyAll(inputs, events);
+                verificationResult = verifyAll(inputs, events, false);


Motivation of printResults is to only print the results at the end of the verification, instead of printing at each verification.

guozhangwang · 2019-10-20T18:12:34Z

streams/src/test/java/org/apache/kafka/streams/tests/SmokeTestDriver.java

+
+                    if (printResults) {
+                        resultStream.printf("\t inputEvents=%n%s%n\t" +
+                                "echoEvents=%n%s%n\tmaxEvents=%n%s%n\tminEvents=%n%s%n\tdifEvents=%n%s%n\tcntEvents=%n%s%n\ttaggEvents=%n%s%n",


Since some of the results like dif / tagg are composed of other results, just printing themselves are not sufficient for debugging, so I've decided to print all intermeidate results here too.

guozhangwang · 2019-10-20T18:12:49Z

streams/src/test/java/org/apache/kafka/streams/integration/SmokeTestDriverIntegrationTest.java

+                SmokeTestClient client = clients.remove(0);
+
+                client.closeAsync();
+                while (!client.closed()) {


This is the key change 1).

guozhangwang · 2019-10-20T18:13:03Z

streams/src/test/java/org/apache/kafka/streams/tests/SmokeTestDriver.java

@@ -331,18 +331,20 @@ public static VerificationResult verify(final String kafka,
        int retry = 0;
        final long start = System.currentTimeMillis();
        while (System.currentTimeMillis() - start < TimeUnit.MINUTES.toMillis(6)) {
-            final ConsumerRecords<String, Number> records = consumer.poll(Duration.ofSeconds(1));
+            final ConsumerRecords<String, Number> records = consumer.poll(Duration.ofSeconds(5));


This is the key change 2).

guozhangwang · 2019-10-20T21:57:48Z

Another flakiness that I've observed on this test is that, the generation maybe reset by the background thread due to heartbeat timeout and then when the main thread triggers poll, the onPartitionsLost can be triggered without committing offsets, therefore causes duplicates and test failures. Before KIP-429's onPartitionsLost is introduced, we would still call onPartitionsRevoked which may still pass.

So I feel that making a strict verification for this client may not be the best idea anymore; if after this PR the flakiness still exists, we should consider having a more general fix.

guozhangwang · 2019-10-21T17:29:34Z

JDK 2.12 failed on kafka.api.SaslSslAdminClientIntegrationTest.testCreatePartitions; JDK 2.13 failed on kafka.api.SaslOAuthBearerSslEndToEndAuthorizationTest.testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl.

vvcephei

Thanks @guozhangwang , I was also thinking that the likely solution would be just to wait longer at the end for verification to pass. I think that blocking until each instance shuts down should also be fine.

Thanks!

bbejeck

Thanks @guozhangwang LGTM

guozhangwang added 2 commits October 18, 2019 17:33

improve error message

3534a28

actual fix

1960f84

guozhangwang commented Oct 20, 2019

View reviewed changes

checkstyle

b3fe890

vvcephei approved these changes Oct 21, 2019

View reviewed changes

bbejeck approved these changes Oct 21, 2019

View reviewed changes

guozhangwang merged commit f90b7e9 into apache:trunk Oct 21, 2019

mjsax added streams tests Test fixes (including flaky tests) labels Oct 21, 2019

guozhangwang deleted the K8940-smoketestdriver-integration branch April 24, 2020 23:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-8940: Tighten up SmokeTestDriver #7565

KAFKA-8940: Tighten up SmokeTestDriver #7565

guozhangwang commented Oct 20, 2019

guozhangwang left a comment

guozhangwang Oct 20, 2019

guozhangwang Oct 20, 2019

guozhangwang Oct 20, 2019

guozhangwang Oct 20, 2019

guozhangwang commented Oct 20, 2019

guozhangwang commented Oct 21, 2019

vvcephei left a comment

bbejeck left a comment

KAFKA-8940: Tighten up SmokeTestDriver #7565

KAFKA-8940: Tighten up SmokeTestDriver #7565

Conversation

guozhangwang commented Oct 20, 2019

Committer Checklist (excluded from commit message)

guozhangwang left a comment

Choose a reason for hiding this comment

guozhangwang Oct 20, 2019

Choose a reason for hiding this comment

guozhangwang Oct 20, 2019

Choose a reason for hiding this comment

guozhangwang Oct 20, 2019

Choose a reason for hiding this comment

guozhangwang Oct 20, 2019

Choose a reason for hiding this comment

guozhangwang commented Oct 20, 2019

guozhangwang commented Oct 21, 2019

vvcephei left a comment

Choose a reason for hiding this comment

bbejeck left a comment

Choose a reason for hiding this comment