Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-8940: Tighten up SmokeTestDriver #7565

Merged

Conversation

guozhangwang
Copy link
Contributor

After many runs of reproducing the failure (on my local MP5 it takes about 100 - 200 run to get one) I think it is more likely a flaky one and not exposing a real bug in rebalance protocol.

What I've observed is that, when the verifying consumer is trying to fetch from the output topics (there are 11 of them), it poll(1sec) each time, and retries 30 times if there's no more data to fetch and stop. It means that if there are no data fetched from the output topics for 30 * 1 = 30 seconds then the verification would stop (potentially too early). And for the failure cases, we observe consistent rebalancing among the closing / newly created clients since the closing is async, i.e. while new clients are added it is possible that closing clients triggered rebalance are not completed yet (note that each instance is configured with 3 threads, and in the worst case there are 6 instances running / pending shutdown at the same time, so a group fo 3 * 6 = 18 members is possible).

However, there's still a possible bug that in KIP-429, somehow the rebalance can never stabilize and members keep re-rejoining and hence cause it to fail. We have another unit test that have bumped up to 3 rebalance by @ableegoldman and if that failed again then it may be a better confirmation such bug may exist.

So what I've done so far for this test:

  1. When closing a client, wait for it to complete closure before moving on to the next iteration and starting a new instance to reduce the rebalance churns.

  2. Poll for 5 seconds instead of 1 to wait for longer time: 5 * 30 = 150 seconds, and locally my laptop finished this test in about 50 seconds.

  3. Minor debug logging improvement; in fact some of them is to reduce redundant debug logging since it is too long and sometimes hides the key information.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

Copy link
Contributor Author

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping @ableegoldman @vvcephei @bbejeck to take a look.

if (records.isEmpty() && recordsProcessed >= recordsGenerated) {
verificationResult = verifyAll(inputs, events);
verificationResult = verifyAll(inputs, events, false);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Motivation of printResults is to only print the results at the end of the verification, instead of printing at each verification.


if (printResults) {
resultStream.printf("\t inputEvents=%n%s%n\t" +
"echoEvents=%n%s%n\tmaxEvents=%n%s%n\tminEvents=%n%s%n\tdifEvents=%n%s%n\tcntEvents=%n%s%n\ttaggEvents=%n%s%n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since some of the results like dif / tagg are composed of other results, just printing themselves are not sufficient for debugging, so I've decided to print all intermeidate results here too.

SmokeTestClient client = clients.remove(0);

client.closeAsync();
while (!client.closed()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the key change 1).

@@ -331,18 +331,20 @@ public static VerificationResult verify(final String kafka,
int retry = 0;
final long start = System.currentTimeMillis();
while (System.currentTimeMillis() - start < TimeUnit.MINUTES.toMillis(6)) {
final ConsumerRecords<String, Number> records = consumer.poll(Duration.ofSeconds(1));
final ConsumerRecords<String, Number> records = consumer.poll(Duration.ofSeconds(5));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the key change 2).

@guozhangwang
Copy link
Contributor Author

Another flakiness that I've observed on this test is that, the generation maybe reset by the background thread due to heartbeat timeout and then when the main thread triggers poll, the onPartitionsLost can be triggered without committing offsets, therefore causes duplicates and test failures. Before KIP-429's onPartitionsLost is introduced, we would still call onPartitionsRevoked which may still pass.

So I feel that making a strict verification for this client may not be the best idea anymore; if after this PR the flakiness still exists, we should consider having a more general fix.

@guozhangwang
Copy link
Contributor Author

JDK 2.12 failed on kafka.api.SaslSslAdminClientIntegrationTest.testCreatePartitions; JDK 2.13 failed on kafka.api.SaslOAuthBearerSslEndToEndAuthorizationTest.testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl.

Copy link
Contributor

@vvcephei vvcephei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @guozhangwang , I was also thinking that the likely solution would be just to wait longer at the end for verification to pass. I think that blocking until each instance shuts down should also be fine.

Thanks!

Copy link
Contributor

@bbejeck bbejeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @guozhangwang LGTM

@guozhangwang guozhangwang merged commit f90b7e9 into apache:trunk Oct 21, 2019
@mjsax mjsax added streams tests Test fixes (including flaky tests) labels Oct 21, 2019
@guozhangwang guozhangwang deleted the K8940-smoketestdriver-integration branch April 24, 2020 23:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
streams tests Test fixes (including flaky tests)
Projects
None yet
4 participants