New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-8940: Tighten up SmokeTestDriver #7565
KAFKA-8940: Tighten up SmokeTestDriver #7565
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping @ableegoldman @vvcephei @bbejeck to take a look.
if (records.isEmpty() && recordsProcessed >= recordsGenerated) { | ||
verificationResult = verifyAll(inputs, events); | ||
verificationResult = verifyAll(inputs, events, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Motivation of printResults
is to only print the results at the end of the verification, instead of printing at each verification.
|
||
if (printResults) { | ||
resultStream.printf("\t inputEvents=%n%s%n\t" + | ||
"echoEvents=%n%s%n\tmaxEvents=%n%s%n\tminEvents=%n%s%n\tdifEvents=%n%s%n\tcntEvents=%n%s%n\ttaggEvents=%n%s%n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since some of the results like dif / tagg are composed of other results, just printing themselves are not sufficient for debugging, so I've decided to print all intermeidate results here too.
SmokeTestClient client = clients.remove(0); | ||
|
||
client.closeAsync(); | ||
while (!client.closed()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the key change 1).
@@ -331,18 +331,20 @@ public static VerificationResult verify(final String kafka, | |||
int retry = 0; | |||
final long start = System.currentTimeMillis(); | |||
while (System.currentTimeMillis() - start < TimeUnit.MINUTES.toMillis(6)) { | |||
final ConsumerRecords<String, Number> records = consumer.poll(Duration.ofSeconds(1)); | |||
final ConsumerRecords<String, Number> records = consumer.poll(Duration.ofSeconds(5)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the key change 2).
Another flakiness that I've observed on this test is that, the generation maybe reset by the background thread due to heartbeat timeout and then when the main thread triggers So I feel that making a strict verification for this client may not be the best idea anymore; if after this PR the flakiness still exists, we should consider having a more general fix. |
JDK 2.12 failed on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @guozhangwang , I was also thinking that the likely solution would be just to wait longer at the end for verification to pass. I think that blocking until each instance shuts down should also be fine.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @guozhangwang LGTM
After many runs of reproducing the failure (on my local MP5 it takes about 100 - 200 run to get one) I think it is more likely a flaky one and not exposing a real bug in rebalance protocol.
What I've observed is that, when the verifying consumer is trying to fetch from the output topics (there are 11 of them), it
poll(1sec)
each time, and retries 30 times if there's no more data to fetch and stop. It means that if there are no data fetched from the output topics for 30 * 1 = 30 seconds then the verification would stop (potentially too early). And for the failure cases, we observe consistent rebalancing among the closing / newly created clients since the closing is async, i.e. while new clients are added it is possible that closing clients triggered rebalance are not completed yet (note that each instance is configured with 3 threads, and in the worst case there are 6 instances running / pending shutdown at the same time, so a group fo 3 * 6 = 18 members is possible).However, there's still a possible bug that in KIP-429, somehow the rebalance can never stabilize and members keep re-rejoining and hence cause it to fail. We have another unit test that have bumped up to 3 rebalance by @ableegoldman and if that failed again then it may be a better confirmation such bug may exist.
So what I've done so far for this test:
When closing a client, wait for it to complete closure before moving on to the next iteration and starting a new instance to reduce the rebalance churns.
Poll for 5 seconds instead of 1 to wait for longer time: 5 * 30 = 150 seconds, and locally my laptop finished this test in about 50 seconds.
Minor debug logging improvement; in fact some of them is to reduce redundant debug logging since it is too long and sometimes hides the key information.
Committer Checklist (excluded from commit message)