Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-5063: Flaky ResetIntegrationTest #2931

Closed

Conversation

mjsax
Copy link
Member

@mjsax mjsax commented Apr 28, 2017

No description provided.

@mjsax
Copy link
Member Author

mjsax commented Apr 28, 2017

Call for review @enothereska @dguy @guozhangwang

@mjsax mjsax changed the title KAFKA-5140: Flaky ResetIntegrationTest KAFKA-5063: Flaky ResetIntegrationTest Apr 28, 2017
@asfbot
Copy link

asfbot commented Apr 28, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/3258/
Test PASSed (JDK 7 and Scala 2.10).

@asfbot
Copy link

asfbot commented Apr 28, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/3263/
Test FAILed (JDK 8 and Scala 2.11).

@asfbot
Copy link

asfbot commented Apr 28, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/3254/
Test PASSed (JDK 8 and Scala 2.12).

@enothereska
Copy link
Contributor

I noticed one of the wait functions does not have the 60000 timeout.

@@ -187,7 +187,8 @@ public void testReprocessingFromScratchAfterResetWithIntermediateUserTopic() thr
streams.close();

assertThat(resultRerun, equalTo(result));
assertThat(resultRerun2, equalTo(result2));
final int maxToCompare = Math.max(result2.size(), resultRerun2.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure if this should be max or min.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be min -- good catch! (problem: the test passed for me locally all the time, so this "fix" does not change local runs)

@enothereska
Copy link
Contributor

Also there is a test that failed, but it seems the results are not there anymore.

@enothereska
Copy link
Contributor

retest this please

@asfbot
Copy link

asfbot commented Apr 30, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/3311/
Test FAILed (JDK 8 and Scala 2.11).

@asfbot
Copy link

asfbot commented Apr 30, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/3306/
Test PASSed (JDK 7 and Scala 2.10).

@asfbot
Copy link

asfbot commented Apr 30, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/3302/
Test FAILed (JDK 8 and Scala 2.12).

@mjsax
Copy link
Member Author

mjsax commented Apr 30, 2017

About the timeout -- I guess we can got back to default anyway -- increasing the timeouts to 60000 was a try to stabilize the test -- it was not a proper fix though. WDTY about this?

@asfbot
Copy link

asfbot commented Apr 30, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/3314/
Test PASSed (JDK 8 and Scala 2.11).

@asfbot
Copy link

asfbot commented Apr 30, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/3309/
Test FAILed (JDK 7 and Scala 2.10).

@asfbot
Copy link

asfbot commented Apr 30, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/3305/
Test PASSed (JDK 8 and Scala 2.12).

@@ -187,7 +185,8 @@ public void testReprocessingFromScratchAfterResetWithIntermediateUserTopic() thr
streams.close();

assertThat(resultRerun, equalTo(result));
assertThat(resultRerun2, equalTo(result2));
final int maxToCompare = Math.min(result2.size(), resultRerun2.size());
assertThat(resultRerun2.subList(0, maxToCompare), equalTo(result2.subList(0, maxToCompare)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the results not identical?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is somewhat non-deterministic with respect to the expected result. Because, CLUSTER mock time get's initialized with System.currentTimeMillis, the timestamps assigned to the input topic records vary, and thus, they can fall into different windows in each run. And therefore, we don't know exactly how many records we need to expect. For this reason, we only wait for the first 10 records -- however, IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived does not return exactly 10 records, but at-least 10 records, because the actual fetch size of the consumer can vary -- thus, we can end up with a different number of returned record for both runs (even if the topic contains the same data). However, the common prefix must be the same and thus we use it for the comparison.

An alternative (more complex) fix would be, to use the initial mock timestamp to figure out the windows in which records fall into, and thus compute the number the expected result record exactly. But I don't think this would make the test better, but only complicates the code.

Does this make sense? \cc @enothereska @guozhangwang

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mjsax With the recent changes on IntegrationTestUtils, waitUntilXXX will use one consumer instance only so I'm wondering if it could still return more than expected results.

Also for the mock time issue, could we eliminate the non-determinism by not using System.currentTimeMillis? We can augment the MockTime to set manual initialization values if necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing MockTime would fix the issue by itself. Will try to do that.

@guozhangwang
Copy link
Contributor

Is this still needed?

@guozhangwang
Copy link
Contributor

retest this please

@mjsax
Copy link
Member Author

mjsax commented May 3, 2017

I think yes -- the failure we observed is unrelated to the consumer reset strategy (cf https://github.com/apache/kafka/pull/2931/files#r114402394). The prefix did always match -- the test failed because different number of records got returned and thus the lists did not match because of different size.

@asfbot
Copy link

asfbot commented May 3, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/3447/
Test PASSed (JDK 8 and Scala 2.11).

@asfbot
Copy link

asfbot commented May 3, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/3440/
Test PASSed (JDK 8 and Scala 2.12).

@asfbot
Copy link

asfbot commented May 3, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/3449/
Test PASSed (JDK 8 and Scala 2.11).

@asfbot
Copy link

asfbot commented May 3, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/3443/
Test PASSed (JDK 7 and Scala 2.10).

@mjsax
Copy link
Member Author

mjsax commented May 9, 2017

Updated this.

@asfbot
Copy link

asfbot commented May 9, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/3668/
Test PASSed (JDK 8 and Scala 2.11).

@asfbot
Copy link

asfbot commented May 9, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/3658/
Test FAILed (JDK 8 and Scala 2.12).

@asfbot
Copy link

asfbot commented May 9, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/3662/
Test FAILed (JDK 7 and Scala 2.10).

@mjsax
Copy link
Member Author

mjsax commented May 9, 2017

Retest this please.

@asfbot
Copy link

asfbot commented May 9, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/3671/
Test FAILed (JDK 8 and Scala 2.11).

@asfbot
Copy link

asfbot commented May 9, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/3665/
Test FAILed (JDK 7 and Scala 2.10).

@asfbot
Copy link

asfbot commented May 9, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/3661/
Test PASSed (JDK 8 and Scala 2.12).

@mjsax
Copy link
Member Author

mjsax commented May 9, 2017

Retest this please

@asfbot
Copy link

asfbot commented May 9, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/3667/
Test PASSed (JDK 7 and Scala 2.10).

@asfbot
Copy link

asfbot commented May 9, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/3663/
Test FAILed (JDK 8 and Scala 2.12).

@asfbot
Copy link

asfbot commented May 9, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/3673/
Test FAILed (JDK 8 and Scala 2.11).

@guozhangwang
Copy link
Contributor

retest this please

@asfbot
Copy link

asfbot commented May 9, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/3689/
Test FAILed (JDK 8 and Scala 2.11).

@asfbot
Copy link

asfbot commented May 9, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/3679/
Test FAILed (JDK 8 and Scala 2.12).

@asfbot
Copy link

asfbot commented May 9, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/3683/
Test PASSed (JDK 7 and Scala 2.10).

@guozhangwang
Copy link
Contributor

Created https://issues.apache.org/jira/browse/KAFKA-5209 for the newly observed transient failures.

Copy link
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged to trunk.

@asfgit asfgit closed this in 7371bf7 May 9, 2017
@mjsax mjsax deleted the kafka-5140-flaky-reset-integration-test branch May 15, 2017 17:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants