Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-4859: Raised Timeout #2680

Closed

Conversation

original-brownbear
Copy link
Member

This fixes https://issues.apache.org/jira/browse/KAFKA-4859 for me over hundreds of iterations while I could easily reproduce it with less than ~30-40 iterations without the increased timeout.

Also raising the timeout (at least on my setup) looks like a valid approach looking at test runtimes being consistently slightly above those 30s default timeout coming from org.apache.kafka.streams.integration.utils.IntegrationTestUtils#DEFAULT_TIMEOUT.

byregion

@original-brownbear
Copy link
Member Author

@guozhangwang fyi :)

@asfbot
Copy link

asfbot commented Mar 13, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/2156/
Test PASSed (JDK 7 and Scala 2.10).

@asfbot
Copy link

asfbot commented Mar 13, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/2157/
Test PASSed (JDK 8 and Scala 2.12).

@asfbot
Copy link

asfbot commented Mar 13, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/2159/
Test PASSed (JDK 8 and Scala 2.11).

@guozhangwang
Copy link
Contributor

@original-brownbear I looked at the issue and here is my findings (cc @mjsax @dguy @enothereska ):

  1. the test runs with and without the caching effects (i.e. different values as in cacheSizeBytes); and the test only fails with `shouldCountClicksPerRegion[1], that is with caching enabled.

  2. in KAFKA-4359 we removed the overwritten commit interval of 1ms; so we are using the default of 30000.

  3. in 02/02/2017 I fixed KAFKA-3896 in which we do not create a new consumer each time iterating on waitUntilMinValuesRecordsReceived, and around the same time this issue start to happen (02/03/2017). Before this fix, the actual time spending on waitUntilMinValuesRecordsReceived will actually take longer due to the creation / closure of the consumer.

So my conclusion is that our commit interval controlling when the flush on the cache will be triggered is too close to the wait time (they are actually the same on paper, though waiting timeout is actually a lower bound on waitUntilMinValuesRecordsReceived).

Hence my suggestion is: 1) add back the line but set a reasonable interval, e.g.:

private static final int COMMIT_INTERVAL = 3000;    // 3 seconds

streamsConfiguration.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, COMMIT_INTERVAL);
  1. when calling waitUntilMinValuesRecordsReceived instead of just doubling the timeout, use the double commit interval:
IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived(..., COMMIT_INTERVAL * 2);

By doing this we won't increase the latency of running this integration test: today the whole unit test suite of Streams takes about 3-4 minutes, while this integration test alone will take at least 30 seconds with the current setting.

@guozhangwang
Copy link
Contributor

I can verify that if we reduce the wait time on waitUntilMinKeyValueRecordsReceived to 25 seconds (5 seconds smaller than the commit interval), then this issue is very likely to be reproduced locally, and the above proposal does resolve that issue.

In addition, there are a bunch of other integration tests that have the same issue.

@guozhangwang
Copy link
Contributor

Have created an alternative PR #2682 for this issue. Verified that the local runs of unit test can be shorten by about 30 - 45 seconds locally (4GB memory).

@original-brownbear
Copy link
Member Author

@guozhangwang nice, you're PR is what I was looking for when I asked where those 30s came from but couldn't find it :)
So I'll just close here right?

@guozhangwang
Copy link
Contributor

@original-brownbear Could you review the other PR and if you feel that covers the purpose of this one, please feel free to close it.

@original-brownbear
Copy link
Member Author

@guozhangwang already did review it and looks like a much cleaner solution to me. Closing here, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants