Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-5281: System tests for transactions #3149

Closed

Conversation

apurvam
Copy link
Contributor

@apurvam apurvam commented May 26, 2017

No description provided.

@asfbot
Copy link

asfbot commented May 26, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/4406/
Test FAILed (JDK 8 and Scala 2.12).

@asfbot
Copy link

asfbot commented May 26, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/4420/
Test FAILed (JDK 7 and Scala 2.11).

@asfbot
Copy link

asfbot commented May 26, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/4412/
Test FAILed (JDK 8 and Scala 2.12).

@asfbot
Copy link

asfbot commented May 26, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/4426/
Test FAILed (JDK 7 and Scala 2.11).

self.num_brokers = 3

# Test parameters
self.num_input_partitions = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for using a single input partition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's temporary. The idea is to have 2 parallel copy jobs so that we have 2 inflight transactions at any given time. So each needs to process a single partition to ensure no overlap in the copied data.

Right now, we have only a single copy job, so there is only one input partition.

@asfbot
Copy link

asfbot commented May 26, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/4457/
Test PASSed (JDK 7 and Scala 2.11).

@asfbot
Copy link

asfbot commented May 26, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/4443/
Test PASSed (JDK 8 and Scala 2.12).

@asfbot
Copy link

asfbot commented May 27, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/4455/
Test FAILed (JDK 8 and Scala 2.12).

@asfbot
Copy link

asfbot commented May 27, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/4469/
Test FAILed (JDK 7 and Scala 2.11).

@asfbot
Copy link

asfbot commented May 27, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/4478/
Test PASSed (JDK 7 and Scala 2.11).

@asfbot
Copy link

asfbot commented May 27, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/4464/
Test PASSed (JDK 8 and Scala 2.12).

@apurvam
Copy link
Contributor Author

apurvam commented May 27, 2017

@asfbot
Copy link

asfbot commented May 27, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/4496/
Test PASSed (JDK 7 and Scala 2.11).

@asfbot
Copy link

asfbot commented May 27, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/4481/
Test FAILed (JDK 8 and Scala 2.12).

@apurvam
Copy link
Contributor Author

apurvam commented May 28, 2017

The hard bounce test seems to have revealed at least one bug: https://issues.apache.org/jira/browse/KAFKA-5339

@asfbot
Copy link

asfbot commented May 28, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/4500/
Test PASSed (JDK 7 and Scala 2.11).

@asfbot
Copy link

asfbot commented May 28, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/4485/
Test PASSed (JDK 8 and Scala 2.12).

@apurvam apurvam force-pushed the KAFKA-5281-transactions-system-tests branch from 4df0f6b to e26e872 Compare May 30, 2017 20:16
@asfbot
Copy link

asfbot commented May 30, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/4554/
Test PASSed (JDK 8 and Scala 2.12).

@asfbot
Copy link

asfbot commented May 30, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/4569/
Test PASSed (JDK 7 and Scala 2.11).

@apurvam
Copy link
Contributor Author

apurvam commented May 30, 2017

A quick update:

The tests have revealed a bug: The TxnOffsetCommitResponse could have a REQUEST_TIMED_OUT error, which is retriable. The client did not handle this as a retriable error but instead considered it fatal and put the transaction manager in error state.

The TransactionalMessageCopier would then get a KafkaException on the next transactional method it invoked. It would catch this exception, and log it to stderr, and exit. So the copy process would never finish.

Unfortunately, due to an oversight in the python test code, we never collected the stderr stream from the message copier, so there was nothing suspicious in the logs collected. I only found this out because very occasionally the background sender thread would run before the process was shut down, detect that the transaction manager is in error state, and log it. Since we collected the transactional_message_copier logs (ie. the actual producer logs), I could get alerted to this issue .

I have now updated the code to both handle the REQUEST_TIMED_OUT error on the TxnOffsetCommitRequest (just like the ConsumerCoordinator does for the regular OffsetCommitRequest), and also collect the stderr log from the TransactionalMessageCopier. My testing so far shows that the test is stable after these changes.

@asfbot
Copy link

asfbot commented May 30, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/4580/
Test PASSed (JDK 7 and Scala 2.11).

@asfbot
Copy link

asfbot commented May 30, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/4565/
Test PASSed (JDK 8 and Scala 2.12).

return self.message_copy_finished

def progress_percent(self):
if self.remaining < 0:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might need the lock here since you access to mutable variables.

remainingMessages.set(maxMessages - numMessagesProcessed.addAndGet(messagesInCurrentTransaction));
}
} catch (Exception e) {
e.printStackTrace();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In VerifiableProducer and VerifiableConsumer, we pass uncaught exceptions to the argument parser handleError. It would be nice to be consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that just for parsing arguments though?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that might be right. Looks like we just let application errors bubble up. Guess we could do the same here since the exception would get printed anyway. Up to you.


@cluster(num_nodes=8)
@matrix(failure_mode=["clean_bounce", "hard_bounce"],
bounce_target=["brokers", "clients"])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to disable the brokers bounce tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so: with the bug fix for KAFKA-5351, they seem to be quite stable.

@apurvam
Copy link
Contributor Author

apurvam commented May 31, 2017

The latest full system test suite passed, except for the log appender test, which I think is unrelated:

http://testing.confluent.io/confluent-kafka-branch-builder-system-test-results/?prefix=2017-05-31--001.1496235437--apurvam--KAFKA-5281-transactions-system-tests--d7551e9/

@apurvam
Copy link
Contributor Author

apurvam commented May 31, 2017

The bounce test failures are being fixed in : #3184

@apurvam
Copy link
Contributor Author

apurvam commented Jun 1, 2017

The bounce tests seem to be stable after incorporating #3184

@asfbot
Copy link

asfbot commented Jun 1, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/4687/
Test PASSed (JDK 8 and Scala 2.12).

@asfbot
Copy link

asfbot commented Jun 1, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/4703/
Test PASSed (JDK 7 and Scala 2.11).

Copy link

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tests and fixes. LGTM.

asfgit pushed a commit that referenced this pull request Jun 1, 2017
Author: Apurva Mehta <apurva@confluent.io>

Reviewers: Jason Gustafson <jason@confluent.io>

Closes #3149 from apurvam/KAFKA-5281-transactions-system-tests

(cherry picked from commit 1959835)
Signed-off-by: Jason Gustafson <jason@confluent.io>
@asfgit asfgit closed this in 1959835 Jun 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants