Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consumer benchmark test for paused partitions #7221

Open
wants to merge 2 commits into
base: trunk
from

Conversation

@seglo
Copy link
Contributor

commented Aug 19, 2019

For details about this new Kafka Consumer benchmark test see Jira issue KAFKA-8814. Original PR and Jira:

To recreate the tests from the Jira issue:

# Run on trunk
TC_PATHS="tests/kafkatest/benchmarks/core/benchmark_test.py::Benchmark.test_consumer_throughput" bash tests/docker/run_tests.sh
# Rebase onto tag 2.3.0
git rebase --onto 2.3.0 trunk
# Run on 2.3.0
TC_PATHS="tests/kafkatest/benchmarks/core/benchmark_test.py::Benchmark.test_consumer_throughput" bash tests/docker/run_tests.sh

@ijuma @hachikuji Please review at your convenience.

@seglo seglo force-pushed the seglo:seglo/KAFKA-8814 branch Aug 19, 2019

@seglo

This comment has been minimized.

Copy link
Contributor Author

commented Aug 19, 2019

@ijuma I don't see benchmark related test results in the PR-triggered Jenkins build. Is there a benchmark build that can be run with this branch?

@hachikuji
Copy link
Contributor

left a comment

@seglo Thanks, this is pretty cool. I'm kind of debating whether this is a general enough need that it makes sense to add it the consumer performance tool. It is definitely useful to understand how pause/resume impacts performance, but it feels a bit too tailored to the consumer api. For example, we resume immediately after each poll rather than having a pause duration or something like that. We could also try to tie it to the data more closely. I think in streams, we use the pause api to control the maximum time lag between different partitions. Would it make sense to do something similar so that the benchmark could be more realistic?

@seglo

This comment has been minimized.

Copy link
Contributor Author

commented Aug 20, 2019

@hachikuji Thanks for the reply. When I first started exploring the way to benchmark this work I had some reservations about modifying the consumer performance tool as well. It makes sense that the existing benchmarks use this tool, but it does place limits on the types of consumer scenarios that can be tested.

There does seem to be precedent to modify the tools for system testing. Some of the apps in org.apache.kafka.tools appear to exist just for this purpose (VerifiableConsumer, VerifiableLog4jAppender, VerifiableProducer). In TransactionalMessageCopier there's an argument called --enable-random-aborts which is only used for testing:

Whether or not to enable random transaction aborts (for system testing)

I like your idea about testing how the partition pauses affect Kafka Streams, but I'm not very familiar with the use case or if this fix has much impact for it. I can speak to how the Alpakka Kafka project will benefit from this fix. The consumer Source (which contains a Kafka Consumer) will always poll on a set interval, but it pauses partitions when there is no demand for records downstream (via akka streams back pressure). The source will still poll regularly to handle any offset commit acknowledgements that might be outstanding, but this would cause the consumer to throw away data pre-fetched data when partitions are paused due to back pressure.

IIRC the original issue was reported by LinkedIn WRT how Samza pauses partitions during its operation, but I'm not familiar with that use case either. I think there's value in demonstrating the performance gain with a low level test like this one because it's simpler to understand, but I agree that maybe it should avoid modifying the consumer performance tool.

Perhaps I could modify VerifiableConsumer instead to support this use case since it's only used for system testing? I could also create a new tool.

@seglo seglo force-pushed the seglo:seglo/KAFKA-8814 branch Aug 24, 2019

@seglo seglo force-pushed the seglo:seglo/KAFKA-8814 branch to c895775 Sep 2, 2019

@seglo

This comment has been minimized.

Copy link
Contributor Author

commented Sep 2, 2019

I looked at Kafka Streams partition pausing use cases, but I'm not sure how to use Kafka Streams in a way that would trigger lots of partition pause/resumes to demonstrate the issue like I have in this PR, or with external projects that use the KafkaConsumer. @mjsax @guozhangwang Do you have any ideas on how to structure a Kafka Streams perf test that would demonstrate the performance improvement from #6988 ?

I looked at org.apache.kafka.tools.VerifiableConsumer. It could be modified to support partition pausing like I've done with ConsumerPerformance, but it doesn't feel like an appropriate place to add it since it is generally used to assert consumer state rather than performance.

I considered making a copy ConsumerPerformance and stripping it down to only support partition pausing so that it's not exposed to end users through kafka-consumer-perf-test.sh, but this wouldn't be a very DRY implementation.

I think there is precedent for modifying the public-facing perf tools for system tests, as I mentioned in this comment: #7221 (comment)

@ijuma Do you have any suggestions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.