[SPARK-56543] Add RTM stateless benchmark#55420
[SPARK-56543] Add RTM stateless benchmark#55420jerrypeng wants to merge 4 commits intoapache:masterfrom
Conversation
| * [[RealTimeTrigger]]. After the run it reports e2e latency percentiles. | ||
| * | ||
| * This benchmark intentionally runs a real local-cluster and a live Kafka broker, so it | ||
| * is slow and is not included in the default test run. Run it explicitly when measuring |
There was a problem hiding this comment.
This claims "not included in default test run" but it is a KafkaSourceTest so I suppose that CI will run it?
|
|
||
| val success = new AtomicLong(0) | ||
|
|
||
| new Timer().scheduleAtFixedRate( |
There was a problem hiding this comment.
Don't we need to stop/cancel this timer?
| } | ||
| }) | ||
|
|
||
| latch.await() |
There was a problem hiding this comment.
We probably should set a timeout. If it timeouts, we should stop query/stop generator and throw query exception.
|
@viirya thank you for your review! I have addressed your comments. PTAL. |
| * is slow. Run it explicitly when measuring RTM throughput and latency for the stateless path. | ||
| */ | ||
| class RTMKafkaKafkaBenchmarkSuite | ||
| extends KafkaSourceTest |
There was a problem hiding this comment.
This extends KafkaSourceTest and is a test suite. So CI will run this benchmark automatically. I think we should make it as a program that manually run.
There was a problem hiding this comment.
Add created a new annotation that excludes from running automatically on CI. Folks can still run manually
| getLatencies(longRunningBatchDurationMs, numBatches, outputTopic) | ||
| } | ||
|
|
||
| private def genData(url: String, topicName: String, throughput: Long): Unit = { |
There was a problem hiding this comment.
genData now cancels the timer and closes the producer in finally, which is good, but the caller only invokes dataGenThread.interrupt() and then continues without waiting for that cleanup to finish. Since this suite mixes in ThreadAudit, the test may finish while the generator thread is still unwinding, sleeping, or blocked in producer.close(), which can lead to flaky thread-leak failures. Please join the thread with a bounded timeout after interrupting it, and ideally put query shutdown plus generator stop/join in an outer try/finally so cleanup also happens if await is interrupted or another exception is thrown.
|
@viirya thank you for your review! I have addressed your comments. PTAL. |
What changes were proposed in this pull request?
Adds RTMKafkaKafkaBenchmarkSuite, a stateless end-to-end benchmark for the Real-Time Mode (RTM) trigger in Structured Streaming.
The benchmark:
timestamp.
Why are the changes needed?
There is currently no benchmarks to measure RTM stateless Kafka-to-Kafka latency. This makes it hard to quantify regressions or improvements to the RTM code path in CI or local development. This benchmark provides a repeatable, self-contained way to measure that.
Does this PR introduce any user-facing change?
no
How was this patch tested?
This is a benchmark-only test suite. The suite was manually verified to compile and initialize correctly against the current codebase.
To run it explicitly:
build/sbt "sql-kafka-0-10/testOnly *RTMKafkaKafkaBenchmarkSuite"
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Sonnet 4.6 (claude-sonnet-4-6)