Skip to content

[SPARK-56543] Add RTM stateless benchmark#55420

Open
jerrypeng wants to merge 4 commits intoapache:masterfrom
jerrypeng:SPARK-56543
Open

[SPARK-56543] Add RTM stateless benchmark#55420
jerrypeng wants to merge 4 commits intoapache:masterfrom
jerrypeng:SPARK-56543

Conversation

@jerrypeng
Copy link
Copy Markdown
Contributor

@jerrypeng jerrypeng commented Apr 20, 2026

What changes were proposed in this pull request?

Adds RTMKafkaKafkaBenchmarkSuite, a stateless end-to-end benchmark for the Real-Time Mode (RTM) trigger in Structured Streaming.

The benchmark:

  1. Spins up a local-cluster Spark context (local-cluster[3, 5, 1024]) and a live embedded Kafka broker.
  2. Generates synthetic records at 1,000 records/sec into an input Kafka topic (5 partitions).
  3. Runs a stateless pipeline with RealTimeTrigger: reads from Kafka → base64-encodes the value → stamps a source-timestamp header → writes to an output Kafka topic.
  4. Captures per-batch processing latency via Spark's observe() API.
  5. After N batches complete, reads back the output topic and reports e2e latency percentiles (p0, p50, p90, p95, p99, p100) by comparing the source-timestamp header to the Kafka sink
    timestamp.

Why are the changes needed?

There is currently no benchmarks to measure RTM stateless Kafka-to-Kafka latency. This makes it hard to quantify regressions or improvements to the RTM code path in CI or local development. This benchmark provides a repeatable, self-contained way to measure that.

Does this PR introduce any user-facing change?

no

How was this patch tested?

This is a benchmark-only test suite. The suite was manually verified to compile and initialize correctly against the current codebase.

To run it explicitly:
build/sbt "sql-kafka-0-10/testOnly *RTMKafkaKafkaBenchmarkSuite"

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Sonnet 4.6 (claude-sonnet-4-6)

* [[RealTimeTrigger]]. After the run it reports e2e latency percentiles.
*
* This benchmark intentionally runs a real local-cluster and a live Kafka broker, so it
* is slow and is not included in the default test run. Run it explicitly when measuring
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This claims "not included in default test run" but it is a KafkaSourceTest so I suppose that CI will run it?


val success = new AtomicLong(0)

new Timer().scheduleAtFixedRate(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to stop/cancel this timer?

}
})

latch.await()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should set a timeout. If it timeouts, we should stop query/stop generator and throw query exception.

@jerrypeng
Copy link
Copy Markdown
Contributor Author

@viirya thank you for your review! I have addressed your comments. PTAL.

* is slow. Run it explicitly when measuring RTM throughput and latency for the stateless path.
*/
class RTMKafkaKafkaBenchmarkSuite
extends KafkaSourceTest
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This extends KafkaSourceTest and is a test suite. So CI will run this benchmark automatically. I think we should make it as a program that manually run.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add created a new annotation that excludes from running automatically on CI. Folks can still run manually

getLatencies(longRunningBatchDurationMs, numBatches, outputTopic)
}

private def genData(url: String, topicName: String, throughput: Long): Unit = {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

genData now cancels the timer and closes the producer in finally, which is good, but the caller only invokes dataGenThread.interrupt() and then continues without waiting for that cleanup to finish. Since this suite mixes in ThreadAudit, the test may finish while the generator thread is still unwinding, sleeping, or blocked in producer.close(), which can lead to flaky thread-leak failures. Please join the thread with a bounded timeout after interrupting it, and ideally put query shutdown plus generator stop/join in an outer try/finally so cleanup also happens if await is interrupted or another exception is thrown.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jerrypeng
Copy link
Copy Markdown
Contributor Author

@viirya thank you for your review! I have addressed your comments. PTAL.

@jerrypeng jerrypeng requested a review from viirya April 28, 2026 05:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants