[SPARK-37458][SS] Remove unnecessary SerializeFromObject from the plan of foreachBatch by HeartSaVioR · Pull Request #34706 · apache/spark

HeartSaVioR · 2021-11-25T03:23:20Z

What changes were proposed in this pull request?

This PR proposes to change the logic on ForeachBatchSink to remove unnecessary SerializeFromObject, via leveraging LogicalRDD instead of ExternalRDD.

Why are the changes needed?

This brings slight performance gain as Spark no longer does unnecessary serde on foreachBatch. In addition, the logic is simpler as we defer the encoding logic to the Dataset.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing UT, and new UTs.

HeartSaVioR · 2021-11-25T03:23:26Z

The performance gain is captured via custom benchmark code:

HeartSaVioR@222d7a6

Typed

[info] OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
[info] Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
[info] LogicalRDD vs ExternalRDD:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] LogicalRDD                                          421            437          15         23.8          42.1       1.0X
[info] ExternalRDD                                         851            866          11         11.7          85.1       0.5X

Untyped

[info] OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
[info] Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
[info] LogicalRDD vs ExternalRDD:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] LogicalRDD                                          469            481           9         21.3          46.9       1.0X
[info] ExternalRDD                                         913            929          13         11.0          91.3       0.5X

The origin DataFrame is typed one and we don't change the type during transformation, so it's not odd untyped case could be slower.

HeartSaVioR · 2021-11-25T03:24:50Z

I didn't add the benchmark code to this PR since the code is quite specific to capture "before vs after" for this PR. Please let me know if we'd like to add the benchmark in any way.

SparkQA · 2021-11-25T05:53:53Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50085/

SparkQA · 2021-11-25T06:42:04Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50085/

HeartSaVioR · 2021-11-25T06:58:41Z

cc. @tdas @zsxwing @viirya @xuanyuanking

SparkQA · 2021-11-25T08:32:01Z

Test build #145613 has finished for PR 34706 at commit af961b2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2021-12-01T03:12:08Z

Friendly reminder, @tdas @zsxwing @viirya @xuanyuanking

dongjoon-hyun

+1, LGTM.

dongjoon-hyun · 2021-12-01T06:54:24Z

cc @cloud-fan, too

HeartSaVioR · 2021-12-01T08:00:08Z

Thanks for reviewing! Let me retrigger builds and merge once build passes.

HeartSaVioR · 2021-12-01T08:00:16Z

retest this, please

dongjoon-hyun · 2021-12-01T08:00:53Z

Oops. Sorry, I merged it before seeing your message, @HeartSaVioR .

dongjoon-hyun · 2021-12-01T08:01:15Z

BTW, thank you so much, @HeartSaVioR and @cloud-fan .

HeartSaVioR · 2021-12-01T08:01:44Z

Never mind. I guess there's no possibility someone else modifies the relevant code, so we're good to go.
Thanks for taking care of!

Don't add SerializeFromObjectExec for foreachBatch

af961b2

github-actions bot added SQL STRUCTURED STREAMING labels Nov 25, 2021

dongjoon-hyun approved these changes Dec 1, 2021

View reviewed changes

cloud-fan approved these changes Dec 1, 2021

View reviewed changes

dongjoon-hyun closed this in 3d9c588 Dec 1, 2021

Conversation

HeartSaVioR commented Nov 25, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HeartSaVioR commented Nov 25, 2021

Uh oh!

HeartSaVioR commented Nov 25, 2021

Uh oh!

SparkQA commented Nov 25, 2021

Uh oh!

SparkQA commented Nov 25, 2021

Uh oh!

HeartSaVioR commented Nov 25, 2021

Uh oh!

SparkQA commented Nov 25, 2021

Uh oh!

HeartSaVioR commented Dec 1, 2021

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Dec 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HeartSaVioR commented Dec 1, 2021

Uh oh!

HeartSaVioR commented Dec 1, 2021

Uh oh!

dongjoon-hyun commented Dec 1, 2021

Uh oh!

dongjoon-hyun commented Dec 1, 2021

Uh oh!

HeartSaVioR commented Dec 1, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dongjoon-hyun commented Dec 1, 2021 •

edited

Loading