Skip to content

[SPARK-37458][SS] Remove unnecessary SerializeFromObject from the plan of foreachBatch#34706

Closed
HeartSaVioR wants to merge 1 commit intoapache:masterfrom
HeartSaVioR:SPARK-37458
Closed

[SPARK-37458][SS] Remove unnecessary SerializeFromObject from the plan of foreachBatch#34706
HeartSaVioR wants to merge 1 commit intoapache:masterfrom
HeartSaVioR:SPARK-37458

Conversation

@HeartSaVioR
Copy link
Contributor

What changes were proposed in this pull request?

This PR proposes to change the logic on ForeachBatchSink to remove unnecessary SerializeFromObject, via leveraging LogicalRDD instead of ExternalRDD.

Why are the changes needed?

This brings slight performance gain as Spark no longer does unnecessary serde on foreachBatch. In addition, the logic is simpler as we defer the encoding logic to the Dataset.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing UT, and new UTs.

@HeartSaVioR
Copy link
Contributor Author

The performance gain is captured via custom benchmark code:

HeartSaVioR@222d7a6

Typed

[info] OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
[info] Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
[info] LogicalRDD vs ExternalRDD:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] LogicalRDD                                          421            437          15         23.8          42.1       1.0X
[info] ExternalRDD                                         851            866          11         11.7          85.1       0.5X

Untyped

[info] OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
[info] Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
[info] LogicalRDD vs ExternalRDD:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] LogicalRDD                                          469            481           9         21.3          46.9       1.0X
[info] ExternalRDD                                         913            929          13         11.0          91.3       0.5X

The origin DataFrame is typed one and we don't change the type during transformation, so it's not odd untyped case could be slower.

@HeartSaVioR
Copy link
Contributor Author

I didn't add the benchmark code to this PR since the code is quite specific to capture "before vs after" for this PR. Please let me know if we'd like to add the benchmark in any way.

@SparkQA
Copy link

SparkQA commented Nov 25, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50085/

@SparkQA
Copy link

SparkQA commented Nov 25, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50085/

@HeartSaVioR
Copy link
Contributor Author

cc. @tdas @zsxwing @viirya @xuanyuanking

@SparkQA
Copy link

SparkQA commented Nov 25, 2021

Test build #145613 has finished for PR 34706 at commit af961b2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

Friendly reminder, @tdas @zsxwing @viirya @xuanyuanking

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Dec 1, 2021

cc @cloud-fan, too

@HeartSaVioR
Copy link
Contributor Author

Thanks for reviewing! Let me retrigger builds and merge once build passes.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@dongjoon-hyun
Copy link
Member

Oops. Sorry, I merged it before seeing your message, @HeartSaVioR .

@dongjoon-hyun
Copy link
Member

BTW, thank you so much, @HeartSaVioR and @cloud-fan .

@HeartSaVioR
Copy link
Contributor Author

Never mind. I guess there's no possibility someone else modifies the relevant code, so we're good to go.
Thanks for taking care of!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants