[SPARK-37458][SS] Remove unnecessary SerializeFromObject from the plan of foreachBatch#34706
[SPARK-37458][SS] Remove unnecessary SerializeFromObject from the plan of foreachBatch#34706HeartSaVioR wants to merge 1 commit intoapache:masterfrom
Conversation
|
The performance gain is captured via custom benchmark code:
The origin DataFrame is typed one and we don't change the type during transformation, so it's not odd untyped case could be slower. |
|
I didn't add the benchmark code to this PR since the code is quite specific to capture "before vs after" for this PR. Please let me know if we'd like to add the benchmark in any way. |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #145613 has finished for PR 34706 at commit
|
|
Friendly reminder, @tdas @zsxwing @viirya @xuanyuanking |
|
cc @cloud-fan, too |
|
Thanks for reviewing! Let me retrigger builds and merge once build passes. |
|
retest this, please |
|
Oops. Sorry, I merged it before seeing your message, @HeartSaVioR . |
|
BTW, thank you so much, @HeartSaVioR and @cloud-fan . |
|
Never mind. I guess there's no possibility someone else modifies the relevant code, so we're good to go. |
What changes were proposed in this pull request?
This PR proposes to change the logic on ForeachBatchSink to remove unnecessary SerializeFromObject, via leveraging LogicalRDD instead of ExternalRDD.
Why are the changes needed?
This brings slight performance gain as Spark no longer does unnecessary serde on foreachBatch. In addition, the logic is simpler as we defer the encoding logic to the Dataset.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing UT, and new UTs.