[SPARK-50661][CONNECT][SS] Fix Spark Connect Scala foreachBatch impl. to support Dataset[T].#49323
[SPARK-50661][CONNECT][SS] Fix Spark Connect Scala foreachBatch impl. to support Dataset[T].#49323haiyangsun-db wants to merge 4 commits intoapache:masterfrom
Conversation
|
Thank you for the fix! The change LGTM, for test complexity, can you add a custom class test case like here? |
|
I believe this makes a 3.5 scala client running scala foreachbatch not able to run against a 4.0 spark server. But for 3.5, streaming scala is still under development, so this should be fine. But it should be worth noting somewhere about this breaking change. cc @HyukjinKwon |
|
Added a new test case for using a custom class with foreachBatch (as simple as the test case in foreach) and probably good enough for now. |
|
Merged to master. |
What changes were proposed in this pull request?
This PR fixes incorrect implementation of Scala Streaming foreachBatch when the input dataset is not a DataFrame (but a Dataset[T]) in spark connect mode.
Note that this only affects
Scala.In
DataStreamWriter:ForeachWriterPacketto something more general for both cases.In
SparkConnectPlanner/StreamingForeachBatchHelperWhy are the changes needed?
Without the fix, Scala foreachBatch will fail or give wrong results when the input dataset is not a DataFrame.
Below is a simple reproduction:
The code above should output 45 in the foreachBatch function. Without the fix, the code will fail because the foreachBatch function will be called with a DataFrame object instead of Dataset[java.lang.Long].
Does this PR introduce any user-facing change?
Yes, this PR includes fixes to the Spark Connect client (we add the encoder to the foreachBatch function during serialization) around the foreachBatch API.
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
No.