[SPARK-39840][SQL][PYTHON] Factor PythonArrowInput out as a symmetry to PythonArrowOutput#37253
[SPARK-39840][SQL][PYTHON] Factor PythonArrowInput out as a symmetry to PythonArrowOutput#37253HyukjinKwon wants to merge 3 commits intoapache:masterfrom
Conversation
|
BTW, this is a base work for the support of arbitrary stateful processing in Structured Streaming with Python ( cc @HeartSaVioR @ueshin @viirya any review would be appreciated. |
c6ea53c to
a6c59df
Compare
| protected def handleMetadataBeforeExec(stream: DataOutputStream): Unit = { | ||
| // Write config for the worker as a number of key -> value pairs of strings | ||
| stream.writeInt(workerConf.size) | ||
| for ((k, v) <- workerConf) { | ||
| PythonRDD.writeUTF(k, stream) | ||
| PythonRDD.writeUTF(v, stream) | ||
| } | ||
| } |
There was a problem hiding this comment.
For the Dataset.groupByKey().flatMapGroupsWithState, you are planning to override this method?
In that case, I guess we should implement this in ArrowPythonRunner and leave this empty the same as PythonArrowOutput.handleMetadataAfterExec?
There was a problem hiding this comment.
Maybe yes .. but my thought is that this configuration passing applies to all Arrow specific executions so we can share by calling super.. Here's the draft version I am working on: master...HeartSaVioR:spark:WIP-flatmapgroupswithstate-pyspark (see ArrowPythonRunnerWithState).
|
|
||
| protected val timeZoneId: String | ||
|
|
||
| protected def handleMetadataBeforeExec(stream: DataOutputStream): Unit = { |
There was a problem hiding this comment.
Btw I'm open to other suggestions about naming ..
sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonArrowInput.scala
Outdated
Show resolved
Hide resolved
| protected def handleMetadataAfterExec(stream: DataInputStream): Unit = { } | ||
|
|
There was a problem hiding this comment.
By default we don't use it. I don't see it is used in other place too. Do you have any plan for it?
There was a problem hiding this comment.
Oh yeah I do, see master...HeartSaVioR:spark:WIP-flatmapgroupswithstate-pyspark
|
Will merge this in few days if there are no more comments .. I believe this refactoring is pretty much consistent with the current code base, structure and hierarchy (also given the symmetry). |
|
Thank you @viirya !!!! |
|
Merged to master. |
What changes were proposed in this pull request?
This PR factors the Arrow input code path out as
PythonArrowInputas symmetry toPythonArrowOutput. The current hierarchy is not affected:In addition, this PR also factors out
handleMetadataAfterExecandhandleMetadataBeforeExecwhich contains the logic to send and receive the metadata such as runtime configurations specific to Arrow in/out.Why are the changes needed?
40485f4 factored
PythonArrowOutputout. It's better to factorPythonArrowInputout too to be consistentDoes this PR introduce any user-facing change?
No, this is refactoring.
How was this patch tested?
Existing test cases should cover.