[SPARK-39840][SQL][PYTHON] Factor PythonArrowInput out as a symmetry to PythonArrowOutput by HyukjinKwon · Pull Request #37253 · apache/spark

HyukjinKwon · 2022-07-22T09:56:27Z

What changes were proposed in this pull request?

This PR factors the Arrow input code path out as PythonArrowInput as symmetry to PythonArrowOutput. The current hierarchy is not affected:

    └── BasePythonRunner
        ├── ArrowPythonRunner with PythonArrowOutput with PythonArrowInput
        ├── CoGroupedArrowPythonRunner with PythonArrowOutput
        ├── PythonRunner
        └── PythonUDFRunner

In addition, this PR also factors out handleMetadataAfterExec and handleMetadataBeforeExec which contains the logic to send and receive the metadata such as runtime configurations specific to Arrow in/out.

Why are the changes needed?

40485f4 factored PythonArrowOutput out. It's better to factor PythonArrowInput out too to be consistent

Does this PR introduce any user-facing change?

No, this is refactoring.

How was this patch tested?

Existing test cases should cover.

HyukjinKwon · 2022-07-22T09:57:29Z

BTW, this is a base work for the support of arbitrary stateful processing in Structured Streaming with Python (Dataset.groupByKey().flatMapGroupsWithState).

cc @HeartSaVioR @ueshin @viirya any review would be appreciated.

ueshin · 2022-07-23T00:55:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonArrowInput.scala

+  protected def handleMetadataBeforeExec(stream: DataOutputStream): Unit = {
+    // Write config for the worker as a number of key -> value pairs of strings
+    stream.writeInt(workerConf.size)
+    for ((k, v) <- workerConf) {
+      PythonRDD.writeUTF(k, stream)
+      PythonRDD.writeUTF(v, stream)
+    }
+  }


For the Dataset.groupByKey().flatMapGroupsWithState, you are planning to override this method?
In that case, I guess we should implement this in ArrowPythonRunner and leave this empty the same as PythonArrowOutput.handleMetadataAfterExec?

Maybe yes .. but my thought is that this configuration passing applies to all Arrow specific executions so we can share by calling super.. Here's the draft version I am working on: master...HeartSaVioR:spark:WIP-flatmapgroupswithstate-pyspark (see ArrowPythonRunnerWithState).

HyukjinKwon · 2022-07-23T01:08:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonArrowInput.scala

+
+  protected val timeZoneId: String
+
+  protected def handleMetadataBeforeExec(stream: DataOutputStream): Unit = {


Btw I'm open to other suggestions about naming ..

sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonArrowInput.scala

viirya · 2022-07-23T07:40:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonArrowOutput.scala

+  protected def handleMetadataAfterExec(stream: DataInputStream): Unit = { }
+


By default we don't use it. I don't see it is used in other place too. Do you have any plan for it?

Oh yeah I do, see master...HeartSaVioR:spark:WIP-flatmapgroupswithstate-pyspark

HyukjinKwon · 2022-07-25T02:49:14Z

Will merge this in few days if there are no more comments .. I believe this refactoring is pretty much consistent with the current code base, structure and hierarchy (also given the symmetry).

HyukjinKwon · 2022-07-25T03:25:12Z

Thank you @viirya !!!!

HyukjinKwon · 2022-07-25T03:28:06Z

Merged to master.

github-actions bot added CORE PYTHON SQL labels Jul 22, 2022

Factor PythonArrowInput out as a symmetry to PythonArrowOutput

91017c3

HyukjinKwon force-pushed the pyarrow-output-trait branch 2 times, most recently from c6ea53c to a6c59df Compare July 22, 2022 09:58

Remove unused imports

9d3cd84

ueshin reviewed Jul 23, 2022

View reviewed changes

HyukjinKwon commented Jul 23, 2022

View reviewed changes

viirya reviewed Jul 23, 2022

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonArrowInput.scala Outdated Show resolved Hide resolved

viirya reviewed Jul 23, 2022

View reviewed changes

Fix comments

b4f9167

viirya approved these changes Jul 25, 2022

View reviewed changes

HyukjinKwon closed this in 2e1467f Jul 25, 2022

HyukjinKwon deleted the pyarrow-output-trait branch January 15, 2024 00:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-39840][SQL][PYTHON] Factor PythonArrowInput out as a symmetry to PythonArrowOutput#37253

[SPARK-39840][SQL][PYTHON] Factor PythonArrowInput out as a symmetry to PythonArrowOutput#37253
HyukjinKwon wants to merge 3 commits intoapache:masterfrom
HyukjinKwon:pyarrow-output-trait

HyukjinKwon commented Jul 22, 2022

Uh oh!

HyukjinKwon commented Jul 22, 2022

Uh oh!

ueshin Jul 23, 2022

Uh oh!

HyukjinKwon Jul 23, 2022

Uh oh!

HyukjinKwon Jul 23, 2022

Uh oh!

Uh oh!

viirya Jul 23, 2022

Uh oh!

HyukjinKwon Jul 23, 2022

Uh oh!

HyukjinKwon commented Jul 25, 2022

Uh oh!

HyukjinKwon commented Jul 25, 2022

Uh oh!

HyukjinKwon commented Jul 25, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		protected val timeZoneId: String

		protected def handleMetadataBeforeExec(stream: DataOutputStream): Unit = {

		protected def handleMetadataAfterExec(stream: DataInputStream): Unit = { }

Conversation

HyukjinKwon commented Jul 22, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon commented Jul 22, 2022

Uh oh!

ueshin Jul 23, 2022

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jul 23, 2022

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jul 23, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

viirya Jul 23, 2022

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jul 23, 2022

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jul 25, 2022

Uh oh!

HyukjinKwon commented Jul 25, 2022

Uh oh!

HyukjinKwon commented Jul 25, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants