[SPARK-34799][PYTHON][SQL] Return User-defined types from Pandas UDF #31735

eddyxu · 2021-03-04T07:16:11Z

What changes were proposed in this pull request?

This PR allows to return User-defined types (UDT) from PandasUDF.

@pandas_udf(StructType([StructField("vec", ArrayType(VectorUDT()))]))
def array_of_udt_structs(series: pd.Series) -> pd.DataFrame:
      vectors = []
      for _, i in series.items():
          vectors.append({"vec": [DenseVector([i]), DenseVector([i * 2])]})
      return pd.DataFrame(vectors)

# Or

@pandas_udf(ArrayType(VectorUDT()))
def array_of_vectors(series: pd.Series) -> pd.Series:
    vectors = []
    for _, i in series.items():
        vectors.append([DenseVector([i]), DenseVector([i * 2])])
    return pd.Series(vectors)

This PR converts UDT into its corresponding UDT.sqlType / StructType before sending the results from PySpark worker to JVM. In JVM , it relaxes the requirements for schema checking, so that Spark sql would consider UDT is compatible to its sqlType type.

Why are the changes needed?

We have use cases that builds UDT to present semantic meanings of results. We use pandas UDF because certain computation (i.e., model inference) requiring expensive initialization, which makes Iterator based PandasUDF a desired implementation:

@pandas_udf(ArrayType(BoundingBox())
def object_detection(batches: Iterator[pd.Series]) -> pd.Series:
     model = load_model()  // Expensive
     for batch in batches:
          ...

Does this PR introduce any user-facing change?

User can start to specify UDT in pandas_udf's returnType

How was this patch tested?

This patch includes 3 tests returning UDT in different forms

attilapiros · 2021-03-04T07:37:41Z

ok to test

attilapiros · 2021-03-04T08:01:06Z

jenkins retest this please

SparkQA · 2021-03-04T08:53:17Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40326/

SparkQA · 2021-03-04T08:57:54Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40326/

SparkQA · 2021-03-04T09:00:04Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40323/

SparkQA · 2021-03-04T09:27:28Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40323/

SparkQA · 2021-03-04T11:44:26Z

Test build #135744 has finished for PR 31735 at commit cd23b1e.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-03-04T12:12:58Z

Test build #135740 has finished for PR 31735 at commit cd23b1e.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

eddyxu · 2021-03-04T18:46:21Z

@HyukjinKwon would you mind to take a look?

attilapiros · 2021-03-04T19:03:11Z

@eddyxu Could you please check the failed test?

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135740/consoleText:

Starting test(pypy3): pyspark.sql.tests.test_pandas_udf_scalar
Traceback (most recent call last):
  File "/usr/lib/pypy3.6-7.2.0-linux_x86_64-portable/lib-python/3/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/pypy3.6-7.2.0-linux_x86_64-portable/lib-python/3/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/tests/test_pandas_udf_scalar.py", line 29, in <module>
    from pyspark.ml.linalg import DenseVector, VectorUDT
  File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/__init__.py", line 22, in <module>
    from pyspark.ml.base import Estimator, Model, Predictor, PredictionModel, \
  File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/base.py", line 25, in <module>
    from pyspark.ml.param.shared import HasInputCol, HasOutputCol, HasLabelCol, HasFeaturesCol, \
  File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/param/__init__.py", line 21, in <module>
    import numpy as np
ModuleNotFoundError: No module named 'numpy'

Had test failures in pyspark.sql.tests.test_pandas_udf_scalar with pypy3; see logs.
[error] running /home/jenkins/workspace/SparkPullRequestBuilder/python/run-tests --modules=pyspark-sql,pyspark-mllib,pyspark-ml --parallelism=8 ; received return code 255

eddyxu · 2021-03-04T19:20:46Z

@attilapiros Thanks for point this out. Looking.

SparkQA · 2021-03-04T23:15:27Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40355/

SparkQA · 2021-03-04T23:42:11Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40357/

SparkQA · 2021-03-04T23:44:20Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40355/

SparkQA · 2021-03-04T23:50:46Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40357/

SparkQA · 2021-03-05T02:52:06Z

Test build #135773 has finished for PR 31735 at commit 3c2c6d2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-03-05T02:58:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala

@@ -89,9 +90,35 @@ case class ArrowEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute]

    columnarBatchIter.flatMap { batch =>
      val actualDataTypes = (0 until batch.numCols()).map(i => batch.column(i).dataType())
-      assert(outputTypes == actualDataTypes, "Invalid schema from pandas_udf: " +
-        s"expected ${outputTypes.mkString(", ")}, got ${actualDataTypes.mkString(", ")}")
+      assert(plainSchema(outputTypes) == plainSchema(actualDataTypes),


I think we wouldn't need to call plainSchema(actualDataTypes) because Arrow schema cannot contain PySpark's UDF?

makes sense.

HyukjinKwon · 2021-03-05T02:59:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala

@@ -54,6 +54,9 @@ class ArrowPythonRunner(
    "Pandas execution requires more than 4 bytes. Please set higher buffer. " +
      s"Please change '${SQLConf.PANDAS_UDF_BUFFER_SIZE.key}'.")

+  /** This is a private key */
+  private val PANDAS_UDF_RETURN_TYPE_JSON = "spark.sql.execution.pandas.udf.return.type.json"


Can we avoid sending this together with configurations?

Hi, do you suggest that we do not send schema via conf, or do not send schema at all?

I see the schema is valuable on the worker for 2 purposes:
1). it can amortize the overhead of checking each row of a returned pandas.{Series/DataFrame} for its schema. By detecting whether we have UDT in the schema before running @pandas_udf, we can avoid to invoke the expensive code path for the existing plain schema case.

and 2) this will be useful for passing UDT into pandas_udf, as wired format is on pyarrow schema, the wire data needs to be reconstruct before sending into the pandas_udf in the worker.

Also as you suggested below, this schema can be used to generate a function that avoid type dispatch when doing UDT->struct conversion.

An alternative implementation could be:

dataOut.writeInt(conf) for ((k, v) <- conf) { PythonRDD.writeUTF(k, dataOut) PythonRDD.writeUTF(v, dataOut) } PythonRDD.writeUTF(schema_json, dataOut) PythonUDFRunner.writeUDFs(dataOut, funcs, argOffsets)

Do we have concern about wire compatibility here? IIUC worker.py should be deployed together with ArrowEvalPythonExec , it might not be an issue.

HyukjinKwon · 2021-03-05T03:01:58Z

python/pyspark/sql/pandas/serializers.py

@@ -183,6 +212,21 @@ def create_array(s, t):
                    raise e
            return array

+        def to_plain_struct(cell):


The performance here will be very bad. We should create a function based on the type, and avoid type-dispatching for every value.

let me see what i can do.

HyukjinKwon · 2021-03-05T03:03:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala

 import org.apache.spark.sql.util.ArrowUtils

+


Let's remove all these unreleased changes

HyukjinKwon · 2021-03-05T03:03:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala

@@ -24,9 +24,10 @@ import org.apache.spark.api.python.ChainedPythonFunctions
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.execution.SparkPlan
-import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.types._


and avoid wildcard import

HyukjinKwon · 2021-03-05T03:04:28Z

@eddyxu, how does it work with regular Python UDF? Looks like the performance here will be very bad. Can you do a quick benchmark?

cc @BryanCutler and @ueshin too FYI

eddyxu · 2021-03-05T03:30:08Z

Thanks for the reviews, @HyukjinKwon.

TLDR, this PR should not introduce performance regression for non-(pandas_udf + user-defined type) cases.

For a regular Python UDF, not the arrow-based pandas UDF, this code path is not used. The UDT to StructType conversion only happens in ArrowStreamPandasUDFSerializer.
For a pandas UDF with regular spark types, this is guarded by a "has_udt" flag . This flag is initialized once during ArrowStreamPandasUDFSerializer creation. So for the normal spark types, there should not have performance penalty.

Performance wise , i will do some benchmarks and look into a way to erase UDT without much type dispatching.

SparkQA · 2021-03-05T03:52:00Z

Test build #135775 has finished for PR 31735 at commit 5b07fdd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-03-05T05:04:50Z

I am sorry I wasn't clear. I meant to compare how it works in regular Python UDF and the pandas UDFs. I would be great if we can use a similar approach for both. Furthermore, we will probably have to do it for toPandas and createDataFrame with Arrow optimization on. It should be best to think about these cases as well.

HyukjinKwon · 2021-03-05T05:05:14Z

BTW, thanks for working on this :-) @eddyxu.

SparkQA · 2021-03-08T18:37:17Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40462/

SparkQA · 2021-03-08T19:07:04Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40462/

da-liii · 2021-03-23T02:00:11Z

python/pyspark/sql/pandas/serializers.py


+        arrs = []


@maropu Here is the refactored impl:

use create_arrs_names

Use dt: Optional[DataType] to separate the logic

Move down Make input conform

hm. we cannot extract these additional code for UDT outside _create_batch? I meant it like this;

if <series has udt>: series = _preprocess_for_udt(series) _create_batch(series)

Yes, this part can be improved

See https://github.com/apache/spark/pull/32026/files

SparkQA · 2021-03-23T02:05:11Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40956/

SparkQA · 2021-03-23T02:12:59Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40956/

da-liii · 2021-03-23T02:40:51Z

@eddyxu I wrote a UDT with Timestamp, but failed to make it work. See the demo pr: eddyxu#4

For ExampleBox, serialize to list works fine. But for ExamplePointWithTimeUDT, to make pa.StructArray.from_pandas work, we need to serialize it to dict. For the above demo pr, the python part works fine. But I failed to deserialize the ExamplePointWithTime properly in the Scala part.

Do we need to make UDT with Timestamp work in this PR? How about postpone it in another JIRA ticket?

@maropu What's your opinion? I do not want to make this PR too complicated and hard to review.

pylint

SparkQA · 2021-03-23T08:11:45Z

Test build #136390 has started for PR 31735 at commit ddda826.

SparkQA · 2021-03-23T09:46:11Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40974/

SparkQA · 2021-03-23T10:34:59Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40974/

pylint

da-liii · 2021-03-23T14:03:20Z

python/pyspark/sql/pandas/serializers.py

+                                     "but got: %s" % str(type(s)))
+                if isinstance(dt, DataType):
+                    type_not_match = "dt must be instance of StructType when t is pyarrow struct"
+                    assert isinstance(dt, StructType), type_not_match


"dt must be StructType as t is pyarrow struct"

Remove temp variable type_not_match with the above shortened error msg. (will change it after the next round of code review)

SparkQA · 2021-03-23T14:13:48Z

Test build #136412 has started for PR 31735 at commit 1e7452d.

da-liii · 2021-03-23T14:14:03Z

Could you take another round of code review? @HyukjinKwon @maropu @ueshin

Github Action are queueing and waiting, I will improve this PR based on your review together with the known-nit I reviewed by myself.

SparkQA · 2021-03-23T18:38:54Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40995/

maropu · 2021-03-25T06:45:10Z

@maropu What's your opinion? I do not want to make this PR too complicated and hard to review.

sgtm. I think it is better to make the PR simpler, so how about focusing on supporting a simple case (@pandas_udf(UDT)) first ? The improvement to support more types (arrays of UDT, ....) can be done in follow-up PRs, I think.

maropu · 2021-03-25T06:46:46Z

python/pyspark/sql/tests/test_pandas_udf_scalar.py

+
+    # SPARK-34799
+    def test_user_defined_types_in_array(self):
+        @pandas_udf(ArrayType(ExamplePointUDT()))


Could you add tests for @pandas_udf(ArrayType(ExamplePointUDT(), False))? It seems it doesn't work well.

maropu · 2021-03-25T07:01:08Z

python/pyspark/sql/pandas/serializers.py

+                    s = s.apply(dt.serialize)
+                elif isinstance(dt, ArrayType) and isinstance(dt.elementType, UserDefinedType):
+                    udt = dt.elementType
+                    s = s.apply(lambda x: [udt.serialize(f) for f in x])


could you add assert to check if dt is UDT or Array(UDT)?

This part is extracted, see https://github.com/apache/spark/pull/32026/files

maropu · 2021-03-25T07:03:17Z

python/pyspark/sql/pandas/serializers.py


-        def create_array(s, t):
-            mask = s.isnull()
+        def create_array(s, t: pa.DataType, dt: Optional[DataType] = None):


It looks like we don't use type hints for internal funcs.

Fixed, see https://github.com/apache/spark/pull/32026/files

…rrow support enabled (#6)

da-liii · 2021-03-31T08:35:33Z

@HyukjinKwon:
Furthermore, we will probably have to do it for toPandas and createDataFrame with Arrow optimization on. It should be best to think about these cases as well.

toPandas and createDataFrame is supported in the latest commits. See dca35df

Just learned about the SPIP vote by you: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-Support-pandas-API-layer-on-PySpark-td30996.html

I wonder if there are any overlap for this PR and the SPIP. I'm new to PySpark. My previous experience/contribution of Apache Spark mainly focused on Yarn/SQL. If there are any overlap or conflicts, could give any feedback.

@maropu
I think it is better to make the PR simpler, so how about focusing on supporting a simple case (@pandas_udf(UDT)) first ? The improvement to support more types (arrays of UDT, ....) can be done in follow-up PRs, I think.

Thanks for your reply and suggestion. Supporting @pandas_udf(UDF) first seems to be a good way to split this PR. For this PR, I think the most complicated part lies in python/pyspark/sql/pandas/serializers.py. The currently implementation with Spark DataType and Arrow DataType mixed decreases the code readibility. For this PR, if CI passed, dca35df would be the last commit.

And I will try to submit a good and small/minimum first splitted PR for you to review with a better and cleaner implemenation of python/pyspark/sql/pandas/seriealizer.py.

Here is my plan:

SPARK-34711: Support UDT for Pandas/Spark conversion with Arrow support Enabled
SPARK-34799: Return User-defined types from Pandas UDF case 1: @pandas_udf(UDT)

SparkQA · 2021-03-31T08:49:04Z

Test build #136757 has finished for PR 31735 at commit dca35df.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-03-31T09:35:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41340/

SparkQA · 2021-03-31T09:35:22Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41340/

PerilousApricot · 2021-05-19T00:15:54Z

Hello, this looks like very good work. I'm having some trouble reading the code -- is there a possibility that these UDTs could leverage https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extension-types when they're in Pandas/Python to skip a costly conversion to Object that currently happens?

github-actions · 2021-08-28T00:08:23Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

eddyxu changed the title ~~[SPARK-34600] Return User-defined types from Pandas UDF~~ [SPARK-34600][Pyspark][SQL] Return User-defined types from Pandas UDF Mar 4, 2021

github-actions bot added CORE PYTHON SQL labels Mar 4, 2021

eddyxu force-pushed the lei/pandas_udf branch from cd23b1e to 3c2c6d2 Compare March 4, 2021 22:02

eddyxu changed the title ~~[SPARK-34600][Pyspark][SQL] Return User-defined types from Pandas UDF~~ [WIP][SPARK-34600][Pyspark][SQL] Return User-defined types from Pandas UDF Mar 4, 2021

HyukjinKwon reviewed Mar 5, 2021

View reviewed changes

eddyxu changed the title ~~[SPARK-34771][PYTHON][SQL] Return User-defined types from Pandas UDF~~ [SPARK-34799][PYTHON][SQL] Return User-defined types from Pandas UDF Mar 23, 2021

pylint

0546f3c

da-liii mentioned this pull request Mar 23, 2021

pylint eddyxu/spark#3

Merged

da-liii reviewed Mar 23, 2021

View reviewed changes

Merge pull request #3 from sadhen/da/pylint

ddda826

pylint

da-tubi and others added 2 commits March 23, 2021 21:32

pylint

e185149

Merge pull request #5 from sadhen/da/pylint2

1e7452d

pylint

da-liii reviewed Mar 23, 2021

View reviewed changes

maropu reviewed Mar 25, 2021

View reviewed changes

Spark 34711: Support UDT for pandas/spark dataframe conversion with A…

dca35df

…rrow support enabled (#6)

github-actions bot added the Stale label Aug 28, 2021

github-actions bot closed this Aug 29, 2021

[SPARK-34799][PYTHON][SQL] Return User-defined types from Pandas UDF #31735

[SPARK-34799][PYTHON][SQL] Return User-defined types from Pandas UDF #31735

Conversation

eddyxu commented Mar 4, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

attilapiros commented Mar 4, 2021

attilapiros commented Mar 4, 2021

SparkQA commented Mar 4, 2021

SparkQA commented Mar 4, 2021

SparkQA commented Mar 4, 2021

SparkQA commented Mar 4, 2021

SparkQA commented Mar 4, 2021

SparkQA commented Mar 4, 2021

eddyxu commented Mar 4, 2021

attilapiros commented Mar 4, 2021

eddyxu commented Mar 4, 2021

SparkQA commented Mar 4, 2021

SparkQA commented Mar 4, 2021

SparkQA commented Mar 4, 2021

SparkQA commented Mar 4, 2021

SparkQA commented Mar 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eddyxu Mar 5, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Mar 5, 2021

eddyxu commented Mar 5, 2021 • edited Loading

SparkQA commented Mar 5, 2021

HyukjinKwon commented Mar 5, 2021

HyukjinKwon commented Mar 5, 2021

SparkQA commented Mar 8, 2021

SparkQA commented Mar 8, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 23, 2021

SparkQA commented Mar 23, 2021

da-liii commented Mar 23, 2021 • edited Loading

SparkQA commented Mar 23, 2021

SparkQA commented Mar 23, 2021

SparkQA commented Mar 23, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 23, 2021

da-liii commented Mar 23, 2021

SparkQA commented Mar 23, 2021

maropu commented Mar 25, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

da-liii commented Mar 31, 2021 • edited Loading

SparkQA commented Mar 31, 2021

SparkQA commented Mar 31, 2021

SparkQA commented Mar 31, 2021

PerilousApricot commented May 19, 2021

github-actions bot commented Aug 28, 2021

eddyxu commented Mar 4, 2021 •

edited

Loading

eddyxu Mar 5, 2021 •

edited

Loading

eddyxu commented Mar 5, 2021 •

edited

Loading

da-liii commented Mar 23, 2021 •

edited

Loading

da-liii commented Mar 31, 2021 •

edited

Loading