Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34799][PYTHON][SQL] Return User-defined types from Pandas UDF #31735

Closed
wants to merge 37 commits into from

Conversation

eddyxu
Copy link
Member

@eddyxu eddyxu commented Mar 4, 2021

What changes were proposed in this pull request?

This PR allows to return User-defined types (UDT) from PandasUDF.

@pandas_udf(StructType([StructField("vec", ArrayType(VectorUDT()))]))
def array_of_udt_structs(series: pd.Series) -> pd.DataFrame:
      vectors = []
      for _, i in series.items():
          vectors.append({"vec": [DenseVector([i]), DenseVector([i * 2])]})
      return pd.DataFrame(vectors)

# Or

@pandas_udf(ArrayType(VectorUDT()))
def array_of_vectors(series: pd.Series) -> pd.Series:
    vectors = []
    for _, i in series.items():
        vectors.append([DenseVector([i]), DenseVector([i * 2])])
    return pd.Series(vectors)

This PR converts UDT into its corresponding UDT.sqlType / StructType before sending the results from PySpark worker to JVM. In JVM , it relaxes the requirements for schema checking, so that Spark sql would consider UDT is compatible to its sqlType type.

Why are the changes needed?

We have use cases that builds UDT to present semantic meanings of results. We use pandas UDF because certain computation (i.e., model inference) requiring expensive initialization, which makes Iterator based PandasUDF a desired implementation:

@pandas_udf(ArrayType(BoundingBox())
def object_detection(batches: Iterator[pd.Series]) -> pd.Series:
     model = load_model()  // Expensive
     for batch in batches:
          ...

Does this PR introduce any user-facing change?

User can start to specify UDT in pandas_udf's returnType

How was this patch tested?

This patch includes 3 tests returning UDT in different forms

@eddyxu eddyxu changed the title [SPARK-34600] Return User-defined types from Pandas UDF [SPARK-34600][Pyspark][SQL] Return User-defined types from Pandas UDF Mar 4, 2021
@attilapiros
Copy link
Contributor

ok to test

@attilapiros
Copy link
Contributor

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Mar 4, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40326/

@SparkQA
Copy link

SparkQA commented Mar 4, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40326/

@SparkQA
Copy link

SparkQA commented Mar 4, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40323/

@SparkQA
Copy link

SparkQA commented Mar 4, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40323/

@SparkQA
Copy link

SparkQA commented Mar 4, 2021

Test build #135744 has finished for PR 31735 at commit cd23b1e.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 4, 2021

Test build #135740 has finished for PR 31735 at commit cd23b1e.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@eddyxu
Copy link
Member Author

eddyxu commented Mar 4, 2021

@HyukjinKwon would you mind to take a look?

@attilapiros
Copy link
Contributor

@eddyxu Could you please check the failed test?

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135740/consoleText:

Starting test(pypy3): pyspark.sql.tests.test_pandas_udf_scalar
Traceback (most recent call last):
  File "/usr/lib/pypy3.6-7.2.0-linux_x86_64-portable/lib-python/3/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/pypy3.6-7.2.0-linux_x86_64-portable/lib-python/3/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/tests/test_pandas_udf_scalar.py", line 29, in <module>
    from pyspark.ml.linalg import DenseVector, VectorUDT
  File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/__init__.py", line 22, in <module>
    from pyspark.ml.base import Estimator, Model, Predictor, PredictionModel, \
  File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/base.py", line 25, in <module>
    from pyspark.ml.param.shared import HasInputCol, HasOutputCol, HasLabelCol, HasFeaturesCol, \
  File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/param/__init__.py", line 21, in <module>
    import numpy as np
ModuleNotFoundError: No module named 'numpy'

Had test failures in pyspark.sql.tests.test_pandas_udf_scalar with pypy3; see logs.
[error] running /home/jenkins/workspace/SparkPullRequestBuilder/python/run-tests --modules=pyspark-sql,pyspark-mllib,pyspark-ml --parallelism=8 ; received return code 255

@eddyxu
Copy link
Member Author

eddyxu commented Mar 4, 2021

@attilapiros Thanks for point this out. Looking.

@eddyxu eddyxu changed the title [SPARK-34600][Pyspark][SQL] Return User-defined types from Pandas UDF [WIP][SPARK-34600][Pyspark][SQL] Return User-defined types from Pandas UDF Mar 4, 2021
@SparkQA
Copy link

SparkQA commented Mar 4, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40355/

@SparkQA
Copy link

SparkQA commented Mar 4, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40357/

@SparkQA
Copy link

SparkQA commented Mar 4, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40355/

@SparkQA
Copy link

SparkQA commented Mar 4, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40357/

@SparkQA
Copy link

SparkQA commented Mar 5, 2021

Test build #135773 has finished for PR 31735 at commit 3c2c6d2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -89,9 +90,35 @@ case class ArrowEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute]

columnarBatchIter.flatMap { batch =>
val actualDataTypes = (0 until batch.numCols()).map(i => batch.column(i).dataType())
assert(outputTypes == actualDataTypes, "Invalid schema from pandas_udf: " +
s"expected ${outputTypes.mkString(", ")}, got ${actualDataTypes.mkString(", ")}")
assert(plainSchema(outputTypes) == plainSchema(actualDataTypes),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we wouldn't need to call plainSchema(actualDataTypes) because Arrow schema cannot contain PySpark's UDF?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense.

@@ -54,6 +54,9 @@ class ArrowPythonRunner(
"Pandas execution requires more than 4 bytes. Please set higher buffer. " +
s"Please change '${SQLConf.PANDAS_UDF_BUFFER_SIZE.key}'.")

/** This is a private key */
private val PANDAS_UDF_RETURN_TYPE_JSON = "spark.sql.execution.pandas.udf.return.type.json"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid sending this together with configurations?

Copy link
Member Author

@eddyxu eddyxu Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, do you suggest that we do not send schema via conf, or do not send schema at all?

I see the schema is valuable on the worker for 2 purposes:
1). it can amortize the overhead of checking each row of a returned pandas.{Series/DataFrame} for its schema. By detecting whether we have UDT in the schema before running @pandas_udf, we can avoid to invoke the expensive code path for the existing plain schema case.

and 2) this will be useful for passing UDT into pandas_udf, as wired format is on pyarrow schema, the wire data needs to be reconstruct before sending into the pandas_udf in the worker.

Also as you suggested below, this schema can be used to generate a function that avoid type dispatch when doing UDT->struct conversion.

An alternative implementation could be:

dataOut.writeInt(conf)
for ((k, v) <- conf) {
  PythonRDD.writeUTF(k, dataOut)
  PythonRDD.writeUTF(v, dataOut)
}

PythonRDD.writeUTF(schema_json, dataOut)

PythonUDFRunner.writeUDFs(dataOut, funcs, argOffsets)

Do we have concern about wire compatibility here? IIUC worker.py should be deployed together with ArrowEvalPythonExec , it might not be an issue.

@@ -183,6 +212,21 @@ def create_array(s, t):
raise e
return array

def to_plain_struct(cell):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance here will be very bad. We should create a function based on the type, and avoid type-dispatching for every value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me see what i can do.

import org.apache.spark.sql.util.ArrowUtils


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove all these unreleased changes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

@@ -24,9 +24,10 @@ import org.apache.spark.api.python.ChainedPythonFunctions
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.expressions._
import org.apache.spark.sql.execution.SparkPlan
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types._
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and avoid wildcard import

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌

@HyukjinKwon
Copy link
Member

@eddyxu, how does it work with regular Python UDF? Looks like the performance here will be very bad. Can you do a quick benchmark?

cc @BryanCutler and @ueshin too FYI

@eddyxu
Copy link
Member Author

eddyxu commented Mar 5, 2021

Thanks for the reviews, @HyukjinKwon.

TLDR, this PR should not introduce performance regression for non-(pandas_udf + user-defined type) cases.

  • For a regular Python UDF, not the arrow-based pandas UDF, this code path is not used. The UDT to StructType conversion only happens in ArrowStreamPandasUDFSerializer.
  • For a pandas UDF with regular spark types, this is guarded by a "has_udt" flag . This flag is initialized once during ArrowStreamPandasUDFSerializer creation. So for the normal spark types, there should not have performance penalty.

Performance wise , i will do some benchmarks and look into a way to erase UDT without much type dispatching.

@SparkQA
Copy link

SparkQA commented Mar 5, 2021

Test build #135775 has finished for PR 31735 at commit 5b07fdd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

I am sorry I wasn't clear. I meant to compare how it works in regular Python UDF and the pandas UDFs. I would be great if we can use a similar approach for both. Furthermore, we will probably have to do it for toPandas and createDataFrame with Arrow optimization on. It should be best to think about these cases as well.

@HyukjinKwon
Copy link
Member

BTW, thanks for working on this :-) @eddyxu.

@SparkQA
Copy link

SparkQA commented Mar 8, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40462/

@SparkQA
Copy link

SparkQA commented Mar 8, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40462/

@eddyxu eddyxu changed the title [SPARK-34771][PYTHON][SQL] Return User-defined types from Pandas UDF [SPARK-34799][PYTHON][SQL] Return User-defined types from Pandas UDF Mar 23, 2021
@da-liii da-liii mentioned this pull request Mar 23, 2021

arrs = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Here is the refactored impl:

  1. use create_arrs_names
  2. Use dt: Optional[DataType] to separate the logic
  3. Move down Make input conform

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm. we cannot extract these additional code for UDT outside _create_batch? I meant it like this;

if <series has udt>:
    series = _preprocess_for_udt(series)

_create_batch(series)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this part can be improved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SparkQA
Copy link

SparkQA commented Mar 23, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40956/

@SparkQA
Copy link

SparkQA commented Mar 23, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40956/

@da-liii
Copy link
Contributor

da-liii commented Mar 23, 2021

@eddyxu I wrote a UDT with Timestamp, but failed to make it work. See the demo pr: eddyxu#4

For ExampleBox, serialize to list works fine. But for ExamplePointWithTimeUDT, to make pa.StructArray.from_pandas work, we need to serialize it to dict. For the above demo pr, the python part works fine. But I failed to deserialize the ExamplePointWithTime properly in the Scala part.

Do we need to make UDT with Timestamp work in this PR? How about postpone it in another JIRA ticket?

@maropu What's your opinion? I do not want to make this PR too complicated and hard to review.

@SparkQA
Copy link

SparkQA commented Mar 23, 2021

Test build #136390 has started for PR 31735 at commit ddda826.

@SparkQA
Copy link

SparkQA commented Mar 23, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40974/

@SparkQA
Copy link

SparkQA commented Mar 23, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40974/

"but got: %s" % str(type(s)))
if isinstance(dt, DataType):
type_not_match = "dt must be instance of StructType when t is pyarrow struct"
assert isinstance(dt, StructType), type_not_match
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"dt must be StructType as t is pyarrow struct"

Remove temp variable type_not_match with the above shortened error msg. (will change it after the next round of code review)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@SparkQA
Copy link

SparkQA commented Mar 23, 2021

Test build #136412 has started for PR 31735 at commit 1e7452d.

@da-liii
Copy link
Contributor

da-liii commented Mar 23, 2021

Could you take another round of code review? @HyukjinKwon @maropu @ueshin

Github Action are queueing and waiting, I will improve this PR based on your review together with the known-nit I reviewed by myself.

@SparkQA
Copy link

SparkQA commented Mar 23, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40995/

@maropu
Copy link
Member

maropu commented Mar 25, 2021

@maropu What's your opinion? I do not want to make this PR too complicated and hard to review.

sgtm. I think it is better to make the PR simpler, so how about focusing on supporting a simple case (@pandas_udf(UDT)) first ? The improvement to support more types (arrays of UDT, ....) can be done in follow-up PRs, I think.


# SPARK-34799
def test_user_defined_types_in_array(self):
@pandas_udf(ArrayType(ExamplePointUDT()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add tests for @pandas_udf(ArrayType(ExamplePointUDT(), False))? It seems it doesn't work well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

postponed

s = s.apply(dt.serialize)
elif isinstance(dt, ArrayType) and isinstance(dt.elementType, UserDefinedType):
udt = dt.elementType
s = s.apply(lambda x: [udt.serialize(f) for f in x])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add assert to check if dt is UDT or Array(UDT)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


def create_array(s, t):
mask = s.isnull()
def create_array(s, t: pa.DataType, dt: Optional[DataType] = None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we don't use type hints for internal funcs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@da-liii
Copy link
Contributor

da-liii commented Mar 31, 2021

@HyukjinKwon:
Furthermore, we will probably have to do it for toPandas and createDataFrame with Arrow optimization on. It should be best to think about these cases as well.

toPandas and createDataFrame is supported in the latest commits. See dca35df

Just learned about the SPIP vote by you: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-Support-pandas-API-layer-on-PySpark-td30996.html

I wonder if there are any overlap for this PR and the SPIP. I'm new to PySpark. My previous experience/contribution of Apache Spark mainly focused on Yarn/SQL. If there are any overlap or conflicts, could give any feedback.

@maropu
I think it is better to make the PR simpler, so how about focusing on supporting a simple case (@pandas_udf(UDT)) first ? The improvement to support more types (arrays of UDT, ....) can be done in follow-up PRs, I think.

Thanks for your reply and suggestion. Supporting @pandas_udf(UDF) first seems to be a good way to split this PR. For this PR, I think the most complicated part lies in python/pyspark/sql/pandas/serializers.py. The currently implementation with Spark DataType and Arrow DataType mixed decreases the code readibility. For this PR, if CI passed, dca35df would be the last commit.

And I will try to submit a good and small/minimum first splitted PR for you to review with a better and cleaner implemenation of python/pyspark/sql/pandas/seriealizer.py.

Here is my plan:

  1. SPARK-34711: Support UDT for Pandas/Spark conversion with Arrow support Enabled
  2. SPARK-34799: Return User-defined types from Pandas UDF case 1: @pandas_udf(UDT)

@SparkQA
Copy link

SparkQA commented Mar 31, 2021

Test build #136757 has finished for PR 31735 at commit dca35df.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 31, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41340/

@SparkQA
Copy link

SparkQA commented Mar 31, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41340/

@PerilousApricot
Copy link

Hello, this looks like very good work. I'm having some trouble reading the code -- is there a possibility that these UDTs could leverage https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extension-types when they're in Pandas/Python to skip a costly conversion to Object that currently happens?

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Aug 28, 2021
@github-actions github-actions bot closed this Aug 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants