[SPARK-27240][PYTHON] Use pandas DataFrame for struct type argument in Scalar Pandas UDF. #24177

ueshin · 2019-03-22T07:38:12Z

What changes were proposed in this pull request?

Now that we support returning pandas DataFrame for struct type in Scalar Pandas UDF.

If we chain another Pandas UDF after the Scalar Pandas UDF returning pandas DataFrame, the argument of the chained UDF will be pandas DataFrame, but currently we don't support pandas DataFrame as an argument of Scalar Pandas UDF. That means there is an inconsistency between the chained UDF and the single UDF.

We should support taking pandas DataFrame for struct type argument in Scalar Pandas UDF to be consistent.
Currently pyarrow >=0.11 is supported.

How was this patch tested?

Modified and added some tests.

ueshin · 2019-03-22T07:42:19Z

python/pyspark/sql/types.py

+        # TODO: remove version check once minimum pyarrow version is 0.11.0
+        if LooseVersion(pa.__version__) < LooseVersion("0.11.0"):
+            raise TypeError("Unsupported type in conversion from Arrow: " + str(at) +
+                            "\nPlease install pyarrow >= 0.11.0 for StructType support.")


Currently only support for pyarrow >=0.11 since I couldn't find a way to reconstruct pandas DataFrame from pyarrow.lib.StructArray.

ueshin · 2019-03-22T07:43:39Z

python/pyspark/serializers.py

+        if self._df_for_struct and type(data_type) == StructType:
+            import pandas as pd
+            import pyarrow as pa
+            column_arrays = zip(*[[chunk.field(i)


pyarrow.lib.StructArray.field() is only available in pyarrow >=0.11.

SparkQA · 2019-03-22T08:13:06Z

Test build #103811 has finished for PR 24177 at commit d503aa2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2019-03-22T08:45:19Z

cc @BryanCutler @HyukjinKwon

HyukjinKwon · 2019-03-22T09:01:25Z

(seems forgot to file a JIRA)

ueshin · 2019-03-22T09:52:26Z

@HyukjinKwon Thanks!
Actually I had filed, but forgot to tag the JIRA ID and the category.

viirya · 2019-03-22T09:52:53Z

python/pyspark/worker.py

@@ -253,7 +253,9 @@ def read_udfs(pickleSer, infile, eval_type):
            "spark.sql.legacy.execution.pandas.groupedMap.assignColumnsByName", "true")\
            .lower() == "true"

-        ser = ArrowStreamPandasUDFSerializer(timezone, safecheck, assign_cols_by_name)
+        df_for_struct = eval_type == PythonEvalType.SQL_SCALAR_PANDAS_UDF


It seems hard to tell why when eval_type is PythonEvalType.SQL_SCALAR_PANDAS_UDF, then df_for_struct should be true. Maybe a well explained comment here is better.

Sure, will add a comment.

viirya · 2019-03-22T09:57:07Z

python/pyspark/serializers.py

+                      for arrays, field in zip(column_arrays, data_type)]
+            s = _check_dataframe_localize_timestamps(pd.concat(series, axis=1), self._timezone)
+        else:
+            s = super(ArrowStreamPandasUDFSerializer, self).arrow_to_pandas(arrow_column, data_type)


Will this create a new serializer each time calling arrow_to_pandas?

No, this is just calling super class's method.

SparkQA · 2019-03-22T11:16:37Z

Test build #103820 has finished for PR 24177 at commit f8b3404.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

Thanks for the PR @ueshin ! If I understand correctly, this change means that any non-nested StructType column from Spark will be converted to Pandas DataFrame for input to a pandas_udf? So if a pandas_udf had 2 arguments with one being a LongType and one being a StructType, then the user would see one Pandas Series and one Pandas DataFrame as the function input?

That behavior sounds reasonable to me, but I think it is a little different for grouped map udfs that merge all columns into a single Pandas DataFrame, and then I'm not sure how this would handle a StructType column. I'm just wondering if this difference might end up confusing to the user, WDYT?

BryanCutler · 2019-03-24T21:11:38Z

python/pyspark/serializers.py

+            import pyarrow as pa
+            column_arrays = zip(*[[chunk.field(i)
+                                   for i in range(chunk.type.num_children)]
+                                  for chunk in arrow_column.data.iterchunks()])


it might be best to avoid dealing with array chunks and keep this high level if possible. Would it be possible to build the Pandas DataFrame by flattening the Arrow column, building a table from those and then converting that to pandas? Something like this I think:

pdf = pa.Table.from_arrays(arrow_column.flatten()).to_pandas()

I'm not sure if the column names in the pdf would end up as expected though...

arrow_column.flatten() is great! Then we can support pyarrow>=0.10.

BryanCutler · 2019-03-24T21:14:10Z

python/pyspark/serializers.py

+        from pyspark.sql.types import StructType, \
+            _arrow_column_to_pandas, _check_dataframe_localize_timestamps
+
+        if self._df_for_struct and type(data_type) == StructType:


does this need to check for a nested struct?

I don't think so. We can't construct pandas DataFrame with a nested DataFrame.
I might miss what you mean?

I was wondering if data_type has a nested struct, then is an error raised before it gets here? That could be addressed as a followup, I'm not sure if there is a test for it, but I'll check.

HyukjinKwon · 2019-03-25T00:57:56Z

I think it is a little different for grouped map udfs that merge all columns into a single Pandas DataFrame

Yes, you were virtually referring wrap_grouped_map_pandas_udf in worker.py IIUC, @BryanCutler? I think we should better match.

ueshin · 2019-03-25T04:47:17Z

@BryanCutler I'm sorry, but I couldn't figure out what you meant.
So, do you want to use multiple "flattened" arguments instead of a single DataFrame in Grouped Map Pandas UDFs?

SparkQA · 2019-03-25T05:16:05Z

Test build #103892 has finished for PR 24177 at commit 4309d46.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2019-03-25T18:23:04Z

@BryanCutler I'm sorry, but I couldn't figure out what you meant.
So, do you want to use multiple "flattened" arguments instead of a single DataFrame in Grouped Map Pandas UDFs?

Sorry, I think I wrote that a little too hastily and it might not have made much sense. Yes, I was referring to wrap_grouped_map_pandas_udf but actually I think it's not an issue since the user doesn't select columns in the same way with a groupby().apply() operation.

BryanCutler

LGTM

BryanCutler · 2019-03-25T18:32:52Z

merged to master, thanks @ueshin !

HyukjinKwon

A late LGTM as well :D

…n Scalar Pandas UDF. Now that we support returning pandas DataFrame for struct type in Scalar Pandas UDF. If we chain another Pandas UDF after the Scalar Pandas UDF returning pandas DataFrame, the argument of the chained UDF will be pandas DataFrame, but currently we don't support pandas DataFrame as an argument of Scalar Pandas UDF. That means there is an inconsistency between the chained UDF and the single UDF. We should support taking pandas DataFrame for struct type argument in Scalar Pandas UDF to be consistent. Currently pyarrow >=0.11 is supported. Modified and added some tests. Closes apache#24177 from ueshin/issues/SPARK-27240/structtype_argument. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>

…n Scalar Pandas UDF. ## What changes were proposed in this pull request? Now that we support returning pandas DataFrame for struct type in Scalar Pandas UDF. If we chain another Pandas UDF after the Scalar Pandas UDF returning pandas DataFrame, the argument of the chained UDF will be pandas DataFrame, but currently we don't support pandas DataFrame as an argument of Scalar Pandas UDF. That means there is an inconsistency between the chained UDF and the single UDF. We should support taking pandas DataFrame for struct type argument in Scalar Pandas UDF to be consistent. Currently pyarrow >=0.11 is supported. ## How was this patch tested? Modified and added some tests. Closes apache#24177 from ueshin/issues/SPARK-27240/structtype_argument. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>

Use pandas DataFrame for struct type argument in Scalar Pandas UDF.

d503aa2

ueshin commented Mar 22, 2019

View reviewed changes

ueshin changed the title ~~Use pandas DataFrame for struct type argument in Scalar Pandas UDF.~~ [SPARK-27240][PYTHON] Use pandas DataFrame for struct type argument in Scalar Pandas UDF. Mar 22, 2019

viirya reviewed Mar 22, 2019

View reviewed changes

Add a comment.

f8b3404

BryanCutler reviewed Mar 24, 2019

View reviewed changes

Use arrow_column.flatten().

4309d46

BryanCutler approved these changes Mar 25, 2019

View reviewed changes

BryanCutler closed this in 594be7a Mar 25, 2019

HyukjinKwon reviewed Mar 25, 2019

View reviewed changes

dlisuk mentioned this pull request Jun 10, 2019

[SPARK-27240][PYTHON] Use pandas DataFrame for struct type argument i… palantir/spark#577

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27240][PYTHON] Use pandas DataFrame for struct type argument in Scalar Pandas UDF. #24177

[SPARK-27240][PYTHON] Use pandas DataFrame for struct type argument in Scalar Pandas UDF. #24177

ueshin commented Mar 22, 2019 •

edited by BryanCutler

Loading

ueshin Mar 22, 2019

ueshin Mar 22, 2019

SparkQA commented Mar 22, 2019

ueshin commented Mar 22, 2019

HyukjinKwon commented Mar 22, 2019

ueshin commented Mar 22, 2019 •

edited

Loading

viirya Mar 22, 2019

ueshin Mar 22, 2019

viirya Mar 22, 2019

ueshin Mar 22, 2019

SparkQA commented Mar 22, 2019

BryanCutler left a comment •

edited

Loading

BryanCutler Mar 24, 2019

ueshin Mar 25, 2019

BryanCutler Mar 24, 2019

ueshin Mar 25, 2019

BryanCutler Mar 25, 2019

HyukjinKwon commented Mar 25, 2019

ueshin commented Mar 25, 2019

SparkQA commented Mar 25, 2019

BryanCutler commented Mar 25, 2019

BryanCutler left a comment

BryanCutler commented Mar 25, 2019

HyukjinKwon left a comment

[SPARK-27240][PYTHON] Use pandas DataFrame for struct type argument in Scalar Pandas UDF. #24177

[SPARK-27240][PYTHON] Use pandas DataFrame for struct type argument in Scalar Pandas UDF. #24177

Conversation

ueshin commented Mar 22, 2019 • edited by BryanCutler Loading

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 22, 2019

ueshin commented Mar 22, 2019

HyukjinKwon commented Mar 22, 2019

ueshin commented Mar 22, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 22, 2019

BryanCutler left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Mar 25, 2019

ueshin commented Mar 25, 2019

SparkQA commented Mar 25, 2019

BryanCutler commented Mar 25, 2019

BryanCutler left a comment

Choose a reason for hiding this comment

BryanCutler commented Mar 25, 2019

HyukjinKwon left a comment

Choose a reason for hiding this comment

ueshin commented Mar 22, 2019 •

edited by BryanCutler

Loading

ueshin commented Mar 22, 2019 •

edited

Loading

BryanCutler left a comment •

edited

Loading