[SPARK-43473][PYTHON] Support struct type in createDataFrame from pandas DataFrame #41149

ueshin · 2023-05-12T00:33:36Z

What changes were proposed in this pull request?

Supports struct type in createDataFrame from pandas DataFrame.

With Arrow optimization, it works without the fallback:

>>> import pandas as pd
>>> from pyspark.sql.types import Row
>>>
>>> pdf = pd.DataFrame(
...     {"a": [Row(1, "a"), Row(2, "b")], "b": [{"s": 3, "t": "x"}, {"s": 4, "t": "y"}]}
... )
>>> schema = "a struct<x int, y string>, b struct<s int, t string>"
>>>
>>> df = spark.createDataFrame(pdf, schema)
>>> df.show()
+------+------+
|     a|     b|
+------+------+
|{1, a}|{3, x}|
|{2, b}|{4, y}|
+------+------+

and Spark Connect also works.

Why are the changes needed?

In vanilla PySpark without Arrow optimization, Row object or dict can be handled as struct type if the schema is provided:

>>> import pandas as pd
>>> from pyspark.sql.types import *
>>>
>>> pdf = pd.DataFrame(
...     {"a": [Row(1, "a"), Row(2, "b")], "b": [{"s": 3, "t": "x"}, {"s": 4, "t": "y"}]}
... )
>>> schema = "a struct<x int, y string>, b struct<s int, t string>"
>>>
>>> df = spark.createDataFrame(pdf, schema)
>>> df.show()
+------+------+
|     a|     b|
+------+------+
|{1, a}|{3, x}|
|{2, b}|{4, y}|
+------+------+

Whereas with Arrow, it uses a fallback to make it:

>>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True)
>>> spark.createDataFrame(pdf, schema).show()
/.../pyspark/sql/pandas/conversion.py:329: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
  A field of type StructType expects a pandas.DataFrame, but got: <class 'pandas.core.series.Series'>
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warn(msg)
+------+------+
|     a|     b|
+------+------+
|{1, a}|{3, x}|
|{2, b}|{4, y}|
+------+------+

and Spark Connect fails:

>>> df = spark.createDataFrame(pdf, schema)
Traceback (most recent call last):
...
ValueError: A field of type StructType expects a pandas.DataFrame, but got: <class 'pandas.core.series.Series'>

Does this PR introduce any user-facing change?

Row object or dict in pandas DataFrame works as struct type when createDataFrame if the schema is provided.

How was this patch tested?

Added the related test.

zhengruifeng · 2023-05-15T01:15:37Z

python/pyspark/sql/pandas/serializers.py

+                    "can be disabled by using SQL config "
+                    "`spark.sql.execution.pandas.convertToArrowArraySafely`."
+                )
+            raise ValueError(error_msg % (series.dtype, series.name, arrow_type)) from e


nit, shall we use PySparkValueError here and above?

cc @itholic

Yeah, at least here I think we should raise PySparkValueError.
For above errors seems like they're generated from PyArrow internally, so I guess maybe we can't catch them by PySparkxxxError.

xinrong-meng · 2023-05-15T19:01:27Z

Does that refactoring still conform to UNSUPPORTED_DATA_TYPE_FOR_ARROW_VERSION?

ueshin · 2023-05-15T19:17:17Z

@xinrong-meng

Does that refactoring still conform to UNSUPPORTED_DATA_TYPE_FOR_ARROW_VERSION?

This PR doesn't change anything related to pyarrow version.

xinrong-meng · 2023-05-15T21:29:00Z

Sorry I meant UNSUPPORTED_DATA_TYPE_FOR_ARROW_CONVERSION. Do we have plans to remove the constraints?
@ueshin

xinrong-meng · 2023-05-15T21:38:45Z

Specifically, nested StructType, and MapType with keys/values in StructType/TimestampType?

ueshin · 2023-05-15T21:58:04Z

Do we have plans to remove the constraints?

I'm not sure if it's planned, but now we can remove the constraints with a bit more work.

HyukjinKwon · 2023-05-16T02:05:54Z

Merged to master.

BryanCutler · 2023-05-24T19:25:43Z

This looks great, thanks for doing it @ueshin !

Support struct type in createDataFrame from pandas DataFrame.

6f1d358

ueshin requested review from BryanCutler, HyukjinKwon and zhengruifeng May 12, 2023 00:33

github-actions bot added CORE PYTHON SQL labels May 12, 2023

ueshin added 2 commits May 12, 2023 10:46

Merge branch 'master' into issues/SPARK-43473/rows

97314ef

Fix.

7fed8fc

zhengruifeng approved these changes May 15, 2023

View reviewed changes

ueshin added 2 commits May 15, 2023 12:07

Merge branch 'master' into issues/SPARK-43473/rows

ddfdf42

Fix.

efc2e7c

Fix.

430e900

xinrong-meng approved these changes May 15, 2023

View reviewed changes

Fix.

35a055c

github-actions bot added the CONNECT label May 15, 2023

Fix.

4e5ee21

itholic approved these changes May 16, 2023

View reviewed changes

HyukjinKwon closed this in 6221995 May 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-43473][PYTHON] Support struct type in createDataFrame from pandas DataFrame #41149

[SPARK-43473][PYTHON] Support struct type in createDataFrame from pandas DataFrame #41149

ueshin commented May 12, 2023

zhengruifeng May 15, 2023

HyukjinKwon May 15, 2023

itholic May 15, 2023 •

edited

xinrong-meng commented May 15, 2023

ueshin commented May 15, 2023

xinrong-meng commented May 15, 2023

xinrong-meng commented May 15, 2023

ueshin commented May 15, 2023 •

edited

HyukjinKwon commented May 16, 2023

BryanCutler commented May 24, 2023

[SPARK-43473][PYTHON] Support struct type in createDataFrame from pandas DataFrame #41149

[SPARK-43473][PYTHON] Support struct type in createDataFrame from pandas DataFrame #41149

Conversation

ueshin commented May 12, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

zhengruifeng May 15, 2023

Choose a reason for hiding this comment

HyukjinKwon May 15, 2023

Choose a reason for hiding this comment

itholic May 15, 2023 • edited

Choose a reason for hiding this comment

xinrong-meng commented May 15, 2023

ueshin commented May 15, 2023

xinrong-meng commented May 15, 2023

xinrong-meng commented May 15, 2023

ueshin commented May 15, 2023 • edited

HyukjinKwon commented May 16, 2023

BryanCutler commented May 24, 2023

itholic May 15, 2023 •

edited

ueshin commented May 15, 2023 •

edited