Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-43473][PYTHON] Support struct type in createDataFrame from pandas DataFrame #41149

Closed
wants to merge 8 commits into from

Conversation

ueshin
Copy link
Member

@ueshin ueshin commented May 12, 2023

What changes were proposed in this pull request?

Supports struct type in createDataFrame from pandas DataFrame.

With Arrow optimization, it works without the fallback:

>>> import pandas as pd
>>> from pyspark.sql.types import Row
>>>
>>> pdf = pd.DataFrame(
...     {"a": [Row(1, "a"), Row(2, "b")], "b": [{"s": 3, "t": "x"}, {"s": 4, "t": "y"}]}
... )
>>> schema = "a struct<x int, y string>, b struct<s int, t string>"
>>>
>>> df = spark.createDataFrame(pdf, schema)
>>> df.show()
+------+------+
|     a|     b|
+------+------+
|{1, a}|{3, x}|
|{2, b}|{4, y}|
+------+------+

and Spark Connect also works.

Why are the changes needed?

In vanilla PySpark without Arrow optimization, Row object or dict can be handled as struct type if the schema is provided:

>>> import pandas as pd
>>> from pyspark.sql.types import *
>>>
>>> pdf = pd.DataFrame(
...     {"a": [Row(1, "a"), Row(2, "b")], "b": [{"s": 3, "t": "x"}, {"s": 4, "t": "y"}]}
... )
>>> schema = "a struct<x int, y string>, b struct<s int, t string>"
>>>
>>> df = spark.createDataFrame(pdf, schema)
>>> df.show()
+------+------+
|     a|     b|
+------+------+
|{1, a}|{3, x}|
|{2, b}|{4, y}|
+------+------+

Whereas with Arrow, it uses a fallback to make it:

>>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True)
>>> spark.createDataFrame(pdf, schema).show()
/.../pyspark/sql/pandas/conversion.py:329: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
  A field of type StructType expects a pandas.DataFrame, but got: <class 'pandas.core.series.Series'>
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warn(msg)
+------+------+
|     a|     b|
+------+------+
|{1, a}|{3, x}|
|{2, b}|{4, y}|
+------+------+

and Spark Connect fails:

>>> df = spark.createDataFrame(pdf, schema)
Traceback (most recent call last):
...
ValueError: A field of type StructType expects a pandas.DataFrame, but got: <class 'pandas.core.series.Series'>

Does this PR introduce any user-facing change?

Row object or dict in pandas DataFrame works as struct type when createDataFrame if the schema is provided.

How was this patch tested?

Added the related test.

"can be disabled by using SQL config "
"`spark.sql.execution.pandas.convertToArrowArraySafely`."
)
raise ValueError(error_msg % (series.dtype, series.name, arrow_type)) from e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, shall we use PySparkValueError here and above?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@itholic itholic May 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, at least here I think we should raise PySparkValueError.
For above errors seems like they're generated from PyArrow internally, so I guess maybe we can't catch them by PySparkxxxError.

@xinrong-meng
Copy link
Member

Does that refactoring still conform to UNSUPPORTED_DATA_TYPE_FOR_ARROW_VERSION?

@ueshin
Copy link
Member Author

ueshin commented May 15, 2023

@xinrong-meng

Does that refactoring still conform to UNSUPPORTED_DATA_TYPE_FOR_ARROW_VERSION?

This PR doesn't change anything related to pyarrow version.

@xinrong-meng
Copy link
Member

Sorry I meant UNSUPPORTED_DATA_TYPE_FOR_ARROW_CONVERSION. Do we have plans to remove the constraints?
@ueshin

@xinrong-meng
Copy link
Member

Specifically, nested StructType, and MapType with keys/values in StructType/TimestampType?

@ueshin
Copy link
Member Author

ueshin commented May 15, 2023

Do we have plans to remove the constraints?

I'm not sure if it's planned, but now we can remove the constraints with a bit more work.

@HyukjinKwon
Copy link
Member

Merged to master.

@BryanCutler
Copy link
Member

This looks great, thanks for doing it @ueshin !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants