-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-43473][PYTHON] Support struct type in createDataFrame from pandas DataFrame #41149
Conversation
"can be disabled by using SQL config " | ||
"`spark.sql.execution.pandas.convertToArrowArraySafely`." | ||
) | ||
raise ValueError(error_msg % (series.dtype, series.name, arrow_type)) from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, shall we use PySparkValueError
here and above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @itholic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, at least here I think we should raise PySparkValueError
.
For above errors seems like they're generated from PyArrow internally, so I guess maybe we can't catch them by PySparkxxxError
.
Does that refactoring still conform to UNSUPPORTED_DATA_TYPE_FOR_ARROW_VERSION? |
This PR doesn't change anything related to pyarrow version. |
Sorry I meant |
Specifically, nested StructType, and MapType with keys/values in StructType/TimestampType? |
I'm not sure if it's planned, but now we can remove the constraints with a bit more work. |
Merged to master. |
This looks great, thanks for doing it @ueshin ! |
What changes were proposed in this pull request?
Supports struct type in
createDataFrame
from pandas DataFrame.With Arrow optimization, it works without the fallback:
and Spark Connect also works.
Why are the changes needed?
In vanilla PySpark without Arrow optimization,
Row
object ordict
can be handled as struct type if the schema is provided:Whereas with Arrow, it uses a fallback to make it:
and Spark Connect fails:
Does this PR introduce any user-facing change?
Row
object ordict
in pandas DataFrame works as struct type whencreateDataFrame
if the schema is provided.How was this patch tested?
Added the related test.